Zarr
Zarr is a binary file format for chunked, compressed, N-Dimensional array. It is used throughout the PyData ecosystem and especially for climate and biological science applications.
Zarr-Python is the official Python package for reading and writing Zarr arrays. Its main feature is a NumPy-like array that translates array operations into file IO seamlessly. KvikIO provides a GPU backend to Zarr-Python that enables GPUDirect Storage (GDS) seamlessly.
KvikIO supports either zarr-python 2.x or zarr-python 3.x.
However, the API provided in kvikio.zarr
differs based on which version of zarr you have, following the differences between zarr-python 2.x and zarr-python 3.x.
Zarr Python 3.x
Zarr-python includes native support for reading Zarr chunks into device memory if you configure Zarr to use GPUs.
You can use any store, but KvikIO provides kvikio.zarr.GDSStore
to efficiently load data directly into GPU memory.
>>> import zarr
>>> from kvikio.zarr import GDSStore
>>> zarr.config.enable_gpu()
>>> store = GDSStore(root="data.zarr")
>>> z = zarr.create_array(
... store=store, shape=(100, 100), chunks=(10, 10), dtype="float32", overwrite=True
... )
>>> type(z[:10, :10])
cupy.ndarray
Zarr Python 2.x
The following uses zarr-python 2.x, and is an example of how to use the convenience function kvikio.zarr.open_cupy_array()
to create a new Zarr array and how to open an existing Zarr array.
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
# See file LICENSE for terms.
import cupy
import numpy
import zarr
import kvikio
import kvikio.zarr
def main(path):
a = cupy.arange(20)
# Let's use KvikIO's convenience function `open_cupy_array()` to create
# a new Zarr file on disk. Its semantic is the same as `zarr.open_array()`
# but uses a GDS file store, nvCOMP compression, and CuPy arrays.
z = kvikio.zarr.open_cupy_array(store=path, mode="w", shape=(20,), chunks=(5,))
# `z` is a regular Zarr Array that we can write to as usual
z[0:10] = numpy.arange(0, 10)
# but it also support direct reads and writes of CuPy arrays
z[10:20] = cupy.arange(10, 20)
# Reading `z` returns a CuPy array
assert isinstance(z[:], cupy.ndarray)
assert (a == z[:]).all()
# Normally, we cannot assume that GPU and CPU compressors are compatible.
# E.g., `open_cupy_array()` uses nvCOMP's Snappy GPU compression by default,
# which, as far as we know, isn’t compatible with any CPU compressor. Thus,
# let's re-write our Zarr array using a CPU and GPU compatible compressor.
#
# Warning: it isn't possible to use `CompatCompressor` as a compressor argument
# in Zarr directly. It is only meant for `open_cupy_array()`. However,
# in an example further down, we show how to write using regular Zarr.
z = kvikio.zarr.open_cupy_array(
store=path,
mode="w",
shape=(20,),
chunks=(5,),
compressor=kvikio.zarr.CompatCompressor.lz4(),
)
z[:] = a
# Because we are using a CompatCompressor, it is now possible to open the file
# using Zarr's built-in LZ4 decompressor that uses the CPU.
z = zarr.open_array(path)
# `z` is now read as a regular NumPy array
assert isinstance(z[:], numpy.ndarray)
assert (a.get() == z[:]).all()
# and we can write to is as usual
z[:] = numpy.arange(20, 40)
# And we can read the Zarr file back into a CuPy array.
z = kvikio.zarr.open_cupy_array(store=path, mode="r")
assert isinstance(z[:], cupy.ndarray)
assert (cupy.arange(20, 40) == z[:]).all()
# Similarly, we can also open a file written by regular Zarr.
# Let's write the file without any compressor.
ary = numpy.arange(10)
z = zarr.open(store=path, mode="w", shape=ary.shape, compressor=None)
z[:] = ary
# This works as before where the file is read as a CuPy array
z = kvikio.zarr.open_cupy_array(store=path)
assert isinstance(z[:], cupy.ndarray)
assert (z[:] == cupy.asarray(ary)).all()
# Using a compressor is a bit more tricky since not all CPU compressors
# are GPU compatible. To make sure we use a compable compressor, we use
# the CPU-part of `CompatCompressor.lz4()`.
ary = numpy.arange(10)
z = zarr.open(
store=path,
mode="w",
shape=ary.shape,
compressor=kvikio.zarr.CompatCompressor.lz4().cpu,
)
z[:] = ary
# This works as before where the file is read as a CuPy array
z = kvikio.zarr.open_cupy_array(store=path)
assert isinstance(z[:], cupy.ndarray)
assert (z[:] == cupy.asarray(ary)).all()
if __name__ == "__main__":
main("/tmp/zarr-cupy-nvcomp")