API reference#
Dask cuDF implements the Dask-DataFrame API with with cudf
objects used in
place of pandas
objects. As recommended in the introduction, the best way to
use dask_cudf
is to use the Dask DataFrame API with the backend set to
cudf.
>>> import dask
>>> dask.config.set({"dataframe.backend": "cudf"})
The rest of this page documents the API you might use from dask_cudf
explicitly.
Creating and storing DataFrames#
Like Dask, Dask-cuDF supports creation of DataFrames from a variety of storage formats. In addition to the methods documented there, Dask-cuDF provides some cuDF-specific methods:
- dask_cudf.from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None)#
Create a
dask.dataframe.DataFrame
from acudf.DataFrame
.This function is a thin wrapper around
dask.dataframe.from_pandas()
, accepting the same arguments (described below) excepting that it operates on cuDF rather than pandas objects.Construct a Dask DataFrame from a Pandas DataFrame
This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to produce cleanly-divided partitions (with known divisions). To preserve the input ordering, make sure the input index is monotonically-increasing. The
sort=False
option will also avoid reordering, but will not result in known divisions.- Parameters:
- datapandas.DataFrame or pandas.Series
The DataFrame/Series with which to construct a Dask DataFrame/Series
- npartitionsint, optional, default 1
The number of partitions of the index to create. Note that if there are duplicate values or insufficient elements in
data.index
, the output may have fewer partitions than requested.- chunksizeint, optional
The desired number of rows per index partition to use. Note that depending on the size and index of the dataframe, actual partition sizes may vary.
- sort: bool
Sort the input by index first to obtain cleanly divided partitions (with known divisions). If False, the input will not be sorted, and all divisions will be set to None. Default is True.
- name: string, optional
An optional keyname for the dataframe. Defaults to hashing the input
- Returns:
- dask.DataFrame or dask.Series
A dask DataFrame/Series partitioned along the index
- Raises:
- TypeError
If something other than a
pandas.DataFrame
orpandas.Series
is passed in.
See also
dask.dataframe.from_array
Construct a dask.DataFrame from an array that has record dtype
dask.dataframe.read_csv
Construct a dask.DataFrame from a CSV file
Examples
>>> from dask.dataframe import from_pandas >>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))), ... index=pd.date_range(start='20100101', periods=6)) >>> ddf = from_pandas(df, npartitions=3) >>> ddf.divisions (Timestamp('2010-01-01 00:00:00'), Timestamp('2010-01-03 00:00:00'), Timestamp('2010-01-05 00:00:00'), Timestamp('2010-01-06 00:00:00')) >>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too! >>> ddf.divisions (Timestamp('2010-01-01 00:00:00'), Timestamp('2010-01-03 00:00:00'), Timestamp('2010-01-05 00:00:00'), Timestamp('2010-01-06 00:00:00'))
For on-disk data that are not supported directly in Dask-cuDF, we recommend using one of
Dask’s data reading facilities, followed by
dask.dataframe.DataFrame.to_backend()
with"cudf"
to obtain a Dask-cuDF object