Getting Started with cudf.pandas and Snowflake#

RAPIDS in Snowflake#

RAPIDS is a suite of libraries to execute end-to-end data science pipelines entirely on GPUs. If you have data in a Snowflake table that you want to explore with the RAPIDS, you can deploy RAPIDS in Snowflake using Snowpark Container Services.

See Documentation

For the purpose of this example, follow the Run RAPIDS on Snowflake guide, befeore getting started.

Visit the documentation >>

NYC Parking Tickets `cudf.pandas` Example#

If you have data in a Snowflake table, you can accelerate your ETL workflow with cuDF.pandas. With cudf.pandas you can accelerate the pandas ecosystem, with zero code changes. Just load cudf.pandas and you will have the benefits of GPU acceleration, with automatic CPU fallback if needed.

For this example, we have a Snowflake table with the Parking Violations Issued - Fiscal Year 2022 dataset from NYC Open Data.

Get data into a Snowflake table#

To follow along, you will need to have the NYC Parking Violations data into your snowflake account, and make sure that this data is accessible from the RAPIDS notebook Snowpark Service Container that you deployed following the Run RAPIDS on Snowflake guide.

In a Snowflake SQL sheet and with ACCOUNTADMIN role

-- Create a database where the table would live --

CREATE DATABASE CUDF_SNOWFLAKE_EXAMPLE;

USE DATABASE DATABASE CUDF_SNOWFLAKE_EXAMPLE; 

CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = 'PARQUET';

CREATE OR REPLACE STAGE my_s3_stage
URL = 's3://rapidsai-data/datasets/nyc_parking/'
FILE_FORMAT = my_parquet_format;

-- Infer schema from parquet file to use when creating table later --
SELECT COLUMN_NAME, TYPE
FROM TABLE(
    INFER_SCHEMA(
        LOCATION => '@my_s3_stage',
        FILE_FORMAT => 'my_parquet_format',
        FILES => ('nyc_parking_violations_2022.parquet')
    )
);

-- Create table using the inferred schema in the previous step --
CREATE OR REPLACE TABLE NYC_PARKING_VIOLATIONS
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
        LOCATION => '@my_s3_stage',
        FILE_FORMAT => 'my_parquet_format',
        FILES => ('nyc_parking_violations_2022.parquet')
        )
      ));

-- Get data from the stage into the table --      
COPY INTO NYC_PARKING_VIOLATIONS
FROM @my_s3_stage
FILES = ('nyc_parking_violations_2022.parquet')
FILE_FORMAT = (TYPE = 'PARQUET')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Ensure access from container#

During the process of deploying RAPIDS in Snowflake, you created a CONTAINER_USER_ROLE and we need to make sure this role has access to the database, schema and table where the data is, to be able to query from it.

-- Ensure the role has USAGE permissions on the database and schema
GRANT USAGE ON DATABASE CUDF_SNOWFLAKE_EXAMPLE TO ROLE CONTAINER_USER_ROLE;
GRANT USAGE ON SCHEMA CUDF_SNOWFLAKE_EXAMPLE.PUBLIC TO ROLE CONTAINER_USER_ROLE; 

-- Ensure the role has SELECT permission on the table
GRANT SELECT ON TABLE CUDF_SNOWFLAKE_EXAMPLE.PUBLIC.NYC_PARKING_VIOLATIONS TO ROLE CONTAINER_USER_ROLE;

Read data and play around.#

Now that you have the data in a Snowflake table, and the RAPIDS Snowpark container up and running, create a new notebook in the workspace directory (anything that is added to this directory will persist), and follow the instructions below.

Load cudf.pandas#

In the first cell of your notebook, load the cudf.pandas extension

%load_ext cudf.pandas

Connect to Snowflake and create a Snowpark session#

import os
from pathlib import Path

from snowflake.snowpark import Session

connection_parameters = {
    "account": os.getenv("SNOWFLAKE_ACCOUNT"),
    "host": os.getenv("SNOWFLAKE_HOST"),
    "token": Path("/snowflake/session/token").read_text(),
    "authenticator": "oauth",
    "database": "CUDF_SNOWFLAKE_EXAMPLE",  # the created database
    "schema": "PUBLIC",
    "warehouse": "CONTAINER_HOL_WH",
}

session = Session.builder.configs(connection_parameters).create()

# Check the session
print(
    f"Current session info: Warehouse: {session.get_current_warehouse()}  "
    f"Database: {session.get_current_database()}    "
    f"Schema: {session.get_current_schema()}  "
    f"Role: {session.get_current_role()}"
)

# Get some interesting columns from the table
table = session.table("NYC_PARKING_VIOLATIONS").select(
    "Registration State",
    "Violation Description",
    "Vehicle Body Type",
    "Issue Date",
    "Summons Number",
)
table

Notice that up to this point, we have a snowpark dataframe. To get a pandas dataframe we use .to_pandas()

Warning

At the moment, there is a known issue that is preventing us to accelerate the following step with cudf, and we hope to solve this issue soon. In the meantime we need to do a workaround to get the data into a pandas dataframe that cudf.pandas can understand.

from cudf.pandas.module_accelerator import disable_module_accelerator

with disable_module_accelerator():
    df = table.to_pandas()

import pandas as pd

df = pd.DataFrame(df)  # this will take a few seconds

In the future the cell above will reduce to simple doing df = table.to_pandas().

But now we are ready to get see cudf.pandas in action. For the record, this dataset has len(df) = 15435607 and you should see the following operations take in the order of milliseconds to run.

Which parking violation is most commonly committed by vehicles from various U.S states?

Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let’s say we want to get the most common type of offence for vehicles registered in different states. We can do:

%%time
(
    df[["Registration State", "Violation Description"]]  # get only these two columns
    .value_counts()  # get the count of offences per state and per type of offence
    .groupby("Registration State")  # group by state
    .head(
        1
    )  # get the first row in each group (the type of offence with the largest count)
    .sort_index()  # sort by state name
    .reset_index()
)

Which vehicle body types are most frequently involved in parking violations? We can also investigate which vehicle body types most commonly appear in parking violations

%%time

(
    df.groupby(["Vehicle Body Type"])
    .agg({"Summons Number": "count"})
    .rename(columns={"Summons Number": "Count"})
    .sort_values(["Count"], ascending=False)
)

How do parking violations vary across days of the week?

%%time

weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

Conclusion#

With cudf.pandas you can GPU accelerated workflows that involve data that is in a Snowflake table, by just reading it into a pandas d

When things start to get a little slow, just load the cudf.pandas and run your existing code on a GPU!

To learn more, we encourage you to visit rapids.ai/cudf-pandas.