Skip to main content
Version: 1.2.4

Connect to Filesystem data

Filesystem data consists of data stored in file formats such as .csv or .parquet, and located in an environment with a folder hierarchy such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or local and networked filesystems. GX can leverage either pandas or Spark to read this data.

To connect to your Filesystem data, you first create a Data Source which tells GX where your data files reside. You then configure Data Assets for your Data Source to tell GX which sets of records you want to be able to access from your Data Source. Finally, you will define Batch Definitions which allow you to request all the records retrieved from a Data Asset or further partition the returned records based on a specified date.

Create a Data Source

Data Sources tell GX where your data is located and how to connect to it. With Filesystem data this is done by directing GX to the folder or online location that contains the data files. GX supports accessing Filesystem data from Amazon S3, Azure Blob Storage, Google Cloud Storage, and local or networked filesystems.

Prerequisites

Quick access to sample data

All Data Contexts include a built in pandas_default Data Source. This Data Source gives access to all of the read_*(...) methods available in pandas.

The read_*(...) methods of the pandas_default Data Source allow you to load data into GX without first configuring a Data Source, Data Asset, and Batch Definition. However, it does not save configurations for reading files to the Data Context and provides less versatility than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default Data Source is intended to facilitate testing Expectations and engaging in data exploration but is less suited for use in production and automated workflows.

Procedure

  1. Define the Data Source's parameters.

    The following information is required when you create a Filesystem Data Source for a local or networked directory:

    • name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
    • base_directory: The path to the folder that contains the data files, or the root folder of the directory hierarchy that contains the data files.

    If you are using a File Data Context, you can provide a path that is relative to the Data Context's base_directory. Otherwise, you should provide the absolute path to the folder that contains your data.

    In this example, a relative path is defined for a folder that contains taxi trip data for New York City in .csv format:

    Python
    source_folder = "./data"
    data_source_name = "my_filesystem_data_source"
  2. Add a Filesystem Data Source to your Data Context.

    GX can leverage either pandas or Spark as the backend for your Filesystem Data Source. To create your Data Source, execute one of the following sets of code:

    Python
    data_source = context.data_sources.add_pandas_filesystem(
    name=data_source_name, base_directory=source_folder
    )

Create a Data Asset

A Data Asset is a collection of related records within a Data Source. These records may be located within multiple files, but each Data Asset is only capable of reading a single specific file format which is determined when it is created. However, a Data Source may contain multiple Data Assets covering different file formats and groups of records.

GX provides two types of Data Assets for Filesystem Data Sources: File Data Assets and Directory Data Assets.

File Data Assets are used to retrieve data from individual files in formats such as .csv or .parquet. The file format that can be read by a File Data Asset is determined when the File Data Asset is created. The specific file that is read is determind by Batch Definitions that are added to the Data Asset after it is created.

Both Spark and pandas Filesystem Data Sources support File Data Assets for all supported Filesystem environments.

Prerequisites

Procedure

  1. Retrieve your Data Source.

    Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

    Python
    import great_expectations as gx

    # This example uses a File Data Context which already has
    # a Data Source defined.
    context = gx.get_context()
    data_source_name = "my_filesystem_data_source"
    data_source = context.data_sources.get(data_source_name)
  2. Define your Data Asset's parameters.

    A File Data Asset for files in a local or networked folder hierarchy only needs one piece of information to be created.

    • name: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.

    This example uses taxi trip data stored in .csv files, so the name "taxi_csv_files" will be used for the Data Asset:

    Python
    asset_name = "taxi_csv_files"
  3. Add the Data Asset to your Data Source.

    A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.

    The following example creates a Data Asset that can read .csv file data:

    Python
    file_csv_asset = data_source.add_csv_asset(name=asset_name)

Create a Batch Definition

A Batch Definition determines which records in a Data Asset are retrieved for Validation. Batch Definitions can be configured to either provide all of the records in a Data Asset, or to subdivide the Data Asset based on a date.

Batch Definitions for File Data Assets can be configured to return the content of a specific file based on either a file path or a regex match for dates in the name of the file.

Prerequisites

  1. Retrieve your Data Asset.

    Replace the value of data_source_name with the name of your Data Source and the value of data_asset_name with the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:

    Python
    data_source_name = "my_filesystem_data_source"
    data_asset_name = "my_file_data_asset"
    file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
  2. Add a Batch Definition to the Data Asset.

    A path Batch Definition returns all of the records in a specific data file as a single Batch. A partitioned Batch Definition will return the records of a single file in the Data Asset based on which file name matches a regex.

    To define a path Batch Definition you need to provide the following information:

    • name: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
    • path: The path within the Data Asset of the data file containing the records to return.

    Update the batch_definition_name and batch_definition_path variables and execute the following code to add a path Batch Definition to your Data Asset:

    Python
    batch_definition_name = "yellow_tripdata_sample_2019-01.csv"
    batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv"

    batch_definition = file_data_asset.add_batch_definition_path(
    name=batch_definition_name, path=batch_definition_path
    )
  3. Optional. Verify the Batch Definition is valid.

    A path Batch Definition always returns all records in a specific file as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition.

    After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():

    Python
    batch = batch_definition.get_batch()
    print(batch.head())
  4. Optional. Create additional Batch Definitions.

    A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional path or partitioned Batch Definitions to your Data Asset.