The dagster_pandas library provides utilities for using pandas with Dagster and for implementing validation on pandas DataFrames. A good place to start with dagster_pandas is the validation guide.
dagster_pandas.
create_dagster_pandas_dataframe_type
(name, description=None, columns=None, event_metadata_fn=None, dataframe_constraints=None, loader=None, materializer=None)[source]¶Constructs a custom pandas dataframe dagster type.
name (str) – Name of the dagster pandas type.
description (Optional[str]) – A markdown-formatted string, displayed in tooling.
columns (Optional[List[PandasColumn]]) – A list of PandasColumn
objects
which express dataframe column schemas and constraints.
event_metadata_fn (Optional[Callable[[], Union[Dict[str, Union[str, float, int, Dict, MetadataValue]], List[MetadataEntry]]]]) – A callable which takes your dataframe and returns a dict with string label keys and MetadataValue values. Can optionally return a List[MetadataEntry].
dataframe_constraints (Optional[List[DataFrameConstraint]]) – A list of objects that inherit from
DataFrameConstraint
. This allows you to express dataframe-level constraints.
loader (Optional[DagsterTypeLoader]) – An instance of a class that
inherits from DagsterTypeLoader
. If None, we will default
to using dataframe_loader.
materializer (Optional[DagsterTypeMaterializer]) – An instance of a class
that inherits from DagsterTypeMaterializer
. If None, we will
default to using dataframe_materializer.
dagster_pandas.
RowCountConstraint
(num_allowed_rows, error_tolerance=0)[source]¶A dataframe constraint that validates the expected count of rows.
dagster_pandas.
StrictColumnsConstraint
(strict_column_list, enforce_ordering=False)[source]¶A dataframe constraint that validates column existence and ordering.
dagster_pandas.
PandasColumn
(name, constraints=None, is_required=None)[source]¶The main API for expressing column level schemas and constraints for your custom dataframe types.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If th column exists, the validate function will validate the column. Defaults to True.
constraints (Optional[List[Constraint]]) – List of constraint objects that indicate the validation rules for the pandas column.
boolean_column
(name, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses boolean constraints on boolean dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
categorical_column
(name, categories, of_types=frozenset({'category', 'object'}), non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses categorical constraints on specified dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
categories (List[Any]) – The valid set of buckets that all values in the column must match.
of_types (Optional[Union[str, Set[str]]]) – The expected dtype[s] that your categories and values must abide by.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
datetime_column
(name, min_datetime=Timestamp('1677-09-21 00:12:43.145224193'), max_datetime=Timestamp('2262-04-11 23:47:16.854775807'), non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None, tz=None)[source]¶Simple constructor for PandasColumns that expresses datetime constraints on ‘datetime64[ns]’ dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_datetime (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to pandas.Timestamp.min.
max_datetime (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to pandas.Timestamp.max.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
tz (Optional[str]) – Required timezone for values eg: tz=’UTC’, tz=’Europe/Dublin’, tz=’US/Eastern’. Defaults to None, meaning naive datetime values.
exists
(name, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses existence constraints.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
float_column
(name, min_value=- inf, max_value=inf, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses numeric constraints on float dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_value (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to -float(‘inf’)
max_value (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to float(‘inf’)
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
integer_column
(name, min_value=- inf, max_value=inf, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses numeric constraints on integer dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_value (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to -float(‘inf’)
max_value (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to float(‘inf’)
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
numeric_column
(name, min_value=- inf, max_value=inf, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses numeric constraints numeric dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_value (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to -float(‘inf’)
max_value (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to float(‘inf’)
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
string_column
(name, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses constraints on string dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
dagster_pandas.
DataFrame
= <dagster.core.types.dagster_type.DagsterType object>¶Define a type in dagster. These can be used in the inputs and outputs of ops.
type_check_fn (Callable[[TypeCheckContext, Any], [Union[bool, TypeCheck]]]) – The function that defines the type check. It takes the value flowing
through the input or output of the op. If it passes, return either
True
or a TypeCheck
with success
set to True
. If it fails,
return either False
or a TypeCheck
with success
set to False
.
The first argument must be named context
(or, if unused, _
, _context
, or context_
).
Use required_resource_keys
for access to resources.
key (Optional[str]) –
The unique key to identify types programmatically.
The key property always has a value. If you omit key to the argument
to the init function, it instead receives the value of name
. If
neither key
nor name
is provided, a CheckError
is thrown.
In the case of a generic type such as List
or Optional
, this is
generated programmatically based on the type parameters.
For most use cases, name should be set and the key argument should not be specified.
name (Optional[str]) – A unique name given by a user. If key
is None
, key
becomes this value. Name is not given in a case where the user does
not specify a unique name for this type, such as a generic class.
description (Optional[str]) – A markdown-formatted string, displayed in tooling.
loader (Optional[DagsterTypeLoader]) – An instance of a class that
inherits from DagsterTypeLoader
and can map config data to a value of
this type. Specify this argument if you will need to shim values of this type using the
config machinery. As a rule, you should use the
@dagster_type_loader
decorator to construct
these arguments.
materializer (Optional[DagsterTypeMaterializer]) – An instance of a class
that inherits from DagsterTypeMaterializer
and can persist values of
this type. As a rule, you should use the
@dagster_type_materializer
decorator to construct these arguments.
required_resource_keys (Optional[Set[str]]) – Resource keys required by the type_check_fn
.
is_builtin (bool) – Defaults to False. This is used by tools to display or
filter built-in types (such as String
, Int
) to visually distinguish
them from user-defined types. Meant for internal use.
kind (DagsterTypeKind) – Defaults to None. This is used to determine the kind of runtime type for InputDefinition and OutputDefinition type checking.
typing_type – Defaults to None. A valid python typing type (e.g. Optional[List[int]]) for the value contained within the DagsterType. Meant for internal use.