Ask AI

You are viewing an unreleased or outdated version of the documentation

Structuring your Dagster project#

Got questions about our recommendations or something to add? Join our GitHub discussion to share how you organize your Dagster code.

Dagster aims to enable teams to ship data pipelines with extraordinary velocity. In this guide, we'll talk about how we imagine structuring larger Dagster projects which help achieve that goal.

At a high level, here are the aspects we'd like to optimize when structuring a complex project:

  • You can quickly get stuff done (e.g., write a new job, fix a breakage, or retire existing data pipelines) without thinking much about where you need to make the change or how it may break something.
  • You can quickly find the relevant code regardless of your familiarity with the related business logic.
  • You can organize at your own pace when you feel things have grown too big, but not over-optimize too early.

As your experience with Dagster grows, certain aspects of this guide might no longer apply to your use cases, and you may want to change the structure to adapt to your business needs.


Example file tree#

This guide uses the fully featured project example to walk through our recommendations. This example project is a large-size project that simulates real-world use cases and showcases a wide range of Dagster features. You can read more about this project and the application of Dagster concept best practices in the example project walkthrough guide.

Below is the complete file tree of the example project.

project_fully_featured
├── Makefile
├── README.md
├── dbt_project
├── project_fully_featured
│   ├── __init__.py
│   ├── assets
│   │   ├── __init__.py
│   │   ├── activity_analytics
│   │   │   ├── __init__.py
│   │   │   └── activity_forecast.py
│   │   ├── core
│   │   │   ├── __init__.py
│   │   │   ├── id_range_for_time.py
│   │   │   └── items.py
│   │   └── recommender
│   │       ├── __init__.py
│   │       ├── comment_stories.py
│   │       ├── recommender_model.py
│   │       ├── user_story_matrix.py
│   │       └── user_top_recommended_stories.py
│   ├── definitions.py
│   ├── jobs.py
│   ├── partitions.py
│   ├── resources
│   │   ├── __init__.py
│   │   ├── common_bucket_s3_pickle_io_manager.py
│   │   ├── duckdb_parquet_io_manager.py
│   │   ├── hn_resource.py
│   │   ├── parquet_io_manager.py
│   │   ├── partition_bounds.py
│   │   └── snowflake_io_manager.py
│   ├── sensors
│   │   ├── __init__.py
│   │   ├── hn_tables_updated_sensor.py
│   │   └── slack_on_failure_sensor.py
│   └── utils
├── project_fully_featured_tests
├── pyproject.toml
├── setup.cfg
├── setup.py
└── tox.ini

Setting up your project#

This project was scaffolded by the dagster project CLI. This tool generates files and folder structures that enable you to quickly get started with everything set up, especially the Python setup.

Refer to the Dagster project files reference for more info about the default files in a Dagster project. This reference also includes details about additional configuration files, like dagster.yaml and workspace.yaml.


For assets#

Keep all assets together in an assets/ directory. As your business logic and complexity grows, grouping assets by business domains in multiple directories inside assets/ helps to organize assets further.

In this example, we keep all assets together in the project_fully_featured/assets/ directory. It is useful because you can use load_assets_from_package_module or load_assets_from_modules to load assets into your definition, as opposed to needing to add assets to the definition every time you define one. It also helps collaboration as your teammates can quickly navigate to the right place to find the core business logic (i.e., assets) regardless of their familiarity with the codebase.

├── project_fully_featured
    ...
│   ├── assets
│   │   ├── __init__.py
│   │   ├── activity_analytics
│   │   │   ├── __init__.py
│   │   │   └── activity_forecast.py
│   │   ├── core
│   │   │   ├── __init__.py
│   │   │   ├── id_range_for_time.py
│   │   │   └── items.py
│   │   └── recommender
│   │       ├── __init__.py
│   │       ├── comment_stories.py
│   │       ├── recommender_model.py
│   │       ├── user_story_matrix.py
│   │       └── user_top_recommended_stories.py
        ....

For schedules and sensors#

In this example, we put sensors and schedules together in the sensors folder. When we build sensors, they are considered policies for when to trigger a particular job. Keeping all the policies together helps us understand what what's available when creating jobs.

Note: Certain sensors, like run status sensors, can listen to multiple jobs and do not trigger a job. We recommend keeping these sensors in the definition as they are often for alerting and monitoring at the code location level.


For resources#

Make resources reusable and share them across jobs or asset groups.

In this example, we grouped resources (e.g., database connections, Spark sessions, API clients, and I/O managers) in the resources folder, where they are bound to configuration sets that vary based on the environment.

In complex projects, we find it helpful to make resources reusable and configured with pre-defined values via configured. This approach allows your teammates to use a pre-defined resource set or make changes to shared resources, thus enabling more efficient project development.

This pattern also helps you easily execute jobs in different environments without code changes. In this example, we dynamically defined a code location based on the deployment in definitions.py and can keep all code the same across testing, local development, staging, and production. Read more about our recommendations in the Transitioning data pipelines from Development to Production guide.


For jobs#

When using asset-based data pipelines, we recommend having a jobs.py file that imports the assets, partitions, sensors, etc. to build each job.

When using ops and graphs#

This project does not include ops or graphs; if it did, this would be the recommendation on how to structure it.

We recommend having a jobs folder rather than a jobs.py file in this situation. Depending on the types of jobs you have, you can create a separate file for each type of job.

We recommend defining ops and graphs a job file along with the job definition within a single file.

├──project_with_ops
    ...
│   ├── jobs
│   │   ├── jobs_using_assets.py
│   │   ├── jobs_using_ops_assets.py
│   │   ├── jobs_using_ops.py
│   │   ├── jobs_using_ops_graphs.py

For multiple code locations#

So far, we've discussed our recommendations for structuring a large project which contains only one code location. Dagster also allows you to structure a project with multiple definitions. We don't recommend over-abstracting too early; in most cases, one code location should be sufficient. A helpful pattern uses multiple code locations to separate conflicting dependencies, where each definition has its own package requirements (e.g., setup.py) and deployment specs (e.g., Dockerfile).

To include multiple code locations in a single project, you'll need to add a configuration file to your project:

You can see a working example of a Dagster project that has multiple code locations in our cloud-examples/multi-location-project repo.


For tests#

We recommend setting up a separate test folder structure that mirrors the main project (e.g., having a folder for test assets with any applicable subfolders), which contains the unit tests for each of the components of the data pipeline.

Each of the components in Dagster such as assets, sensors, and resources can all be tested separately. Refer to the Testing in Dagster documentation for more info.


For other projects outside Dagster#

As your data platform evolves, Dagster will enable you to orchestrate other data tools, such as dbt projects or Jupyter notebooks.

To learn more about Dagster's integrations, visit this page for guidance and integration libraries.

project_fully_featured
├── dbt_project
│   ├── README.md
│   ├── analysis
│   ├── config
│   │   └── profiles.yml
│   ├── data
│   │   └── full_sample.csv
│   ├── dbt_project.yml
│   ├── macros
│   │   ├── aggregate_actions.sql
│   │   └── generate_schema_name.sql
│   ├── models
│   │   ├── activity_analytics
│   │   │   ├── activity_daily_stats.sql
│   │   │   ├── comment_daily_stats.sql
│   │   │   └── story_daily_stats.sql
│   │   ├── schema.yml
│   │   └── sources.yml
│   ├── snapshots
│   ├── target
│   │   └── manifest.json
│   └── tests
│       └── assert_true.sql
├── project_fully_featured
│ ...