Ask AI

You are viewing an unreleased or outdated version of the documentation

Serverless deployment in Dagster+#

This guide is applicable to Dagster+.

Dagster+ Serverless is a fully managed version of Dagster+, and is the easiest way to get started with Dagster. With Serverless, you can run your Dagster jobs without spinning up any infrastructure.


When to choose Serverless#

Serverless works best with workloads that primarily orchestrate other services or perform light computation. Most workloads fit into this category, especially those that orchestrate third-party SaaS products like cloud data warehouses and ETL tools.

If any of the following are applicable, you should select Hybrid deployment:

  • You require substantial computational resources. For example, training a large machine learning (ML) model in-process.
  • Your dataset is too large to fit in memory. For example, training a large machine learning (ML) model in-process on a terabyte of data.
  • You need to distribute computation across many nodes for a single run. Dagster+ runs currently execute on a single node with 4 CPUs.
  • You don't want to add Dagster Labs as a data processor.

Limitations#

Serverless is subject to the following limitations:

  • Maximum of 100 GB of bandwidth per day
  • Maximum of 4500 step-minutes per day
  • Runs receive 4 vCPU cores, 16 GB of RAM and 128 GB of ephemeral disk
  • Code locations receive 0.25 vCPU cores and 1 GB of RAM
  • All Serverless jobs run in the United States
  • Infrastructure cannot be customized or extended, such as using additional containers

Pro customers may request a quota increase by contacting Sales.


Getting started with Serverless#

With GitHub#

If you are a GitHub user, our GitHub integration is the fastest way to get started. It uses a GitHub app and GitHub Actions to set up a repo containing skeleton code and configuration consistent with Dagster+'s best practices with a single click.

When you create a new Dagster+ organization, you'll be prompted to choose Serverless or Hybrid deployment. Once activated, our GitHub integration will scaffold a new git repo for you with Serverless and Branch Deployments already configured. Pushing to the main branch will deploy to your prod Serverless deployment. Pull requests will spin up ephemeral branch deployments using the Serverless agent.

If you are importing a Dagster project that's in an existing GitHub repo:
  • The repo will need to allow the Workflow permission for Read and write permissions. Workflow permissions settings can be found in GitHub's Settings > Actions > General > Workflow permissions. In GitHub Enterprise, these permissions are controlled at the Organization level.

  • An initial commit will need to be able to be merged directly to the repo's main branch to automatically add the GitHub Actions workflow files. If branch protection rules require changes be done through a pull request, it will prevent the automatic setup completing.

    • You can temporarily disable the branch protection rules and then re-enable them after the automatic setup completes.

With Gitlab#

If you are a Gitlab user, our Gitlab integration is the fastest way to get started. It uses a Gitlab app to set up a repo containing skeleton code and CI/CD configuration consistent with Dagster+'s best practices with a single click.

When you create a new Dagster+ organization, you'll be prompted to choose Serverless or Hybrid deployment. Once activated, our Gitlab integration will scaffold a new git repo for you with Serverless and Branch Deployments already configured. Pushing to the main branch will deploy to your prod Serverless deployment. Pull requests will spin up ephemeral branch deployments using the Serverless agent.

Other (BitBucket or local development)#

If you don't want to use our GitHub/Gitlab integrations, we offer a powerful CLI that you can use in another CI environment or on your local laptop.

First, create a new project with the Dagster open-source CLI.

The below example uses our quickstart_etl example project. For more info about the examples, visit the Dagster GitHub repository.

pip install dagster
dagster project from-example \
  --name my-dagster-project \
  --example quickstart_etl
If using a different project, ensure that dagster-cloud is included as a dependency in your setup.py or requirements.txt file.

For example, in my-dagster-project/setup.py:

install_requires=[
    "dagster",
    "dagster-cloud",    # add this line
    ...

Next, install the dagster-cloud CLI and log in to your org. Note: The CLI requires a recent version of Python 3 and Docker.

pip install dagster-cloud
dagster-cloud configure

You can also configure the dagster-cloud tool noninteractively; see the CLI docs for more information.

Finally, deploy your project with Dagster+ Serverless:

dagster-cloud serverless deploy-python-executable ./my-dagster-project \
  --location-name example \
  --package-name quickstart_etl \
  --python-version 3.12

Note: Windows users should use the deploy command instead of deploy-python-executable.

Adding secrets#

Often you'll need to securely access secrets from your jobs. Dagster+ supports several methods for adding secrets - refer to the Dagster+ environment variables and secrets documentation for more info.

Adding dependencies#

Any dependencies specified in either requirements.txt or setup.py will be installed for you automatically by the Dagster+ Serverless infrastructure.


Customizing the runtime environment#

Dagster+ Serverless packages your code as PEX files and deploys them on Docker images. Using PEX files significantly reduces the time to deploy since it does not require building a new Docker image and provisioning a new container for every code change. Many apps will work fine with the default Dagster+ Serverless setup. However, some apps may need to make changes to the runtime environment, either to include data files, use a different base image, different Python version, or install some native dependencies. You can customize the runtime environment using various methods described below.

Including data files#

To add data files to your deployment, use the Data Files Support built into Python's setup.py. This requires adding a package_data or include_package_data keyword in the call to setup() in setup.py. For example, given this directory structure:

- setup.py
- my_dagster_project/
  - __init__.py
  - repository.py
  - data/
    - file1.txt
    - file2.csv

If you want to include the data folder, modify your setup.py to add the package_data line:

# setup.py
from setuptools import find_packages, setup

if __name__ == "__main__":
    setup(
        name="my_dagster_project",
        packages=find_packages(exclude=["my_dagster_project_tests"]),
        # Add the following line. Here "data/*" is relative to the my_dagster_project sub directory.
        package_data={"my_dagster_project": ["data/*"]},
        install_requires=[
            "dagster",
            ...
        ],
    )

Using a different Python version#

Python versions 3.9 through 3.12 are all supported for Serverless deployments. You can specify the version you want by updating your GitHub workflow or using the --python-version command line argument:

  • With GitHub: Change the python_version parameter for the build_deploy_python_executable job in your .github/workflows files. For example:

    - name: Build and deploy Python executable
      if: env.ENABLE_FAST_DEPLOYS == 'true'
      uses: dagster-io/dagster-cloud-action/actions/build_deploy_python_executable@pex-v0.1
      with:
        dagster_cloud_file: "$GITHUB_WORKSPACE/project-repo/dagster_cloud.yaml"
        build_output_dir: "$GITHUB_WORKSPACE/build"
        python_version: "3.10" # Change this value to the desired Python version
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    
  • With the CLI: Add the --python-version CLI argument to the deploy command to specify the registry path to the desired base image:

    dagster-cloud serverless deploy-python-executable --location-name=my_location --python-version=3.10
    

Using a different base image or using native dependencies#

Dagster+ runs your code on a Docker image that we build as follows:

  1. The standard Python "slim" Docker image, such as python:3.10-slim is used as the base.
  2. The dagster-cloud[serverless] module installed in the image.

As far as possible, add all dependencies by including the corresponding native Python bindings in your setup.py. When that is not possible, you can build and upload a custom base image that will be used to run your Python code.

To build and upload the image, use the command line:

  1. Build your Docker image using docker build or your usual Docker toolchain. Ensure the dagster-cloud[serverless] dependency is included. You can do this by adding the following to your Dockerfile:

    RUN pip install "dagster-cloud[serverless]"
    
  2. Upload your Docker image to Dagster+ using the upload-base-image command. Note that this command prints out the tag used in Dagster+ to identify your image:

    $ dagster-cloud serverless upload-base-image local-image:tag
    
    ...
    To use the uploaded image run: dagster-cloud serverless deploy-python-executable ... --base-image-tag=sha256_518ad2f92b078c63c60e89f0310f13f19d3a1c7ea9e1976d67d59fcb7040d0d6
    
  3. To use a Docker image you have published to Dagster+, use the --base-image-tag tag printed out by the above command.

    • With GitHub: Set the SERVERLESS_BASE_IMAGE_TAG environment variable in your GitHub Actions configuration (usually at .github/workflows/dagster-cloud-deploy.yml):

      env:
        DAGSTER_CLOUD_URL: ...
        DAGSTER_CLOUD_API_TOKEN: ...
        SERVERLESS_BASE_IMAGE_TAG: "sha256_518ad2f92b078c63c60e89f0310f13f19d3a1c7ea9e1976d67d59fcb7040d0d6"
      
    • With the CLI: Add the --base-image-tag CLI argument to the deploy command:

      dagster-cloud serverless deploy-python-executable \
        --location-name example \
        --package-name assets_modern_data_stack \
        --base-image-tag sha256_518ad2f92b078c63c60e89f0310f13f19d3a1c7ea9e1976d67d59fcb7040d0d6
      

Disabling PEX-based deploys#

Prior to using PEX files, Dagster+ deployed code using Docker images. This feature is still available. To deploy using a Docker image instead of PEX:

  • With GitHub: Delete the ENABLE_FAST_DEPLOYS: 'true' line in your GitHub Actions configuration (usually at .github/workflows/dagster-cloud-deploy.yml):

    env:
      DAGSTER_CLOUD_URL: ...
      DAGSTER_CLOUD_API_TOKEN: ...
      # ENABLE_FAST_DEPLOYS: 'true' # disabled
    
  • With the CLI: Use the deploy command instead of the deploy-python-executable command:

    dagster-cloud serverless deploy \
      --location-name example \
      --package-name assets_modern_data_stack
    

The Docker image deployed can be customized using either lifecycle hooks or customizing the base image.

This method is the easiest to set up, and does not require setting up any additional infrastructure.

In the root of your repo, you can provide two optional shell scripts: dagster_cloud_pre_install.sh and dagster_cloud_post_install.sh. These will run before and after Python dependencies are installed. They are useful for installing any non-Python dependencies or otherwise configuring your environment.


Transitioning to Hybrid#

If your organization begins to hit the limitations of Serverless, you should transition to a Hybrid deployment. Hybrid deployments allow you to run an agent in your own infrastructure and give you substantially more flexibility and control over the Dagster environment.

To switch to Hybrid, navigate to Status > Agents in your Dagster+ account. On this page, an organization administrator can disable the Serverless agent on and view instructions for enabling Hybrid.

After changing the deployment type, you will need to update your code locations' images and configuration to be compatible with the type of Hybrid agent that you chose. Complete the following steps to finalize the transition:


Security and data protection#

Unlike Hybrid, Serverless Deployments on Dagster+ require direct access to your data, secrets and source code.

  • Secrets and source code are built into the image directly. Images are stored in a per-customer container registry with restricted access.
  • User code is securely sandboxed using modern container sandboxing techniques.
  • All production access is governed by industry-standard best practices which are regularly audited.

I/O management on Serverless#

The default I/O manager cannot be used if you are a Serverless user who:
  • Works with personally identifiable information (PII)
  • Works with private health information (PHI)
  • Has signed a business association agreement (BAA), or
  • Are otherwise working with data subject to GDPR or other such regulations

In Serverless, code that uses the default I/O manager is automatically adjusted to save data in Dagster+ managed storage. This automatic change is useful because the default file system in Serverless is ephemeral, which means the default I/O manager wouldn't work as expected. However, this automatic change means potentially sensitive data is being stored, not just processed or orchestrated, by Dagster+.

To avoid this behavior, you can:

  • Use an I/O manager that stores data in your infrastructure
  • Write code that doesn't use an I/O manager

Whitelisting / Allowlisting Dagster's IP addresses#

Serverless code will make requests from one of the following IP addresses. You may need to whitelist / allowlist them for services your code interacts with.

34.216.9.66
35.162.181.243
35.83.14.215
44.230.239.14
44.240.64.133
52.34.41.163
52.36.97.173
52.37.188.218
52.38.102.213
52.39.253.102
52.40.171.60
52.89.191.177
54.201.195.80
54.68.25.27
54.71.18.84

Note: Additional IP addresses may be added over time. This list was last updated on October 24, 2024.


Run isolation#

Dagster+ Serverless offers two settings for run isolation: isolated and non-isolated. Non-isolated runs are for iterating quickly and trade off isolation for speed. Isolated runs are for production and compute heavy Assets/Jobs.

Isolated runs (default)#

Isolated runs each take place in their own container with their own compute resources: 4 cpu cores and 16GB of RAM.

These runs may take up to 3 minutes to start while these resources are provisioned.

When launching runs manually, select Isolate run environment in the Launchpad to launch an isolated runs. Scheduled, sensor, and backfill runs are always isolated.

The isolated run toggle in Dagster+

Note: if non-isolated runs aren't enabled (see the section below), the toggle won't appear and all runs will be isolated.

Non-isolated runs#

This can be enabled or disabled in deployment settings with

non_isolated_runs:
  enabled: True

Non-isolated runs provide a faster start time by using a standing, shared container for each code location.

They have fewer compute resources: 0.25 vCPU cores and 1GB of RAM. These resources are shared with other processes for a code location like sensors. As a result, it's recommended to use isolated runs for compute intensive jobs and asset materializations.

While launching runs from the Launchpad, uncheck Isolate run environment. When materializing an asset, shift-click Materialize all and uncheck it in the modal.

The isolated run toggle in Dagster+

By default only one non-isolated run will execute at once. While a run is in progress, the Launchpad will swap to only launching isolated runs.

This limit can be configured in deployment settings. Take caution; The limit is in place to help wih avoiding crashes due to OOMs.

non_isolated_runs:
  enabled: True
  max_concurrent_non_isolated_runs: 1