Azure

New in version 21.04.0.

Tip

For automated Azure infrastructure setup, consider using Batch Forge in Seqera Platform.

Overview

Nextflow provides built-in support for Azure Cloud Services, it enables you to:

Store and access data using Azure Blob Storage and Azure file shares.
Execute workflows using Azure Batch compute service.

Quick start

To run pipelines with Azure Batch:

Create an Azure Batch account in the Azure portal.
Increase the quotas in your Azure Batch account to the pipeline’s needs.

Note

Quotas impact the number of pools, CPUs, and jobs you can create.
Create a storage account and Azure Blob Storage in the same region as the Batch account.

Configure Nextflow to submit processes to Azure Batch using configuration. For example:

process {
    executor = 'azurebatch'
}

azure {
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
    }
    batch {
        location = '<LOCATION>'
        accountName = '<BATCH_ACCOUNT_NAME>'
        autoPoolMode = true
        allowPoolCreation = true
    }
}

Replace the following:

STORAGE_ACCOUNT_NAME: your account name
LOCATION: your Azure region
ACCOUNT_NAME: your batch account name

Note

The above snippet excludes authentication for Azure services. See Authentication for more information.

Launch your pipeline with the above configuration and add a working directory on Azure Blob Storage:
```
nextflow run <PIPELINE_NAME> -w az://<BLOB_STORAGE>/
```
Replace the following:
- PIPELINE_NAME: your pipeline, for example, nextflow-io/rnaseq-nf
- BLOB_STORAGE: your Azure Blob Storage from the storage account defined in your configuration

Tip

You can list Azure regions with: az account list-locations -o table

Authentication

Nextflow supports three authentication methods for Azure services, listed in order of security and recommended usage:

Managed identities: The most secure option, available only within Azure. Uses Azure-managed credentials without storing secrets. Service principals: A secure and flexible method that works across environments. Uses Microsoft Entra credentials for authentication. Access keys: The most basic and least secure method. Relies on direct access keys for authentication.

Required roles

The following role assignments are required to use Azure services:

For Azure storage:

Storage Blob Data Reader
Storage Blob Data Contributor

For Azure Batch:

Batch Data Contributor

To assign roles to a managed identity or service principal, see the Azure documentation.

Managed identities

New in version 24.05.0-edge.

Managed identity is the recommended authentication method when running Nextflow within Azure. It automatically manages credentials without requiring you to store secrets.

When using a managed identity, an Azure resource can authenticate based on what it is rather than using access keys. For example, Nextflow can submit jobs to Azure Batch and access private storage accounts without additional credentials if it is running on Azure Virtual Machines with a managed identity and the necessary permissions.

There are two types of managed identities:

System-assigned managed identity
User-assigned managed identity

System-assigned managed identity

A system-assigned identity is tied to a specific Azure resource. To use it:

Enable system-assigned managed identity on your Azure resource.
Configure the required role assignments. See Required Roles for more information.
Add the following configuration:

azure {
    managedIdentity {
        system = true
    }
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
    }
    batch {
        accountName = '<BATCH_ACCOUNT_NAME>'
        location = '<BATCH_ACCOUNT_LOCATION>'
    }
}

Replace the following:

STORAGE_ACCOUNT_NAME: your Azure storage account name
BATCH_ACCOUNT_NAME: your Azure Batch account name
BATCH_ACCOUNT_LOCATION: your Azure Batch account location

Tip

Nextflow uses the AZURE_MANAGED_IDENTITY_SYSTEM environment variable if the managed identity is not set in the Nextflow configuration file.

AZURE_MANAGED_IDENTITY_SYSTEM: When set to true, enables system-assigned managed identity.

User-assigned managed identity

A user-assigned identity can be shared across multiple Azure resources and its lifecycle is not tied to any specific resource. See Azure Documentation for more information.

To use user-assigned managed identity:

Create a managed identity.
Configure the required role assignments.
Assign the identity to your Azure resource.
Add the following configuration:

azure {
    managedIdentity {
        clientId = '<USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID>'
    }
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
    }
    batch {
        accountName = '<BATCH_ACCOUNT_NAME>'
        location = '<BATCH_ACCOUNT_LOCATION>'
    }
}

Replace the following:

USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID: your user assigned managed identity object ID
STORAGE_ACCOUNT_NAME: your Azure Storage account name
BATCH_ACCOUNT_NAME: your Azure Batch account name
BATCH_ACCOUNT_LOCATION: your Azure Batch account location

Tip

Nextflow uses the following environment variable if the managed identity client ID is not provided in the Nextflow configuration file:

AZURE_MANAGED_IDENTITY_USER: The client ID for a user-assigned managed identity.

Service principals

New in version 22.11.0-edge.

Service principal credentials can be used to access Azure Batch and Storage accounts. Similar to a managed identity, a service principal is an application identity that can have specific permissions and role-based access. Unlike managed identities, service principals require a secret key for authentication. The secret key is less secure but allows you to authenticate with Azure resources when operating outside of the Azure environment, such as from your local machine or a non-Azure cloud. To create a service principal, follow the steps in the Register a Microsoft Entra app and create a service principal guide and add the credentials in the Nextflow configuration file. For example:

azure {
    activeDirectory {
        servicePrincipalId = '<SERVICE_PRINCIPAL_CLIENT_ID>'
        servicePrincipalSecret = '<SERVICE_PRINCIPAL_CLIENT_SECRET>'
        tenantId = '<TENANT_ID>'
    }
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
    }
    batch {
        accountName = '<BATCH_ACCOUNT_NAME>'
        location = '<BATCH_ACCOUNT_LOCATION>'
    }
}

Tip

Nextflow uses the following environment variables if storage settings are not set in the Nextflow config file:

AZURE_CLIENT_ID: your Azure service principal application ID
AZURE_CLIENT_SECRET: your Azure service principal secret
AZURE_TENANT_ID: your Azure Entra account tenant ID

Access keys

The basic authentication method uses direct access keys. While simple to set up, it’s less secure than managed identities or service principals because it provides full access to your resources. This approach requires you to obtain and manage the access keys for your Azure Batch and Storage accounts.

You can find your access keys in the Azure portal:

For Storage accounts, navigate to your Storage account and select Access keys in the left-hand navigation menu.
For Batch accounts, navigate to your Batch account and select Keys in the left-hand navigation menu.

Keep access keys secure and rotate them regularly. Use environment variables instead hardcoding keys in configuration files, especially in shared or version-controlled environments:

azure {
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
        accountKey = '<STORAGE_ACCOUNT_KEY>'
    }
    batch {
        accountName = '<BATCH_ACCOUNT_NAME>'
        accountKey = '<BATCH_ACCOUNT_KEY>'
        location = '<LOCATION>'
    }
}

You can also use a Shared Access Token (SAS) instead of an account key to provide time-limited access to your storage account:

azure.storage.sasToken = '<SAS_TOKEN>'

Tip

When creating a SAS token, ensure you enable Read, Write, Delete, List, Add, Create permissions for the Container and Object resource types. The value of sasToken should be stripped of the leading ? character.

Tip

Nextflow uses the following environment variables if storage settings are not provided in the Nextflow configuration file:

AZURE_STORAGE_ACCOUNT_NAME: your Azure Storage account name
AZURE_STORAGE_ACCOUNT_KEY: your Azure Storage account key
AZURE_BATCH_ACCOUNT_NAME: your Azure Batch account name
AZURE_BATCH_ACCOUNT_KEY: your Azure Batch account key
AZURE_STORAGE_SAS_TOKEN: your Azure Storage account SAS token

Azure Blob Storage

Azure Storage provides several services for storing and accessing data in Azure. Nextflow primarily supports Azure Blob Storage for most workloads. Limited support is available for Azure file shares for specific use cases.

Nextflow uses the az:// URI scheme to enable transparent access to files stored in Azure Blob Storage. Once Azure Storage is configured, you can use this scheme to reference files in any blob container as if they were local. For example, a file at path /path/to/file.txt in a container named my-container:

file('az://my-container/path/to/file.txt')

The Blob storage account name and key need to be provided in the Nextflow configuration file as shown below:

azure {
    storage {
        accountName = '<BLOB_ACCOUNT_NAME>'
    }
}

Tip

Nextflow uses environment variables if storage settings are not provided in the Nextflow configuration file:

AZURE_STORAGE_ACCOUNT_NAME: your Azure Storage account name
AZURE_STORAGE_ACCOUNT_KEY: your Azure Storage account access key
AZURE_STORAGE_SAS_TOKEN: a shared access signature (SAS) token for Azure Storage access

Azure Batch

Azure Batch is a managed computing service that enables large-scale parallel and high-performance computing (HPC) batch jobs in Azure. Nextflow provides a built-in executor for Azure Batch. It allows you to offload your workflows to the cloud for scalable and cost-effective execution.

Nextflow integrates seamlessly with Azure Batch to:

Dynamically create and manage compute pools based on workflow requirements.
Automatically scale nodes up and down as needed during workflow execution.
Support both regular and low-priority VMs for cost optimization.
Manage task dependencies and data transfers between Azure Storage and compute nodes.

This section describes how to configure and use Azure Batch with Nextflow for efficient cloud-based workflow execution.

For comprehensive information about Azure Batch features and capabilities, refer to the official Azure Batch documentation.

Overview

Nextflow integrates with Azure Batch by mapping its execution model to Azure Batch’s structure. A Nextflow process corresponds to an Azure Batch job, and each instance of that process (a Nextflow task) becomes an Azure Batch task. These Azure Batch tasks are executed on compute nodes within an Azure Batch pool, which is a collection of virtual machines that can scale up or down based on an autoscale formula.

Nextflow manages these pools dynamically. You can assign processes to specific, pre-existing pools using the process.queue directive. Nextflow can create if it doesn’t exist and azure.batch.allowPoolCreation is set to true. Alternatively, autoPoolMode enables Nextflow to automatically create multiple pools based on the CPU and memory requirements defined in your processes.

An Azure Batch task is created for each Nextflow process instance. This task first downloads the necessary input files from Azure Blob Storage to its assigned compute node. It then runs the process script. Finally, it uploads any output files back to Azure Blob Storage.

Nextflow ensures that the corresponding Azure Batch jobs are marked as terminated when the entire Nextflow pipeline finishes. The compute pools, if configured for auto-scaling, automatically scale down according to their defined formula and help manage costs. Nextflow can also be configured to delete these node pools after pipeline completion.

To use Azure Batch, set the executor directive to azurebatch in your Nextflow configuration file and add a work directory on Azure Blob Storage:

process {
    executor = 'azurebatch'
}

azure {
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
    }
    batch {
        location = '<LOCATION>'
        accountName = '<BATCH_ACCOUNT_NAME>'
        autoPoolMode = true
        allowPoolCreation = true
    }
}

Finally, launch your pipeline with the above configuration:

nextflow run <PIPELINE_NAME> -w az://<CONTAINER>/

Tip

Nextflow uses the following environment variables if the Batch settings are not provided in the Nextflow config file:

AZURE_BATCH_ACCOUNT_NAME: The name of your Azure Batch account.
AZURE_BATCH_ACCOUNT_KEY: The access key for your Azure Batch account.

Quotas

Azure Batch enforces quotas on resources like pools, jobs, and VMs.

Pools: A maximum number of pools can exist at one time. Nextflow throws an error if it attempts to create a pool when this quota is met. To proceed, you must delete existing pools.
Active jobs: There is a limit on the number of active jobs. Completed jobs should be terminated. If this limit is reached, Nextflow throws an error. You must terminate or delete jobs before you run additional pipelines.
VM cores: Each Azure VM family has a maximum core quota. Nextflow creates jobs and tasks if you request a VM family size for which you lack sufficient core quota. However, they remain pending as the pool cannot scale due to the quota. This can cause your pipeline to hang indefinitely.

You can increase your quota on the Azure Portal by opening your Batch account and selecting Quotas in the left-hand menu, then request an increase in quota numbers.

Pool management

Warning

Clean up the Batch pools or use auto scaling to avoid any extra charges in the Batch account.

Nextflow supports two approaches for managing Batch pools:

Auto pools

Nextflow automatically creates pools of compute nodes appropriate for your pipeline when using the autoPoolMode mode.

The cpus and memory directives are used to find the smallest machine type that fits the requested resources in the Azure machine family, specified by machineType. If memory is not specified, 1 GB of memory is allocated per CPU. When no options are specified, it only uses one compute node of the type Standard_D4_v3.

To use autoPoolMode, enable it in the azure.batch config scope:

azure.batch {
    autoPoolMode = true
    allowPoolCreation = true
}

process {
    machineType = 'Standard_D4_v3'
    cpus = 4
    memory = '16 GB'
}

You can specify multiple machine families per process using glob patterns in the machineType directive:

process.machineType = "Standard_D*d_v5,Standard_E*d_v5"  // D or E v5 machines with data disk

For example, the following process creates a pool of Standard_E8d_v5 machines based when using autoPoolMode:

process EXAMPLE_PROCESS {
    machineType "Standard_E*d_v5"
    cpus 8
    memory 8.GB

    script:
    """
    echo "cpus: ${task.cpus}"
    """
}

Note

Nextflow creates pools with 4x CPUs to enable task packing for tasks using <4 CPUs.
Nextflow uses 2x CPUs for tasks using <8 CPUs.
Override this by specifying exact machine type (e.g., Standard_E2d_v5).
Use regex to avoid certain machines (e.g., standard_*[^p]_v* avoids ARM-based machines).

The pool is not removed when the pipeline terminates unless the configuration setting azure.batch.deletePoolsOnCompletion is enabled in your Nextflow configuration file.

Named pools

Use the queue directive in Nextflow to specify the ID of the Azure Batch compute pool that should run the process to control which compute node pool is used for a specific task in your pipeline.

The pool is expected to be available in the Batch environment unless the setting azure.batch.allowPoolCreation = true is set in the azure.batch config scope. In the latter case, Nextflow creates the pools on-demand.

The configuration details for each pool can be specified using a snippet. For example:

azure.batch {
    pools {
        small {
            vmType = 'Standard_D2_v2'
            vmCount = 5
        }
        large {
            vmType = 'Standard_D8_v3'
            vmCount = 2
        }
    }
}

process {
    withLabel: small_jobs {
        queue = 'small'
    }
    withLabel: big_jobs {
        queue = 'large'
    }
}

The above example defines the configuration for two node pools. The first configuration provisions 10 compute nodes of type Standard_D2_v2, the second 5 nodes of type Standard_E2_v3. See the Azure configuration section for the complete list of available configuration options.

Warning

Pool names can only contain alphanumeric, hyphen, and underscore characters. Wrap hyphenated names in quotes, for example, 'pool-1'.

Requirements on pre-existing named pools

The target pool must satisfy requirements when Nextflow is configured to use a pool that is already available in the Batch account:

The pool must be declared as dockerCompatible (Container Type property).
The task slots per node must match the number of cores for the selected VM. Otherwise, Nextflow returns an error like “Azure Batch pool ‘ID’ slots per node does not match the VM num cores (slots: N, cores: Y)”.
Unless you are using Fusion, all tasks must have AzCopy available in the path. If azure.batch.copyToolInstallMode = 'node' this requires every node to have the azcopy binary located at $AZ_BATCH_NODE_SHARED_DIR/bin/.

Auto-scaling

Azure Batch can automatically scale pools based on the parameters you define. With automatic scaling, Batch dynamically adds nodes to a pool as task demands increase and removes compute nodes as task demands decrease.

To enable this feature for pools created by Nextflow, add autoScale = true to the corresponding pool configuration scope. For example, when using the autoPoolMode:

azure.batch.pools.<POOL_NAME> {
    autoScale = true
    vmCount = 5        // Initial count
    maxVmCount = 50    // Maximum count
}

The default auto-scaling formula initializes the pool with the number of VMs specified by the vmCount option. It dynamically scales up the pool based on the number of pending tasks. The pool limit is defined by maxVmCount. The formula automatically scales the pool down to zero nodes to minimize costs when no jobs are submitted for execution. The complete default formula is shown below:

// Get pool lifetime since creation.
lifespan = time() - time("{{poolCreationTime}}");
interval = TimeInterval_Minute * {{scaleInterval}};

// Compute the target nodes based on pending tasks.
$samples = $PendingTasks.GetSamplePercent(interval);
$tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max( $PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes/2);
targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));

// For first interval deploy 1 node, for other intervals scale up/down as per tasks.
$TargetDedicatedNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
$NodeDeallocationOption = taskcompletion;

The azure.batch.pools.<POOL_NAME>.scaleFormula configuration option can be used to provide custom formulas.

Task authentication

Nextflow creates SAS tokens for specific containers and passes them to tasks to enable file operations with Azure Storage by default. SAS tokens expire after a set period of time. The expiration time is set to 48 hours and is configurable via the azure.storage.tokenDuration setting.

New in version 25.05.0-edge.

You can also authenticate to Azure Storage using a managed identity when using the Fusion file system.

To do this:

Create a user-assigned managed identity with the Azure Storage Blob Data Contributor role for your storage account.
Attach the user-assigned managed identity to the node pool manually.
Configure Nextflow to use this identity via the azure.managedIdentity.clientId setting.

azure {
    managedIdentity {
        clientId = '<MANAGED_IDENTITY_CLIENT_ID>'
    }
}

Each task authenticates as the managed identity and download and upload files to Azure Storage using these credentials. It is possible to attach more than one managed identity to a pool. Fusion uses the identity specified with azure.managedIdentity.clientId.

Task packing

Each Azure Batch node is allocated a specific number of task slots that determine how many tasks can run concurrently on that node. The number of slots is calculated based on the virtual machine’s available CPUs. Nextflow intelligently assigns task slots to each process based on the proportion of total node resources requested through the cpus, memory, and disk directives.

For example, consider a Standard_D4d_v5 machine with 4 vCPUs, 16 GB of memory, and 150 GB local disk:

If a process requests cpus 2, memory 8.GB, or disk 75.GB, two task slots are allocated (50% of resources), allowing two tasks to run concurrently on the node.
If a process requests cpus 4, memory 16.GB, or disk 150.GB, four task slots are allocated (100% of resources), allowing one task to run on the node.

Resource overprovisioning can occur if tasks consume more than their allocated share of resources. For instance, the node described above my become overloaded and fail if a process with cpus 2 uses more than 8 GB of memory or 75 GB of disk space is run. It’s important to accurately specify resource requirements to ensure optimal performance and prevent task failures.

Warning

Tasks may fail if they exceed their allocated resources.

This is particularly important for storage. Azure virtual machines come with fixed storage disks that are not expandable. The task will fail if the sum of all processes requests more storage than the machine has available.

Container support

Using Azure Batch requires the use of containers for every process to ensure portability and reproducibility of your workflows.

Nextflow supports both public and private container images. You can pull public images from repositories such as https://community.wave.seqera.io/. You need to authenticate for private images. Ensure the Batch pool has access when using a private registry like Azure Container Registry (ACR). Provide ACR credentials in the azure.registry configuration scope:

azure.registry {
    server = '<REGISTRY_SERVER>'     // e.g., 'myregistry.azurecr.io'
    userName = '<REGISTRY_USER>'
    password = '<REGISTRY_PASSWORD>'
}

Tip

Nextflow uses the following environment variables if the registry credentials are not provided in the Nextflow configuration file:

AZURE_REGISTRY_USER_NAME: the username for Azure Container Registry authentication
AZURE_REGISTRY_PASSWORD: the password for Azure Container Registry authentication

Public container images from Docker Hub or other public registries be used without additional configuration, even when a private registry is configured. The Docker service on the nodes automatically use the appropriate registry based on the container image reference in your process container directive.

Virtual networks

Pools can be configured to use virtual networks to connect to your existing network infrastructure.

azure.batch.pools.<POOL_NAME>.virtualNetwork = '<SUBNET_ID>'

The subnet ID must be in the following format:

/subscriptions/<SUBSCRIPTION>/resourceGroups/<GROUP>/providers/Microsoft.Network/virtualNetworks/<VNET>/subnets/<SUBNET>

Warning

Virtual Networks require Microsoft Entra authentication (service principal or managed identity).

Nextflow can submit tasks to existing pools that are already attached to a virtual network without requiring additional configuration.

Start tasks

Start tasks are optional commands that can be run when nodes join the pool. They are useful for setting up the node environment or for running other commands that need to be executed when the node is created.

azure.batch.pools.<POOL_NAME>.startTask {
    script = 'echo "Hello, world!"'
    privileged = true  // optional, default: false
}

Operating systems

When Nextflow creates a pool of compute nodes, it selects:

The virtual machine image reference: defines the OS and software to be installed on the node.
The Batch node agent SKU (a program that runs on each node and provides the interface between the node and the Azure Batch service).

Together, these settings determine the operating system, version, and available features for each compute node in your pool.

Nextflow creates pool nodes suitable for most bioinformatics workloads based on Ubuntu 22.04 by default. However, you can customize the operating system in your pool configuration to meet specific requirements.

Below are example configurations for common operating systems used in scientific computing:

Ubuntu 22.04 (default)

azure.batch.pools.<POOL_NAME> {
    sku = "batch.node.ubuntu 22.04"
    offer = "ubuntu-hpc"
    publisher = "microsoft-dsvm"
}

CentOS 8

azure.batch.pools.<POOL_NAME> {
    sku = "batch.node.centos 8"
    offer = "centos-container"
    publisher = "microsoft-azure-batch"
}

Advanced features

Azure file shares

New in version nf-azure: 0.11.0

Azure file shares provide fully managed file shares that can be mounted to compute nodes. Files become immediately available in the file system and can be accessed as local files within processes.

The Azure File share must exist in the storage account configured for Blob Storage. You must provide the name of the source Azure file share and mount path (the destination path where the files are mounted). Additional mount options (see the Azure Files documentation) can also be set for further customization of the mounting process.

Configuration:

azure {
    storage {
        fileShares {
            references {  // any name of your choice
                mountPath = '/mnt/references'
                mountOptions = '-o vers=3.0,dir_mode=0777,file_mode=0777,sec=ntlmssp'  // optional
            }
        }
    }
}

Warning

File shares require storage account key authentication. Managed identity and service principal are not supported.

Hybrid workloads

Nextflow allows multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed in the local computer or local computing cluster and some jobs are offloaded to Azure Batch.

Use one or more process selectors in your Nextflow configuration to apply the Azure Batch configuration to the subset of processes that you want to offload. For example:

process {
    withLabel: bigTask {
        executor = 'azurebatch'
        queue = 'my-batch-pool'
        container = 'my/image:tag'
    }
}

azure {
    storage {
        accountName = '<STORAGE_ACCOUNT_NAME>'
        accountKey = '<STORAGE_ACCOUNT_KEY>'
    }
    batch {
        location = '<LOCATION>'
        accountName = '<BATCH_ACCOUNT_NAME>'
        accountKey = '<BATCH_ACCOUNT_KEY>'
    }
}

With the above configuration, processes with the bigTask label run on Azure Batch, while the remaining processes run locally.

Launch the pipeline with the -bucket-dir option to specify an Azure Blob Storage path for the Azure Batch compute jobs. Optionally, use the -work-dir option to specify the local storage for the jobs computed locally:

nextflow run <SCRIPT> -bucket-dir az://my-container/some/path

Warning

Blob Storage path needs a subdirectory, for example, az://container/work, not az://container.
Nextflow handles file transfers between local and cloud.
With Fusion, -bucket-dir is optional.

Advanced configuration

See the Azure configuration section for all available configuration options.