Azure
New in version 21.04.0.
Tip
For automated Azure infrastructure setup, consider using Batch Forge in Seqera Platform.
Overview
Nextflow provides built-in support for Azure cloud services, allowing you to:
Store and access data using Azure Blob Storage and Azure File Shares.
Execute workflows using Azure Batch compute service.
Quick start
Create an Azure Batch account in the Azure portal.
Increase the quotas in your Azure Batch account to the pipeline’s needs. Quotas impact the number of Pools, CPUs, and Jobs you can create.
Create a Storage account and Blob Container in the same region as the Batch account.
Configure Nextflow to submit processes to Azure Batch using configuration, for example:
process { executor = 'azurebatch' } azure { storage { accountName = '<STORAGE_ACCOUNT_NAME>' } batch { location = '<LOCATION>' accountName = '<BATCH_ACCOUNT_NAME>' autoPoolMode = true allowPoolCreation = true } }
Replace the following:
STORAGE_ACCOUNT_NAME
: your account name.LOCATION
: your Azure region.ACCOUNT_NAME
: your batch account name.
Note
The above snippet excludes authentication for Azure services. See Authentication for more information.
Launch your pipeline with the above configuration and add a working directory on Azure Blob Storage:
nextflow run <PIPELINE_NAME> -w az://<CONTAINER>/
Replacing the following:
PIPELINE_NAME
: your pipeline; for example,nextflow-io/rnaseq-nf
.CONTAINER
: your blob container from the storage account defined in your configuration.
Tip
You can list Azure regions with: az account list-locations -o table
Authentication
Nextflow supports three authentication methods for Azure services, listed in order of security and recommended usage:
Managed Identities (Most secure, Azure-only): Azure-managed credentials without storing secrets.
Service Principals (Secure, works anywhere): Microsoft Entra credentials for authentication.
Access Keys (Basic): direct access keys for authentication.
Required roles
To use Azure services, the following role assignments are required:
For Azure Storage:
Storage Blob Data Reader
Storage Blob Data Contributor
For Azure Batch:
Batch Data Contributor
To assign roles to a Managed Identity or Service Principal, refer to the official Azure documentation.
Managed identities
New in version 24.05.0-edge.
Managed Identity is the recommended authentication method when running Nextflow within Azure. It automatically manages credentials without requiring you to store secrets.
When using a Managed Identity, an Azure resource can authenticate based on what it is, rather than using access keys. For example, if Nextflow is running on an Azure Virtual Machine with a managed identity and the necessary permissions, it can submit jobs to Azure Batch and access private storage accounts without requiring additional credentials.
There are two types of managed identities:
System-assigned managed identity
A system-assigned identity is tied to a specific Azure resource. To use it:
Enable system-assigned Managed Identity on your Azure resource.
Configure the required role assignments. See Required Roles below for more information.
Add the following configuration:
azure {
managedIdentity {
system = true
}
storage {
accountName = '<STORAGE_ACCOUNT_NAME>'
}
batch {
accountName = '<BATCH_ACCOUNT_NAME>'
location = '<BATCH_ACCOUNT_LOCATION>'
}
}
Replace the following:
STORAGE_ACCOUNT_NAME
: The name of your Azure Storage account.BATCH_ACCOUNT_NAME
: The name of your Azure Batch account.BATCH_ACCOUNT_LOCATION
: The location of your Azure Batch account.
Tip
Nextflow will use the following environment variable if the managed identity setting is not provided in the Nextflow config file:
AZURE_MANAGED_IDENTITY_SYSTEM
: When set totrue
, enables system-assigned managed identity.
User-assigned managed identity
A user-assigned identity can be shared across multiple Azure resources and its lifecycle is not tied to any specific resource. See Azure Documentation for details.
To use it:
Configure the required role assignments
Add the following configuration:
azure {
managedIdentity {
clientId = '<USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID>'
}
storage {
accountName = '<STORAGE_ACCOUNT_NAME>'
}
batch {
accountName = '<BATCH_ACCOUNT_NAME>'
location = '<BATCH_ACCOUNT_LOCATION>'
}
}
Replace the following:
USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID
: The object ID of your user assigned managed identity.STORAGE_ACCOUNT_NAME
: The name of your Azure Storage account.BATCH_ACCOUNT_NAME
: The name of your Azure Batch account.BATCH_ACCOUNT_LOCATION
: The location of your Azure Batch account.
Tip
Nextflow will use the following environment variable if the managed identity client ID is not provided in the Nextflow config file:
AZURE_MANAGED_IDENTITY_USER
: The client ID for a user-assigned managed identity.
Service principals
New in version 22.11.0-edge.
Service Principal credentials can be used to access Azure Batch and Storage accounts. Similar to a Managed Identity, a Service Principal is an application identity that can have specific permissions and role-based access. Unlike Managed Identities, Service Principals require a secret key for authentication which is less secure but allows you to authenticate with Azure resources when operating outside of the Azure environment, such as from your local machine or a non-Azure cloud. To create a Service Principal, follow the steps in the official Azure documentation and enter the credentials in the Nextflow configuration file, for example:
azure {
activeDirectory {
servicePrincipalId = '<SERVICE_PRINCIPAL_CLIENT_ID>'
servicePrincipalSecret = '<SERVICE_PRINCIPAL_CLIENT_SECRET>'
tenantId = '<TENANT_ID>'
}
storage {
accountName = '<STORAGE_ACCOUNT_NAME>'
}
batch {
accountName = '<BATCH_ACCOUNT_NAME>'
location = '<BATCH_ACCOUNT_LOCATION>'
}
}
Tip
Nextflow will use the following environment variables if storage settings are not provided in the Nextflow config file:
AZURE_CLIENT_ID
: The Application ID of your Azure Service Principal.AZURE_CLIENT_SECRET
: The secret of your Azure Service Principal.AZURE_TENANT_ID
: The tenant ID of your Azure Entra Account.
Access keys
The basic authentication method using direct access keys. While simple, it’s less secure than Managed Identities or Service Principals because it provides full access to your resources. This approach requires you to obtain and manage the access keys for your Azure Storage and Batch accounts.
You can find your access keys in the Azure Portal:
For Storage accounts, navigate to your Storage account and select Access keys in the left-hand navigation menu.
For Batch accounts, navigate to your Batch account and select Keys in the left-hand navigation menu.
Keep access keys secure and rotate them regularly, as they provide full access to your resources. To reduce risk, avoid hardcoding keys in configuration files. Use environment variables instead, especially in shared or version-controlled environments:
azure {
storage {
accountName = '<STORAGE_ACCOUNT_NAME>'
accountKey = '<STORAGE_ACCOUNT_KEY>'
}
batch {
accountName = '<BATCH_ACCOUNT_NAME>'
accountKey = '<BATCH_ACCOUNT_KEY>'
location = '<LOCATION>'
}
}
You can also use a Shared Access Token (SAS) instead of an account key to provide time-limited access to your storage account:
azure.storage.sasToken = '<SAS_TOKEN>'
Tip
When creating a SAS token, ensure you enable these permissions: Read
, Write
, Delete
, List
, Add
, Create
on the resource types Container
and Object
.
The value of sasToken
should be stripped of the leading ?
character.
Tip
Nextflow will use the following environment variables if storage settings are not provided in the Nextflow config file:
AZURE_STORAGE_ACCOUNT_NAME
: The name of your Azure Storage account.AZURE_STORAGE_ACCOUNT_KEY
: The key of your Azure Storage account.AZURE_BATCH_ACCOUNT_NAME
: The name of your Azure Batch account.AZURE_BATCH_ACCOUNT_KEY
: The key of your Azure Batch account.AZURE_STORAGE_SAS_TOKEN
: The SAS token of your Azure Storage account.
Azure Blob Storage
Azure Storage provides several services for storing and accessing data in Azure. Nextflow primarily supports Azure Blob Storage for most workloads, with limited support for Azure File Shares for specific use cases. This section describes how to configure and use Azure Blob Storage with Nextflow.
Nextflow enables transparent access to files stored in Azure Blob Storage using the az://
URI scheme. Once Azure Storage is configured (including authentication, as detailed in the Authentication section), you can use this scheme to reference files in any blob container as if they were local. For example, a file named foo.txt
within a container my-data
can be accessed in your Nextflow script as az://my-data/foo.txt
.
Here’s an example of how to use the az://
URI for a file at path /path/to/file.txt
in a container named my-container
:
file('az://my-container/path/to/file.txt')
The Blob storage account name and key need to be provided in the Nextflow configuration file as shown below:
azure {
storage {
accountName = '<BLOB_ACCOUNT_NAME>'
}
}
Tip
Nextflow will use the following environment variables if storage settings are not provided in the Nextflow config file:
AZURE_STORAGE_ACCOUNT_NAME
: The name of your Azure Storage account.AZURE_STORAGE_ACCOUNT_KEY
: The access key for your Azure Storage account.AZURE_STORAGE_SAS_TOKEN
: A shared access signature (SAS) token for Azure Storage access.
Azure Batch
Azure Batch is a managed computing service that enables large-scale parallel and high-performance computing (HPC) batch jobs in Azure. Nextflow provides a built-in executor for Azure Batch, allowing you to offload your workflows to the cloud for scalable and cost-effective execution.
Nextflow integrates seamlessly with Azure Batch to:
Dynamically create and manage compute pools based on workflow requirements.
Automatically scale nodes up and down as needed during workflow execution.
Support both regular and low-priority VMs for cost optimization.
Manage task dependencies and data transfers between Azure Storage and compute nodes.
This section describes how to configure and use Azure Batch with Nextflow for efficient cloud-based workflow execution.
For comprehensive information about Azure Batch features and capabilities, refer to the official Azure Batch documentation.
Overview
Nextflow integrates with Azure Batch by mapping its execution model to Azure Batch’s structure. A Nextflow process corresponds to an Azure Batch job, and each instance of that process (a Nextflow task) becomes an Azure Batch task. These Azure Batch tasks are executed on compute nodes within an Azure Batch pool, which is a collection of virtual machines that can scale up or down based on an autoscale formula.
Nextflow manages these pools dynamically. You can assign processes to specific, pre-existing pools using the process.queue
directive. If a pool doesn’t exist, Nextflow can create it, provided azure.batch.allowPoolCreation
is set to true. Alternatively, autoPoolMode
enables Nextflow to automatically create multiple pools based on the CPU and memory requirements defined in your processes.
For each Nextflow process instance, an Azure Batch task is created. This task first downloads the necessary input files from Azure Blob Storage to its assigned compute node. It then runs the process script. Finally, it uploads any output files back to Azure Blob Storage.
Once the entire Nextflow pipeline finishes, Nextflow ensures that the corresponding Azure Batch jobs are marked as terminated. The compute pools, if configured for auto-scaling, will then automatically scale down according to their defined formula, helping to manage costs. Nextflow can also be configured to delete these node pools after pipeline completion.
To use Azure Batch, simply set the executor
directive to azurebatch
in your Nextflow configuration file and add a working directory on Azure Blob Storage:
process {
executor = 'azurebatch'
}
azure {
storage {
accountName = '<STORAGE_ACCOUNT_NAME>'
}
batch {
location = '<LOCATION>'
accountName = '<BATCH_ACCOUNT_NAME>'
autoPoolMode = true
allowPoolCreation = true
}
}
Finally, launch your pipeline with the above configuration:
nextflow run <PIPELINE_NAME> -w az://<CONTAINER>/
Tip
Nextflow will use the following environment variables if the Batch settings are not provided in the Nextflow config file:
AZURE_BATCH_ACCOUNT_NAME
: The name of your Azure Batch account.AZURE_BATCH_ACCOUNT_KEY
: The access key for your Azure Batch account.
Azure Batch Quotas
Azure Batch enforces quotas on resources like pools, jobs, and VMs.
Pools: A maximum number of pools can exist at one time. If Nextflow attempts to create a pool when this quota is met, it will throw an error. To proceed, you must delete existing pools.
Active Jobs: There is a limit on the number of active jobs. Completed jobs should be terminated. If this limit is reached, Nextflow will throw an error. You must terminate or delete jobs before running additional pipelines.
VM Cores: Each Azure VM family has a maximum core quota. If you request a VM family size for which you lack sufficient core quota, Nextflow will create jobs and tasks, but they will remain pending as the pool cannot scale due to the quota. This can cause your pipeline to hang indefinitely.
You can increase your quota on the Azure Portal by going to the Batch account and clicking on the “Quotas” link in the left menu, then requesting an increase in quota numbers.
Pool management
Warning
To avoid any extra charges in the Batch account, remember to clean up the Batch pools or use auto scaling.
Nextflow supports two approaches for managing Batch pools:
Auto pools
When using the autoPoolMode
mode, Nextflow automatically creates pools of compute nodes appropriate for your pipeline.
By default, the cpus and memory directives are used to find the smallest machine type that fits the requested resources in the Azure machine family, specified by machineType. If memory is not specified, 1 GB of memory is allocated per CPU. When no options are specified, it only uses one compute node of the type Standard_D4_v3.
To use autoPoolMode
, you need to enable it in the azure.batch config scope:
azure.batch {
autoPoolMode = true
allowPoolCreation = true
}
process {
machineType = 'Standard_D4_v3'
cpus = 4
memory = '16 GB'
}
You can specify multiple machine families per process using glob patterns in the machineType directive:
process.machineType = "Standard_D*d_v5,Standard_E*d_v5" // D or E v5 machines with data disk
For example, the following process will create a pool of Standard_E8d_v5 machines based when using autoPoolMode
:
process EXAMPLE_PROCESS {
machineType "Standard_E*d_v5"
cpus 8
memory 8.GB
script:
"""
echo "cpus: ${task.cpus}"
"""
}
Note
For tasks using <4 CPUs, Nextflow creates pools with 4x CPUs to enable task packing.
For tasks using <8 CPUs, Nextflow uses 2x CPUs.
Override this by specifying exact machine type (e.g.,
Standard_E2d_v5
).Use regex to avoid certain machines (e.g.,
standard_*[^p]_v*
avoids ARM-based machines).
The pool is not removed when the pipeline terminates, unless the configuration setting azure.batch.deletePoolsOnCompletion
is enabled in your Nextflow configuration file.
Named pools
To control which compute node pool is used for a specific task in your pipeline, use the queue directive in Nextflow to specify the ID of the Azure Batch compute pool that should run the process.
The pool is expected to be already available in the Batch environment, unless the setting azure.batch.allowPoolCreation = true
is provided in the azure.batch
config scope in the pipeline configuration file. In the latter case, Nextflow will create the pools on-demand.
The configuration details for each pool can be specified using a snippet. For example:
azure.batch {
pools {
small {
vmType = 'Standard_D2_v2'
vmCount = 5
}
large {
vmType = 'Standard_D8_v3'
vmCount = 2
}
}
}
process {
withLabel: small_jobs {
queue = 'small'
}
withLabel: big_jobs {
queue = 'large'
}
}
The above example defines the configuration for two node pools. The first will provision 10 compute nodes of type Standard_D2_v2, the second 5 nodes of type Standard_E2_v3. See the Azure configuration section for the complete list of available configuration options.
Warning
Pool names can only contain alphanumeric, hyphen and underscore characters. Wrap hyphenated names in quotes: 'pool-1'
.
Requirements on pre-existing named pools
When Nextflow is configured to use a pool already available in the Batch account, the target pool must satisfy the following requirements:
The pool must be declared as dockerCompatible (Container Type property).
The task slots per node must match the number of cores for the selected VM. Otherwise, Nextflow will return an error like “Azure Batch pool ‘ID’ slots per node does not match the VM num cores (slots: N, cores: Y)”.
Unless you are using Fusion, all tasks must have AzCopy available in the path. If
azure.batch.copyToolInstallMode = 'node'
this will require every node to have the azcopy binary located at$AZ_BATCH_NODE_SHARED_DIR/bin/
.
Auto-scaling
Azure Batch can automatically scale pools based on parameters that you define, saving you time and money. With automatic scaling, Batch dynamically adds nodes to a pool as task demands increase, and removes compute nodes as task demands decrease.
To enable this feature for pools created by Nextflow, add the option autoScale = true
to the corresponding pool configuration scope. For example, when using the autoPoolMode
:
azure.batch.pools.<POOL_NAME> {
autoScale = true
vmCount = 5 // Initial count
maxVmCount = 50 // Maximum count
}
The default auto-scaling formula initializes the pool with the number of VMs specified by the vmCount
option. It then dynamically scales up the pool based on the number of pending tasks, up to the limit defined by maxVmCount
. When no jobs are submitted for execution, the formula automatically scales the pool down to zero nodes to minimize costs. The complete default formula is shown below:
// Get pool lifetime since creation.
lifespan = time() - time("{{poolCreationTime}}");
interval = TimeInterval_Minute * {{scaleInterval}};
// Compute the target nodes based on pending tasks.
$samples = $PendingTasks.GetSamplePercent(interval);
$tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max( $PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes/2);
targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));
// For first interval deploy 1 node, for other intervals scale up/down as per tasks.
$TargetDedicatedNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
$NodeDeallocationOption = taskcompletion;
Custom formulas can be provided via the azure.batch.pools.<POOL_NAME>.scaleFormula
configuration option.
Task authentication
By default, Nextflow creates SAS tokens for specific containers and passes them to tasks to enable file operations with Azure Storage. The SAS token will expire after a set period of time, which is configurable via the azure.storage.tokenDuration
setting which is set to 48 hours by default.
New in version 25.05.0-edge.
When using the Fusion file system, you can also authenticate to Azure Storage using a managed identity. This requires:
Creating a user-assigned managed identity with the Azure Storage Blob Data Contributor role for your storage account
Manually attaching the user-assigned managed identity to the node pool
Configuring Nextflow to use this identity via the
azure.managedIdentity.clientId
setting.
azure {
managedIdentity {
clientId = '<MANAGED_IDENTITY_CLIENT_ID>'
}
}
Each task will now authenticate as this managed identity and download and upload files to Azure Storage using these credentials. It is possible to attach more than one managed identity to a pool, and Fusion will use the one specified with the azure.managedIdentity.clientId
setting.
Task Packing
Each Azure Batch node is allocated a specific number of task slots, determining how many tasks can run concurrently on that node. The number of slots is calculated based on the virtual machine’s available CPUs. Nextflow intelligently assigns task slots to each process based on the proportion of total node resources requested through the cpus
, memory
, and disk
directives.
For example, consider a Standard_D4d_v5 machine with 4 vCPUs, 16 GB of memory, and 150 GB local disk:
If a process requests
cpus 2
,memory 8.GB
, ordisk 75.GB
, it will be allocated two task slots (50% of resources), allowing two tasks to run concurrently on the node.If a process requests
cpus 4
,memory 16.GB
, ordisk 150.GB
, it will be allocated four task slots (100% of resources), allowing only one task to run on the node.
Resource overprovisioning can occur if tasks consume more than their allocated share of resources. For instance, if a process with cpus 2
uses more than 8 GB of memory or 75 GB of disk space, the node may become overloaded, resulting in performance degradation or task failure. It’s important to accurately specify resource requirements to ensure optimal performance and prevent task failures.
Warning
Tasks may fail if they exceed their allocated resources
This is particularly important for storage. Azure virtual machines come with fixed storage disks, which are not expandable. If the sum of all processes requests more storage than the machine has available, the task will fail.
Container support
Using Azure Batch requires the use of containers for every process to ensure portability and reproducibility of your workflows.
Nextflow supports both public and private container images. While public images can be pulled from repositories like https://community.wave.seqera.io/, private images require authentication. For private registries like Azure Container Registry (ACR), the Batch pool must be authenticated. If Nextflow is responsible for creating the pool, this can be configured by adding ACR credentials to the azure.registry
config scope:
azure.registry {
server = '<REGISTRY_SERVER>' // e.g., 'myregistry.azurecr.io'
userName = '<REGISTRY_USER>'
password = '<REGISTRY_PASSWORD>'
}
Tip
Nextflow will use the following environment variables if the registry credentials are not provided in the Nextflow config file:
AZURE_REGISTRY_USER_NAME
: The username for Azure Container Registry authenticationAZURE_REGISTRY_PASSWORD
: The password for Azure Container Registry authentication
Public container images from Docker Hub or other public registries can still be used without additional configuration, even when a private registry is configured. The Docker service on the nodes will automatically use the appropriate registry based on the container image reference in your process container directive.
Virtual networks
Pools can be configured to use a virtual network to connect to your existing network infrastructure.
azure.batch.pools.<POOL_NAME>.virtualNetwork = '<SUBNET_ID>'
The subnet ID must be in the following format:
/subscriptions/<SUBSCRIPTION>/resourceGroups/<GROUP>/providers/Microsoft.Network/virtualNetworks/<VNET>/subnets/<SUBNET>
Warning
Virtual Networks require Microsoft Entra authentication (Service Principal or Managed Identity)
Nextflow can submit tasks to existing pools that are already attached to a virtual network without requiring additional configuration.
Start tasks
Start tasks are optional commands that can be run when nodes join the pool. They are useful for setting up the node environment or for running any other commands that need to be executed when the node is created.
azure.batch.pools.<POOL_NAME>.startTask {
script = 'echo "Hello, world!"'
privileged = true // optional, default: false
}
Operating Systems
When Nextflow creates a pool of compute nodes, it selects:
The virtual machine image reference: defines the OS and software to be installed on the node
The Batch node agent SKU: a program that runs on each node and provides the interface between the node and the Azure Batch service
Together, these settings determine the operating system, version, and available features for each compute node in your pool.
By default, Nextflow creates pool nodes based on Ubuntu 22.04, which is suitable for most bioinformatics workloads. However, you can customize the operating system in your pool configuration to meet specific requirements. Below are example configurations for common operating systems used in scientific computing.
For Ubuntu 22.04 (default):
azure.batch.pools.<POOL_NAME> {
sku = "batch.node.ubuntu 22.04"
offer = "ubuntu-hpc"
publisher = "microsoft-dsvm"
}
CentOS 8:
azure.batch.pools.<POOL_NAME> {
sku = "batch.node.centos 8"
offer = "centos-container"
publisher = "microsoft-azure-batch"
}
Advanced Features
Azure file shares
New in version nf-azure: 0.11.0
Azure file shares provide fully managed file shares that can be mounted to compute nodes. This is useful for sharing reference data or tools across tasks. Files become immediately available in the file system and can be accessed as local files within processes.
The Azure File share must exist in the storage account configured for Blob Storage. The name of the source Azure file share and mount path (the destination path where the files are mounted) must be provided. Additional mount options (see the Azure Files documentation) can be set as well for further customisation of the mounting process.
Configuration:
azure {
storage {
fileShares {
references { // any name of your choice
mountPath = '/mnt/references'
mountOptions = '-o vers=3.0,dir_mode=0777,file_mode=0777,sec=ntlmssp' // optional
}
}
}
}
Warning
File shares require storage account key authentication. Managed Identity and Service Principal are not supported.
Hybrid Workloads
Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed in the local computer or local computing cluster and some jobs are offloaded to Azure Batch.
To enable this feature, use one or more Process selectors in your Nextflow configuration to apply the Azure Batch configuration to the subset of processes that you want to offload. For example:
process {
withLabel: bigTask {
executor = 'azurebatch'
queue = 'my-batch-pool'
container = 'my/image:tag'
}
}
azure {
storage {
accountName = '<STORAGE_ACCOUNT_NAME>'
accountKey = '<STORAGE_ACCOUNT_KEY>'
}
batch {
location = '<LOCATION>'
accountName = '<BATCH_ACCOUNT_NAME>'
accountKey = '<BATCH_ACCOUNT_KEY>'
}
}
With the above configuration, processes with the bigTask label will run on Azure Batch, while the remaining processes will run on the local computer.
Then launch the pipeline with the -bucket-dir option to specify an Azure Blob Storage path for the jobs computed with Azure Batch, and optionally, use the -work-dir option to specify the local storage for the jobs computed locally:
nextflow run <SCRIPT> -bucket-dir az://my-container/some/path
Warning
Blob Storage path needs a subdirectory, for example,
az://container/work
, notaz://container
.Nextflow handles file transfers between local and cloud.
With Fusion,
-bucket-dir
is optional.
Advanced configuration
See the Azure configuration section for all available configuration options.