Dynamic resource allocation based on inputs in Nextflow
Nextflow fits my mental model for pipelines perfectly. One of the features I appreciate most is the ability to dynamically assign resources to a process based on various rules, such as the size of input files. This is particularly convenient when running bioinformatics pipelines on HPC systems. In bioinformatics, many tools are not optimized (though many are, and I mean no disrespect). As a result, depending on the input or other circumstances, some tools may require large amounts of resources.
At MGnify, we run a lot of such pipelines, so being able to adjust resources in this way is very handy.
Here’s an example pipeline that adjusts the number of CPUs and memory based on input size:
process TEST {
cpus {
def input_size = input_file.size()
if ( input_size >= 100000 ) { // ~100kb
return 4
} else {
return 2
}
}
memory {
def input_size = input_file.size()
if ( input_size >= 100000 ) { // ~100kb
return 2.GB
} else {
return 1.GB
}
}
input:
path input_file
exec:
println "input size: ${input_file.size()}"
println "memory: $task.memory"
println "cpus: $task.cpus"
}
workflow {
TEST(file(params.file))
}
The syntax it’s pretty simple, we need to use a closure as this has to be executed when the input file is provided to the process.
Running this on a “small” file:
$ nextflow run main.nf --file small.txt
N E X T F L O W ~ version 24.10.4
Launching `main.nf` [gloomy_laplace] DSL2 - revision: 97b2c8a4b2
executor > local (1)
[e9/0141cb] TEST [100%] 1 of 1 ✔
input size: 8950
memory: 1 GB
cpus: 2
On a “largue” file
$ nextflow run main.nf --file big.txt
N E X T F L O W ~ version 24.10.4
Launching `main.nf` [amazing_bhaskara] DSL2 - revision: 97b2c8a4b2
executor > local (1)
[a8/241b3a] TEST [100%] 1 of 1 ✔
input size: 204802
memory: 2 GB
cpus: 4
You get 1GB and 2 CPUs for files smaller than 100kb, and for bigger files you get 2 GB and 4 CPUs. I think this is pretty neat, the Nextflow DSL may seeem strange (specially for bionfirmaticians that mostly have only coded in python) but once you get it it’s pretty powerful.
It’s possible to use the number of lines, or the number of sequences in a fasta file too.