Create workflow
Up to this point, you’ve seen how OpenProblems uses modular components — dataset loaders, methods, and metrics — to define benchmarking tasks for single-cell analysis. Now, we’ll bring these components together into a complete, executable workflow using Viash. By creating a Viash Nextflow component, we can orchestrate the execution of these modules, making it simple to run a full benchmark, manage dependencies, and generate comprehensive results.
Let’s break down how we build a Nextflow workflow component using Viash. The core idea is to define a config.vsh.yaml
file that describes our workflow’s inputs, outputs, dependencies, and execution logic, and then write a main.nf
script containing the Nextflow workflow itself. Viash will take care of the rest, automatically generating the necessary boilerplate code and wiring everything together.
Viash Config for Nextflow Workflows (config.vsh.yaml
)
The config.vsh.yaml
file is where we specify the blueprint of our workflow. Here’s a breakdown of the key sections:
src/workflows/run_benchmark/config.vsh.yaml
1name: run_benchmark
namespace: workflows
argument_groups:
2 - name: Inputs
arguments:
- name: "--input_train"
__merge__: /src/api/file_train.yaml
type: file
direction: input
required: true
- ...
3 - name: Outputs
arguments:
- name: "--output_scores"
type: file
required: true
direction: output
description: A yaml file containing the scores of each of the methods
default: score_uns.yaml
- ...
4resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
5dependencies:
- name: h5ad/extract_uns_metadata
- name: methods/logistic_regression
- name: metrics/accuracy
- name: control_methods/true_labels
6runners:
- type: nextflow
- 1
-
name
andnamespace
: These fields, as with other component types, uniquely identify your workflow component within the Viash ecosystem. - 2
- Input arguments: Here we define the input files our workflow will consume. These typically consists of all of the files created by the dataset processor defined in the previous page.
- 3
-
Output arguments: The outputs are where the results of your benchmarking workflow will be stored. For benchmarks, this should always be the following.
output_scores
: A yaml containing the scores of the methods on the datasets.output_method_configs
: A yaml containing the Viash configs of the methods used in the benchmark.output_metric_configs
: A yaml containing the Viash configs of the metrics used in the benchmark.output_dataset_info
: A yaml containing the metadata of the datasets used in the benchmark.output_task_info
: A yaml containing the metadata of the benchmark task itself. - 4
-
resources
: This section is where you point Viash to your Nextflow workflow definition, which is typically stored in a file namedmain.nf
with an entry point namedrun_wf
. - 5
-
dependencies
: This is what makes workflow components truly powerful. Here, you declare the other Viash components that your workflow depends on. These can be methods, metrics, or other utility components. You need to specify each dependency by name, and optionally its repository if it’s not a component from this repository. - 6
-
runners
: This tells Viash that it should use Nextflow to run this component.
The Nextflow Workflow (main.nf
)
The main.nf
file contains the actual Nextflow workflow that orchestrates the execution of your components. Let’s dissect the key parts of this file:
src/workflows/run_benchmark/main.nf
1
2{
workflow run_wf :
take3
input_ch
:
main
= input_ch
dataset_ch | map{ id, state ->
[id, state + ["_meta": [join_id: id]]]
}
4| extract_uns_metadata.run(
: [input: "input_solution"],
fromState: { id, output, state ->
toState+ [
state : readYaml(output.output).uns
dataset_uns]
}
)
5= [
methods ,
true_labels
logistic_regression]
= [
metrics
accuracy]
6= dataset_ch
score_ch // run all methods
| runEach(
: methods,
components// ...
)
// run all metrics
| runEach(
: metrics,
components// ...
)
7= ...
output_ch
:
emit
output_ch}
- 1
-
Auto-Generated Code (not shown): When you build your workflow component with Viash, it automatically injects some boilerplate code at the beginning of your
main.nf
. This code handles tasks like finding the components you declared as dependencies and setting up the initial data flow. - 2
-
Workflow Structure: Your main workflow logic will typically be enclosed within a
workflow
block, namedrun_wf
as we specified in ourconfig.vsh.yaml
. This is the entry point for your workflow. - 3
-
Channels: The Data Flow Backbone Nextflow uses channels to pass data between different processes (which, in our case, are the Viash components). Channels are like asynchronous queues, allowing components to operate independently and concurrently. Viash workflows use a specific convention for the data passed through channels. Each element in a channel is a tuple
[id, state]
, whereid
is a unique identifier for a particular data instance, andstate
is a dictionary key-value dictionary that holds all of the data and metadata associated with thatid
. - 4
-
Component Execution with
.run()
: Viash enhances components with a.run()
method. While a regular component is executed as a process in a nextflow workflow directly,.run()
provides more fine-grained control. You can specify which data gets passed to the component using thefromState
argument, and how the component’s output gets merged back into the state using thetoState
argument. This is a powerful way to customize how components interact with the workflow’s data. - 5
-
Method and Metric Components: Here, we define the methods and metrics we want to use in our benchmark. These are the components we declared as dependencies in our
config.vsh.yaml
. We can pass these components directly to themethods
andmetrics
variables, and then use them in the workflow as needed. - 6
-
Running Components with
runEach()
: TherunEach()
function takes a list of components and runs them in parallel, passing the data through the channels. This is where the benchmarking actually happens. We run each method on each dataset, then run each metric on the results, and finally collect the scores. - 7
- Boilerplate: The rest of the workflow is typically boilerplate code. This will be moved to a separate helper file to avoid code duplication.
Test Run
To test your workflow, you can use the provided scripts/run_benchmark/run_test_local.sh
script. This script is designed to run the run_benchmark
workflow on a small test dataset.
Build All Docker Containers: Before running the script, make sure you have built all Docker containers for your components by running
scripts/project/build_all_docker_containers.sh
. This will ensure that the necessary dependencies are available when the workflow runs.Edit run_test_local.sh: You’ll need to modify the script to match your workflow’s inputs and outputs. The relevant lines are currently marked with TODO comments.
Run the Script: Once you’ve edited the script, run it from the root of the repository with
./scripts/run_benchmark/run_test_local.sh
.
The script will execute the run_benchmark
workflow using Nextflow and the Docker profile. After the script finishes, you can examine the output files in the specified publish_dir
to verify that your workflow is working correctly.
Next Steps
Now that you’ve created a complete benchmarking workflow, you can run it on your own datasets and methods. You can also extend the workflow to include additional metrics, methods, or other components.
We can now run the workflow on the OpenProblems cloud infrastructure. This will allow us to benchmark our methods on a larger scale and generate comprehensive results. Get in touch to get access to the platform for launching your workflows.
Once the workflow has finished running and has been reviewed by the results QC team, the results will be published on the OpenProblems website.