# Day 4: Introduction to Automation and Nextflow ## Overview ### Lead Instructor - [Phelelani Mpangase](https://github.com/phelelani) | [email](mailto:Phelelani.Mpangase@wits.ac.za) ### Co-Instructor(s) - [Amel Ghouila](https://github.com/amelgh) | [@AmelGhouila](https://twitter.com/AmelGhouila) | [email](mailto:amel.ghouila@gmail.com) ### Helper(s) - Maria Tsagiopoulou | [@tsayo7](https://twitter.com/tsayo7) | [email](mariatsayo@gmail.com) ### General Topics - Introduction to Nextflow - Use of workflow systems for automation / reproducibility - Basic syntax of Nextflow - Transform and execute a workflow in Nextflow ## Schedule - _**09:00 - 10:00**_ [Introduction to Nextflow](4.Day4.md#-nextflow-a-tutorial-through-examples-) - _**10:00 - 11:00**_ [Parameters, Channels and Processes](4.Day4.md#3-generalising-and-extending) - _**11:00 - 11:30**_ _**Coffee break**_ - _**11:30 - 12:30**_ [Docker/Singularity](4.Day4.md#4-nextflow--docker--singularity-containers), [Executors](4.Day4.md#5-executors) and [Channel Operations](4.Day4.md#6-channel-operations) - _**12:30 - 14:00**_ _**Lunch break**_ - _**14:00 - 15:00**_ [Practical exercises in Nextflow](4.Day4.md#7-practical-execise-using-the-h3abionet-variant-calling-pipeline) - _**15:00 - 16:00**_ [Practical exercises in Nextflow](4.Day4.md#7-practical-execise-using-the-h3abionet-variant-calling-pipeline) - _**16:00 - 16:15**_ _**Coffee break**_ - _**16:15 - 18:00**_ [Practical exercises in Nextflow](4.Day4.md#7-practical-execise-using-the-h3abionet-variant-calling-pipeline) ## Learning Objectives - Find and use Nextflow tool definitions [online](https://www.nextflow.io/docs/latest/index.html). - Understand how to write Nextflow scripts and definitions for command line tools. - Understand the concepts of Nextflow `Channels`, `Processes` and `Channel` operators. - Understand how to handle multiple inputs and outputs in Nextflow. - Understand Nextflow's configuration file (`nextflow.config`), profiles and input parameters. - Use Docker/Singularity with Nextflow to provide software dependencies and ensure reproducibility. - Join Nextflow tools into a workflow. - Run Nextflow workflows on local, HPC and cloud systems.
Flow chart summarizing the resources and best practices for development, maintenance, sharing and publishing of reproducible and portable workflows.
## 1. Introduction This tutorial in an introduction to Nextflow, primarily through examples. Since the tutorial is brief, it is designed to whet your appetite -- we're only going to dip in and out of some of its features in a superficial way. **Exercises:** Throughout this tutorial there will be some practical examples. Not all will be covered in class for time reasons but you can come back and do them. ### 1.1 Workflow Languages Many scientific applications require: - Multiple data files - Multiple applications - Perhaps different parameters General purpose languages not well suited: - Too low a level of abstraction - Does not separate workflow from application - Not reproducible Nextflow is a Groovy-based language for expressing workflows: - Portable -- works on most Unix-like systems - Very easy to install (**NB:** requires Java 7, 8) - Scalable - Supports Docker/Singularity - Supports a range of scheduling systems Key Nextflow concepts: - **Processes**: actual work being done -- usually simple (call program that does the analysis) - **Channels**: for communication between processes (inputs, outputs) - When all inputs ready, process is executed. - Each process runs in its own directory -- files are staged. - Supports resumption of previous partial runs. ### 1.2. Nexflow Script First, lets set up a directory where we will do all our Nextflow exercises: ```bash mkdir $HOME/day4 cd $HOME/day4 ``` Then, we download the data we will be using for the exercises: ```bash wget https://github.com/fpsom/CODATA-RDA-Advanced-Bioinformatics-2019/raw/master/files/data/tutorial.zip unzip tutorial.zip ``` Tyep `ls -l` and hit `#### 1.4.2. `timeline` ``` nextflow run cleandups.nf -with-timeline
#### 1.4.3. `report` ``` nextflow run cleandups.nf -with-report
**NB:** For debugging, `-with-trace` option may be useful. ## 2. Groovy Can inter-mix Nextflow, Groovy and Java code - Very powerful, flexible - Don't need to know much (any?) Groovy but a little knowledge is a powerful thing Here we'll do some cookbook Groovy... ### 2.2. Groovy Closures Closures are anonymous functions -- similar to lambdas in Python. - Don't want the overhead of naming a function we only use once - Typically use with higher-order functions -- functions that take other functions as arguments. Very powerful and useful Syntax for a closure that takes one argument: ```groovy { parm -> expression } ``` This is an anonymous function that takes one parameter (I've called it `parm`, you can call it whatever you want) and `expression` is a valid expression, probably including the parameter. Let's look at some examples: ```groovy { a -> a*a } (3) { a -> a*a+7*a - 2 } (3) for (n in 1..5) print( {it*it} (n)); { x, y -> Math.sqrt(x*x + y*y) } (3,4) ``` OK, I am simplifying a bit here. Closures are a bit more than functions. Now what we have seen so far isn't so useful but the power comes when we have a function that take another function. Consider this very simplistic example. Suppose we have a program where we are manipulating lists of numbers. Sometimes we want to sum a list, sometimes we want to sum the squares of the numbers; sometimes we want to sum the cubes of the cosines of the numbers. The business of going through the list and summing is the same in all cases. What differs is what we do to the numbers in each case -- so rather than have a separate procedure for each type of summation, we just have one. But we pass this procedure a function that says what to do: ```groovy int doX(f, nums) { sum=0 for ( n in nums ) { sum = sum+f(n) } return sum } ``` We can call it thus: ```groovy print doX ( {a->a}, [4,5,16] ); print doX ( {a->a*a}, [4,5,16] ); print doX ( { it*it }, [4,5,16]); m=10 print doX({a->m*a+2}, [1,2,3]) ``` **NB:** You don't have to name the parameter -- if you don't name the parameter then the name `it` is assumed. **Exercise 3:** Look at the sample Groovy code [here](files/data/groovy.nf). Try to understand and execute on your machine. ## 3. Generalising and Extending We'll now extend this example, introducing more powerful features of Nextflow as well as some of the complication of workflow design. Extending the example: - Parameterise the input. - Want output to go to convenient place. - Workflow takes in multiple input files -- processes are executed on each in turn. - Complication: may need to carry the base name of the input to the final output. - Can repeat some steps for different parameters. ### 3.1. Parameters Parameters can be specified in a Nextflow script file: ```nextflow input_ch = Channel.fromPath(params.data_dir) ``` They can also be passed to the Nextflow command when executing a script: ```bash nextflow run phylo1.nf --data_dir data/polyseqs.fa ``` During debugging you may want to specify default values: ```nextflow params.data_dir = 'data' ``` If you run the Nextflow program without parameters, it will use this as a default value; if you give it parameters, the default value is replaced. Of course, as a matter of good practice, default values for parameters are really designed for real parameters of the process (like gap penalties in a multiple sequence alignment) rather than data files. Nextflow makes a distinction between parameters with a single dash (`-`) and with a double dash (`--`). The single dash ones are from a small, language defined subset modifying the behaviour of Nextflow -- for example we've seen `--with-dag` already. The double-dash parameters are user-defined and completely extensible -- they are used to populate `params`. They modify the behaviour of **your** program. ### 3.2. [Channels](https://www.nextflow.io/docs/latest/channel.html) Nextflow channels support different data types: - `file` - `val` - `set` **NB:** `val` is the most generic -- could be a file name. But sending a file provides power since you can access Groovy's file handling capacity **and**, more importantly does staging of files #### 3.2.1. Creating channels ```nextflow Channel.create() Channel.empty Channel.from("blast","plink") Channel.fromPath("data/*.fa") Channel.fromFilePairs("data/{YRI,CEU,BEB}.*) Channel.watchPath("*fa") ``` There are others. **NB:** The `fromPath` method takes a `Unix` `glob` and creates a new channel which has all the files that match the glob. These files are then emitted one by one to processes that use these values. This default semantics can be changed using the channel operators that Nexflow provides, some of which are shown below. There are many, many operations you can do on channels and their contents. ```nextflow bind buffer close filter map/reduce group join, merge mix copy split spread fork count min/max/sum print/view ``` ### 3.3. Generalising Our Example #### 3.3.1. Multiple inputs ```nextflow #!/usr/bin/env nextflow params.data_dir = "data" input_ch = Channel.fromPath("${params.data_dir}/*.bim") process getIDs { input: file input from input_ch output: file "${input.baseName}.ids" into id_ch file "$input" into orig_ch script: "cut -f 2 $input | sort > ${input.baseName}.ids" } process getDups { input: file input from id_ch output: file "${input.baseName}.dups" into dups_ch script: out = "${input.baseName}.dups" """ uniq -d $input > $out touch ignore """ } process removeDups { publishDir "output", pattern: "${badids.baseName}_clean.bim", overwrite:true, mode:'copy' input: file badids from dups_ch file orig from orig_ch output: file "${badids.baseName}_clean.bim" into cleaned_ch script: "grep -v -f $badids $orig > ${badids.baseName}_clean.bim " } ``` Here the `getIDs` process will execute once, for each file found in the initial glob. On a machine with multiple cores, these would probably execute in parallel, and as we'll see later if you are running on the head node of a cluster, each could run as a separate job. **NB:** that in this version of `getIDs` we name the output file dependant on the input file. This is convenient to do because now we are taking many input files. There is no danger of there being any name clashes during execution because each parallel execution of `getIDs` runs in a separate local directory. However, at the end we want to be able to distinguish which output came from which input without having to do detective work -- so we name the files conveniently. Files that get created on the way but don't need at the end we can name boringly. ```bash nextflow run cleandups.nf ``` ```bash N E X T F L O W ~ version 19.07.0 Launching `cleanups.nf` [small_wozniak] - revision: f8696171b0 executor > local (3) [6c/1b5ca2] process > getIDs (1) [100%] 1 of 1 ✔ [74/7d0dc8] process > getDups (1) [100%] 1 of 1 ✔ [05/51ca59] process > removeDups (1) [100%] 1 of 1 ✔ ``` Now I'm going to add a next step -- say we want to split the IDs into groups using `split` but try different values of splitting. #### 3.3.2. Multiple parameters **Exercise 4:** Now try adding a process to our Nextflow example and for splitting the file but using different split values (SOLUTION [HERE](files/data/cleandups_inclass_example.nf)), e.g.: ```bash split -l 400 data.txt dataX ``` will produce files `dataXaa`, `dataXab`, `dataXac` and so on... ```nextflow splits = [400,500,600] process splitIDs { input: file bim from cleaned_ch each split from splits output: file ("*-$split-*") into output_ch; script: "split -l $split $bim ${bim.baseName}-$split- " } ``` Have a look at the modified Nextflow scrip [here](files/data/cleandups_channels.nf). ### 3.4. Managing Grouped Files We've seen so far where we have a stream of file being processed independently. But in many applications there may be matched data sets. We'll now look at an example, using a popular bioinformatics tool called `PLINK`. In its most common usages, `PLINK` takes in three related files, typically with the same but different suffixed: `.bed`, `.bim`, `.fam`. Short version of the command: ```bash plink --bfile /path/YRI --freq --out /tmp/YRI ``` Long version of the command: ```bash plink --bed YRI.bed --bim YRI.bim --fam YRI.fam --freq --out /tmp/YRI ``` If you don't know what `PLINK` does, don't worry. It's the Swiss Army knife for bioinformatics. The above commands are equivalent (the first is the short-hand for the second when the `bed`, `bim`, and `fam` files have the same base). The command finds frequencies of genome variations -- the output in this example will go into a file called `YRI.frq`. **Problem:** - Pass the files on another channel(s) to be staged - Pass the base name as value/or work it out **Pros/Cons** - Simple - Need extra channel/some gymnastics Lets recap -- Groovy closures. Simply, a **closure** is an anonymous function: - Code wrapped in braces `{ }` - Default argument called `it` ```groovy [1,2,3].each { print it * it } [1,2,3].each { num -> print num * num } ``` Similar to lambdas in `Python` and `Java`. #### 3.4.1. Version 1: `map` ```nextflow #!/usr/bin/env nextflow params.dir = "data/pops/" dir = params.dir params.pops = ["YRI","CEU","BEB"] Channel.from(params.pops) .map { pop -> [ file("$dir/${pop}.bed"), file("$dir/${pop}.bim"), file("$dir/${pop}.fam")] } .set { plink_data } plink_data.subscribe { println "$it" } ``` This example takes a stream of values from `params.pops` and for each value (that's what map does) it applies a closure that takes a string and produces a tuple of files. That tuple is then bound to a channel called `plink_data`. **NB:** There are two **distinct** uses of the `set`: - As a channel operator as shown here - In an input/output clause of a channel ```bash [data/pops/YRI.bed, data/pops/YRI.bim, data/pops/YRI.fam] [data/pops/CEU.bed, data/pops/CEU.bim, data/pops/CEU.fam] [data/pops/BEB.bed, data/pops/BEB.bim, data/pops/BEB.fam] ``` Now let's look at more realistic exammple. To try this example on your own computer. **NB:** Since you may not have `plink` on your computer, our code actually fakes the output. If you do have `plink` you can make the necessary changes. ```nextflow process getFreq { input: set file(bed), file(bim), file(fam) from plink_data output: file "${bed.baseName}.frq" into result """ plink --bed $bed \ --bim $bim \ --fam $fam \ --freq \ --out ${bed.baseName}" """ } ``` Look at [`plink1B.nf`](files/data/plink1B.nf). It's a slightly different ways of doing things. On examples of this size, none of these options are much better or worse but it's useful to see different ways of doing things for later. #### 3.4.2. Version 2: `fromFilePairs` Use `fromFilePairs` - Takes a closure used to gather `files` together with the same `key`: ```nextflow x_ch = Channel.fromFilePairs( files ) { closure } ``` Specify the files as a glob. Closure associates each `file` with a `key`. `fromPairs` puts all files with same key together; returns a list of pairs (`key`, `list`) ```nextflow #!/usr/bin/env nextflow commands = Channel.fromFilePairs("/usr/bin/*", size:-1) { it.baseName[0] } commands.subscribe { k = it[0]; n = it[1].size(); println "There are $n files starting with $k"; } ``` Here we use standard globbing to find all the files in the `/usr/bin` directory. The closure takes the first letter of each file -- all the files with the same letter are put together. The `size` parameter says how many we put togther : `-1` means all. A more complex example – **default** closure ```nextflow Channel .fromFilePairs ("${params.dir}/*.{bed,fam,bim}", size:3, flat : true) .ifEmpty { error "No matching plink files" } .set { plink_data } plink_data.subscribe { println "$it" } ``` `fromFilePairs` - Matches the `glob` - The first `*` is taken as the matching key - For each unique match of the `*` we return - The thing that matches the `*` - The list of files that match the glob with that item - Up to 3 matching files (the default of size is 2 -- hence the name) ```bash [CEU, [data/pops/CEU.bed, data/pops/CEU.bim, data/pops/CEU.fam]] [YRI, [data/pops/YRI.bed, data/pops/YRI.bim, data/pops/YRI.fam]] [BEB, [data/pops/BEB.bed, data/pops/BEB.bim, data/pops/BEB.fam]] ``` ```nextflow process checkData { input: set pop, file(pl_files) from plink_data output: file "${pl_files[0]}.frq" into result script: base = pl_files[0].baseName "plink --bfile $base --freq --out ${base}" } ``` *OR* ```nextflow process checkData { input: set pop, file(pl_files) from plink_data output: file "${pop}.frq" into result script: "plink --bfile $pop --freq --out $pop" } ``` #### 3.4.3 Version 3: Final version ```nextflow #!/usr/bin/env nextflow params.dir = "data/pops/" dir = params.dir params.pops = ["YRI","CEU","BEB"] Channel .fromFilePairs("${params.dir}/{YRI,BEB,CEU}.{bed,bim,fam}",size:3) { file -> file.baseName } .filter { key, files -> key in params.pops } .set { plink_data } process checkData { input: set pop, file(pl_files) from plink_data output: file "${pop}.frq" into result script: "plink --bfile $pop --freq --out $pop" } ``` **Exercise 5:** Have a look at [`weather.nf`](files/data/weather.nf). In the data directory are set of data files for different years and months. First, I want you to use `paste` to combine all the files for the same year and month (`paste` joins files horizontal-wise). Then these new files should be concated. ### 3.5. On absolute paths Great care needs to be taken when referring to absolute paths. Consider the following script. Assumming that local execution is being done, this should work. ```nextflow input = Channel.fromPath("/data/batch1/myfile.fa") process show { input: file data from input output: file 'see.out' script: cp $data /home/scott/answer ``` However, there is a big difference in the two uses of absolute paths. While it might be more appropriate or useful to pass the first path as a parameter, there is no real problem. Netflow will transparently stage the input files to the working directories as appropriate (generally using hard links). But the second hard-coded file will cause failures when we try to use Docker. ## 4. Nextflow + [Docker](https://www.nextflow.io/docs/latest/docker.html) & [Singularity](https://www.nextflow.io/docs/latest/singularity.html) Containers Light-weight virtualisation abstraction layer: - Currently run on `Unix` like systems (e.g., Linux, macOS). - Windows support coming... You create Docker/Singularity images locally, or get from repositories (): Getting Docker images from repositories: ```bash docker pull ubuntu docker pull quay.io/banshee1221/h3agwas-plink ``` Getting Singularity images from repositories: ```bash singularity pull docker://ubuntu singularity pull docker://quay.io/banshee1221/h3agwas-plink ``` Running Docker ```bash docker run