Things I wished I had known when I started using snakemake

Happy new year and all of that!

I am still VERY late in my catching up with papers. This is not helped by the fact that I’ve changed devices (again) and I am now in a Dell Machine, on a full GNOMEian experience worth of a separate entry. Paper newsletters will resume. For real. In the meantime, here goes a post for certain tips when using Snakemake.

I’m making another workflow! This time everything is done using Snakemake. I have a very basic knowledge of python; so far I have mostly managed using bash and R. But since 1. I must learn better python because everybody in the lab uses python-based tools, 2. the people (myself included) are in need a more automated procedure to go from raw to processed data, I thought a snakemake project could be the perfect excuse to tackle both issues.

And I was not wrong. The only problem is, the curve can be really steep if one has never worked with python and/or workflows. There is a real change in the way of thinking compared to other stuff like scripting and creating functions. During my month and a half learning and tackling stuff with snakemake, I have come across a number of problems that may be extra-difficult to solve without some knowledge beforehand. Below some random thoughts that might be of help for the (un)initiated. My idea is to provide food for thought at an early stage of getting in touch with snakemake, despite most of this is mentioned one way or another in the official documentation and the endless stream of tutorials and stackoverflows. Also bear in mind that I am still learning!

  • Here a very basic conceptual guide to snakemake using sandwiches (yes!)
  • To start off, go check the official snakemake workflow repository for inspiration or to not reinvent the wheel! This is how I learned the most during my first days, just reading a bunch of workflows and getting familiar with the structure. A great example is this: https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth
  • One of the first good practices observable in these workflows is the heavy use of config files and separate smk files. The config file should be the cornerstone of user input, instead of passing things manually when calling snakemake. Here the user should be able to configure most if not all of hte parameters relevant for each of the pieces of code that are run in the workflow. The config is a good place to also add static variables such as name of species, build version of the genome, run name, project name, max number of CPUs allowed, etc. These can be later on accessed in a rule with something as: genomebuild=config["params"]["genomebuild"] . Get familiar with YAML files and use the config file a lot!
  • The .smk files are nothing more than sub-files to keep your workflow tidy. The snakefile can be a gathering point from where to start the workflow, with a collection of paths to the snakefiles and a simple rule all that serves to specify what output files to collect from the workflow. To make the workflow see other .smk files, simply add them as: include: "path/to/your/smkfile.smk" .
  • Among these, another important file is the common.smk . Although not necessary (it is just a convention), this is used to catalog and define all the internal functions of the workflow, as well as to pass the information in config file to static variables, to create objects (and play) with the information from the config file, etc. Use this too!
  • At the beginning I got the concepts of inputs for rules and inputs for programs mixed out. Remember that {input} and {output} are first and foremost snakemake’s way to interpret what rules connects to what rules. Thus they are not strictly necessary to be present in the actual command that runs on a rule, meaning, you do not need explicit mentions like program {input} > {output} in the shell/run section of a rule. While it is true that we often want a program to take in whatever came out of another program and therefore the concept of rule inputs and program inputs can overlap, we must remember these are technically separate things. This means we can use inputs and outputs to make rules wait for each other even if there is no apparent connection between them. While this is mentioned in the documentation, it has far more implications than apparent. Sometimes we could be interested in the summary or report of a program and not in the actual output of a program. We could define this as the input for a downstream rule that operates with the report, and also as the output of the rule that generates it.
  • You can specify more than one type of file in input and output:
rule step1:
	input:
		file1="path/to/file1",
		file2="path/to/file2"
	(...)
	shell:
		'''
		command -input1 {input.file1} -input2 {input.file2} ...
		'''

This can also be done without giving them names, in which case they can be accessed as list elements:

rule step1:
	input:
		"path/to/file1",
		"path/to/file2"
	(...)
	shell:
		'''
		command -input1 {input[0]} -input2 {input[1]} ...
		'''
  • Rules can also lack any of the input or output, depending on what you need them to do. This opens a bunch of potential uses too, like the well-known rule all to collect all you want the pipeline to do while doing nothing in return, but also for rules that must run on its own such as a command to download and index a database (which does not depend on anything from the user a priori), etc.
  • Wildcard constraints are somewhat underrated provided their importance. While there is a lot of emphasis in the use of wildcards as a good snakemake practice to propagate the workflow to all your input files/samples/etc, the constraints that help define what must be searched for are a bit more obscure in the documentation. It is all nice and good if you have a filename as sample_date_pairend.fq.gz with all its different units of meaning separated as wildcards. But it can (and will happen) that more info can be added in more wildcards, or reshuffled, etc. One way to guide snakemake into what needs to look for is defining the necessary composition of the wildcard in constraints of regular expressions. For example, the wildcard {pairend} can be constrained to be either just 1 or 2, and this will prevent from mistakenly reading any other, more complex string as the {pairend} wildcard.
  • When using wildcard constraints, make 100% sure that your regexp of choice can retrieve the intended text (do this manually or in a website like regexr.com). Neglecting this could lead to weird errors where the wildcards are properly identified in the steps that use expand to collect inputs, but not in the actual rule that spits out the specified files. E.g. If we want to use a wildcard {nreads} to indicate the number of reads (e.g. 10000, 250000, …), we should not only specify the range of characters "[0-9]", but "[0-9]+", where the + indicates any one or more characters that are numeric. I’ve spent way too much time in a bug related to this.
  • Be careful when using expand(). This function can specify a number of desired files and expanding wildcards according to a series of constraints that can be specified as arguments of the expand function. This can be used to nicely collect a range of desired inputs for a rule in order to tie everything together in something like, a final step, the rule all , etc. But we tend to confuse expand with wildcard propagation. Expand has two main ways of working: In rules without output, it will serve to collect all the files as specified by wildcard propagation. In rules WITH output, it will be used to generate a list of arguments together, separated by a space. Thus, expand() should not be used when one wants to run one instance of the rule per file. Contrary to initial belief, the output of expand is not parsed one-by-one by snakemake such as when propagating wildcards. Instead, the output of expand is a whole list that, if specified within the contet of the rule that will be run, will be passed as a whole. For example, if we have:
expand("03_bams/{runname}_{organism}_{sublib}_B4_tagged_FINAL_rRNAfilt.bam", runname=runname, organism=organism, sublib=get_sublibs())

This will be something like:

["03_bams/hsymtest1_hydractinia_A_B4_tagged_FINAL_rRNAfilt.bam","03_bams/hsymtest1_hydractinia_B_B4_tagged_FINAL_rRNAfilt.bam","03_bams/hsymtest1_hydractinia_C_B4_tagged_FINAL_rRNAfilt.bam","03_bams/hsymtest1_hydractinia_D_B4_tagged_FINAL_rRNAfilt.bam","03_bams/hsymtest1_hydractinia_E_B4_tagged_FINAL_rRNAfilt.bam"]

And, instead of snakemake doing something like:

command -input 03_bams/hsymtest1_hydractinia_A_B4_tagged_FINAL_rRNAfilt.bam -output ...
command -input 03_bams/hsymtest1_hydractinia_B_B4_tagged_FINAL_rRNAfilt.bam -output ...
command -input 03_bams/hsymtest1_hydractinia_C_B4_tagged_FINAL_rRNAfilt.bam -output ...
command -input 03_bams/hsymtest1_hydractinia_D_B4_tagged_FINAL_rRNAfilt.bam -output ...
command -input 03_bams/hsymtest1_hydractinia_E_B4_tagged_FINAL_rRNAfilt.bam -output ...

It would understand:

command -input 03_bams/hsymtest1_hydractinia_A_B4_tagged_FINAL_rRNAfilt.bam 03_bams/hsymtest1_hydractinia_B_B4_tagged_FINAL_rRNAfilt.bam 03_bams/hsymtest1_hydractinia_C_B4_tagged_FINAL_rRNAfilt.bam 03_bams/hsymtest1_hydractinia_D_B4_tagged_FINAL_rRNAfilt.bam 03_bams/hsymtest1_hydractinia_E_B4_tagged_FINAL_rRNAfilt.bam -output ...

There might be occasions where we could be interested in something like this; for example, with tools such as kallisto or salmon where arguments can be passed down in pairs of positional arguments (e.g. kallisto (...) read1.fq read2.fq ). But overall, in my experience, I have ended up refraining from using expand in this way (as you can specify multiple inputs and outputs; see above).

  • A special one: if the initial input files of your workflow do not follow a naming scheme, use DICTIONARIES to change their names and adapt them to wildcards. For this, what I did was: i) a parseable table with the path of your input files in a column, and a series of columns with the important bits of information that can be used as wildcards, ii) a function defined in your Snakefile or a common.smk file that takes a path present in the table and returns the corresponding column values sorted, and iii) defining a dictionary in the Snakefile/common.smk file with new paths as keys (using list comprehension and the function in step ii) and original paths as values. This dictionary can be called in the input step using a lambda function. I will detail this in the future.
  • Quoting marks in scripts that use variable expansions, or awk/perl commands can be a pain because the string of commands will go several rounds of interpretations before it is run. The best practice is to specify the whole awk string in a params element of the rule (using the r"""STRINGHERE"""), and then quoting it such as when we do {input} or {output}: awk {params.awk:q} {input} > {output} . Where :q ensures proper quotation handled by snakemake. Regardless of this remember to escape the curly brackets using two instead of one, otherwise they could be misinterpreted as unknown wildcards: ${num_reads} would become ${{num_reads}}.
  • If working with snakemake and R almost exclusively, you can use optparse to import variables into th R scripts.
  • And so far last but not least, as a good rule of thumb, remember to write your rules to run for any one file and not for every file.

References and links I’ve found useful over time:

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#constraining-wildcards
https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/13-quoting/index.html
https://stackoverflow.com/questions/72701330/missing-input-files-error-for-expanded-input-in-snakemake
https://edwards.flinders.edu.au/wildcards-in-snakemake/

Acknowledgements:

@Khalidtab@universeodon.com for tips on R optsparse(), the expand() function, and config files.

Leave a Reply

Your email address will not be published. Required fields are marked *