Does there exist a "best" pipeline for RNA-seq analysis?

Well, the short answer is “NO”.
Nevertheless, I’d still like to write down how I finally got this answer:)

RNA-seq serves as an extremely flexible tool for studying the transcriptome, and it becomes more and more popular in every biological field. To decipher the information content encoded in the gigantic amount of nucleic acid strings, bioinformaticians have put great effort into developing data analysis pipelines. However, this large and growing number of programs/pipelines also hinder newcomers to choose suitable tools for their own work. Imagine that: there are at least 4 steps for differential gene expression (DGE) analysis (trimming, aligning, quantification, and assessing DGE). If there are 3 different programs for each step (far underestimation!), then there will be 3*3*3*3=81 combinations!

Well, some might think, “Maybe I can just follow the procedures described in some good articles and everything would be OK!” This strategy works well for most of the cases in my daily lab life, but causes additional confusion in RNA-seq analysis. I found lots of studies used stale tools probably because the researchers did the analyses several years ago, and/or they also followed some even earlier articles to handle their data. Moreover, the rapid development of the RNA-seq tools forces users to take the version of the program and database into account because it could make a huge difference than what you imagine!

Here are something I learned from the process of chasing the answer:

Bear in mind before start

  1. NEVER EVER trust your tools!
    The Golden Rule of Bioinformatics. Most of the tools are developed using some data sets or even limited to some species as benchmarks. Therefore, it could go wrong when applied to your own data. For example, I am working on the Mimulus transcript data, and the mapping rate was a disaster using the default settings of HiSAT2. After carefully reading through the HiSAT2 manual, I found that HiSAT2 is originally developed for the human transcriptome and the parameters are very stringent! Therefore, parameter tweaking is required in my case (I will have the other note talking about this).
  1. There is no universally best workflow suitable for all data set
    Following up the previous point, the best overall approach for a specific data set may be sub-optional for the other. Several factors have to be considered, such as the goal of the research, the subsequent assays, and the available resources/databases.
  1. Experimental design is super important!
    There are lots of articles comparing the existing tools and pipelines. However, none of them can save a badly designed experiment.

“Garbage in, garbage out.”

Since the RNA-seq is a kind of high throughput approach, there are some more factors that have to be taken into account, such as batch effects. In addition, since RNA-seq compares tons of genes at the same time (multiple comparisons), replicates (biological vs technical) and statistical powers should be considered carefully. There are some articles discussing how many number of replicates is necessary for detecting the desired DGE fold change. (e.g. Schurch et al., 2016: “How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?”)

  1. Document everything
    Just like writing the lab note of the wet lab, recording processes and tools is helpful for troubleshooting and improving the reproducibility. This is especially important for the rapidly growing bioinformatic field.
  • Version of the program
  • Version of the database (genome, annotation, etc)
  • Your own dataset
  • Scripts and codes

Determine the research goal before choosing tools

As mentioned above, the flexibility makes the RNA-seq a powerful tool, but it also complicates the decision-making process. Therefore, the first step is to determine what is the eventual goal of the study. Here are several common applications of the RNA-seq:

  • Differential expression (maybe the most common usage)
  • Variant calling (Of course, it is a “sequencing”!)
  • Transcript identification (e.g. new isoform discovery)

By combining several additional factors, such as spatial relationships, non-coding RNAs, inter-organism interactions, and cell types (scRNA-seq!), you can easily make the list longer. To whom it may interest, check Van den Berge et al’s article in 2019.

Also, consider the questions:

  • Are you going to do the downstream analysis? (i.e. RNA-seq is only the intermediate process or the final results you have?)
  • What is the downstream analysis?
  • Putting all the information together, and you may decide the desirable accuracy, speed, and cost accordingly.

Checking the basic information of your organism

For example, the genome size, genome variation, ploidy status, and annotation.

It’s time to choose the tools (finally)!

  1. Check the comparison articles published recently
    There are a bunch of comparison articles of RNA-seq tools if searching on PubMed, and different authors always used different data sets (real data, simulation data, or the hybrid) and strategies to test the tools. To make the things more complicated, each step of the data analysis can be examined separately. Therefore, before choosing the tools, poor Ph.D. students have to choose which articles to believe first.
    Since the progress of the tool development is very fast, I prefer to read the most recently published article first, and check the older ones if I find something that appeals to me. One of the most comprehensive articles is published by Sahraeian et al. in 2017 (the whole analysis procedure) and Baruzzo et al. in 2017 (the aligners). Just as suggested by the authors, this kind of study should be updated regularly because it is a fast-developing field.
    After having some ideas in mind, I like to make a check list of programs I am interested in, and try to further narrow down by the language/system on which the program is developed. For example, I prefer to try the softwares written in R or Unix first, as those are two systems I am familiar with.
  1. Use the well-maintained programs and read the manual
    I always check the latest update date of the programs when I visit the official website of a program. When will an author update their works? Definitely when someone reports the bugs or provides suggestions for improvement. Therefore, the releasing log could be a sign of how popular the program is and whether the authors still maintain the tool. Just like online games, it is inevitable that a program may have some bugs. Therefore, more people using a program is like doing the publish test in different ways, so it is more possible to find hidden bugs. In addition, you may find your question has already been asked and some decent people have answered it.
    Also, read the manual or use the tutorial of the official website. The authors are the people most familiar with that program on earth! Even though there are lots of other tutorials share by independent users online, it is hard for newcomers to judge if every step is correct or if some details should be aware of.
  1. Try the tools used by your colleagues or cohorts
    It’s always good to have someone (especially PI) to discuss with or share the scripts!
  1. Compare the results from different tools
    This is one of the most convincing and popular strategies for analyzing the RNA-seq data. So start to play with your data in different ways!!