Finding suitable aligners for my project

Owing to the stay-at-home order, I cannot do my wet lab experiment anymore. To use this time effectively, I revisited my Mimulus time course data and thought through the questions I listed in the to-do note. I found that I still don’t have good answers for most of the questions, even though those are frequently brought up in meetings. For example, people always ask something like, “Have you considered using…” or “Why don’t you try…”. However, they barely give you a concrete reason for those suggestions (for example, what are the pros and cons of the tools they suggest, comparing to the original one I used).

Yes, the only way to know if a program is suitable for my data is to test it by myself and tweak the parameters if necessary. However, there are numerous programs, so just asking others to use some tools without any specific reason or guideline is not very persuasive and meaningful during the discussion. Of course, if you nod off during the meeting but unfortunately still have to say something, anyway, this is a way for doing it.

Therefore, one of my goals recently is to have perceptions of those frequently-asked questions, about how to chose tools especially. In this post, I will record some thoughts about the decision making of RNA-seq aligners.

First, let’s see how many aligners are available now. You can find an astonishing summary figure (i.e. a super long one!!!!) right in the middle of this article. This article is talking about the topic which is very relevant to my post: “What is the best NGS alignment software”. Since it was written in 2016, I would say there should be some updates after then (i.e. more aligners! Ones have an interest in the exact number can check the gigantic list on Wiki).

Nonetheless, don’t worry! This list can be further narrow down to the length that somewhat I can handle. Let’s consider some criteria:

  1. It should be splice aware, for I would like to align the RNA-seq reads to the genome.
  2. It is still used broadly and maintained by the authors.
  3. I prefer the tools with lighter computational load and higher speed.

Then, I checked the latest articles in regard to the comparison of the RNA-seq aligners to see some convincing and up-to-date evidence. Surprisingly, there are only a few articles talking about this question, and the latest two are:

2016, Baruzzo et al. “Simulation-based comprehensive benchmarking of RNA-seq aligners”

2019, Raplee et al. “Aligning the Aligners: Comparison of RNA Sequencing Data Alignment and Gene Expression Quantification Tools for Clinical Breast Cancer Research”

I recommend using Baruzzo’s article as the guideline for choosing aligners and tweaking the parameters, for it demonstrates the results of comprehensive investigations with good experimental design. Moreover, the authors even kindly provide the instruction about the parameter tweaking in the supplementary information! On the other hand, although Raplee’s article was published later, it has some defects and only compares HISAT2 and STAR. I will discuss this article in the other post.

Baruzzo’s article, 2016

Let’s take a closer look into Baruzzo’s article first. Here are some details of the experimental design:

  • Comparing 14 splice-aware algorithms
  • Analyzing performance at 3 levels:
    • base level
    • read level: most relevant for gene-level quantification
    • junction level: important in reconstructing alternative splicing events
  • Default vs. Optimized parameter settings
  • Simulating data:
    • (1) human and (2) malaria parasite Plasmodium falciparum
    • three levels (T1-T3)
      Levels Substitution Indel Error rates Example
      T1 0.001 0.0001 0.005 Human
      T2 0.005 0.002 0.01 Model organisms
      T3 0.03 0.005 0.02
  • Other analyses, e.g. multimappers, adapters, and two-pass mode

And yes, there is a one-line suggestion (I guess that’s what most readers search first for when reading this kind of article) in the discussion section:

“Based on this analysis the most reliable general-purpose aligners appear to be CLC, Novoalign, GSNAP, and STAR.”

However, don’t jump into the conclusion! Be aware, this sentence is in the discussion section rather in the abstract or introduction…

One presumption for this one-line summary is you would only use the default settings without any tweaking. According to Figure 3 in the article, some aligners can perform pretty well after adjusting some parameters. Therefore, you may want to consider the speed of the aligning for making the decision (Figure 4 in the article); for example, HISAT2 perform far faster than others.

Figure 3 in Baruzzo’s article. Note that parameter tweaking is necessary for some of the aligners (e.g. HISAT2). For others, teaking is optional but can improve the results (e.g. STAR).

Here are several points I feel important and take into account after reading:

  1. In fact, all aligners perform well for the T1 libries with default, expect for CRAC.

    • That means you can use whatever you like for the high quality sequencing of humans.
    • Even for T2 (certain model organisms with good annotations), most aligners with default setting can handle the task well.
  2. Some aligners are remarkably accurate even for the very short anchors and without annotation, and therefore outperform others at junction level. Thus, if the isoform discovery is what you want to do, consider those aligners: HISAT, HISAT2, and ContexMap2.

  3. Some aligners are GUI (graphical user interface), and some of them require licence to use.

  4. With annotations, 2-pass model is not required (save the time!).

  5. Some famous aligners (e.g. TopHat2) are not good choice!

Eventually, I chose STAR and HISAT2. Here are some notes for making this decision:

  • STAR: Pretty fast and balance for most situations. Tweaking can further improve the performance.
  • HISAT2: Super fast, but the tweaking is necessary (I have done this before so it’s OK for me).
  • CLC: GUI; I don’t think my laptop can run it very fast.
  • Novoalign: Licence required.
  • GSNAP: Our computing cluster (FARM) does not have the latest version. The version we have is version 2017-11-15. Maybe I can try it in the future.

Besides, I also considered BBMap as some posts recommanded it on forums and it is still updating. I wonder why it is not included in Baruzzo’s article because I know JGI is still using BBMap. So maybe I will try it someday.

Raplee’s article, 2019

This article only compared STAR and HISAT2 with default setting, and I will discuss it in the other post. The most important discovery in this article is it showed that the default HISAT2 cannot handle pseudogenes very well, and this issue became very servere under some circanstances. This finding also imply the importance of using different tools to analyze data.


Suggestions on Forums

Reading the discussion on forums (SEQanswers and Biostars) provided me some ideas as lots of users share their experience and results. However, I found there are some pitfalls and should be aware of some points:

  1. The date of the post.
    • Bear in mind the fast-evolving nature of NGS!
  2. Check more posts of the users who responded the questions you have interest in.
    • To know if the guy is an expert or just a passerby.
  3. The data set and the organism discussed in the post.
    • The strategy could be very different for distinct experimental materials.
  4. Evidence talks.
    • Go through the example in the post carefully. Some of them may not be very reasonable…

Othe articles

Just like I mentioned above, I am surprised when I found that there are not so many articles related to the aligner comparison. For whom it may be interesting, here are some older articles, and it is worth mentioning that none of them include HISAT/HISAT2, two of the fastest aligners curently:

2014, Wang et al.“Comparisons and performance evaluations of RNA-seq alignment tools”: Tophat, STAR, MapSplice, and GNSAP

2014, Benjamin et al.“Comparing reference-based RNA-Seq mapping methods for non-human primate data”: (splice-aware aligners) TopHat2 and GSNAP

2013, Engström et al.“Systematic evaluation of spliced alignment programs for RNA-seq data”: (splice-aware aligners) GSNAP, MapSplice, PALMapper, ReadsMap, STAR and TopHat

2012, Fonseca et al.“Tools for mapping high-throughput sequencing data”: a long list

Also, you can check the rollout papers of aligners because the authors have to persuade editors that their new tools have some advantage over others. Besides, some articles did not compare the aligners directly, but went through the entire quantative procedure to do the comparison.