Factors affecting the alignment of RNA-seq reads

Even though mapping RNA-seq reads to the reference genome or transcriptome seems intuitive at the first glance, there are lots of factors that can complicate the alignment procedures. Since a software can outperform others in certain situations, it is helpful to realize what we may face during the alignment and figure out which should be dealt with care and which can be ignored safely. For example, if your experimental material is a polyploidy species, then the duplicated genes could cause the major issue for assigning the reads to the correct locus.

Biological factors

  1. Splicing (i.e. splice-aware or not)
  2. Polymorphism
  3. Alternative splicing (i.e. isoforms)
    • splice-aware tools $\neq$ able to distinguish isoforms
  4. Low-complexity sequences (e.g. repetitive regions)
  5. Pseudogenes
  6. Homologous gene family
  7. Pathological splicing (for disease)
  8. Contaminations
    • From symbiosis organisms or the carry-over (e.g. soil for plant material)

Technical factors

  1. Paired-end vs. single-end sequencing
  2. Strand-specific or not
  3. Sequencing error
  4. Incomplete annotation
    • e.g. unannotated splice junctions
    • e.g. new isoforms
  5. Dealing with the intron-sized gaps
    • The anchor sequence could be very short (e.g. 1 bp!)
  6. Adapter sequences
    • “adapter read-through” if DNA fragment is shorter than the read length
  7. Contaminations
    • From the handling procedures (will detect the RNA signals from the guy doing the library prep)

Other factors you may want to take into account when choosing aligners:

  1. Speed
  2. Computational burden