Factors affecting the alignment of RNA-seq reads

Even though mapping RNA-seq reads to the reference genome or transcriptome seems intuitive at the first glance, there are lots of factors that can complicate the alignment procedures. Since a software can outperform others in certain situations, it is helpful to realize what we may face during the alignment and figure out which should be dealt with care and which can be ignored safely. For example, if your experimental material is a polyploidy species, then the duplicated genes could cause the major issue for assigning the reads to the correct locus.

Biological factors

Splicing (i.e. splice-aware or not)
Polymorphism
Alternative splicing (i.e. isoforms)
- splice-aware tools $\neq$ able to distinguish isoforms
Low-complexity sequences (e.g. repetitive regions)
Pseudogenes
Homologous gene family
Pathological splicing (for disease)
Contaminations
- From symbiosis organisms or the carry-over (e.g. soil for plant material)

Technical factors

Paired-end vs. single-end sequencing
Strand-specific or not
Sequencing error
Incomplete annotation
- e.g. unannotated splice junctions
- e.g. new isoforms
Dealing with the intron-sized gaps
- The anchor sequence could be very short (e.g. 1 bp!)
Adapter sequences
- “adapter read-through” if DNA fragment is shorter than the read length
Contaminations
- From the handling procedures (will detect the RNA signals from the guy doing the library prep)

Other factors you may want to take into account when choosing aligners:

Speed
Computational burden