Po-Kai's Note

A Ph.D. student's random writing


  • Home

  • About

  • Tags

  • Categories

  • Archives

Running Conda/Bioconda on UCD FARM Cluster (II): Installing Packages Manually by `conda install`

Posted on 2020-11-02 | In FARM , Conda |

Conda normalizes the installation procedures of a wide variety of programs and provides 2 ways for installation:
(a) installing packages manually by conda install
(b) installing through a YAML file.

In this post, I am covering the first method.

What is the Conda environment and why using it?

Before diving into how to use Conda to install packages, let’s talk about some aspects of the “environment”.

As mentioned in the previous post, Conda has two features: (1) a package manager for program installing and version control and (2) an environment manager. Environment management is especially important for developers because they have to be confident that their applications can run smoothly in different environments (e.g. different dependencies, software versions, or operating systems) used by their cohorts or clients. So, you might ask that: does the environment really matter to me if I am just a user of packages?

Well, a good concept of environment will help you avoid some undesirable pitfalls when using Conda, and Conda environments will become a great tool to increase the reproducibility of your assay. Thus, first, let’s take a look at how Conda acts as an environment manager.

After the installation is completed, the installer, either Anaconda or Miniconda, will create a “base” environment, which contains a list of packages. All the packages are ready to use in the base environment, and it is quite handy most of the time. However, if you are going to add more gears in Conda, especially very old or novel tools, the setting of the base environment may not fit the requirements of those tools. Conda provides a simple way to create an isolate environment and you can run it without affecting other environments. You can imagine that the base environment is the bench in your lab and you can do most of the examines on it, but for some special analyses, say, radioactive experiments, you have to go to a specific room with particular types of equipment.

Even for the bioinformatic programs that you think are regularly-used and up-to-date, it is still not a good idea to put all of them in the base environment because the dependency conflict can easily undermine reproducibility. For example, the newly installed software may update some of the existing packages in your base environment, and turns out this change tweaks the setting of the existing programs so that you cannot reproduce the results you ran a month ago anymore. It would be really painful if you have to figure this out when your PI asks you what causes the difference…Besides, the official document of Conda mentions that the packages with similar filenames and serve similar purposes may cause some problem if they are installed in the same environment. Therefore, create isolated environments for your bioinformatic tools! This also provides an additional benefit: you can export the environment of a specific pipeline for your cohort so that he/she can run the pipeline without worrying about the setting. This will save both you and your cohort’s lives.

Here is a very good article for whom is interested in more details of Conda environment management. The author used very vivid comparisons to explain how Conda works.

Get your tools manually by conda install

Goal: Create a new environment called “aligners”, and install Hisat2 and STAR, two aligners used in my project, in the aligners environment.

Read more »

Runing Conda/Bioconda on UCD FARM Cluster (I): Install and setup Conda/Bioconda

Posted on 2020-10-26 | In FARM , Conda |

What is Conda?

Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Conda documentationdocs.conda.io/en/latest

This introduction from the official website clearly describes 2 important features of Conda:

  1. Version management
  2. Environment control system

With these features, Conda is endowed with the ability to:

  1. Provide the user with full control of the software and its dependencies
  2. Decrease the reliance on the administration privileges, especially using the computer cluster
  3. Increase the reproducibility on different machines

Why using Conda on the FARM cluster?

Although there is a list of software that has already been set up on FARM, you may still find the one you need is not available. It could be either too new that the cluster managers do not add it to the list yet, or too old but you still want to use it to reproduce some published results. Of course, you could email the admin and request it, but sometimes they will just ask you to do it by yourself. I met this situation when I was playing with GATK in 2018: there was only GATK3.6 on FARM, but GATK4 was online for a while and had a full set of best practice workflows! However, installing software is not fun for me because there are diverse installation methods for the packages constructed in different languages, and it could be sort of troublesome to check them one by one. Also, regular users cannot use sudo (administration privileges required) on FARM and that can be quite frustrating if I want to use the package immediately. Furthermore, if there are more than one packages that I have to run in a single assay, it could be very time-consuming to go through all the documentation/readmes to settle down the pipeline.

Fortunately, Conda can solve all the abovementioned problems that could be met on the computer cluster! It normalizes the way to download the programs from different platforms, and you can skip the sudo command to finish the installation. Additionally, it provides a very helpful function to create and manage the software in an isolated environment. Those features largely decrease the efforts for maintaining the tools on FARM.
(Check the Bioconda article to see the inspiring introduction!)

Relationships between Anaconda, Miniconda, Conda, and Bioconda

There are several related things that include “conda” in their names, and it is kind of confusing for newcomers who are not familiar with this family. Here, I am covering a brief introduction of them:

  1. Anaconda: A distribution of the Python and R language. It is a huge toolkit for data science and includes a lot of packages after installation. It possesses a package management system, Conda, for the user to easily install more packages and manage them.

  2. Miniconda: The bootstrap version of Anaconda that encompasses fewer packages, but the Conda is still included!

  3. Conda: The package management system in Anaconda/Miniconda. You can set up the “channel” for downloading the software you have an interest in.

  4. Bioconda: A channel for Conda, which is specified for bioinformatics software.

    The longer introduction of channel of Conda: Here

Here is the relationship between them:
Imgur

Read more »

Using Singularity on UCD FARM Cluster

Posted on 2020-10-10 | In FARM , Containers |

Why using Singularity?

Singularity is a container platform and is the only container installed on FARM cluster because of its higher security compared to Docker, which is the other (more) famous container. Therefore, for tasting the benefit of using containers in bioinformatic analyses, I played around with Singularity using some tutorial examples.

I would say there are far more tutorials for Docker than for Singularity. In addition, there is a larger number of existing bioinformatic container imagines in Docker Hub, a Docker registry. For example, I found there was no container imagine for HISAT2 (a popular aligner for RNAseq) in the registry of Singularity called Sylabs Cloud (this post was written in Oct 2020), but there were tons of HISAT2 container imagines in Docker Hub.

Fortunately, Singularity can import Docker container images. That means you can still utilize the deposits on Docker Hub! For example, try:

1
singularity run docker://godlovedc/lolcow

There are some warning messages, but it still works and you can still see the cow talking some weird jokes on your screen.
Imgur


Run Singularity on FARM: Basic

You will see me go through most of them bellow for running the example provided by Biocontainers. Here are some brief introductions of those commands.

Create a Singularity image

Well, you cannot do this on FARM because the sudo command (system administration only) is required. Forget it or do it on your own computer.

Read more »

Finding suitable aligners for my project

Posted on 2020-07-02 | In RNA-seq , Dry_lab |

Owing to the stay-at-home order, I cannot do my wet lab experiment anymore. To use this time effectively, I revisited my Mimulus time course data and thought through the questions I listed in the to-do note. I found that I still don’t have good answers for most of the questions, even though those are frequently brought up in meetings. For example, people always ask something like, “Have you considered using…” or “Why don’t you try…”. However, they barely give you a concrete reason for those suggestions (for example, what are the pros and cons of the tools they suggest, comparing to the original one I used).

Yes, the only way to know if a program is suitable for my data is to test it by myself and tweak the parameters if necessary. However, there are numerous programs, so just asking others to use some tools without any specific reason or guideline is not very persuasive and meaningful during the discussion. Of course, if you nod off during the meeting but unfortunately still have to say something, anyway, this is a way for doing it.

Therefore, one of my goals recently is to have perceptions of those frequently-asked questions, about how to chose tools especially. In this post, I will record some thoughts about the decision making of RNA-seq aligners.

First, let’s see how many aligners are available now. You can find an astonishing summary figure (i.e. a super long one!!!!) right in the middle of this article. This article is talking about the topic which is very relevant to my post: “What is the best NGS alignment software”. Since it was written in 2016, I would say there should be some updates after then (i.e. more aligners! Ones have an interest in the exact number can check the gigantic list on Wiki).

Nonetheless, don’t worry! This list can be further narrow down to the length that somewhat I can handle. Let’s consider some criteria:

  1. It should be splice aware, for I would like to align the RNA-seq reads to the genome.
  2. It is still used broadly and maintained by the authors.
  3. I prefer the tools with lighter computational load and higher speed.

Then, I checked the latest articles in regard to the comparison of the RNA-seq aligners to see some convincing and up-to-date evidence. Surprisingly, there are only a few articles talking about this question, and the latest two are:

2016, Baruzzo et al. “Simulation-based comprehensive benchmarking of RNA-seq aligners”

2019, Raplee et al. “Aligning the Aligners: Comparison of RNA Sequencing Data Alignment and Gene Expression Quantification Tools for Clinical Breast Cancer Research”

I recommend using Baruzzo’s article as the guideline for choosing aligners and tweaking the parameters, for it demonstrates the results of comprehensive investigations with good experimental design. Moreover, the authors even kindly provide the instruction about the parameter tweaking in the supplementary information! On the other hand, although Raplee’s article was published later, it has some defects and only compares HISAT2 and STAR. I will discuss this article in the other post.

Read more »

Factors affecting the alignment of RNA-seq reads

Posted on 2020-06-19 | In RNA-seq , Dry_lab |

Even though mapping RNA-seq reads to the reference genome or transcriptome seems intuitive at the first glance, there are lots of factors that can complicate the alignment procedures. Since a software can outperform others in certain situations, it is helpful to realize what we may face during the alignment and figure out which should be dealt with care and which can be ignored safely. For example, if your experimental material is a polyploidy species, then the duplicated genes could cause the major issue for assigning the reads to the correct locus.

Biological factors

  1. Splicing (i.e. splice-aware or not)
  2. Polymorphism
  3. Alternative splicing (i.e. isoforms)
    • splice-aware tools $\neq$ able to distinguish isoforms
      Read more »

Does there exist a "best" pipeline for RNA-seq analysis?

Posted on 2020-06-09 | In RNA-seq , Dry_lab |

Well, the short answer is “NO”.
Nevertheless, I’d still like to write down how I finally got this answer:)

Read more »

20200428 The first post!

Posted on 2020-04-28 |

The FIRST post!

新開幕!

Timo超可愛

薩爾達傳說好玩

我不想要寫paper…喔氣氣氣

Hello World

Posted on 2020-04-28 |

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment

Po-Kai

Po-Kai

8 posts
5 categories
6 tags
GitHub
0%
© 2020 Po-Kai
Powered by Hexo
|
Theme — NexT.Pisces v5.1.4