Building a RNA-seq pipeline in python
- Lacey Westphal
- Mar 14, 2017
- 3 min read
Alright, so I'm not very good at following my own timeline. I said this post would be about my failed logistic regression from the Prosper Loan data, but I built a python RNA-seq pipeline today and I had a lot of fun doing it, so I want to share it first.
Let me give you a very brief summary about what 'RNA-seq' is - it's taking all the RNA from a population of cells (remember the RNA is the message that is translated by the ribosome to make proteins) and sequencing it. Then you go through a series of steps and eventually you end up with a file that contains the genes and the number of RNA sequences that fit into it. So you can get an idea of the amount of RNA that was made at a particular time in a population.
You might want this information if you have a drug that works to prevent a disease but you aren't sure how it works. You treat a population of cells with the drug (and you have a population that is untreated) and you can see if the drug causes any of the gene expression in the population to change. That could be a really great way to generate a hypothesis to test further down the line.
I'm doing RNA-seq for a different reason: I want to find out if a specific protein affects the expression of a set of genes.
In order to process the data there are a few steps:
1. Align the sequenced reads (of which there are around 10 million, each 75 base pairs long) to a reference genome. It doesn't do you much good if you don't know what the sequences are, so aligning them to a genome allows you to have the context.
-For this we use a program called TopHat2 (which also uses Bowtie2 and samtools)
2. Once they are aligned to a genome, you want to count up the reads that fit within certain areas of the genome. In my case, I want to know all the reads that fit within annotated genes.
-For this we use a program called HTSeq
All of the processing steps are completed on the command line, so to save some time and prevent mistakes, I decided to write a pipeline in python that everyone in the lab can use. Right now I have nothing very fancy in the script, but I can add on to it as I see fit (as it breaks) in the future.
I used the subprocess package in Python to run my command line programs. I wanted to be able to provide a list of the sequence files and have it run through the pipeline one by one.
A problem I encountered was that the subprocess.Popen(args) would initiate and then the next line of my code would execute, even if the process wasn't completed. That was the purpose of the proc.wait() line in the code. I'm going to paste the code here with comments so you can see what I ended up with. If you have any suggestions to clean this up, please let me know!
Check it out on gist here.


I'm pretty excited to share this with my colleagues. I think it will save us a lot of tedious work.
コメント