Poisson distribution: Lander & Waterman 1988
Below are summaries written by members of the group:
19 Jan 2016
Johnathan Kaiser, UI
Cui P, Lin Q, Ding F, Xin C, Gong W, et al. (2010) A comparison between ribo-minus RNA-sequencing and polyA-selected RNA-sequencing. Genomics 96: 259–265
Discussion revolved around two different protocols for cDNA library preparation using polyadenylated tails to target transcripts and using rRNA depleted RNA libraries through ribo-minus RNA sequencing (rmRNAseq). rRNA depletion involves hybridization of rRNAs from total RNA sample with labeled probes that are washed out; in this scenario the flow through consists of the rmRNA. Poly-A preparation works similarly, however processed transcripts with hybridized poly-A tails are of interest and the flow through is discarded. Cui et al. found that mRNA sequencing generated a higher percentage of transcripts correlated to exonic regions whereas rmRNAseq generates higher amounts of intronic and intergenic transcripts. Additionally, sequencing RNA through the two methods shows different unique transcripts, all of which are expressed at low levels relative to the abundantly shared transcripts. Background read density for both methods was assessed using the Poisson model which tends to be reliable for modeling genomic sample data. Overall, it was concluded that rmRNAseq is likely better for total coverage of the transcriptome (certainly in picking up expression of non-coding RNAs), however most transcripts, and certainly coding transcripts that are abundantly expressed are shared between the two methods.
26 Jan 2016
Marietta Easterling, WSU
This week we reviewed “RNA-Seq: an assessment of technical reproducibility and comparision with gene expression array” by John C. Marion, Christopher E. Mason, Shrikant M. Mane, Matthew Stephens and Yoav Gild. The paper is comparing gene expression using two different sequencing methods, microarrays and RNA-seq. There are several drawbacks to using microarray technology such as: high background noise, hybridization problems to the probe and probe limitation to which genes can be identified. Most microarray chips are very inexpensive but creating the chip is very expensive and as stated above has many limitations. This will be especially true for research done using nonmodel organisms. RNA-seq technology is not subjected to these limitation and can sequence many samples at the same time to get a lot more reads back as well as expression patterns. The researchers tried to use similar tissue preparation protocols so any differences seen would be in the technologies only. The tissue used was liver and kidneys, to look at differential gene expression between these two technologies.
We discussed some of the finer key points of using RNA-seq such as the lack of lane controls used since the paper was published, to save money by creating more space for samples. Lane controls were normally placed on the inner most lanes because of lane effects. Lane effects generally occur on the edges, the paper discussed this effect but found in their own research that there was some land effect but it was very slight and not a major problem.
The result shows that microarray data are comparative but not absolute. Using RNA-seq 4,959 unique differently expressed genes were identified compared to the array, which was only 1,579. Another important feature of RNA-seq is seeing more subtle changes in gene expression such as alternative splicing. In the results RNA-seq data showed that there is alternative splicing in a gene found in both the kidney and liver tissue. Microarray would only show the expression on the gene which may be equal in both tissue but due to alternative splicing may have different functions. Overall the paper shows that there are many differences between these two technologies. Microarray is a relativity cheap way to look at differential gene expression in model organisms. RNA-seq, though more expensive, can give a greater depth in the understanding of gene expression for any organism, including non-model species.
2 Feb 2016
Summary: “Full-length transcriptome assembly from RNA-Seq data without a reference genome”
De novo transcriptome assembly is the construction of transcripts into the collection of all the transcripts for a species that does not have a sequenced reference genome. RNA-seq methods coupled with de novo transcriptome are powerful tools to discover variation in genetic transcription beyond model organisms. The full diversity of life can now be efficiently probed with sequencing technology. However, massively parallel sequencing technology utilizes relatively short sequences of fragments to probe gene transcription. These short fragments provide a computational challenge to assemble an accurate representation of an experimental transcriptome. For organisms with reference genomes sequence alignment bypasses this challenge altogether. Complex mathematical strategies for accurate assembly must be used with de novo assembly however. Trinity is a computer program that efficiently accomplishes assembly with robust and “greedy” searching. Trinity can assemble most of even the lowest abundance transcripts. When compared side-by-side with other mapping-first and assembly-first protocols Trinity’s performance was comparable to mapping-first protocols. Trinity identified splice variations and isoforms in mouse and whitefly (only unfinished reference genome available).
9 Feb 2016
Joshua Brindley, WSU
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., & Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols, 7(3), 562-578.
This week’s discussion revolved around a suite of programs, Bowtie, TopHat, and Cufflinks, used to analyze differential gene and transcript expression of RNA-seq data. First Bowtie is used to quickly align the short reads of raw data. The TopHat program is used to align reads to a reference genome. An advantage to TopHat is its ability to identify splice variants and reads that span exon junctions. TopHat is limited due to its needs for a quality reference genome and reliance on Illuminia or SOLiD sequence data. Once aligned to the reference the Cufflinks package is used. First Cufflinks assembles the transcripts. Within Cufflinks a program, Cuffcompare compares Cufflink assemblies to the annotated reference to sort out new genens from known genes. Novel genes and transcripts should be validated with cloning and PCR. Once transcripts are assembled Cuffmerge merges the transcript assemblies together for a final transcriptome assembly. A separate program, Cuffdiff, then calculates expression in different replicates and tests the significance of the observed change in expression between replicates. While some transcripts might show differential expression, this could be insignificant in total gene expression. Beyond measuring expression Cuffdiff also identifies genes that are differentially sliced or differentially regulated. The results from Cuffdiff can then be easily visualized with the CummeRbund package. This package is used to create plots of the data and transforms the Cuffdiff data in files compatible with R. Overall Bowtie, TopHat, and Cufflinks are a user friendly set of tools that makes the analysis of large RNA-seq data sets achievable by a broader group of scientists.
Mark Smithson, WSU
Rapaport, F., Khanin, R., Liang, Y., Pirun, M., Krek, A., Zumbo, P., … & Betel, D. (2013). Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol, 14(9), R95.
Last week, we concluded that Tuxedo’s algorithm for differential expression analysis, Cuff-diff, is easy-to use, and effective for quantifying the differential expression of transcripts with multiple isoforms. This week we observed that other widely used programs outperform Cuff diff in a few aspects of differential gene expression quantification (Rappaport et al). Given the same control RNA libraries, Cuff diff’s was outperformed by other widely-used software for differential expression analysis.
We first discussed the three major theoretical steps of differential expression analysis (normalization, statistical modeling and tests for differential expression) and how these steps differ among current software for differential expression analysis. Normalization involves a transformation of the raw read data to remove the effects of fragment length, number of reads, and other technical aspects of sequencing that produce noise. All software evaluated in Rappaport et al performed comparably in the normalization step, despite their different algorithms (TMM, RPKM, FPKM, etc). Our discussion of the statistical modeling of gene expression focused on the poisson and negative binomial distributions. We emphasized that the assumptions of the poisson distribution (mean = variance) are often violated by RNA seq. data in that the variation in expression of a gene often surpasses its mean expression. DESeq and edgeR use the negative binomial distribution which includes a dispersion factor to account for differences in the mean and variance of expression. When reviewing the various statistical tests used for differential expression, we were unsure about how many degrees of freedom to assume.
Finally we reviewed results of an experimental comparison of the different software programs. Although Cuffdiff was the only software able to detect differential isoforms, it had more false positives, poor signal to noise vs. P value correlations and lower AUC scores than the other tested software. DEseq and edgeR performed slightly better with few false positives and consistent AUC sores. However these two programs still showed poor signal-to-noise vs. P value correlations. Limmavoom, Poissonseq and Bayseq all had better signal-to-noise vs. P value correlations but Bayseq and EdgeR require replicated samples. Overall each program showed some advantages and disadvantages. For a complete summary of results see Table 2 of Rappaport et al 2013.
Allison Kolbe, WSU
Crowley JJ, Zhabotynsky V, Sun W, Huang S, et al. Analyses of allele-specific gene expression in highly divergent mouse crosses identifies pervasive allelic imbalance. Nat Genet. 2015. Apr;47(4):353-60. doi: 10.1038/ng.3222.
This week we reviewed a paper utilizing RNA-seq to study allele-specific gene expression. Allele specific gene expression is defined as one allele being expressed more than the other in a heterozygous individual. This study used three inbred mouse lines and made all possible crosses between them to see what effects this genetic diversity would have on allele specific expression. The authors state that allelic imbalance requires presence of regulatory variants acting in cis, and would not result from trans. They performed a principal component analysis on expression data derived from multiple tissue types from each of the strains, as well as brain tissue from parents and F1 crosses. They found that genetic background drives gene expression differences, and notably that F1 crosses formed near perfect intermediates between parental lines in the PCA. Overall, it seemed that strain and parent-of-origin did not have huge effect on allele imbalance.
Additionally, the authors were interested in maternal/paternal effects. Maternally/paternally influenced genes identified by this study were consistent with those previously described in the literature. One cross had a higher density of paternally expressed alleles, indicating a slight strain x parent-of-origin effect here. The authors saw a global bias toward paternal alleles, even when removing the 95 alleles that were positively identified as imprinted in this study. They also found that allelic imbalance is clustered on chromosomes, and that genes with consistent paternal expression are closer to CpG islands. This suggests a potential mechanism for regulation of allele specificity. The paper evaluated the paternal effects of brain tissues in depth, but we discussed whether the same trend would be observed in other tissue types. Finally, we speculated about the evolutionary advantage of these paternal effects – perhaps they provide a mechanism to balance the strength of maternal effects in the event that the maternal effects confer maladaptive traits in a new environment.
Anthony Brown, WSU
Conesa, A, Madrigal, P, Tarazona, S. et al. A survey of best practices for RNA-seq data analysis. Genome Biology 2016 17:13 DOI: 10.1186/s13059-016-0881-8
This week we discussed a paper on the best practices for RNA-seq data analysis. We first reviewed the different types of library preparation techniques for RNA-seq: poly(A) enrichment and ribo minus. Poly(A) enrichment entails capturing polyadenylated transcripts with oligo (dT) beads, whereas a ribo minus preparation entails specifically targeting and removing ribosomal RNA. While poly(A) enrichment leads to more fragments that map back to gene regions, ribo minus should be used if other RNAs (like microRNAs or long non-coding RNAs) are of interest. There are other situations where ribo minus has a clear advantage as well. Because bacterial transcripts are not polyadenylated, ribo minus should be used for bacterial transcriptomics studies. The same goes for studies where only limited or highly degraded RNA is available; ribo minus is the clear choice in these instances because of the higher input requirements for poly(A) enrichment and that poly(A) enrichment on highly degraded RNA will lead to a bias toward sequencing the three prime end of polyadenylated transcripts.
We also discussed the benefits of using a stranded library preparation kit over an unstranded preparation kit. When using an unstranded kit (meaning that it is uncertain which strand was transcribed), overlapping transcripts can be difficult or impossible to disentangle, which could lead to overestimation of gene expression and/or poor transcriptome assembly. A similar issue can arise with microRNAs (which can bind to complementary transcripts and subsequently prevent or limit translation). With an unstranded kit, it would not be possible to distinguish between an mRNA and a complementary microRNA.
Finally, we discussed biological replication in RNAseq experiments. The authors recommended having at least three biological replicates per treatment group. According to the authors, RNAseq experiments without replicates only have “exploratory” value, but the point was brought up in our discussion that RNAseq data without replicates could be used for transcriptome assembly. We also discussed the need for replicates at a higher level in order to generalize conclusions. Our consensus was that multiple population replicates for each treatment are preferred, but this is not always possible depending on the study system.
In conclusion, there are many different ways to perform and analyze RNAseq experiments. There isn’t one pipeline that works best for every system and every question, so it is important for biologists to consider the different options very carefully to come up with the most sensible experimental design.
Marcus Hooker, WSU
Huang, Y., Chain, F., Panchal, M., et al. (2016). Transcriptome profiling of immune tissues reveals habitat-specific gene expression between lake and river sticklebacks. Molecular Ecology, 25, 943-958.
This week we discussed a paper on differential gene expression in three-spined sticklebacks collected from parapatric population pairs found in rivers and adjacent lakes. Colonization of these habitats has happened recurrently and has given rise to different phenotypes in each habitat that are consistent across the geographical distribution of the species. One possible hypothesis the authors put forward for this trend is that differing parasite loads in lakes versus rivers, with lake populations having higher parasite abundance than rivers, has created the observed habitat-specific phenotypes. To understand the differences between lake and river populations, the authors collected fish from 4 natural populations pairs (3 population pairs in Europe and 1 in Canada) and extracted total RNA from head kidney and spleen tissue, both of which are involved in immune activity, to generate 77 complete transcriptomes. Reads were quality filtered before mapping onto the known stickleback transcriptome, differential expression (DE) was identified using EdgeR, and gene function was assessed by comparing the results of the DE analysis to annotated stickleback and zebrafish Gene Ontology (GO) databases. Principal components analysis separated gene expression profiles from head kidney versus spleen tissue in the first axis while the second axis separated European from Canadian population pairs. Habitat-specific expression was found in 139 of the observed 16,397 expressed genes, only 105 of the 139 genes had annotated GO term and of these, only 8 had specific immune system functions. While this is a small number, other genes may be involved during parasite infection, other than those associated with various immune functions. Previous laboratory-controlled studies by Lenz et al. 2013 and Haase et al. 2014 looked at differential expression in parasite-challenged versus control conditions and in head kidney versus gill tissue, respectively. Of the 1057 DE genes found by Lenz et al. (2013), ten also showed differential expression in the current study. Haase et al. (2014) found 1060 DE genes in the head kidney and 1415 in the gill, of which six and 25 overlapped with the current study, respectively. Even though few DE genes were uncovered in the current study, the presence of overlapping DE genes when compared to other studies on parasite driven differential expression suggests that, while parasite burden alone is not likely to be the cause of the observed phenotypic differences between lake and river populations of three-spined sticklebacks, it was at least one factor driving the differences between habitat types.
Wade Roberts, WSU
Smith S., Bernatchez L., Beheregaray L.B. 2013. RNA-seq analysis reveals extensive transcriptional plasticity to temperature stress in a freshwater fish species. BMC Genomics 14:375.
In the current study, crimson-spotted rainbowfish (Melanotaenia duboulayi) were used to begin examining genetic responses to increasing temperatures. The authors aim was to provide an initial investigation of the transcriptomic response of these fish to thermal stress. While the species is distributed over several ecoregions, the study collected rainbowfish from only 1 population. Two groups were used for the experiment: a control group (n=6) maintained at 21’C and a treatment group that was gradually acclimated to 33’C (increasing temperatures of 2’C per day). mRNA was sequenced using an Illumina HiSeq2000 using 101bp, paired-end reads. Transcriptome assembly was performed using multiple kmers in Velvet/Oases and the best assembly was chosen with optimal balance between median, mean, and N50 contig lengths. Differential expression analyses were performed using two mapping strategies (Bowtie and BWA) and two statistical programs (edgeR and DESeq), in four different combinations. Finally, contigs were annotated using blastx and Blast2GO against the NCBI database.
The four different differential expression and mapping combinations produced very different numbers of significant DE contigs. BWA-edgeR produced the most (14,076) while Bowtie-DESeq produced the least (5,577). Overall, there were 4,251 contigs identified by all four approaches and were used to blast against the reference database. Over half (2,106) of these contigs were successfully annotated. As expected, heat shock proteins (Hsp) were upregulated in the stressed fish along with transcripts involved in catabolism and lipid metabolism. Additionally, transcripts identified as Calnexin, NADH dehydrogenase, and glutathione S-transferase were upregulated. The authors suggest these three genes may be involved in long-term reallocation of energy resources during heat stress, while Hsp genes may be involved in response to short-term exposure to heat stress. Further, genes involved in the regulation of metabolic functions and developmental processes show some levels of differential regulation and could be additional candidate genes for long-term adaptation. This study clearly represents a straightforward pilot study with standard RNAseq methods used to produce initial results towards understanding transcriptional response to heat stress. Additional sampling of other populations will further strengthen the ability to draw conclusions about strategies used to acclimate to rising temperatures in the oceans.
Alexandra Fraik WSU
Lappalainen, Tuuli, et al. “Epistatic selection between coding and regulatory variation in human evolution and disease.” The American Journal of Human Genetics 89.3 (2011): 459-463.
This week’s discussion revolved around a study investigating the modifier effects of cis-regulatory variation on the penetrance of deleterious coding variants. The model of epistasis predicts that deleterious single nucleotide variant (SNVs) will be derived when a putatively new, deleterious coding SNV arises on the expressed haplotype in a coding SNV and regulatory SNV heterozygous pair. One thought of how this may occur is through allelic imbalance of the coding and regulatory SNVs. In other words, if a putatively deleterious coding SNV is linked with a regulatory SNV in high allele frequency and a new, putatively deleterious coding mutation arises, this could produce a double heterozygote coding-regulatory SNV with severe, deleterious phenotypic consequences. To test if this was the case, this study utilized re-sequencing data from 60 individuals of European origin and 58 of Nigerian to look provided by the 1000 genomes project to look at sequence-level genetic variation. Using expression quantitative trait loci (eQTLs) they mapped gene expression data in cis to look at common regulatory variation. To capture expression data for rare regulatory variants, they investigated whether this allele-specific pattern of expression could be observed using RNA-sequencing of the 60 European-descent individuals.
Looking at these expressional data, they found there was little functional coding variation on haplotypes that were more highly expressed. In other words, selective pressures of epistasis result in a deficiency of negative coding variation on the more highly expressed regulatory haplotype. Looking at synonymous versus non-synonymous SNVs, they found that the more highly expressed haplotypes of non-synonymous SNVs carried fewer derived alleles compared to synonymous SNVs. Our discussion mainly revolved around the model described in the paper used to describe the predictions of the epistasis model. We spent a long time speculating the fitness curves of the double-heterozygotes produced in figure 1. These were used to describe the gain of expression and loss of expression regulatory-coding SNV allelic combinations. More clarity regarding the method by which these curves were produced may have better guided the discussion.