NGS Data Analysis

We start D(ifferential)E(xpression) analysis with read counts and sample information. Counts are generated from reads overlapping features (using featureCounts for example); sample information pertains to treatments, phenotypes that are of interest, i.e. why you did the experiment in the first place.

DEeq2 uses SummarizedExperiment to hold count data, which was a little confusing for me, not knowing what SE objects should look like. Coming from EdgeR which uses basic read counts, condition vector as input I had a hard time getting the 'Quick Start' going, ironically.,

This webpage was helpful in working through issues. Going from a matrix of counts I encountered:

Error in round(assay(se)) : non-numeric argument to mathematical function

which was basically telling me of a non-numeric count. This is not a fault of rownames, which I tested first. Best go back to the input counts, reformat and fix it. I had spaces instead of tabs, and a final row of where my joined counts came from (not that familiar with join on cmd line...) so I removed them and it worked OK. Good idea: go back to source data if errors keep occurring and you cannot see the issue in R.

> dds
class: DESeqDataSet
dim: 63677 52
exptData(0):
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData metadata column names(0):
colnames(52): APR001_T1 APR002_T1 ... APR083_T1 APR084_T1
colData names(3): SampleName respU tissU

I proceeded as normal and it worked fine. I say fine, we had 99 samples as input but only used 52 because no phenotype information was available. If you design an experiment (I wasn't involved) and you miss things like this you only have yourself to blame.

Next I ran a few diagnostics. They seem pretty good in DESeq2. They are based on the rlog transform, which took quite some time to still be running so I used the fast=TRUE approximation, which took seconds.

One of the interesting parts of the analysis is that we have FFPE and fresh-frozen tissue samples (tumor), and so I was interested to see the difference and if the tissues could be used together. Heatmap says NO!

Apologies for small image, but the basic info is there: bottom left corner is all "T2" (fresh-frozen, 12 samples), then a small T1 subgroup (bottom left of 'main' group, 8 samples), then 'the rest', minus the top right cluster of T1's. This indicates that FFPE and FrFro cannot be analysed together, or at least they are not compatible sources of read counts.

Also no clustering occurred between non/responder, and so 'better' (read:different) phenotypes were sought. We had information on chemo, or chemo+drug, and so these along with sex as batch effect can be used in a design = ~ sex + chemo + drug + chemo:drug. We have age also, but that is not as useful as we have <50 samples and I would worry about that sort of thing.

NGS Data Analysis - CBT

Monday, 5 January 2015

DESeq2 Setup and Analysis