[aroma.affymetrix] Gene ST array summarisation by probeset?

2017-04-07 Thread Sophie Marion de Procé
Dear all,

I'm analysing a Rat Gene ST 2.1 array. I would like to filter the dataset 
using thresholds of expression for a minimum proportion of samples in each 
group.
I've been following the paper by Rodrigo-Domingo et al. (2014) Reproducible 
probe-level analysis of the Affymetrix Exon 1.0 ST array with 
R/Bioconductor.

They have two steps to the filtering step, first they filter probesets and 
then the transcripts. I'm getting stuck at the filtering step (code chunk 
number 7: intensityFiltering in the R code provided in this paper), as I 
can't find a way to summarise the Gene St array at the probeset level in 
order to do the first step. I have used plm <- RmaPlm(can) to summarise my 
data at the transcript level, but the plmPs  <- RmaPlm(csN, mergeGroups = 
FALSE) seems to summarise by transcripts as well.

So my questions are:
1) Is it possible to summarise a Gene ST array at the probeset level? If 
yes, how?
2) Less specific to the aroma-affymetrix package, but is it necessary to 
have the probeset-level dataset in order to filter present/absent 
probes/transcripts? What would be an appropriate workflow for this?

Thank you very much for your help,
Best wishes,
Sophie.

Here is the code chunk from that paper:

###
### code chunk number 7: intensityFiltering
###

# ** user-defined groups; default: groups defined by the treatment column 
in SampleInformation.txt
sample.info$number <- seq(1, nrow(sample.info))
groups <- split(sample.info$number, sample.info$treatment)

# check whether the filtering is already performed, perform it otherwise
# at probeset level
if(file.exists(file = paste(output.folder, "/PresentProbesets", ds, ".Rdata"
, sep = ""))){
 load(file = paste(output.folder, "/PresentProbesets", ds, ".Rdata", sep = 
""))
} else {
 # remove cross-hybridising probesets
 non.crosshyb.probesets <- probesets.NetAffx.32$probeset_id[probesets.
NetAffx.32$crosshyb_type == 1] 
 if(!exists("exFit")){
  load(file= paste(output.folder, "CoreExonIntensities.Rdata", sep = "/"))
 }
 exon.intensities <- exFit[exFit$groupName %in% non.crosshyb.probesets, -c(3
:5)]
 rm("exFit")
 gc()

 # ** user-defined criteria for absent/present probesets
 # ** criterion 1: probeset absent/present in a group of samples: present 
in at least half of the samples
 present.exons <- lapply(groups, FUN = function(group){
   if(length(group) > 1){
 apply(log2(exon.intensities[, group + 2]) < 3, 1, sum)/length(group) < 
0.5
   } else {
 log2(exon.intensities[, group + 2]) > 3
   }
 })
 present.exons <- t(do.call(rbind, present.exons))  # convert to dataframe
 rownames(present.exons) <- exon.intensities$groupName  # use probeset 
identities

 # ** criterion 2: probeset absent/present in the dataset
 # remove probesets not present in any of the groups
 absent.exons <- apply(present.exons, 1, AllFalse)
 probesets.to.keep <- absent.exons[absent.exons == FALSE]
 probesets.to.keep <- as.factor(names(probesets.to.keep))
 n.present.exons <- length(probesets.to.keep)
 rm(list = c("absent.exons"))
 save(probesets.to.keep, file = paste(output.folder, "/PresentProbesets", ds
, ".Rdata", sep = ""))
}
# check whether transcript filtering has been performed; perform it if not
if(file.exists(file = paste(output.folder, "/PresentTranscripts", ds, 
".Rdata", sep = ""))){
 load(file = paste(output.folder, "/PresentTranscripts", ds, ".Rdata", sep = 
""))
} else {
 if(!exists("exon.intensities")){
  load(file= paste(output.folder, "CoreExonIntensities.Rdata", sep = "/"))
  exon.intensities <- exFit[exFit$groupName %in% non.crosshyb.probesets, -c(
3:5)]
  rm("exFit")
  gc()
 }

 # ** user-defined criteria for absent transcripts:
 # criterion 1: half or more of probesets of transcript present in sample
 # --> transcript present in sample
 core.transcripts <- unique(exon.intensities$unitName)
 # create a list of transcript clusters present/absent in each sample
 present.genes<- lapply(core.transcripts, FUN = function(gene){
   apply(log2(exon.intensities[exon.intensities$unitName == gene, -c(1:2)]) 
< 3, 2, sum)/
   length(exon.intensities[exon.intensities$unitName == gene,]$groupName) < 
0.5
 })# FALSE: gene not present in sample
 names(present.genes) <- core.transcripts 
 present.genes <- do.call(rbind, present.genes) # convert to logical matrix 

 # criterion 2: transcript present in at least half of the samples of a 
group 
 # --> transcript present in the group
 present.genes.in.group <- lapply(groups, FUN = function(group){
   if(length(group) > 1){
 apply(present.genes[ , group], 1, sum)/length(group) >= 0.5
   } else {
 present.genes[ , group]
   }
 }) 
 present.genes.in.group <- do.call(rbind, present.genes.in.group) # logical 
matrix
 present.genes.in.group <- t(present.genes.in.group)
 # keep genes only present in at least two groups
 transcripts.to.keep <- apply(present.genes.in.group, 1, TwoOrMoreTrue)
 transcripts.to.keep <- n

Re: [aroma.affymetrix] AffymetrixCdfFile$byChipType -- Argument 'nbrOfUnits' contains 1 NA value(s)

2017-04-05 Thread Sophie Marion de Procé
Dear Guido,

Thank you for your reply, I followed your much simpler code and it worked!
So it seems I got mixed up with the folders. Or it could be that I should
have set the working directory with setwd(). I think I assumed that
aroma.affymetrix would use the wd variable to find the appropriate files. I
was surprised it didn't say that it couldn't locate the file as in other
posts in the mailing list.
In any case, it works now!

Thanks for your help,
Best wishes,
Sophie.


Code that worked:
setwd("~/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd")
library(aroma.affymetrix)
chipType <- "RaGene-2_1-st"
cdf <- AffymetrixCdfFile$byChipType(chipType, tags=c("byTranscript-fsetid",
"pd.ragene.2.1.st"))



On 5 April 2017 at 08:48, Hooiveld, Guido  wrote:

> Hi,
>
> Creating the CDF object using the same binary CDF file works fine in my
> hands… ?? See below.
>
>
>
> It has been a while since I extensively used aroma.affymetrix, but I
> noticed the main difference between your and my code is that you
> specifically set in R all relevant directories (cel.directory,
> annotation.data, etc), whereas I just set the working directory (to
> “aroma.affy.test”). That was the only directory I set/specified.
>
>
>
> Before starting R/aroma.affymetrix  I downloaded the binary CDF file, and
> within this (working) directory I created the relevant CDF directory (i.e.
> D:\aroma.affy.test\annotationData\chipTypes\RaGene-2_1-st), and then
> copied the CDF into that dir.
>
> Next I ran the code below.
>
>
>
> Note that I am on a Windows machine. Also note that since I don’t have
> some RaGene 2.1 CEL files I could not test whether the subsequent
> normalization goes fine.
>
>
>
> HTH,
>
> Guido
>
>
>
> > setwd("D:\\aroma.affy.test")
>
> >
>
> > library(aroma.affymetrix)
>
> > chipType <- "RaGene-2_1-st"
>
> > cdf <- AffymetrixCdfFile$byChipType(chipType,
> tags=c("byTranscript-fsetid", "pd.ragene.2.1.st"))
>
> > print(cdf)
>
> AffymetrixCdfFile:
>
> Path: annotationData/chipTypes/RaGene-2_1-st
>
> Filename: RaGene-2_1-st,byTranscript-fsetid,pd.ragene.2.1.st.cdf
>
> File size: 15.00 MiB (15728222 bytes)
>
> Chip type: RaGene-2_1-st,byTranscript-fsetid,pd.ragene.2.1.st
>
> File format: v4 (binary; XDA)
>
> Dimension: 1190x1190
>
> Number of cells: 1416100
>
> Number of units: 36685
>
> Cells per unit: 38.60
>
> Number of QC units: 0
>
> >
>
> >
>
>
>
> > sessionInfo()
>
> R version 3.3.1 Patched (2016-10-18 r71535)
>
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> Running under: Windows 7 x64 (build 7601) Service Pack 1
>
>
>
> locale:
>
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252
>
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
>
> [5] LC_TIME=English_United States.1252
>
>
>
> attached base packages:
>
> [1] stats graphics  grDevices utils datasets  methods   base
>
>
>
> other attached packages:
>
> [1] aroma.light_3.4.0  aroma.affymetrix_3.1.0 affxparser_1.46.0
> aroma.core_3.1.0
>
> [5] R.devices_2.15.1   R.filesets_2.11.0  R.utils_2.5.0
> R.oo_1.21.0
>
> [9] R.methodsS3_1.7.1
>
>
>
> loaded via a namespace (and not attached):
>
> [1] matrixStats_0.52.0 codetools_0.2-15   listenv_0.6.0
> future_1.4.0   digest_0.6.12
>
>  [6] R.huge_0.9.0   PSCBS_0.62.0   tools_3.3.1
> R.cache_0.12.0 parallel_3.3.1
>
> [11] base64enc_0.1-3aroma.apd_0.6.0R.rsp_0.40.0
> globals_0.9.0  DNAcopy_1.48.0
>
> >
>
>
>
> -
>
> Guido Hooiveld, PhD
>
> Nutrition, Metabolism & Genomics Group
>
> Division of Human Nutrition & Health
>
> Wageningen University
>
> *Visiting address*:   *Mail address*:
>
> HELIX (Building 124), room 2048
>
> Stippeneng 4   PO Box 17
>
> 6708 WE Wageningen  6700 AA Wageningen
>
> the Netherlands   the Netherlands
>
>
>
> tel: (+) 31 317 485788 <+31%20317%20485%20788>
>
> fax: (+) 31 317 483342 <+31%20317%20483%20342>
>
> email:  guido.hooiv...@wur.nl
>
> internet:   http://www.wur.nl/nmg/hooiveld
>
> internet:   http://www.wur.nl/nmg
>
> http://scholar.google.com/citations?user=qFHaMnoJ
>
> http://www.researcherid.com/rid/F-4912-2010
>
>
>
> *From:* aroma-affymetrix@googleg

[aroma.affymetrix] AffymetrixCdfFile$byChipType -- Argument 'nbrOfUnits' contains 1 NA value(s)

2017-04-03 Thread Sophie Marion de Procé
Dear all,

I'm new to aroma.affymetrix and I seem to be stuck at the cdf file step 
using AffymetrixCdfFile$byChipType.
I've set up the folders and files as advised, and got the cdf file for 
RaGene-2_1-st 
from http://nmg-r.bioinformatics.nl/NuGO_R.html.
My command to create the cdf object fails with the error (detailed 
below): Argument 'nbrOfUnits' contains 1 NA value(s).

Would you have any suggestion of what I'm doing wrong?

Thanks very much for your help,
Best wishes,
Sophie.


Here is my code:
require(aroma.affymetrix)
require(biomaRt)
require(GenomeGraphs)

wd <- 
"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd"
ds <- "myDataSet"
cel.directory <- 
"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/rawData/myDataSet/RaGene-2_1-st"
annotation.data<-"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/annotationData/chipTypes/RaGene-2_1-st"
output.folder <- 
"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/output"
library.file 
<-"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/annotationData/chipTypes/RaGene-2_1-st/RaGene-2_1-st,byTranscript-fsetid,pd.ragene.2.1.st.cdf"
chipType<-"RaGene-2_1-st"
sample.info <-read.table(file=paste(cel.directory,"SampleInformation.txt", 
sep="/"),sep="\t", header=TRUE)
 
transcript.clusters.NetAffx.36 <- 
read.csv(file=paste(annotation.data,"RaGene-2_1-st-v1.na36.rn5.transcript.csv",sep="/"),
 
skip=22)
probesets.NetAffx.36 
<-read.csv(file=paste(annotation.data,"RaGene-2_1-st-v1.na36.rn5.probeset.csv",sep
 
="/"), skip=21)

cdf <- 
AffymetrixCdfFile$byChipType(chipType,tags="byTranscript-fsetid","pd.ragene.2.1.st")
 
# FAILS


And the error message I get:
[2017-04-03 17:02:35] Exception: Argument 'nbrOfUnits' contains 1 NA 
value(s).

  at #10. getNumerics.Arguments(static, ..., asMode = "integer", disallow = 
disallow)
  - getNumerics.Arguments() is in environment 'R.utils'

  at #09. getNumerics(static, ..., asMode = "integer", disallow = disallow)
  - getNumerics() is in environment 'R.utils'

  at #08. getIntegers.Arguments(static, ..., length = length)
  - getIntegers.Arguments() is in environment 'R.utils'

  at #07. getIntegers(static, ..., length = length)
  - getIntegers() is in environment 'R.utils'

  at #06. getInteger.Arguments(static, ...)
  - getInteger.Arguments() is in environment 'R.utils'

  at #05. getInteger(static, ...)
  - getInteger() is in environment 'R.utils'
  - originating from ''

  at #04. Arguments$getInteger(nbrOfUnits, range = c(0, Inf))
  - Arguments$getInteger() is local of the calling function

  at #03. byChipType.UnitAnnotationDataFile(static, ...)
  - byChipType.UnitAnnotationDataFile() is in environment 
'aroma.core'

  at #02. byChipType(static, ...)
  - byChipType() is in environment 'aroma.core'
  - originating from ''

  at #01. AffymetrixCdfFile$byChipType(chipType, tags = 
"byTranscript-fsetid", 
  "pd.ragene.2.1.st")
  - AffymetrixCdfFile$byChipType() is local of the calling function

Error: Argument 'nbrOfUnits' contains 1 NA value(s).
In addition: Warning message:
In storage.mode(x) <- asMode : NAs introduced by coercion


Finally, my sessionInfo():
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid  stats graphics  grDevices utils datasets  methods   
base 

other attached packages:
 [1] GenomeGraphs_1.34.0biomaRt_2.30.0 aroma.light_3.4.0 
 aroma.affymetrix_3.1.0 aroma.core_3.1.0   R.devices_2.15.1   
R.filesets_2.11.0  R.utils_2.5.0  R.oo_1.21.0   
 affxparser_1.46.0  R.methodsS3_1.7.1 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10   GenomeInfoDb_1.10.3XVector_0.14.1   
  bitops_1.0-6   base64enc_0.1-3tools_3.3.2 
   zlibbioc_1.20.0digest_0.6.12 
 aroma.apd_0.6.0   
[10] memoise_1.0.0  RSQLite_1.1-2  lattice_0.20-35 
   R.cache_0.12.0 Matrix_1.2-8   DBI_0.6   
 parallel_3.3.2 R.rsp_0.40.0   
S4Vectors_0.12.2  
[19] PSCBS_0.62.0   globals_0.9.0  IRanges_2.8.2   
   stats4_3.3.2   Biobase_2.34.0 
listenv_0.6.0  DNAcopy_1.48.0 AnnotationDbi_1.36.2 
  XML_3.98-1.6  
[28] codetools_0.2-15   matrixStats_0.51.0 
BiocGenerics_0.20.0GenomicRanges_1.26.4   
SummarizedExperiment_1.4.0 future_1.4.0   R.huge_0.9.0