[aroma.affymetrix] Gene ST array summarisation by probeset?
Dear all, I'm analysing a Rat Gene ST 2.1 array. I would like to filter the dataset using thresholds of expression for a minimum proportion of samples in each group. I've been following the paper by Rodrigo-Domingo et al. (2014) Reproducible probe-level analysis of the Affymetrix Exon 1.0 ST array with R/Bioconductor. They have two steps to the filtering step, first they filter probesets and then the transcripts. I'm getting stuck at the filtering step (code chunk number 7: intensityFiltering in the R code provided in this paper), as I can't find a way to summarise the Gene St array at the probeset level in order to do the first step. I have used plm <- RmaPlm(can) to summarise my data at the transcript level, but the plmPs <- RmaPlm(csN, mergeGroups = FALSE) seems to summarise by transcripts as well. So my questions are: 1) Is it possible to summarise a Gene ST array at the probeset level? If yes, how? 2) Less specific to the aroma-affymetrix package, but is it necessary to have the probeset-level dataset in order to filter present/absent probes/transcripts? What would be an appropriate workflow for this? Thank you very much for your help, Best wishes, Sophie. Here is the code chunk from that paper: ### ### code chunk number 7: intensityFiltering ### # ** user-defined groups; default: groups defined by the treatment column in SampleInformation.txt sample.info$number <- seq(1, nrow(sample.info)) groups <- split(sample.info$number, sample.info$treatment) # check whether the filtering is already performed, perform it otherwise # at probeset level if(file.exists(file = paste(output.folder, "/PresentProbesets", ds, ".Rdata" , sep = ""))){ load(file = paste(output.folder, "/PresentProbesets", ds, ".Rdata", sep = "")) } else { # remove cross-hybridising probesets non.crosshyb.probesets <- probesets.NetAffx.32$probeset_id[probesets. NetAffx.32$crosshyb_type == 1] if(!exists("exFit")){ load(file= paste(output.folder, "CoreExonIntensities.Rdata", sep = "/")) } exon.intensities <- exFit[exFit$groupName %in% non.crosshyb.probesets, -c(3 :5)] rm("exFit") gc() # ** user-defined criteria for absent/present probesets # ** criterion 1: probeset absent/present in a group of samples: present in at least half of the samples present.exons <- lapply(groups, FUN = function(group){ if(length(group) > 1){ apply(log2(exon.intensities[, group + 2]) < 3, 1, sum)/length(group) < 0.5 } else { log2(exon.intensities[, group + 2]) > 3 } }) present.exons <- t(do.call(rbind, present.exons)) # convert to dataframe rownames(present.exons) <- exon.intensities$groupName # use probeset identities # ** criterion 2: probeset absent/present in the dataset # remove probesets not present in any of the groups absent.exons <- apply(present.exons, 1, AllFalse) probesets.to.keep <- absent.exons[absent.exons == FALSE] probesets.to.keep <- as.factor(names(probesets.to.keep)) n.present.exons <- length(probesets.to.keep) rm(list = c("absent.exons")) save(probesets.to.keep, file = paste(output.folder, "/PresentProbesets", ds , ".Rdata", sep = "")) } # check whether transcript filtering has been performed; perform it if not if(file.exists(file = paste(output.folder, "/PresentTranscripts", ds, ".Rdata", sep = ""))){ load(file = paste(output.folder, "/PresentTranscripts", ds, ".Rdata", sep = "")) } else { if(!exists("exon.intensities")){ load(file= paste(output.folder, "CoreExonIntensities.Rdata", sep = "/")) exon.intensities <- exFit[exFit$groupName %in% non.crosshyb.probesets, -c( 3:5)] rm("exFit") gc() } # ** user-defined criteria for absent transcripts: # criterion 1: half or more of probesets of transcript present in sample # --> transcript present in sample core.transcripts <- unique(exon.intensities$unitName) # create a list of transcript clusters present/absent in each sample present.genes<- lapply(core.transcripts, FUN = function(gene){ apply(log2(exon.intensities[exon.intensities$unitName == gene, -c(1:2)]) < 3, 2, sum)/ length(exon.intensities[exon.intensities$unitName == gene,]$groupName) < 0.5 })# FALSE: gene not present in sample names(present.genes) <- core.transcripts present.genes <- do.call(rbind, present.genes) # convert to logical matrix # criterion 2: transcript present in at least half of the samples of a group # --> transcript present in the group present.genes.in.group <- lapply(groups, FUN = function(group){ if(length(group) > 1){ apply(present.genes[ , group], 1, sum)/length(group) >= 0.5 } else { present.genes[ , group] } }) present.genes.in.group <- do.call(rbind, present.genes.in.group) # logical matrix present.genes.in.group <- t(present.genes.in.group) # keep genes only present in at least two groups transcripts.to.keep <- apply(present.genes.in.group, 1, TwoOrMoreTrue) transcripts.to.keep <- n
Re: [aroma.affymetrix] AffymetrixCdfFile$byChipType -- Argument 'nbrOfUnits' contains 1 NA value(s)
Dear Guido, Thank you for your reply, I followed your much simpler code and it worked! So it seems I got mixed up with the folders. Or it could be that I should have set the working directory with setwd(). I think I assumed that aroma.affymetrix would use the wd variable to find the appropriate files. I was surprised it didn't say that it couldn't locate the file as in other posts in the mailing list. In any case, it works now! Thanks for your help, Best wishes, Sophie. Code that worked: setwd("~/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd") library(aroma.affymetrix) chipType <- "RaGene-2_1-st" cdf <- AffymetrixCdfFile$byChipType(chipType, tags=c("byTranscript-fsetid", "pd.ragene.2.1.st")) On 5 April 2017 at 08:48, Hooiveld, Guido wrote: > Hi, > > Creating the CDF object using the same binary CDF file works fine in my > hands… ?? See below. > > > > It has been a while since I extensively used aroma.affymetrix, but I > noticed the main difference between your and my code is that you > specifically set in R all relevant directories (cel.directory, > annotation.data, etc), whereas I just set the working directory (to > “aroma.affy.test”). That was the only directory I set/specified. > > > > Before starting R/aroma.affymetrix I downloaded the binary CDF file, and > within this (working) directory I created the relevant CDF directory (i.e. > D:\aroma.affy.test\annotationData\chipTypes\RaGene-2_1-st), and then > copied the CDF into that dir. > > Next I ran the code below. > > > > Note that I am on a Windows machine. Also note that since I don’t have > some RaGene 2.1 CEL files I could not test whether the subsequent > normalization goes fine. > > > > HTH, > > Guido > > > > > setwd("D:\\aroma.affy.test") > > > > > > library(aroma.affymetrix) > > > chipType <- "RaGene-2_1-st" > > > cdf <- AffymetrixCdfFile$byChipType(chipType, > tags=c("byTranscript-fsetid", "pd.ragene.2.1.st")) > > > print(cdf) > > AffymetrixCdfFile: > > Path: annotationData/chipTypes/RaGene-2_1-st > > Filename: RaGene-2_1-st,byTranscript-fsetid,pd.ragene.2.1.st.cdf > > File size: 15.00 MiB (15728222 bytes) > > Chip type: RaGene-2_1-st,byTranscript-fsetid,pd.ragene.2.1.st > > File format: v4 (binary; XDA) > > Dimension: 1190x1190 > > Number of cells: 1416100 > > Number of units: 36685 > > Cells per unit: 38.60 > > Number of QC units: 0 > > > > > > > > > > > sessionInfo() > > R version 3.3.1 Patched (2016-10-18 r71535) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > Running under: Windows 7 x64 (build 7601) Service Pack 1 > > > > locale: > > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 > > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > > > [5] LC_TIME=English_United States.1252 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] aroma.light_3.4.0 aroma.affymetrix_3.1.0 affxparser_1.46.0 > aroma.core_3.1.0 > > [5] R.devices_2.15.1 R.filesets_2.11.0 R.utils_2.5.0 > R.oo_1.21.0 > > [9] R.methodsS3_1.7.1 > > > > loaded via a namespace (and not attached): > > [1] matrixStats_0.52.0 codetools_0.2-15 listenv_0.6.0 > future_1.4.0 digest_0.6.12 > > [6] R.huge_0.9.0 PSCBS_0.62.0 tools_3.3.1 > R.cache_0.12.0 parallel_3.3.1 > > [11] base64enc_0.1-3aroma.apd_0.6.0R.rsp_0.40.0 > globals_0.9.0 DNAcopy_1.48.0 > > > > > > > - > > Guido Hooiveld, PhD > > Nutrition, Metabolism & Genomics Group > > Division of Human Nutrition & Health > > Wageningen University > > *Visiting address*: *Mail address*: > > HELIX (Building 124), room 2048 > > Stippeneng 4 PO Box 17 > > 6708 WE Wageningen 6700 AA Wageningen > > the Netherlands the Netherlands > > > > tel: (+) 31 317 485788 <+31%20317%20485%20788> > > fax: (+) 31 317 483342 <+31%20317%20483%20342> > > email: guido.hooiv...@wur.nl > > internet: http://www.wur.nl/nmg/hooiveld > > internet: http://www.wur.nl/nmg > > http://scholar.google.com/citations?user=qFHaMnoJ > > http://www.researcherid.com/rid/F-4912-2010 > > > > *From:* aroma-affymetrix@googleg
[aroma.affymetrix] AffymetrixCdfFile$byChipType -- Argument 'nbrOfUnits' contains 1 NA value(s)
Dear all, I'm new to aroma.affymetrix and I seem to be stuck at the cdf file step using AffymetrixCdfFile$byChipType. I've set up the folders and files as advised, and got the cdf file for RaGene-2_1-st from http://nmg-r.bioinformatics.nl/NuGO_R.html. My command to create the cdf object fails with the error (detailed below): Argument 'nbrOfUnits' contains 1 NA value(s). Would you have any suggestion of what I'm doing wrong? Thanks very much for your help, Best wishes, Sophie. Here is my code: require(aroma.affymetrix) require(biomaRt) require(GenomeGraphs) wd <- "/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd" ds <- "myDataSet" cel.directory <- "/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/rawData/myDataSet/RaGene-2_1-st" annotation.data<-"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/annotationData/chipTypes/RaGene-2_1-st" output.folder <- "/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/output" library.file <-"/Users/sdeproce/Documents/AITMAN_GROUP/KO_SHR_rat_strains/Transcriptomes/aroma_affymetrix_wd/annotationData/chipTypes/RaGene-2_1-st/RaGene-2_1-st,byTranscript-fsetid,pd.ragene.2.1.st.cdf" chipType<-"RaGene-2_1-st" sample.info <-read.table(file=paste(cel.directory,"SampleInformation.txt", sep="/"),sep="\t", header=TRUE) transcript.clusters.NetAffx.36 <- read.csv(file=paste(annotation.data,"RaGene-2_1-st-v1.na36.rn5.transcript.csv",sep="/"), skip=22) probesets.NetAffx.36 <-read.csv(file=paste(annotation.data,"RaGene-2_1-st-v1.na36.rn5.probeset.csv",sep ="/"), skip=21) cdf <- AffymetrixCdfFile$byChipType(chipType,tags="byTranscript-fsetid","pd.ragene.2.1.st") # FAILS And the error message I get: [2017-04-03 17:02:35] Exception: Argument 'nbrOfUnits' contains 1 NA value(s). at #10. getNumerics.Arguments(static, ..., asMode = "integer", disallow = disallow) - getNumerics.Arguments() is in environment 'R.utils' at #09. getNumerics(static, ..., asMode = "integer", disallow = disallow) - getNumerics() is in environment 'R.utils' at #08. getIntegers.Arguments(static, ..., length = length) - getIntegers.Arguments() is in environment 'R.utils' at #07. getIntegers(static, ..., length = length) - getIntegers() is in environment 'R.utils' at #06. getInteger.Arguments(static, ...) - getInteger.Arguments() is in environment 'R.utils' at #05. getInteger(static, ...) - getInteger() is in environment 'R.utils' - originating from '' at #04. Arguments$getInteger(nbrOfUnits, range = c(0, Inf)) - Arguments$getInteger() is local of the calling function at #03. byChipType.UnitAnnotationDataFile(static, ...) - byChipType.UnitAnnotationDataFile() is in environment 'aroma.core' at #02. byChipType(static, ...) - byChipType() is in environment 'aroma.core' - originating from '' at #01. AffymetrixCdfFile$byChipType(chipType, tags = "byTranscript-fsetid", "pd.ragene.2.1.st") - AffymetrixCdfFile$byChipType() is local of the calling function Error: Argument 'nbrOfUnits' contains 1 NA value(s). In addition: Warning message: In storage.mode(x) <- asMode : NAs introduced by coercion Finally, my sessionInfo(): R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X El Capitan 10.11.6 locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] GenomeGraphs_1.34.0biomaRt_2.30.0 aroma.light_3.4.0 aroma.affymetrix_3.1.0 aroma.core_3.1.0 R.devices_2.15.1 R.filesets_2.11.0 R.utils_2.5.0 R.oo_1.21.0 affxparser_1.46.0 R.methodsS3_1.7.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 GenomeInfoDb_1.10.3XVector_0.14.1 bitops_1.0-6 base64enc_0.1-3tools_3.3.2 zlibbioc_1.20.0digest_0.6.12 aroma.apd_0.6.0 [10] memoise_1.0.0 RSQLite_1.1-2 lattice_0.20-35 R.cache_0.12.0 Matrix_1.2-8 DBI_0.6 parallel_3.3.2 R.rsp_0.40.0 S4Vectors_0.12.2 [19] PSCBS_0.62.0 globals_0.9.0 IRanges_2.8.2 stats4_3.3.2 Biobase_2.34.0 listenv_0.6.0 DNAcopy_1.48.0 AnnotationDbi_1.36.2 XML_3.98-1.6 [28] codetools_0.2-15 matrixStats_0.51.0 BiocGenerics_0.20.0GenomicRanges_1.26.4 SummarizedExperiment_1.4.0 future_1.4.0 R.huge_0.9.0