Re: [R] FW: Selecting undefined column of a data frame (was[BioC]read.phenoData vs read.AnnotatedDataFrame)
Resolution: To avoid bugs in code due to typos of data frame column names that can occur when using the '$' extractor, foo - data.frame(Filename = c(a, b)) foo$FileName NULL a past alternative was to use foo[, FileName] instead of foo$FileName. However, this too now silently returns NULL. foo[, FileName] NULL A modest and simple modification is to use TRUE for the row index argument. foo[T, FileName] Error in `[.data.frame`(foo, T, FileName) : undefined columns selected An error is issued, and the misspelled column name can more easily be found in debugging the issue. all.equal(foo$Filename, foo[T, Filename]) [1] TRUE The two accessor methods yield the same result when column names are spelled correctly. all.equal(iris$Species, iris[T, Species]) [1] TRUE Other solutions no doubt exist. Currently a single argument to [.data.frame will throw an error if the argument does not match a column name. foo[FileName] Error in `[.data.frame`(foo, FileName) : undefined columns selected sessionInfo() R version 2.5.1 (2007-06-27) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Steven McKinney Sent: Fri 8/3/2007 11:10 AM To: r-help@stat.math.ethz.ch Subject: Re: [R] FW: Selecting undefined column of a data frame (was[BioC]read.phenoData vs read.AnnotatedDataFrame) I see now that for my example foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL the issue is in this clause of the [.data.frame extractor. The lines if (drop length(y) == 1L) return(.subset2(y, 1L)) return the NULL result just before the error check cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) is performed. Is this intended behaviour, or has a logical bug crept into the [.data.frame extractor? if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done. Is this intended? cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Should this discussion move to R-devel? Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Steven McKinney Sent: Fri 8/3/2007 10:37 AM To: r-help@stat.math.ethz.ch Subject: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame) Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo
[R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label (given by the order of Annotated object) and my comparisons were confused without being obvious as to why or where. Our solution: specify that filename is as.character so assignment of file to target is correct(after correcting $Filename) now that using read.AnnotatedDataFrame rather than readphenoData. Data-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd) Hurrah! It may be beneficial to others, that if the filename argument isn't specified, that filenames are read from the phenoData object if included here. Thanks! -Original Message- From: Martin Morgan [mailto:[EMAIL PROTECTED] Sent: Thursday, 26 July 2007 11:49 a.m. To: Johnstone, Alice Cc: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame Hi Alice -- Johnstone, Alice [EMAIL PROTECTED] writes: Using R2.5.0 and Bioconductor I have been following code to analysis Affymetrix expression data: 2 treatments vs control. The original code was run last year and used the read.phenoData command, however with the newer version I get the error message Warning messages: read.phenoData is deprecated, use read.AnnotatedDataFrame instead The phenoData class is deprecated, use AnnotatedDataFrame (with ExpressionSet) instead I use the read.AnnotatedDataFrame command, but when it comes to the end of the analysis the comparison of the treatment to the controls gets mixed up compared to what you get using the original read.phenoData ie it looks like the 3 groups get labelled wrong and so the comparisons are different (but they can still be matched up). My questions are, 1) do you need to set up your target file differently when using read.AnnotatedDataFrame - what is the standard format? I can't quite tell where things are going wrong for you, so it would help if you can narrow down where the problem occurs. I think read.AnnotatedDataFrame should be comparable to read.phenoData. Does pData(pd) look right? What about pData(Data) and pData(eset.rma) ? It's not important but pData(pd)$Target is the same as pd$Target. Since the analysis is on eset.rma, it probably makes sense to use the pData from there to construct your design matrix
Re: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame)
I see now that for my example foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL the issue is in this clause of the [.data.frame extractor. The lines if (drop length(y) == 1L) return(.subset2(y, 1L)) return the NULL result just before the error check cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) is performed. Is this intended behaviour, or has a logical bug crept into the [.data.frame extractor? if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done. Is this intended? cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Should this discussion move to R-devel? Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Steven McKinney Sent: Fri 8/3/2007 10:37 AM To: r-help@stat.math.ethz.ch Subject: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame) Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label (given by the order
Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
You are reading the wrong part of the code for your argument list: foo[FileName] Error in `[.data.frame`(foo, FileName) : undefined columns selected [.data.frame is one of the most complex functions in R, and does many different things depending on which arguments are supplied. On Fri, 3 Aug 2007, Steven McKinney wrote: Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label (given by the order of Annotated object) and my comparisons were confused without being obvious as to why or where. Our solution: specify that filename is as.character so assignment of file to target is correct(after correcting $Filename) now that using read.AnnotatedDataFrame rather than readphenoData. Data-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd) Hurrah! It may be beneficial to others, that if the filename argument isn't specified, that filenames are read from the phenoData object if included here. Thanks! -Original Message- From: Martin Morgan [mailto:[EMAIL PROTECTED] Sent: Thursday, 26 July 2007 11:49 a.m. To: Johnstone, Alice Cc: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame Hi Alice -- Johnstone, Alice [EMAIL PROTECTED] writes: Using R2.5.0 and Bioconductor I have been following code to analysis Affymetrix expression data: 2 treatments vs control. The original code was run last year and used the read.phenoData command, however with the newer version I get the error message Warning messages: read.phenoData is deprecated, use read.AnnotatedDataFrame instead The phenoData class is deprecated, use AnnotatedDataFrame (with ExpressionSet) instead I use the read.AnnotatedDataFrame command, but when it comes to the end of the analysis the comparison of the treatment to the controls gets mixed up compared to what you get using the original read.phenoData ie it looks like the 3 groups get labelled wrong and so the comparisons are different (but they can still be matched up). My questions are, 1) do you need to set up your target file differently when using read.AnnotatedDataFrame - what is the standard format? I can't quite tell where things are going wrong for you, so it would help if you can narrow
Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
Thanks Prof Ripley, I used double indexing (if I understand the doc correctly) so my call was foo[, FileName] I traced through each line of `[.data.frame` following the sequence of commands executed for my call. In the code section if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done. Is this intended? cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } the return happened after execution of if (drop length(y) == 1L) return(.subset2(y, 1L)) before the check on column names. Shouldn't the check on column names cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) occur before if (drop length(y) == 1L) return(.subset2(y, 1L)) rather than after? -Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: Fri 8/3/2007 12:25 PM To: Steven McKinney Cc: r-help@stat.math.ethz.ch Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame) You are reading the wrong part of the code for your argument list: foo[FileName] Error in `[.data.frame`(foo, FileName) : undefined columns selected [.data.frame is one of the most complex functions in R, and does many different things depending on which arguments are supplied. On Fri, 3 Aug 2007, Steven McKinney wrote: Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label (given by the order of Annotated object) and my comparisons were confused without being obvious as to why or where. Our solution: specify that filename is as.character so assignment
Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
I've since seen your followup a more detailed explanation may help. The path through the code for your argument list does not go where you quoted, and there is a reason for it. Generally when you extract in R and ask for an non-existent index you get NA or NULL as the result (and no warning), e.g. y - list(x=1, y=2) y[[z]] NULL Because data frames 'must' have (column) names, they are a partial exception and when the result is a data frame you get an error if it would contain undefined columns. But in the case of foo[, FileName], the result is a single column and so will not have a name: there seems no reason to be different from foo[[FileName]] NULL foo$FileName NULL which similarly select a single column. At one time they were different in R, for no documented reason. On Fri, 3 Aug 2007, Prof Brian Ripley wrote: You are reading the wrong part of the code for your argument list: foo[FileName] Error in `[.data.frame`(foo, FileName) : undefined columns selected [.data.frame is one of the most complex functions in R, and does many different things depending on which arguments are supplied. On Fri, 3 Aug 2007, Steven McKinney wrote: Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label (given by the order of Annotated object) and my comparisons were confused without being obvious as to why or where. Our solution: specify that filename is as.character so assignment of file to target is correct(after correcting $Filename) now that using read.AnnotatedDataFrame rather than readphenoData. Data-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd) Hurrah! It may be beneficial to others, that if the filename argument isn't specified, that filenames are read from the phenoData object if included here. Thanks! -Original Message- From: Martin Morgan [mailto:[EMAIL PROTECTED] Sent: Thursday, 26 July 2007 11:49 a.m. To: Johnstone, Alice Cc: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame Hi Alice -- Johnstone, Alice [EMAIL PROTECTED] writes: Using R2.5.0 and Bioconductor I have been following code to analysis Affymetrix expression data: 2 treatments vs
Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
-Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: Fri 8/3/2007 1:05 PM To: Steven McKinney Cc: r-help@stat.math.ethz.ch Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame) I've since seen your followup a more detailed explanation may help. The path through the code for your argument list does not go where you quoted, and there is a reason for it. Using a copy of [.data.frame with browser() I have traced the flow of execution. (My copy with the browser command is at the end of this email) foo[, FileName] Called from: `[.data.frame`(foo, , FileName) Browse[1] n debug: mdrop - missing(drop) Browse[1] n debug: Narg - nargs() - (!mdrop) Browse[1] n debug: if (Narg 3) { if (!mdrop) warning(drop argument will be ignored) if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y - NextMethod([) cols - names(y) if (!is.null(cols) any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } Browse[1] n debug: if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } Browse[1] n debug: if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) Browse[1] n debug: y - if (missing(j)) x else .subset(x, j) Browse[1] n debug: if (drop length(y) == 1L) return(.subset2(y, 1L)) Browse[1] n NULL So `[.data.frame` is exiting after executing + if (drop length(y) == 1L) + return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done. Is this intended? Couldn't the error check + cols - names(y) + if (any(is.na(cols))) + stop(undefined columns selected) be done before the above return()? What would break if the error check on column names was done before returning a NULL result due to incorrect column name spelling? Why should foo[, FileName] NULL differ from foo[seq(nrow(foo)), FileName] Error in `[.data.frame`(foo, seq(nrow(foo)), FileName) : undefined columns selected Thank you for your explanations. Generally when you extract in R and ask for an non-existent index you get NA or NULL as the result (and no warning), e.g. y - list(x=1, y=2) y[[z]] NULL Because data frames 'must' have (column) names, they are a partial exception and when the result is a data frame you get an error if it would contain undefined columns. But in the case of foo[, FileName], the result is a single column and so will not have a name: there seems no reason to be different from foo[[FileName]] NULL foo$FileName NULL which similarly select a single column. At one time they were different in R, for no documented reason. On Fri, 3 Aug 2007, Prof Brian Ripley wrote: You are reading the wrong part of the code for your argument list: foo[FileName] Error in `[.data.frame`(foo, FileName) : undefined columns selected [.data.frame is one of the most complex functions in R, and does many different things depending on which arguments are supplied. On Fri, 3 Aug 2007, Steven McKinney wrote: Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols
Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
What would break is that three methods for doing the same thing would give different answers. Please do have the courtesy to actually read the detailed explanation you are given. Sorry Prof. Ripley, I am attempting to read carefully, as this issue has deeper coding/debugging implications, and as you point out, [.data.frame is one of the most complex functions in R so please bear with me. This change in behaviour has taken away a side-effect debugging tool, discussed below. On Fri, 3 Aug 2007, Steven McKinney wrote: -Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: Fri 8/3/2007 1:05 PM To: Steven McKinney Cc: r-help@stat.math.ethz.ch Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame) I've since seen your followup a more detailed explanation may help. The path through the code for your argument list does not go where you quoted, and there is a reason for it. Generally when you extract in R and ask for an non-existent index you get NA or NULL as the result (and no warning), e.g. y - list(x=1, y=2) y[[z]] NULL Because data frames 'must' have (column) names, they are a partial exception and when the result is a data frame you get an error if it would contain undefined columns. But in the case of foo[, FileName], the result is a single column and so will not have a name: there seems no reason to be different from foo[[FileName]] NULL foo$FileName NULL which similarly select a single column. At one time they were different in R, for no documented reason. This difference provided a side-effect debugging tool, in that where bar - foo[, FileName] used to throw an error, alerting as to a typo, it now does not. Having been burned by NULL results due to typos in code lines using the $ extractor such as bar - foo$FileName I learned to use bar - foo[, FileName] to help cut down on typo bugs. With the ubiquity of camelCase object names, this is a constant typing bug hazard. I am wondering what to do now to double check spelling when accessing columns of a dataframe. If [.data.frame stays as is, can a debug mechanism be implemented in R that forces strict adherence to existing list names in debug mode? This would also help debug typos in camelCase names when using the $ and [[ extractors and accessors. Are there other debugging tools already in R that can help point out such camelCase list element name typos? On Fri, 3 Aug 2007, Prof Brian Ripley wrote: You are reading the wrong part of the code for your argument list: foo[FileName] Error in `[.data.frame`(foo, FileName) : undefined columns selected [.data.frame is one of the most complex functions in R, and does many different things depending on which arguments are supplied. On Fri, 3 Aug 2007, Steven McKinney wrote: Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -- Brian D. Ripley
Re: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame)
I suspect you'll get some creative answers, but if all you're worried about is whether a column exists before you do something with it, what's wrong with: nm - ... ## a character vector of names if(!all(nm %in% names(yourdata))) ## complain else ## do something I think this is called defensive programming. Bert Gunter Genentech -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Steven McKinney Sent: Friday, August 03, 2007 10:38 AM To: r-help@stat.math.ethz.ch Subject: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame) Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label (given by the order of Annotated object) and my comparisons were confused without being obvious as to why or where. Our solution: specify that filename is as.character so assignment of file to target is correct(after correcting $Filename) now that using read.AnnotatedDataFrame rather than readphenoData. Data-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd) Hurrah! It may be beneficial to others, that if the filename argument isn't specified, that filenames are read from the phenoData object if included here. Thanks! -Original Message- From: Martin Morgan [mailto:[EMAIL PROTECTED] Sent: Thursday, 26 July 2007 11:49 a.m. To: Johnstone, Alice Cc: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame Hi Alice -- Johnstone, Alice [EMAIL PROTECTED] writes: Using R2.5.0 and Bioconductor I have been following code to analysis Affymetrix expression data: 2 treatments vs control. The original code was run last year and used the read.phenoData command, however with the newer version I get the error message Warning messages: read.phenoData is deprecated, use read.AnnotatedDataFrame instead The phenoData class is deprecated, use AnnotatedDataFrame (with ExpressionSet) instead I use the read.AnnotatedDataFrame command, but when it comes to the end of the analysis the comparison of the treatment to the controls gets mixed up compared to what you get using the original read.phenoData ie it looks like the 3 groups get labelled wrong and so the comparisons are different (but they can still be matched up
Re: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame)
Hi Bert, -Original Message- From: Bert Gunter [mailto:[EMAIL PROTECTED] Sent: Fri 8/3/2007 3:19 PM To: Steven McKinney; r-help@stat.math.ethz.ch Subject: RE: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame) I suspect you'll get some creative answers, but if all you're worried about is whether a column exists before you do something with it, what's wrong with: nm - ... ## a character vector of names if(!all(nm %in% names(yourdata))) ## complain else ## do something I think this is called defensive programming. This is a good example of good defensive programming. I do indeed check variable/object names whenever obtaining them from an external source (user input, file input, a list in code). I was able to practice a defensive programming style in the past by using bar - foo[, FileName] instead of bar - foo$FileName but this has changed recently, so I need to figure out some other mechanisms. R is such a productive language, but this change will lead many of us to chase elusive typos that used to get revealed. I'm hoping that some kind of explicit data frame variable checking mechanism might be introduced since we've lost this one. It would also be great to have such a mechanism to help catch list access and extraction errors. Why should foo$FileName always quietly return NULL? I'm not sure why the following incongruity is okay. foo - matrix(1:4, nrow = 2) dimnames(foo) - list(NULL, c(a, b)) bar - foo[, A] Error: subscript out of bounds foo.df - as.data.frame(foo) foo.df a b 1 1 3 2 2 4 bar - foo.df[, A] bar NULL It is a lot of extra typing to wrap every command in extra code, but more of that will need to happen going forward. Steve McKinney Bert Gunter Genentech -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Steven McKinney Sent: Friday, August 03, 2007 10:38 AM To: r-help@stat.math.ethz.ch Subject: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame) Hi all, What are current methods people use in R to identify mis-spelled column names when selecting columns from a data frame? Alice Johnson recently tackled this issue (see [BioC] posting below). Due to a mis-spelled column name (FileName instead of Filename) which produced no warning, Alice spent a fair amount of time tracking down this bug. With my fumbling fingers I'll be tracking down such a bug soon too. Is there any options() setting, or debug technique that will flag data frame column extractions that reference a non-existent column? It seems to me that the [.data.frame extractor used to throw an error if given a mis-spelled variable name, and I still see lines of code in [.data.frame such as if (any(is.na(cols))) stop(undefined columns selected) In R 2.5.1 a NULL is silently returned. foo - data.frame(Filename = c(a, b)) foo[, FileName] NULL Has something changed so that the code lines if (any(is.na(cols))) stop(undefined columns selected) in [.data.frame no longer work properly (if I am understanding the intention properly)? If not, could [.data.frame check an options() variable setting (say warn.undefined.colnames) and throw a warning if a non-existent column name is referenced? sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: plotrix lme4 Matrix lattice 2.2-3 0.99875-4 0.999375-0 0.16-2 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Johnstone, Alice Sent: Wed 8/1/2007 7:20 PM To: [EMAIL PROTECTED] Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame For interest sake, I have found out why I wasn't getting my expected results when using read.AnnotatedDataFrame Turns out the error was made in the ReadAffy command, where I specified the filenames to be read from my AnnotatedDataFrame object. There was a typo error with a capital N ($FileName) rather than lowercase n ($Filename) as in my target file..whoops. However this meant the filename argument was ignored without the error message(!) and instead of using the information in the AnnotatedDataFrame object (which included filenames, but not alphabetically) it read the .cel files in alphabetical order from the working directory - hence the wrong file was given the wrong label