Re: [Rd] read.csv
Gene names being misinterpreted by spreadsheet software (read.csv is no different) is a classic issue in bioinformatics. It seems like every practitioner ends up encountering this issue in due time. E.g. https://pubmed.ncbi.nlm.nih.gov/15214961/ https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 https://www.nature.com/articles/d41586-021-02211-4 https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao wrote: > > Dear R-developers, > > I came to a somewhat unexpected behaviour of read.csv() which is trivial but > worthwhile to note -- my data involves a protein named "1433E" but to save > space I drop the quote so it becomes, > > Gene,SNP,prot,log10p > YWHAE,13:62129097_C_T,1433E,7.35 > YWHAE,4:72617557_T_TA,1433E,7.73 > > Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly > confused by scientific notation) numeric 1433 which only alerts me when I > tried to combine data, > > all_data <- data.frame() > for (protein in proteins[1:7]) > { >cat(protein,":\n") >f <- paste0(protein,".csv") >if(file.exists(f)) >{ > p <- read.csv(f) > print(p) > if(nrow(p)>0) all_data <- bind_rows(all_data,p) >} > } > > proteins[1:7] > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" > > dplyr::bind_rows() failed to work due to incompatible types nevertheless > rbind() went ahead without warnings. > > Best wishes, > > > Jing Hua > > __ > R-devel@r-project.org mailing list > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Tangentially, your code will be more efficient if you add the data files to a *list* one by one and then apply bind_rows or do.call(rbind,...) after you have accumulated all of the information (see chapter 2 of the _R Inferno_). This may or may not be practically important in your particular case. Burns, Patrick. 2012. The R Inferno. Lulu.com. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf. On 2024-04-16 6:46 a.m., jing hua zhao wrote: Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
As an aside, the odd format does not seem to bother data.table::fread() which also happens to be my personally preferred workhorse for these tasks: > fname <- "/tmp/r/filename.csv" > read.csv(fname) Gene SNP prot log10p 1 YWHAE 13:62129097_C_T 1433 7.35 2 YWHAE 4:72617557_T_TA 1433 7.73 > data.table::fread(fname) Gene SNP prot log10p 1: YWHAE 13:62129097_C_T 1433E 7.35 2: YWHAE 4:72617557_T_TA 1433E 7.73 > readr::read_csv(fname) Rows: 2 Columns: 4 ── Column specification ── Delimiter: "," chr (2): Gene, SNP dbl (2): prot, log10p ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 2 × 4 Gene SNP prot log10p 1 YWHAE 13:62129097_C_T 1433 7.35 2 YWHAE 4:72617557_T_TA 1433 7.73 > That's on Linux, everything current but dev version of data.table. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On 16/04/2024 7:36 a.m., Rui Barradas wrote: Às 11:46 de 16/04/2024, jing hua zhao escreveu: Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel Hello, I wrote a file with that content and read it back with read.csv("filename.csv", as.is = TRUE) There were no problems, it all worked as expected. What platform are you on? I got the same output as Jing Hua: Input filename.csv: Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Output: > read.csv("filename.csv") Gene SNP prot log10p 1 YWHAE 13:62129097_C_T 1433 7.35 2 YWHAE 4:72617557_T_TA 1433 7.73 Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Hum... This boils down to > as.numeric("1.23e") [1] 1.23 > as.numeric("1.23e-") [1] 1.23 > as.numeric("1.23e+") [1] 1.23 which in turn comes from this code in src/main/util.c (function R_strtod) if (*p == 'e' || *p == 'E') { int expsign = 1; switch(*++p) { case '-': expsign = -1; case '+': p++; default: ; } for (n = 0; *p >= '0' && *p <= '9'; p++) n = (n < MAX_EXPONENT_PREFIX) ? n * 10 + (*p - '0') : n; expn += expsign * n; } which sets the exponent to zero even if the for loop terminates immediately. This might qualify as a bug, as it differs from the C function strtod which accepts "A sequence of digits, optionally containing a decimal-point character (.), optionally followed by an exponent part (an e or E character followed by an optional sign and a sequence of digits)." [Of course, there would be nothing to stop e.g. "1433E1" from being converted to numeric.] -pd > On 16 Apr 2024, at 12:46 , jing hua zhao wrote: > > Dear R-developers, > > I came to a somewhat unexpected behaviour of read.csv() which is trivial but > worthwhile to note -- my data involves a protein named "1433E" but to save > space I drop the quote so it becomes, > > Gene,SNP,prot,log10p > YWHAE,13:62129097_C_T,1433E,7.35 > YWHAE,4:72617557_T_TA,1433E,7.73 > > Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly > confused by scientific notation) numeric 1433 which only alerts me when I > tried to combine data, > > all_data <- data.frame() > for (protein in proteins[1:7]) > { > cat(protein,":\n") > f <- paste0(protein,".csv") > if(file.exists(f)) > { > p <- read.csv(f) > print(p) > if(nrow(p)>0) all_data <- bind_rows(all_data,p) > } > } > > proteins[1:7] > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" > > dplyr::bind_rows() failed to work due to incompatible types nevertheless > rbind() went ahead without warnings. > > Best wishes, > > > Jing Hua > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Às 11:46 de 16/04/2024, jing hua zhao escreveu: Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel Hello, I wrote a file with that content and read it back with read.csv("filename.csv", as.is = TRUE) There were no problems, it all worked as expected. Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus. www.avg.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On 16 April 2024 at 10:46, jing hua zhao wrote: | Dear R-developers, | | I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, | | Gene,SNP,prot,log10p | YWHAE,13:62129097_C_T,1433E,7.35 | YWHAE,4:72617557_T_TA,1433E,7.73 | | Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, | | all_data <- data.frame() | for (protein in proteins[1:7]) | { |cat(protein,":\n") |f <- paste0(protein,".csv") |if(file.exists(f)) |{ | p <- read.csv(f) | print(p) | if(nrow(p)>0) all_data <- bind_rows(all_data,p) |} | } | | proteins[1:7] | [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" | | dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. You may need to reconsider aiding read.csv() (and alternate reading functions) by supplying column-type info instead of relying on educated heuristic guesses which appear to fail here due to the nature of your data. Other storage formats can store type info. That is generally safer and may be an option too. I think this was more of an email for r-help than r-devel. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv
Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel