Re: [Rd] read.csv

2024-04-16 Thread Reed A. Cartwright
Gene names being misinterpreted by spreadsheet software (read.csv is
no different) is a classic issue in bioinformatics. It seems like
every practitioner ends up encountering this issue in due time. E.g.

https://pubmed.ncbi.nlm.nih.gov/15214961/

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

https://www.nature.com/articles/d41586-021-02211-4

https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates


On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao  wrote:
>
> Dear R-developers,
>
> I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
> worthwhile to note -- my data involves a protein named "1433E" but to save 
> space I drop the quote so it becomes,
>
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
>
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
> confused by scientific notation) numeric 1433 which only alerts me when I 
> tried to combine data,
>
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>cat(protein,":\n")
>f <- paste0(protein,".csv")
>if(file.exists(f))
>{
>  p <- read.csv(f)
>  print(p)
>  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>}
> }
>
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
>
> dplyr::bind_rows() failed to work due to incompatible types nevertheless 
> rbind() went ahead without warnings.
>
> Best wishes,
>
>
> Jing Hua
>
> __
> R-devel@r-project.org mailing list
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Ben Bolker
  Tangentially, your code will be more efficient if you add the data 
files to a *list* one by one and then apply bind_rows or 
do.call(rbind,...) after you have accumulated all of the information 
(see chapter 2 of the _R Inferno_). This may or may not be practically 
important in your particular case.


Burns, Patrick. 2012. The R Inferno. Lulu.com. 
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf.



On 2024-04-16 6:46 a.m., jing hua zhao wrote:

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile 
to note -- my data involves a protein named "1433E" but to save space I drop 
the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
cat(protein,":\n")
f <- paste0(protein,".csv")
if(file.exists(f))
{
  p <- read.csv(f)
  print(p)
  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
}
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Dirk Eddelbuettel


As an aside, the odd format does not seem to bother data.table::fread() which
also happens to be my personally preferred workhorse for these tasks:

> fname <- "/tmp/r/filename.csv"
> read.csv(fname)
   Gene SNP prot log10p
1 YWHAE 13:62129097_C_T 1433   7.35
2 YWHAE 4:72617557_T_TA 1433   7.73
> data.table::fread(fname)
 Gene SNP   prot log10p

1:  YWHAE 13:62129097_C_T  1433E   7.35
2:  YWHAE 4:72617557_T_TA  1433E   7.73
> readr::read_csv(fname)
Rows: 2 Columns: 4
── Column specification 
──
Delimiter: ","
chr (2): Gene, SNP
dbl (2): prot, log10p

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 2 × 4
  Gene  SNP  prot log10p

1 YWHAE 13:62129097_C_T  1433   7.35
2 YWHAE 4:72617557_T_TA  1433   7.73
> 

That's on Linux, everything current but dev version of data.table.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Duncan Murdoch

On 16/04/2024 7:36 a.m., Rui Barradas wrote:

Às 11:46 de 16/04/2024, jing hua zhao escreveu:

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile 
to note -- my data involves a protein named "1433E" but to save space I drop 
the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
 cat(protein,":\n")
 f <- paste0(protein,".csv")
 if(file.exists(f))
 {
   p <- read.csv(f)
   print(p)
   if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
 }
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Hello,

I wrote a file with that content and read it back with


read.csv("filename.csv", as.is = TRUE)


There were no problems, it all worked as expected.


What platform are you on?  I got the same output as Jing Hua:

Input filename.csv:

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Output:

> read.csv("filename.csv")
   Gene SNP prot log10p
1 YWHAE 13:62129097_C_T 1433   7.35
2 YWHAE 4:72617557_T_TA 1433   7.73

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread peter dalgaard
Hum...

This boils down to

> as.numeric("1.23e")
[1] 1.23
> as.numeric("1.23e-")
[1] 1.23
> as.numeric("1.23e+")
[1] 1.23

which in turn comes from this code in src/main/util.c (function R_strtod)

if (*p == 'e' || *p == 'E') {
int expsign = 1;
switch(*++p) {
case '-': expsign = -1;
case '+': p++;
default: ;
}
for (n = 0; *p >= '0' && *p <= '9'; p++) n = (n < MAX_EXPONENT_PREFIX) 
? n * 10 + (*p - '0') : n;
expn += expsign * n;
}

which sets the exponent to zero even if the for loop terminates immediately.  

This might qualify as a bug, as it differs from the C function strtod which 
accepts

"A sequence of digits, optionally containing a decimal-point character (.), 
optionally followed by an exponent part (an e or E character followed by an 
optional sign and a sequence of digits)."

[Of course, there would be nothing to stop e.g. "1433E1" from being converted 
to numeric.]

-pd


> On 16 Apr 2024, at 12:46 , jing hua zhao  wrote:
> 
> Dear R-developers,
> 
> I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
> worthwhile to note -- my data involves a protein named "1433E" but to save 
> space I drop the quote so it becomes,
> 
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
> 
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
> confused by scientific notation) numeric 1433 which only alerts me when I 
> tried to combine data,
> 
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>   cat(protein,":\n")
>   f <- paste0(protein,".csv")
>   if(file.exists(f))
>   {
> p <- read.csv(f)
> print(p)
> if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>   }
> }
> 
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
> 
> dplyr::bind_rows() failed to work due to incompatible types nevertheless 
> rbind() went ahead without warnings.
> 
> Best wishes,
> 
> 
> Jing Hua
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Rui Barradas

Às 11:46 de 16/04/2024, jing hua zhao escreveu:

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile 
to note -- my data involves a protein named "1433E" but to save space I drop 
the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
cat(protein,":\n")
f <- paste0(protein,".csv")
if(file.exists(f))
{
  p <- read.csv(f)
  print(p)
  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
}
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Hello,

I wrote a file with that content and read it back with


read.csv("filename.csv", as.is = TRUE)


There were no problems, it all worked as expected.

Hope this helps,

Rui Barradas




--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Dirk Eddelbuettel


On 16 April 2024 at 10:46, jing hua zhao wrote:
| Dear R-developers,
| 
| I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
worthwhile to note -- my data involves a protein named "1433E" but to save 
space I drop the quote so it becomes,
| 
| Gene,SNP,prot,log10p
| YWHAE,13:62129097_C_T,1433E,7.35
| YWHAE,4:72617557_T_TA,1433E,7.73
| 
| Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,
| 
| all_data <- data.frame()
| for (protein in proteins[1:7])
| {
|cat(protein,":\n")
|f <- paste0(protein,".csv")
|if(file.exists(f))
|{
|  p <- read.csv(f)
|  print(p)
|  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
|}
| }
| 
| proteins[1:7]
| [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
| 
| dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

You may need to reconsider aiding read.csv() (and alternate reading
functions) by supplying column-type info instead of relying on educated
heuristic guesses which appear to fail here due to the nature of your data.

Other storage formats can store type info. That is generally safer and may be
an option too.

I think this was more of an email for r-help than r-devel.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv

2024-04-16 Thread jing hua zhao
Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
worthwhile to note -- my data involves a protein named "1433E" but to save 
space I drop the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
   cat(protein,":\n")
   f <- paste0(protein,".csv")
   if(file.exists(f))
   {
 p <- read.csv(f)
 print(p)
 if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
   }
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel