Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import
Gabor Grothendieck [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] You could try using the COM interface rather than the ODBC interface. Try code such as this: library(RDCOMClient) xls - COMCreate(Excel.Application) xls[[Workbooks]]$Open(MySpreadsheet.xls) sheet - xls[[ActiveSheet]] mydata - sheet[[UsedRange]][[value]] xls$Quit() # convert mydata to a character matrix mydata.char - matrix(unlist(mydata), nc = length(xx)) Gabor, Thank you for that suggestion. I try to avoid COM, but it seems to work well with this problem. Because I have empty cells, which are treated as NULLS, the unlist didn't quite work. Here's what I did: library(RDCOMClient) xls - COMCreate(Excel.Application) xls[[Workbooks]]$Open(U:/efg/lab/R/Plasmid/construct list.xls) sheet - xls[[ActiveSheet]] mydata - sheet[[UsedRange]][[value]] xls$Quit() for (column in 1:length(mydata)) { cat(column, , length(mydata[[column]]), , length(unlist(mydata[[column]])), \n) } The results show that while mydata is a list of columns, if you unlist each column you'll be short by the number of NULL values. 1 1251 1251 2 1251 1198 3 1251 870 4 1251 327 5 1251 1250 This seemed a bit crude to fix that problem (can someone suggest a more elegant way?): mymatrix - NULL for (column in 1:length(mydata)) { # Use lappy to replace NULLs with strings, column-by-column mymatrix - cbind(mymatrix, lapply(mydata[[column]], function(cell) { ifelse(is.null(cell), , cell) } )) } # Fix column names colnames(mymatrix) - mymatrix[1,] mymatrix - mymatrix[-1,] mymatrix[273:276,] Plasmid Number Plasmid Concentration Comments Lost [1,] 274yxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxy 1 ug/ul 4 mg [2,] 275a xyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyx 1 ug/2 ul [3,] 275b xyxyxyxyxyxyxyxyxyxyxyxyx 1 ug/5 ul [4,] 276xyxyxyxyxyxyxyxyxyxyxyxyxyxy1 ug/5 ul Assumed Lost Thank you for preserving 275a and 275b as the names here. So, I'd recommend RDCOMClient over RODBC with Excel files. Being lucky shouldn't be part of processing Excel files. efg __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import
On 11/4/05, Earl F. Glynn [EMAIL PROTECTED] wrote: Gabor Grothendieck [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] You could try using the COM interface rather than the ODBC interface. Try code such as this: library(RDCOMClient) xls - COMCreate(Excel.Application) xls[[Workbooks]]$Open(MySpreadsheet.xls) sheet - xls[[ActiveSheet]] mydata - sheet[[UsedRange]][[value]] xls$Quit() # convert mydata to a character matrix mydata.char - matrix(unlist(mydata), nc = length(xx)) Gabor, Thank you for that suggestion. I try to avoid COM, but it seems to work well with this problem. Because I have empty cells, which are treated as NULLS, the unlist didn't quite work. Here's what I did: library(RDCOMClient) xls - COMCreate(Excel.Application) xls[[Workbooks]]$Open(U:/efg/lab/R/Plasmid/construct list.xls) sheet - xls[[ActiveSheet]] mydata - sheet[[UsedRange]][[value]] xls$Quit() for (column in 1:length(mydata)) { cat(column, , length(mydata[[column]]), , length(unlist(mydata[[column]])), \n) } The results show that while mydata is a list of columns, if you unlist each column you'll be short by the number of NULL values. 1 1251 1251 2 1251 1198 3 1251 870 4 1251 327 5 1251 1250 This seemed a bit crude to fix that problem (can someone suggest a more elegant way?): mymatrix - NULL for (column in 1:length(mydata)) { # Use lappy to replace NULLs with strings, column-by-column mymatrix - cbind(mymatrix, lapply(mydata[[column]], function(cell) { ifelse(is.null(cell), , cell) } )) } # Fix column names colnames(mymatrix) - mymatrix[1,] mymatrix - mymatrix[-1,] mymatrix[273:276,] Plasmid Number Plasmid Concentration Comments Lost [1,] 274yxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxy 1 ug/ul 4 mg [2,] 275a xyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyx 1 ug/2 ul [3,] 275b xyxyxyxyxyxyxyxyxyxyxyxyx 1 ug/5 ul [4,] 276xyxyxyxyxyxyxyxyxyxyxyxyxyxy1 ug/5 ul Assumed Lost Thank you for preserving 275a and 275b as the names here. So, I'd recommend RDCOMClient over RODBC with Excel files. Being lucky shouldn't be part of processing Excel files. You could try something like this which turns the data into a textConnection which is read using read.table: con - textConnection(do.call(paste, mydata)) dd - read.table(con, header = TRUE, na.strings = NULL, as.is = TRUE) You might need to vary the arguments to read.table depending on what it is you want to get out. Also, I have assumed that none of the strings contain spaces though using a sep= arg on paste and read.table could handle that too. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import
Hi As I now exclusively use copy paste method to transfer data from Excel to R I tried it and I got correctly a factor column when there were some non numeric data in Excel. Ctrl-C in Excel mydf-read.delim(clipboard) in R Are you sure that a respective column in Excel has values 275a and 275b in it? If yes I had tried to define colClasses vector for your columns. HTH Petr On 2 Nov 2005 at 12:45, Earl F. Glynn wrote: To: r-help@stat.math.ethz.ch From: Earl F. Glynn [EMAIL PROTECTED] Date sent: Wed, 2 Nov 2005 12:45:53 -0600 Subject:[R] RODBC and Excel: Wrong Data Type Assumed on Import The first column in my Excel sheet has mostly numbers but I need to treat it as character data: library(RODBC) channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls) plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) odbcClose(channel) names(plasmid) [1] Plasmid Number PlasmidConcentration Comments Lost # How is the type decided? I need a character type. class(plasmid$Plasmid Number) [1] numeric typeof(plasmid$Plasmid Number) [1] double plasmid$Plasmid Number[273:276] [1] 274 NA NA 276 The two NAs are supposed to be 275a and 275b. I tried the as.is=TRUE but that didn't help. I consulted Section 4, Relational databases, in the R Data Import/Export document (for Version 2.2.0). Section 4.2.2, Data types, was not helpful. In particular, this did not seem helpful: The more comprehensive of the R interface packages hide the type conversion issues from the user. Section 4.3.2, Package RODBC, provided a simple example of using ODBC .. with a(sic) Excel spreadsheet but is silent on how to control the data type on import. Could the documentation be expanded to address this issue? I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid NA. Thanks for any help with this. efg -- Earl F. Glynn Scientific Programmer Bioinformatics Department Stowers Institute for Medical Research __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Petr Pikal [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import
From my experience (somewhat of a guess): 1. Excel uses the first 16 rows of data to determine if a column is numeric or character. The data type which is most common in the first 16 rows will then be used for the whole column. If you sort the data so that at least the first 9 rows have character data, you may find this allows the data to be interpreted as character. There is supposedly a registy setting that can control how many lines to use (instead of 16), but I have not had success with the setting. I suspect that ODBC uses JET4, which may be the real source of the problem. See more here: http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/ 2. The gregmisc bundle has a different read.xls function that uses a Perl script (xls2csv) and seems to be safer with mixed-type columns. Requires a working version of Perl. Best, Kevin Wright The first column in my Excel sheet has mostly numbers but I need to treat it as character data: library(RODBC) http://tolstoy.newcastle.edu.au/R/help/05/09/11324.html#14938qlink1 * channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls) * * plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) * * odbcClose(channel) * names(plasmid) [1] Plasmid Number Plasmid Concentration Comments Lost # How is the type decided? I need a character type. class(plasmid$Plasmid Number) [1] numeric typeof(plasmid$Plasmid Number) [1] double plasmid$Plasmid Number[273:276] [1] 274 NA NA 276 The two NAs are supposed to be 275a and 275b. I tried the as.is=TRUE but that didn't help. I consulted Section 4, Relational databases, in the R Data Import/Export document (for Version 2.2.0). Section 4.2.2, Data types, was not helpful. In particular, this did not seem helpful: The more comprehensive of the R interface packages hide the type conversion issues from the user. Section 4.3.2, Package RODBC, provided a simple example of using ODBC .. with a(sic) Excel spreadsheet but is silent on how to control the data type on import. Could the documentation be expanded to address this issue? I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid NA. Thanks for any help with this. efg -- Earl F. Glynn Scientific Programmer Bioinformatics Department [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import
Kevin Wright [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] From my experience (somewhat of a guess): Excel uses the first 16 rows of data to determine if a column is numeric or character. The data type which is most common in the first 16 rows will then be used for the whole column. I ran some experiments trying to force RODBC to read column 1 of my worksheet as character data (the data are mostly numbers with two exceptions, 275a and 275b, as mentioned earlier). Here's the base code: library(RODBC) channel - odbcConnectExcel(U:/efg/lab/R/Krumlauf-Plasmid/construct list.xls) plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) odbcClose(channel) names(plasmid) [1] Plasmid Number PlasmidConcentration Comments Lost When Excel Sheet1 has rows 2:13 as an X to attempt to force treatment of column 1 as character data: class(plasmid$Plasmid Number) [1] numeric typeof(plasmid$Plasmid Number) [1] double plasmid$Plasmid Number[1:20] [1] NA NA NA NA NA NA NA NA NA NA NA NA 2 3 4 5 6 7 8 9 Why would any software with 12 consecutive X character strings assume the data are purely numeric? Add one more X so rows 2:14 have an X to attempt to force treatment of column 1 as character data: class(plasmid$Plasmid Number) [1] character typeof(plasmid$Plasmid Number) [1] character plasmid$Plasmid Number[1:20] [1] X X X X X X X X X X X X X NA NA NA NA NA NA NA So RODBC now recognizes character Xs in column 1 and then declares all numbers as invalid? These are incredibly (bad) assumptions. I say this is a bug, but it may be an ODBC problem and not one with R. And if this is not an official bug, then it's a serious design problem. Minimally, this issue should be described in the R Data Import/Export document, which everyone is told to read before asking a question. It's frustrating when packages like this work for toy problems, and the documentation never mentions the pitfalls of real data. The gregmisc bundle has a different read.xls function that uses a Perl script (xls2csv) and seems to be safer with mixed-type columns. Requires a working version of Perl. Thanks for this suggestion, but I think I'll just convert the Excel spreadsheet to a .csv and maintain it in that format. efg __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import
You could try using the COM interface rather than the ODBC interface. Try code such as this: library(RDCOMClient) xls - COMCreate(Excel.Application) xls[[Workbooks]]$Open(MySpreadsheet.xls) sheet - xls[[ActiveSheet]] mydata - sheet[[UsedRange]][[value]] xls$Quit() # convert mydata to a character matrix mydata.char - matrix(unlist(mydata), nc = length(xx)) On 11/3/05, Kevin Wright [EMAIL PROTECTED] wrote: From my experience (somewhat of a guess): 1. Excel uses the first 16 rows of data to determine if a column is numeric or character. The data type which is most common in the first 16 rows will then be used for the whole column. If you sort the data so that at least the first 9 rows have character data, you may find this allows the data to be interpreted as character. There is supposedly a registy setting that can control how many lines to use (instead of 16), but I have not had success with the setting. I suspect that ODBC uses JET4, which may be the real source of the problem. See more here: http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/ 2. The gregmisc bundle has a different read.xls function that uses a Perl script (xls2csv) and seems to be safer with mixed-type columns. Requires a working version of Perl. Best, Kevin Wright The first column in my Excel sheet has mostly numbers but I need to treat it as character data: library(RODBC) http://tolstoy.newcastle.edu.au/R/help/05/09/11324.html#14938qlink1 * channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls) * * plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) * * odbcClose(channel) * names(plasmid) [1] Plasmid Number Plasmid Concentration Comments Lost # How is the type decided? I need a character type. class(plasmid$Plasmid Number) [1] numeric typeof(plasmid$Plasmid Number) [1] double plasmid$Plasmid Number[273:276] [1] 274 NA NA 276 The two NAs are supposed to be 275a and 275b. I tried the as.is=TRUE but that didn't help. I consulted Section 4, Relational databases, in the R Data Import/Export document (for Version 2.2.0). Section 4.2.2, Data types, was not helpful. In particular, this did not seem helpful: The more comprehensive of the R interface packages hide the type conversion issues from the user. Section 4.3.2, Package RODBC, provided a simple example of using ODBC .. with a(sic) Excel spreadsheet but is silent on how to control the data type on import. Could the documentation be expanded to address this issue? I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid NA. Thanks for any help with this. efg -- Earl F. Glynn Scientific Programmer Bioinformatics Department [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] RODBC and Excel: Wrong Data Type Assumed on Import
The first column in my Excel sheet has mostly numbers but I need to treat it as character data: library(RODBC) channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls) plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) odbcClose(channel) names(plasmid) [1] Plasmid Number PlasmidConcentration Comments Lost # How is the type decided? I need a character type. class(plasmid$Plasmid Number) [1] numeric typeof(plasmid$Plasmid Number) [1] double plasmid$Plasmid Number[273:276] [1] 274 NA NA 276 The two NAs are supposed to be 275a and 275b. I tried the as.is=TRUE but that didn't help. I consulted Section 4, Relational databases, in the R Data Import/Export document (for Version 2.2.0). Section 4.2.2, Data types, was not helpful. In particular, this did not seem helpful: The more comprehensive of the R interface packages hide the type conversion issues from the user. Section 4.3.2, Package RODBC, provided a simple example of using ODBC .. with a(sic) Excel spreadsheet but is silent on how to control the data type on import. Could the documentation be expanded to address this issue? I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid NA. Thanks for any help with this. efg -- Earl F. Glynn Scientific Programmer Bioinformatics Department Stowers Institute for Medical Research __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html