subject:"\[R\] RODBC and Excel\: Wrong Data Type Assumed on Import"

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-04 Thread Earl F. Glynn

Gabor Grothendieck [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 You could try using the COM interface rather than the ODBC
 interface.  Try code such as this:

 library(RDCOMClient)
 xls - COMCreate(Excel.Application)
 xls[[Workbooks]]$Open(MySpreadsheet.xls)
 sheet - xls[[ActiveSheet]]
 mydata - sheet[[UsedRange]][[value]]
 xls$Quit()

 # convert mydata to a character matrix
 mydata.char - matrix(unlist(mydata), nc = length(xx))

Gabor,

Thank you for that suggestion.  I try to avoid COM, but it seems to work
well with this problem.

Because I have empty cells, which are treated as NULLS, the unlist didn't
quite work.

Here's what I did:

library(RDCOMClient)
xls - COMCreate(Excel.Application)
xls[[Workbooks]]$Open(U:/efg/lab/R/Plasmid/construct list.xls)
sheet - xls[[ActiveSheet]]
mydata - sheet[[UsedRange]][[value]]
xls$Quit()

for (column in 1:length(mydata))
{
  cat(column,  , length(mydata[[column]]),  ,
length(unlist(mydata[[column]])), \n)
}

The results show that while mydata is a list of columns, if you unlist each
column you'll be short by the number of NULL values.

1   1251   1251
2   1251   1198
3   1251   870
4   1251   327
5   1251   1250

This seemed a bit crude to fix that problem (can someone suggest a more
elegant way?):

mymatrix - NULL
for (column in 1:length(mydata))
{
 # Use lappy to replace NULLs with  strings, column-by-column
  mymatrix - cbind(mymatrix, lapply(mydata[[column]], function(cell) {
ifelse(is.null(cell), , cell) } ))
}
# Fix column names
colnames(mymatrix) - mymatrix[1,]
mymatrix - mymatrix[-1,]

 mymatrix[273:276,]
 Plasmid Number Plasmid
Concentration Comments Lost
[1,] 274yxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxy 1 ug/ul
4 mg   
[2,] 275a xyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyx 1 ug/2
ul  
[3,] 275b xyxyxyxyxyxyxyxyxyxyxyxyx   1 ug/5
ul  
[4,] 276xyxyxyxyxyxyxyxyxyxyxyxyxyxy1 ug/5
ul  Assumed Lost

Thank you for preserving 275a and 275b as the names here.

So, I'd recommend RDCOMClient over RODBC with Excel files.  Being lucky
shouldn't be part of processing Excel files.

efg

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-04 Thread Gabor Grothendieck

On 11/4/05, Earl F. Glynn [EMAIL PROTECTED] wrote:
 Gabor Grothendieck [EMAIL PROTECTED] wrote in message
 news:[EMAIL PROTECTED]
  You could try using the COM interface rather than the ODBC
  interface.  Try code such as this:
 
  library(RDCOMClient)
  xls - COMCreate(Excel.Application)
  xls[[Workbooks]]$Open(MySpreadsheet.xls)
  sheet - xls[[ActiveSheet]]
  mydata - sheet[[UsedRange]][[value]]
  xls$Quit()
 
  # convert mydata to a character matrix
  mydata.char - matrix(unlist(mydata), nc = length(xx))

 Gabor,

 Thank you for that suggestion.  I try to avoid COM, but it seems to work
 well with this problem.

 Because I have empty cells, which are treated as NULLS, the unlist didn't
 quite work.

 Here's what I did:

 library(RDCOMClient)
 xls - COMCreate(Excel.Application)
 xls[[Workbooks]]$Open(U:/efg/lab/R/Plasmid/construct list.xls)
 sheet - xls[[ActiveSheet]]
 mydata - sheet[[UsedRange]][[value]]
 xls$Quit()

 for (column in 1:length(mydata))
 {
  cat(column,  , length(mydata[[column]]),  ,
 length(unlist(mydata[[column]])), \n)
 }

 The results show that while mydata is a list of columns, if you unlist each
 column you'll be short by the number of NULL values.

 1   1251   1251
 2   1251   1198
 3   1251   870
 4   1251   327
 5   1251   1250

 This seemed a bit crude to fix that problem (can someone suggest a more
 elegant way?):

 mymatrix - NULL
 for (column in 1:length(mydata))
 {
  # Use lappy to replace NULLs with  strings, column-by-column
  mymatrix - cbind(mymatrix, lapply(mydata[[column]], function(cell) {
 ifelse(is.null(cell), , cell) } ))
 }
 # Fix column names
 colnames(mymatrix) - mymatrix[1,]
 mymatrix - mymatrix[-1,]

  mymatrix[273:276,]
 Plasmid Number Plasmid
 Concentration Comments Lost
 [1,] 274yxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxy 1 ug/ul
 4 mg   
 [2,] 275a xyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyxyx 1 ug/2
 ul  
 [3,] 275b xyxyxyxyxyxyxyxyxyxyxyxyx   1 ug/5
 ul  
 [4,] 276xyxyxyxyxyxyxyxyxyxyxyxyxyxy1 ug/5
 ul  Assumed Lost

 Thank you for preserving 275a and 275b as the names here.

 So, I'd recommend RDCOMClient over RODBC with Excel files.  Being lucky
 shouldn't be part of processing Excel files.


You could try something like this which turns the data into
a textConnection which is read using read.table:

con - textConnection(do.call(paste, mydata))
dd - read.table(con, header = TRUE, na.strings = NULL, as.is = TRUE)

You might need to vary the arguments to read.table depending on what
it is you want to get out.  Also, I have assumed that none of the
strings contain
spaces though using a sep= arg on paste and read.table could handle that
too.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-03 Thread Petr Pikal

Hi

As I now exclusively use copy paste method to transfer data from 
Excel to R I tried it and I got correctly a factor column when there 
were some non numeric data in Excel.

Ctrl-C in Excel
mydf-read.delim(clipboard) in R


Are you sure that a respective column in Excel has values 275a and 
275b in it? 

If yes I had tried to define colClasses vector for your columns.

HTH
Petr


On 2 Nov 2005 at 12:45, Earl F. Glynn wrote:

To: r-help@stat.math.ethz.ch
From:   Earl F. Glynn [EMAIL PROTECTED]
Date sent:  Wed, 2 Nov 2005 12:45:53 -0600
Subject:[R] RODBC and Excel:  Wrong Data Type Assumed on Import

 The first column in my Excel sheet has mostly numbers but I need to
 treat it as character data:
 
  library(RODBC)
  channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct
  list.xls) plasmid - sqlFetch(channel,Sheet1, as.is=TRUE)
  odbcClose(channel)
 
  names(plasmid)
 [1] Plasmid Number PlasmidConcentration  Comments
 Lost
 
 # How is the type decided?  I need a character type.
  class(plasmid$Plasmid Number)
 [1] numeric
  typeof(plasmid$Plasmid Number)
 [1] double
 
  plasmid$Plasmid Number[273:276]
 [1] 274  NA  NA 276
 
 The two NAs are supposed to be 275a and 275b.  I tried the
 as.is=TRUE but that didn't help.
 
 I consulted Section 4, Relational databases, in the R Data
 Import/Export document (for Version 2.2.0).
 
 Section 4.2.2, Data types, was not helpful.  In particular, this did
 not seem helpful:  The more comprehensive of the R interface packages
 hide the type conversion issues from the user.
 
 Section 4.3.2, Package RODBC, provided a simple example of using ODBC
 .. with a(sic) Excel spreadsheet but is silent on how to control the
 data type on import.  Could the documentation be expanded to address
 this issue?
 
 I really need to show Plasmid 275a and Plasmid 275b instead of
 Plasmid NA.
 
 Thanks for any help with this.
 
 efg
 --
 Earl F. Glynn
 Scientific Programmer
 Bioinformatics Department
 Stowers Institute for Medical Research
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html

Petr Pikal
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-03 Thread Kevin Wright

From my experience (somewhat of a guess):

1.

Excel uses the first 16 rows of data to determine if a column is numeric or
character. The data type which is most common in the first 16 rows will then
be used for the whole column. If you sort the data so that at least the
first 9 rows have character data, you may find this allows the data to be
interpreted as character. There is supposedly a registy setting that can
control how many lines to use (instead of 16), but I have not had success
with the setting. I suspect that ODBC uses JET4, which may be the real
source of the problem. See more here:
http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/

2.

The gregmisc bundle has a different read.xls function that uses a Perl
script (xls2csv) and seems to be safer with mixed-type columns.
Requires a working version of Perl.

Best,

Kevin Wright



The first column in my Excel sheet has mostly numbers but I need to treat it
as character data:

 library(RODBC)
http://tolstoy.newcastle.edu.au/R/help/05/09/11324.html#14938qlink1
* channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls) *
* plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) *
* odbcClose(channel) *

 names(plasmid)

[1] Plasmid Number Plasmid Concentration Comments Lost

# How is the type decided? I need a character type.
 class(plasmid$Plasmid Number)

[1] numeric
 typeof(plasmid$Plasmid Number)

[1] double

 plasmid$Plasmid Number[273:276]

[1] 274 NA NA 276

The two NAs are supposed to be 275a and 275b. I tried the as.is=TRUE but
that didn't help.

I consulted Section 4, Relational databases, in the R Data Import/Export
document (for Version 2.2.0).

Section 4.2.2, Data types, was not helpful. In particular, this did not seem
helpful: The more comprehensive of the R interface packages hide the type
conversion issues from the user.

Section 4.3.2, Package RODBC, provided a simple example of using ODBC ..
with a(sic) Excel spreadsheet but is silent on how to control the data type
on import. Could the documentation be expanded to address this issue?

I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid
NA.

Thanks for any help with this.

efg

--
Earl F. Glynn
Scientific Programmer
Bioinformatics Department

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-03 Thread Earl F. Glynn

Kevin Wright [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 From my experience (somewhat of a guess):

 Excel uses the first 16 rows of data to determine if a column is numeric
or
 character. The data type which is most common in the first 16 rows will
then
 be used for the whole column.

I ran some experiments trying to force RODBC to read column 1 of my
worksheet as character data (the data are mostly numbers with two
exceptions, 275a and 275b, as mentioned earlier).



Here's the base code:



 library(RODBC)

 channel - odbcConnectExcel(U:/efg/lab/R/Krumlauf-Plasmid/construct
list.xls)

 plasmid - sqlFetch(channel,Sheet1, as.is=TRUE)

 odbcClose(channel)

 names(plasmid)

[1] Plasmid Number PlasmidConcentration  Comments
Lost



When Excel Sheet1 has rows 2:13 as an X to attempt to force treatment of
column 1 as character data:



 class(plasmid$Plasmid Number)

[1] numeric

 typeof(plasmid$Plasmid Number)

[1] double

 plasmid$Plasmid Number[1:20]

 [1] NA NA NA NA NA NA NA NA NA NA NA NA  2  3  4  5  6  7  8  9



Why would any software with 12 consecutive X character strings assume
the data are purely numeric?



Add one more X so rows 2:14 have an X to attempt to force treatment of
column 1 as character data:



 class(plasmid$Plasmid Number)

[1] character

 typeof(plasmid$Plasmid Number)

[1] character

 plasmid$Plasmid Number[1:20]

 [1] X X X X X X X X X X X X X NA  NA  NA  NA  NA
NA  NA



So RODBC now recognizes character Xs in column 1 and then declares all
numbers as invalid?  These are incredibly (bad) assumptions.



I say this is a bug, but it may be an ODBC problem and not one with R.
And if this is not an official bug, then it's a serious design problem.
Minimally, this issue should be described in the R Data Import/Export
document, which everyone is told to read before asking a question.



It's frustrating when packages like this work for toy problems, and the
documentation never mentions the pitfalls of real data.


 The gregmisc bundle has a different read.xls function that uses a Perl
 script (xls2csv) and seems to be safer with mixed-type columns.
 Requires a working version of Perl.

Thanks for this suggestion, but I think I'll just convert the Excel
spreadsheet to a .csv and maintain it in that format.

efg

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-03 Thread Gabor Grothendieck

You could try using the COM interface rather than the ODBC
interface.  Try code such as this:

library(RDCOMClient)
xls - COMCreate(Excel.Application)
xls[[Workbooks]]$Open(MySpreadsheet.xls)
sheet - xls[[ActiveSheet]]
mydata - sheet[[UsedRange]][[value]]
xls$Quit()

# convert mydata to a character matrix
mydata.char - matrix(unlist(mydata), nc = length(xx))



On 11/3/05, Kevin Wright [EMAIL PROTECTED] wrote:
 From my experience (somewhat of a guess):

 1.

 Excel uses the first 16 rows of data to determine if a column is numeric or
 character. The data type which is most common in the first 16 rows will then
 be used for the whole column. If you sort the data so that at least the
 first 9 rows have character data, you may find this allows the data to be
 interpreted as character. There is supposedly a registy setting that can
 control how many lines to use (instead of 16), but I have not had success
 with the setting. I suspect that ODBC uses JET4, which may be the real
 source of the problem. See more here:
 http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/

 2.

 The gregmisc bundle has a different read.xls function that uses a Perl
 script (xls2csv) and seems to be safer with mixed-type columns.
 Requires a working version of Perl.

 Best,

 Kevin Wright



 The first column in my Excel sheet has mostly numbers but I need to treat it
 as character data:

  library(RODBC)
 http://tolstoy.newcastle.edu.au/R/help/05/09/11324.html#14938qlink1
 * channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls) *
 * plasmid - sqlFetch(channel,Sheet1, as.is=TRUE) *
 * odbcClose(channel) *

  names(plasmid)

 [1] Plasmid Number Plasmid Concentration Comments Lost

 # How is the type decided? I need a character type.
  class(plasmid$Plasmid Number)

 [1] numeric
  typeof(plasmid$Plasmid Number)

 [1] double

  plasmid$Plasmid Number[273:276]

 [1] 274 NA NA 276

 The two NAs are supposed to be 275a and 275b. I tried the as.is=TRUE but
 that didn't help.

 I consulted Section 4, Relational databases, in the R Data Import/Export
 document (for Version 2.2.0).

 Section 4.2.2, Data types, was not helpful. In particular, this did not seem
 helpful: The more comprehensive of the R interface packages hide the type
 conversion issues from the user.

 Section 4.3.2, Package RODBC, provided a simple example of using ODBC ..
 with a(sic) Excel spreadsheet but is silent on how to control the data type
 on import. Could the documentation be expanded to address this issue?

 I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid
 NA.

 Thanks for any help with this.

 efg

 --
 Earl F. Glynn
 Scientific Programmer
 Bioinformatics Department

[[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] RODBC and Excel: Wrong Data Type Assumed on Import

2005-11-02 Thread Earl F. Glynn

The first column in my Excel sheet has mostly numbers but I need to treat it
as character data:

 library(RODBC)
 channel - odbcConnectExcel(U:/efg/lab/R/Plasmid/construct list.xls)
 plasmid - sqlFetch(channel,Sheet1, as.is=TRUE)
 odbcClose(channel)

 names(plasmid)
[1] Plasmid Number PlasmidConcentration  Comments
Lost

# How is the type decided?  I need a character type.
 class(plasmid$Plasmid Number)
[1] numeric
 typeof(plasmid$Plasmid Number)
[1] double

 plasmid$Plasmid Number[273:276]
[1] 274  NA  NA 276

The two NAs are supposed to be 275a and 275b.  I tried the as.is=TRUE but
that didn't help.

I consulted Section 4, Relational databases, in the R Data Import/Export
document (for Version 2.2.0).

Section 4.2.2, Data types, was not helpful.  In particular, this did not
seem helpful:  The more comprehensive of the R interface packages hide the
type conversion issues from the user.

Section 4.3.2, Package RODBC, provided a simple example of using ODBC ..
with a(sic) Excel spreadsheet but is silent on how to control the data type
on import.  Could the documentation be expanded to address this issue?

I really need to show Plasmid 275a and Plasmid 275b instead of Plasmid
NA.

Thanks for any help with this.

efg
--
Earl F. Glynn
Scientific Programmer
Bioinformatics Department
Stowers Institute for Medical Research

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

Re: [R] RODBC and Excel: Wrong Data Type Assumed on Import

[R] RODBC and Excel: Wrong Data Type Assumed on Import

7 matches

Site Navigation

Mail list logo

Footer information