Re: [R] Creating a Data Frame from an XML

2013-01-22 Thread Ben Tupper

On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:

 Hello,
 
 I'm attempting to read information from an XML into a data frame in R using
 the XML package. I am unable to get the data into a data frame as I would
 like.  I have some sample code below.
 
 *XML Code:*
 
 Header...
 
 Data I want in a data frame:
 
   data
  row BRAND=GMC NUM=1 YEAR=1999 VALUE=1 /
  row BRAND=FORD NUM=1 YEAR=2000 VALUE=12000 /
  row BRAND=GMC NUM=1 YEAR=2001 VALUE=12500 /
  row BRAND=FORD NUM=1 YEAR=2002 VALUE=13000 /
  row BRAND=GMC NUM=1 YEAR=2003 VALUE=14000 /
  row BRAND=FORD NUM=1 YEAR=2004 VALUE=17000 /
  row BRAND=GMC NUM=1 YEAR=2005 VALUE=15000 /
  row BRAND=GMC NUM=1 YEAR=1967 VALUE=PRICLESS /
  row BRAND=FORD NUM=1 YEAR=2007 VALUE=17500 /
  row BRAND=GMC NUM=1 YEAR=2008 VALUE=22000 /
  /data
 
 *R Code:*
 
 doc -xmlInternalTreeParse (Sample2.xml)
 top - xmlRoot (doc)
 xmlName (top)
 names (top)
 art - top [[row]]
 art
 **
 *Output:*
 
 artrow BRAND=GMC NUM=1 YEAR=1999 VALUE=1/
 
 
 
 
 This is where I am having difficulties.  I am unable to access additional
 rows; ( i.e.  row BRAND=GMC NUM=1 YEAR=1967 VALUE=PRICLESS / )
 
 and I am unable to access the individual entries to actually create the
 data frame.  The data frame I would like is as follows:
 
 BRANDNUMYEARVALUE
 GMC1  1999  1
 FORD   2  2000  12000
 GMC1  2001   12500
etc
 
 Any help or suggestions would be appreciated.  Conversly, my eventual goal
 would be to take a data frame and write it into an XML in the previously
 shown format.
 
Hi,

You are so close!

You have a number of nodes with the name 'row'.  The [[ function selects just 
one item from a list, and when there's a number that have that name it returns 
just the first.  So you really want to use the [ function instead and then 
select by order index using [[

library(XML)

 s - c(  data,  row BRAND=\GMC\ NUM=\1\ YEAR=\1999\ 
 VALUE=\1\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2000\ VALUE=\12000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2001\ VALUE=\12500\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2002\ VALUE=\13000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2003\ VALUE=\14000\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2004\ VALUE=\17000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2005\ VALUE=\15000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\1967\ VALUE=\PRICLESS\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2007\ VALUE=\17500\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2008\ VALUE=\22000\ /, 
 /data)

 x - xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))

 x[row][[1]]
 row BRAND=GMC NUM=1 YEAR=1999 VALUE=1/

 x[row][[2]]
 row BRAND=FORD NUM=1 YEAR=2000 VALUE=12000/ 

Your rows are set up so the attributes have the values you want - use xmlAttrs 
to retrieve them.

 xmlAttrs(x[row][[2]])
  BRAND NUMYEAR   VALUE 
 FORD 1  2000 12000 


You can use lapply to iterate through each row and apply the xmlAttrs function. 
 You'll end up with a list if character vectors.

 y - lapply(x[row], xmlAttrs)
 str(y)
List of 10
 $ row: Named chr [1:4] GMC 1 1999 1
  ..- attr(*, names)= chr [1:4] BRAND NUM YEAR VALUE
 $ row: Named chr [1:4] FORD 1 2000 12000
  ..- attr(*, names)= chr [1:4] BRAND NUM YEAR VALUE
 $ row: Named chr [1:4] GMC 1 2001 12500
  ..- attr(*, names)= chr [1:4] BRAND NUM YEAR VALUE
.
.
.

Next make a character matrix using do.call and rbind ...

 m - do.call(rbind, y)
 str(m)
 chr [1:10, 1:4] GMC FORD GMC FORD GMC FORD GMC GMC FORD ...
 - attr(*, dimnames)=List of 2
  ..$ : chr [1:10] row row row row ...
  ..$ : chr [1:4] BRAND NUM YEAR VALUE

And then on to a data.frame...

 d - as.data.frame(m)
 str(d)
'data.frame':   10 obs. of  4 variables:
 $ BRAND: chr  GMC FORD GMC FORD ...
 $ NUM  : chr  1 1 1 1 ...
 $ YEAR : chr  1999 2000 2001 2002 ...
 $ VALUE: chr  1 12000 12500 13000 ...

Cheers,
Ben




 Thank you
 
 AG
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine   04575-0475 
http://www.bigelow.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Creating a Data Frame from an XML

2013-01-22 Thread Gabor Grothendieck
On Tue, Jan 22, 2013 at 3:11 PM, Adam Gabbert adamjgabb...@gmail.com wrote:
 Hello,

 I'm attempting to read information from an XML into a data frame in R using
 the XML package. I am unable to get the data into a data frame as I would
 like.  I have some sample code below.

 *XML Code:*

 Header...

 Data I want in a data frame:

data
   row BRAND=GMC NUM=1 YEAR=1999 VALUE=1 /
   row BRAND=FORD NUM=1 YEAR=2000 VALUE=12000 /
   row BRAND=GMC NUM=1 YEAR=2001 VALUE=12500 /
   row BRAND=FORD NUM=1 YEAR=2002 VALUE=13000 /
   row BRAND=GMC NUM=1 YEAR=2003 VALUE=14000 /
   row BRAND=FORD NUM=1 YEAR=2004 VALUE=17000 /
   row BRAND=GMC NUM=1 YEAR=2005 VALUE=15000 /
   row BRAND=GMC NUM=1 YEAR=1967 VALUE=PRICLESS /
   row BRAND=FORD NUM=1 YEAR=2007 VALUE=17500 /
   row BRAND=GMC NUM=1 YEAR=2008 VALUE=22000 /
   /data

 *R Code:*

 doc -xmlInternalTreeParse (Sample2.xml)
 top - xmlRoot (doc)
 xmlName (top)
 names (top)
 art - top [[row]]
 art
 **

This will get a data frame of character columns

 as.data.frame(t(xpathSApply(doc, //row, xmlAttrs)), stringsAsFactors = 
 FALSE)
   BRAND NUM YEARVALUE
1GMC   1 19991
2   FORD   1 200012000
3GMC   1 200112500
4   FORD   1 200213000
5GMC   1 200314000
6   FORD   1 200417000
7GMC   1 200515000
8GMC   1 1967 PRICLESS
9   FORD   1 200717500
10   GMC   1 200822000


--
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Creating a Data Frame from an XML

2013-01-22 Thread arun


Hi,

May be this also helps:
s - c(  data,  row BRAND=\GMC\ NUM=\1\ YEAR=\1999\ VALUE=\1\ 
/, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2000\ VALUE=\12000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2001\ VALUE=\12500\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2002\ VALUE=\13000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2003\ VALUE=\14000\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2004\ VALUE=\17000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2005\ VALUE=\15000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\1967\ VALUE=\PRICLESS\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2007\ VALUE=\17500\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2008\ VALUE=\22000\ /, 
 /data)


Lines1-gsub(^\\s+| \\s+$,,gsub([^0-9A-Z], ,s))
dat1-read.table(text=Lines1[Lines1!=],sep=,header=F,stringsAsFactors=F)
dat1New-dat1[,seq(2,ncol(dat1),by=2)]
colnames(dat1New)- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))


str(dat1New)
#'data.frame':    10 obs. of  4 variables:
# $ BRAND: chr  GMC FORD GMC FORD ...
# $ NUM  : int  1 1 1 1 1 1 1 1 1 1
# $ YEAR : int  1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr  1 12000 12500 13000 ...

#or

Lines2-gsub( 
.*,,gsub(^.*=\(.*)\\\s+.*=\(.*)\\\s+.*=\(.*)\\\s+.*=\(.*)\.*,\\1
 \\2 \\3 \\4,s))
dat2-read.table(text=Lines2[Lines2!=Lines2!= 
],sep=,header=FALSE,stringsAsFactors=FALSE)
 colnames(dat2)- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))
 

 str(dat2)
'data.frame':    10 obs. of  4 variables:
# $ BRAND: chr  GMC FORD GMC FORD ...
# $ NUM  : int  1 1 1 1 1 1 1 1 1 1
# $ YEAR : int  1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr  1 12000 12500 13000 ...


head(dat2,3)
#  BRAND NUM YEAR VALUE
#1   GMC   1 1999 1
#2  FORD   1 2000 12000
#3   GMC   1 2001 12500


A.K.


- Original Message -
From: Ben Tupper btup...@bigelow.org
To: Adam Gabbert adamjgabb...@gmail.com
Cc: r-help@r-project.org
Sent: Tuesday, January 22, 2013 10:13 PM
Subject: Re: [R] Creating a Data Frame from an XML


On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:

 Hello,
 
 I'm attempting to read information from an XML into a data frame in R using
 the XML package. I am unable to get the data into a data frame as I would
 like.  I have some sample code below.
 
 *XML Code:*
 
 Header...
 
 Data I want in a data frame:
 
   data
  row BRAND=GMC NUM=1 YEAR=1999 VALUE=1 /
  row BRAND=FORD NUM=1 YEAR=2000 VALUE=12000 /
  row BRAND=GMC NUM=1 YEAR=2001 VALUE=12500 /
  row BRAND=FORD NUM=1 YEAR=2002 VALUE=13000 /
  row BRAND=GMC NUM=1 YEAR=2003 VALUE=14000 /
  row BRAND=FORD NUM=1 YEAR=2004 VALUE=17000 /
  row BRAND=GMC NUM=1 YEAR=2005 VALUE=15000 /
  row BRAND=GMC NUM=1 YEAR=1967 VALUE=PRICLESS /
  row BRAND=FORD NUM=1 YEAR=2007 VALUE=17500 /
  row BRAND=GMC NUM=1 YEAR=2008 VALUE=22000 /
  /data
 
 *R Code:*
 
 doc -xmlInternalTreeParse (Sample2.xml)
 top - xmlRoot (doc)
 xmlName (top)
 names (top)
 art - top [[row]]
 art
 **
 *Output:*
 
 artrow BRAND=GMC NUM=1 YEAR=1999 VALUE=1/
 
 
 
 
 This is where I am having difficulties.  I am unable to access additional
 rows; ( i.e.  row BRAND=GMC NUM=1 YEAR=1967 VALUE=PRICLESS / )
 
 and I am unable to access the individual entries to actually create the
 data frame.  The data frame I would like is as follows:
 
 BRAND    NUM    YEAR    VALUE
 GMC        1          1999      1
 FORD       2          2000      12000
 GMC        1          2001       12500
    etc
 
 Any help or suggestions would be appreciated.  Conversly, my eventual goal
 would be to take a data frame and write it into an XML in the previously
 shown format.
 
Hi,

You are so close!

You have a number of nodes with the name 'row'.  The [[ function selects just 
one item from a list, and when there's a number that have that name it returns 
just the first.  So you really want to use the [ function instead and then 
select by order index using [[

library(XML)

 s - c(  data,  row BRAND=\GMC\ NUM=\1\ YEAR=\1999\ 
 VALUE=\1\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2000\ VALUE=\12000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2001\ VALUE=\12500\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2002\ VALUE=\13000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2003\ VALUE=\14000\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2004\ VALUE=\17000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2005\ VALUE=\15000\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\1967\ VALUE=\PRICLESS\ /, 
 row BRAND=\FORD\ NUM=\1\ YEAR=\2007\ VALUE=\17500\ /, 
 row BRAND=\GMC\ NUM=\1\ YEAR=\2008\ VALUE=\22000\ /, 
 /data)

 x - xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))

 x[row][[1]]
row BRAND=GMC NUM=1 YEAR=1999 VALUE=1/

 x[row][[2]]
row BRAND=FORD NUM=1 YEAR=2000 VALUE=12000/ 

Your rows are set up so the attributes have the values you want - use xmlAttrs 
to retrieve them.

 xmlAttrs(x[row][[2]])
  BRAND     NUM    YEAR   VALUE 
FORD     1  2000 12000 


You can use lapply to iterate through each row and apply the xmlAttrs function. 
 You'll end up with a list if character vectors.

 y - lapply(x[row], xmlAttrs)
 str(y)
List of 10
$ row: Named chr [1:4] GMC 1 1999 1