Re: [R] Loading problem with XML_1.9

2007-07-09 Thread Duncan Temple Lang

Weijun and I corresponded off-list so that I could
get a copy of the data.

On a relatively modest machine with 2G of RAM, 10G swap,
dual core 1Ghz 64bit AMDs, the code below takes approximately
100 seconds. It is not optimized in any particular way, so
there is room for improvement.

doc <- xmlTreeParse("mi1.txt.gz", useInternal = TRUE)
mols <- getNodeSet(doc, "//molecule")

ans =
lapply(mols,
function(node, targets) {

   names = as.character(xpathApply(node, "//name/text()",
 xmlValue))
   if(any(names %in% targets))
 xpathApply(node, "//moleculeName", xmlValue)
   else
 character()
}, c("Vps4b", "SKD1", "frm-1"))

ans = ans[sapply(ans, length) > 0]


We can read the file without uncompressing which probably
slows things down slightly.
The parsing of the tree takes about 20 seconds and occupies
approximately 1G (very roughly).
Then we find all the  nodes of which there
are 25452.  Then we loop over each of these and
do a sub-query using XPath and see if
the text child in the  nodes are in the set
of interest (entirely made up for my test), and if so
fetch the content of any  within this
 node.

It would be nice if we could build the hash for targets
just once.
And we could get clever with the XPath query to try do
the matching and selection in one query. This might actually slow things
down.

(There are garbage collection issues with XPath sub-queries
for which I am still deciding on the optimal strategy.)

So perhaps the lesson her is that for those working with XML,
XPath is worth using before more specialized approaches
and large XML data files can fit into memory. The tree
is not using contiguous memory so nodes can be squeezed into
available spaces.

D.


Luo Weijun wrote:
> Thanks, Dr. Lang,
> I used xmlEventParse() + branches concept as you
> suggested, it really works, and the memory issue is
> gone. Now I can query large XML files from within R.
> but here is another problem: it is too slows (a simple
> query has not finished for 1.5h), even though the
> number of relevant records is very limited, but the
> whole XML file has more than 500 thousand
> similarly-structured records. And the parser has to go
> through all of them as to find the matches. Attached
> is part of the XML files with two records. I am trying
> to retrieve the content of  nodes from
>  records where  nodes bear specific
> gene names.
> Is it possible to locate based on node content (or
> xmlValue) rather than node names (since they are the
> same in all records) first and then parse the xml
> record locally? Would query based on XPath be faster
> in this case? I understand that we do have the
> facility in the XML package for XPath based queries,
> called getNodeSet(). But that requires reading the
> whole XML tree into the memory first, which is not
> feasible for my large XML file. Or can I call
> XML::XPath statements using your R-Perl interface
> package? Any suggestions/thoughts? Thank you!
> Weijun
> 
> 
> Part of my XML file: 
> 
> 
> 20119043
> protein20
> 1009020
> 20GI6677981
> SKD120
> Vps4b20
> 8030489C12Rik20
> Mouse homologue of
> yeast Vacuolar protein sorting 4 (Vps4); Suppressor of
> potassium transport defect 1. Mem
> ber of mammalian class E Vps proteins involved in
> endosomal transport; AAA-type
> ATPase.20Mo
> use homologue of yeast  Vacuolar protein sorting 4
> (Vps4); Suppressor of potassium  transport defect 1.
> Member of  mammalian class E Vps prot
> eins involved in endosomal transport; AAA-type
> ATPase.20
> 
> 337974miClust80
> 
> 
> 200
> 
> 20158189434SBP1
> 00
> 20158217953mVps2
> 00
> 
> 
> 
> 30116226
> protein30
> 960630
> 30HGNC9859
> RAP1GDS130
> GDS130
> MGC11885930
> MGC11886130
> 
> 300
> 
> 93569116280RAC1
> 00
> 104132103040RHOA
> 00
> 12181874726MBIP
> 00
> 
> 
> --- Duncan Temple Lang <[EMAIL PROTECTED]>
> wrote:
> 
>> Well, as you mention at the end of the mail,
>> several people have given you suggestions about
>> how to solve the problem using different approaches.
>> You might search on the Web for how to install a 64
>> bit version of libxml2?
>> Using xmlTreeParse(, useInternalNodes = TRUE) is an
>> approach
>> to reducing the memory consumption as is using the
>> handlers
>> argument. And if size is really the issue, you
>> should consider
>> the SAX model which is very memory efficient and
>> made available
>> via the xmlEventParse() function in the XML package.
>> And it even provides the concepts of branches to
>> provide a
>> hybrid of SAX and DOM-style parsing together.
>>
>> However, to solve the problem of the xmlMemDisplay
>> symbol not being found, you can look for where
>> that is used and remove it.It is in
>> src/DocParse.c
>> in the routine RS_XML_MemoryShow().  You can remove
>> the line
>>   xmlMemDisplay(stderr)
>> or indeed the entire routine.  Then re-install and
>> reload the package.
>>
>>  D.
>>
>>
>> Luo Weijun wrote:
>>> Hello Dr. Lang an

Re: [R] Loading problem with XML_1.9

2007-07-08 Thread Luo Weijun
Thanks, Dr. Lang,
I used xmlEventParse() + branches concept as you
suggested, it really works, and the memory issue is
gone. Now I can query large XML files from within R.
but here is another problem: it is too slows (a simple
query has not finished for 1.5h), even though the
number of relevant records is very limited, but the
whole XML file has more than 500 thousand
similarly-structured records. And the parser has to go
through all of them as to find the matches. Attached
is part of the XML files with two records. I am trying
to retrieve the content of  nodes from
 records where  nodes bear specific
gene names.
Is it possible to locate based on node content (or
xmlValue) rather than node names (since they are the
same in all records) first and then parse the xml
record locally? Would query based on XPath be faster
in this case? I understand that we do have the
facility in the XML package for XPath based queries,
called getNodeSet(). But that requires reading the
whole XML tree into the memory first, which is not
feasible for my large XML file. Or can I call
XML::XPath statements using your R-Perl interface
package? Any suggestions/thoughts? Thank you!
Weijun


Part of my XML file: 


20119043
protein20
1009020
20GI6677981
SKD120
Vps4b20
8030489C12Rik20
Mouse homologue of
yeast Vacuolar protein sorting 4 (Vps4); Suppressor of
potassium transport defect 1. Mem
ber of mammalian class E Vps proteins involved in
endosomal transport; AAA-type
ATPase.20Mo
use homologue of yeast  Vacuolar protein sorting 4
(Vps4); Suppressor of potassium  transport defect 1.
Member of  mammalian class E Vps prot
eins involved in endosomal transport; AAA-type
ATPase.20

337974miClust80


200

20158189434SBP1
00
20158217953mVps2
00



30116226
protein30
960630
30HGNC9859
RAP1GDS130
GDS130
MGC11885930
MGC11886130

300

93569116280RAC1
00
104132103040RHOA
00
12181874726MBIP
00


--- Duncan Temple Lang <[EMAIL PROTECTED]>
wrote:

> 
> Well, as you mention at the end of the mail,
> several people have given you suggestions about
> how to solve the problem using different approaches.
> You might search on the Web for how to install a 64
> bit version of libxml2?
> Using xmlTreeParse(, useInternalNodes = TRUE) is an
> approach
> to reducing the memory consumption as is using the
> handlers
> argument. And if size is really the issue, you
> should consider
> the SAX model which is very memory efficient and
> made available
> via the xmlEventParse() function in the XML package.
> And it even provides the concepts of branches to
> provide a
> hybrid of SAX and DOM-style parsing together.
> 
> However, to solve the problem of the xmlMemDisplay
> symbol not being found, you can look for where
> that is used and remove it.It is in
> src/DocParse.c
> in the routine RS_XML_MemoryShow().  You can remove
> the line
>   xmlMemDisplay(stderr)
> or indeed the entire routine.  Then re-install and
> reload the package.
> 
>  D.
> 
> 
> Luo Weijun wrote:
> > Hello Dr. Lang and all,
> > I posted this message in R-help mail list, but
> haven’t
> > solved my problem so far. Therefore, could you
> help me
> > look at it?
> > I have loading problem with XML_1.9 under 64 bit
> > R2.3.1 for Mac OS X, which I got from
> > http://R.research.att.com/. XML_1.9 works fine
> under
> > 32 bit R2.5.0. I thought that could be
> installation
> > problem, and I tried install.packages or biocLite,
> > every time the package installed fine, except some
> > warning messages below:
> > ld64 warning: in /usr/lib/libxml2.dylib, file does
> not
> > contain requested architecture
> > ld64 warning: in /usr/lib/libz.dylib, file does
> not
> > contain requested architecture
> > ld64 warning: in /usr/lib/libiconv.dylib, file
> does
> > not contain requested architecture
> > ld64 warning: in /usr/lib/libz.dylib, file does
> not
> > contain requested architecture
> > ld64 warning: in /usr/lib/libxml2.dylib, file does
> not
> > contain requested architecture
> > 
> > Here is the error messages I got, when XML is
> loaded:
> >> library(XML)
> > Error in dyn.load(x, as.logical(local),
> > as.logical(now)) : 
> > unable to load shared library
> > '/usr/local/lib64/R/library/XML/libs/XML.so':
> >  
> dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
> > 6): Symbol not found: _xmlMemDisplay
> >   Referenced from:
> > /usr/local/lib64/R/library/XML/libs/XML.so
> >   Expected in: flat namespace
> > Error: .onLoad failed in 'loadNamespace' for 'XML'
> > Error: package/namespace load failed for 'XML'
> > 
> > Session information
> >> sessionInfo()
> > Version 2.3.1 Patched (2006-06-27 r38447) 
> > powerpc64-apple-darwin8.7.0 
> > 
> > attached base packages:
> > [1] "methods"   "stats" "graphics" 
> "grDevices"
> > "utils" "datasets" 
> > [7] "base" 
> > 
> > Prof Brian Ripley also suggested that this could
> be
> > that I don’t have a 64-bit version of libxml2
> > installed. Where I get it and where/how to install
> it,
> > if that’s the problem? 
> > The reason

Re: [R] Loading problem with XML_1.9

2007-07-07 Thread Duncan Temple Lang

Well, as you mention at the end of the mail,
several people have given you suggestions about
how to solve the problem using different approaches.
You might search on the Web for how to install a 64 bit version of libxml2?
Using xmlTreeParse(, useInternalNodes = TRUE) is an approach
to reducing the memory consumption as is using the handlers
argument. And if size is really the issue, you should consider
the SAX model which is very memory efficient and made available
via the xmlEventParse() function in the XML package.
And it even provides the concepts of branches to provide a
hybrid of SAX and DOM-style parsing together.

However, to solve the problem of the xmlMemDisplay
symbol not being found, you can look for where
that is used and remove it.It is in src/DocParse.c
in the routine RS_XML_MemoryShow().  You can remove
the line
  xmlMemDisplay(stderr)
or indeed the entire routine.  Then re-install and
reload the package.

 D.


Luo Weijun wrote:
> Hello Dr. Lang and all,
> I posted this message in R-help mail list, but haven’t
> solved my problem so far. Therefore, could you help me
> look at it?
> I have loading problem with XML_1.9 under 64 bit
> R2.3.1 for Mac OS X, which I got from
> http://R.research.att.com/. XML_1.9 works fine under
> 32 bit R2.5.0. I thought that could be installation
> problem, and I tried install.packages or biocLite,
> every time the package installed fine, except some
> warning messages below:
> ld64 warning: in /usr/lib/libxml2.dylib, file does not
> contain requested architecture
> ld64 warning: in /usr/lib/libz.dylib, file does not
> contain requested architecture
> ld64 warning: in /usr/lib/libiconv.dylib, file does
> not contain requested architecture
> ld64 warning: in /usr/lib/libz.dylib, file does not
> contain requested architecture
> ld64 warning: in /usr/lib/libxml2.dylib, file does not
> contain requested architecture
> 
> Here is the error messages I got, when XML is loaded:
>> library(XML)
> Error in dyn.load(x, as.logical(local),
> as.logical(now)) : 
> unable to load shared library
> '/usr/local/lib64/R/library/XML/libs/XML.so':
>   dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
> 6): Symbol not found: _xmlMemDisplay
>   Referenced from:
> /usr/local/lib64/R/library/XML/libs/XML.so
>   Expected in: flat namespace
> Error: .onLoad failed in 'loadNamespace' for 'XML'
> Error: package/namespace load failed for 'XML'
> 
> Session information
>> sessionInfo()
> Version 2.3.1 Patched (2006-06-27 r38447) 
> powerpc64-apple-darwin8.7.0 
> 
> attached base packages:
> [1] "methods"   "stats" "graphics"  "grDevices"
> "utils" "datasets" 
> [7] "base" 
> 
> Prof Brian Ripley also suggested that this could be
> that I don’t have a 64-bit version of libxml2
> installed. Where I get it and where/how to install it,
> if that’s the problem? 
> The reason I need to use R64 is that I have memory
> limitation issue with R 32 bit version when I load
> some very large XML trees (the data file is about
> 800M). And Martin suggested me to use 'handler'
> argument of xmlTreeParse, tried 'handler' with
> useInternalNodes=T, but I still got this memory
> problem with R 32 bit version. Please tell me what I
> can do now. Thank you so much!
> Weijun
> 
> 
> 
>
> 
> 
> Comedy with an Edge to see what's on, when.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Loading problem with XML_1.9

2007-07-07 Thread Luo Weijun
Hello Dr. Lang and all,
I posted this message in R-help mail list, but haven’t
solved my problem so far. Therefore, could you help me
look at it?
I have loading problem with XML_1.9 under 64 bit
R2.3.1 for Mac OS X, which I got from
http://R.research.att.com/. XML_1.9 works fine under
32 bit R2.5.0. I thought that could be installation
problem, and I tried install.packages or biocLite,
every time the package installed fine, except some
warning messages below:
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libiconv.dylib, file does
not contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture

Here is the error messages I got, when XML is loaded:
> library(XML)
Error in dyn.load(x, as.logical(local),
as.logical(now)) : 
unable to load shared library
'/usr/local/lib64/R/library/XML/libs/XML.so':
  dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
6): Symbol not found: _xmlMemDisplay
  Referenced from:
/usr/local/lib64/R/library/XML/libs/XML.so
  Expected in: flat namespace
Error: .onLoad failed in 'loadNamespace' for 'XML'
Error: package/namespace load failed for 'XML'

Session information
> sessionInfo()
Version 2.3.1 Patched (2006-06-27 r38447) 
powerpc64-apple-darwin8.7.0 

attached base packages:
[1] "methods"   "stats" "graphics"  "grDevices"
"utils" "datasets" 
[7] "base" 

Prof Brian Ripley also suggested that this could be
that I don’t have a 64-bit version of libxml2
installed. Where I get it and where/how to install it,
if that’s the problem? 
The reason I need to use R64 is that I have memory
limitation issue with R 32 bit version when I load
some very large XML trees (the data file is about
800M). And Martin suggested me to use 'handler'
argument of xmlTreeParse, tried 'handler' with
useInternalNodes=T, but I still got this memory
problem with R 32 bit version. Please tell me what I
can do now. Thank you so much!
Weijun



   


Comedy with an Edge to see what's on, when.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Loading problem with XML_1.9

2007-06-28 Thread Martin Morgan
Weijun --

If memory is a problem, you might try using the 'handler' argument of
xmlTreeParse. This provides access to each node as it is processed, so
that you can, for instance, choose to ignore nodes, or save only
numeric values, or ... I'm not sure whether the entire document is
read into a C 'external pointer', or whether the savings is just in
the R representation of the document.

Also, depending on how you use the resulting document, you might want
to watch out for the memory leak mentioned in
http://www.omegahat.org/RSXML/Changes

Martin

Luo Weijun <[EMAIL PROTECTED]> writes:

> Hello all,
> I have loading problem with XML_1.9 under 64 bit
> R2.3.1, which I got from http://R.research.att.com/.
> XML_1.9 works fine under 32 bit R2.5.0. I thought that
> could be installation problem, and I tried
> install.packages or biocLite, every time the package
> installed fine, except some warning messages below:
> ld64 warning: in /usr/lib/libxml2.dylib, file does not
> contain requested architecture
> ld64 warning: in /usr/lib/libz.dylib, file does not
> contain requested architecture
> ld64 warning: in /usr/lib/libiconv.dylib, file does
> not contain requested architecture
> ld64 warning: in /usr/lib/libz.dylib, file does not
> contain requested architecture
> ld64 warning: in /usr/lib/libxml2.dylib, file does not
> contain requested architecture
>
> Here is the error messages I got, when XML is loaded:
>> library(XML)
> Error in dyn.load(x, as.logical(local),
> as.logical(now)) : 
> unable to load shared library
> '/usr/local/lib64/R/library/XML/libs/XML.so':
>   dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
> 6): Symbol not found: _xmlMemDisplay
>   Referenced from:
> /usr/local/lib64/R/library/XML/libs/XML.so
>   Expected in: flat namespace
> Error: .onLoad failed in 'loadNamespace' for 'XML'
> Error: package/namespace load failed for 'XML'
>
> I understand that it has been pointed out that
> Sys.getenv("PATH") needs to be revised in the file
> XML/R/zzz.R, but I canâ�t even find that file under
> XML/R/ directory. Does anybody have any idea what
> might be the problem, and how to solve it? Thanks a
> lot!
> BTW, the reason I need to use R64 is that I have
> memory limitation issue with R 32 bit version when I
> load some very large XML trees. 
>
> Session information
>> sessionInfo()
> Version 2.3.1 Patched (2006-06-27 r38447) 
> powerpc64-apple-darwin8.7.0 
>
> attached base packages:
> [1] "methods"   "stats" "graphics"  "grDevices"
> "utils" "datasets" 
> [7] "base" 
>
> Weijun
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Loading problem with XML_1.9

2007-06-27 Thread Prof Brian Ripley

Please don't post to multiple lists: I have removed the BioC-devel list.
This is about MacOS X, and the appropriate list is R-sig-mac.

There is no intrinsic 64-bit problem: package XML 1.9-0 (sic) works fine 
on 64-bit versions of Solaris and Linux.  Most likely there was an 
installation problem, and you do not have a 64-bit version of libxml2 
installed or in the run-time library path.


On Wed, 27 Jun 2007, Luo Weijun wrote:


Hello all,
I have loading problem with XML_1.9 under 64 bit
R2.3.1, which I got from http://R.research.att.com/.


For MacOS X, unstated.


XML_1.9 works fine under 32 bit R2.5.0. I thought that
could be installation problem, and I tried
install.packages or biocLite, every time the package
installed fine, except some warning messages below:
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libiconv.dylib, file does
not contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture

Here is the error messages I got, when XML is loaded:

library(XML)

Error in dyn.load(x, as.logical(local),
as.logical(now)) :
   unable to load shared library
'/usr/local/lib64/R/library/XML/libs/XML.so':
 dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
6): Symbol not found: _xmlMemDisplay
 Referenced from:
/usr/local/lib64/R/library/XML/libs/XML.so
 Expected in: flat namespace
Error: .onLoad failed in 'loadNamespace' for 'XML'
Error: package/namespace load failed for 'XML'

I understand that it has been pointed out that
Sys.getenv("PATH") needs to be revised in the file
XML/R/zzz.R, but I can’t even find that file under
XML/R/ directory. Does anybody have any idea what
might be the problem, and how to solve it? Thanks a
lot!
BTW, the reason I need to use R64 is that I have
memory limitation issue with R 32 bit version when I
load some very large XML trees.

Session information

sessionInfo()

Version 2.3.1 Patched (2006-06-27 r38447)
powerpc64-apple-darwin8.7.0

attached base packages:
[1] "methods"   "stats" "graphics"  "grDevices"
"utils" "datasets"
[7] "base"

Weijun


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.