Re: [R] Loading problem with XML_1.9

2007-07-09 Thread Duncan Temple Lang

Weijun and I corresponded off-list so that I could
get a copy of the data.

On a relatively modest machine with 2G of RAM, 10G swap,
dual core 1Ghz 64bit AMDs, the code below takes approximately
100 seconds. It is not optimized in any particular way, so
there is room for improvement.

doc - xmlTreeParse(mi1.txt.gz, useInternal = TRUE)
mols - getNodeSet(doc, //molecule)

ans =
lapply(mols,
function(node, targets) {

   names = as.character(xpathApply(node, //name/text(),
 xmlValue))
   if(any(names %in% targets))
 xpathApply(node, //moleculeName, xmlValue)
   else
 character()
}, c(Vps4b, SKD1, frm-1))

ans = ans[sapply(ans, length)  0]


We can read the file without uncompressing which probably
slows things down slightly.
The parsing of the tree takes about 20 seconds and occupies
approximately 1G (very roughly).
Then we find all the molecule nodes of which there
are 25452.  Then we loop over each of these and
do a sub-query using XPath and see if
the text child in the name nodes are in the set
of interest (entirely made up for my test), and if so
fetch the content of any moleculeName within this
molecule node.

It would be nice if we could build the hash for targets
just once.
And we could get clever with the XPath query to try do
the matching and selection in one query. This might actually slow things
down.

(There are garbage collection issues with XPath sub-queries
for which I am still deciding on the optimal strategy.)

So perhaps the lesson her is that for those working with XML,
XPath is worth using before more specialized approaches
and large XML data files can fit into memory. The tree
is not using contiguous memory so nodes can be squeezed into
available spaces.

D.


Luo Weijun wrote:
 Thanks, Dr. Lang,
 I used xmlEventParse() + branches concept as you
 suggested, it really works, and the memory issue is
 gone. Now I can query large XML files from within R.
 but here is another problem: it is too slows (a simple
 query has not finished for 1.5h), even though the
 number of relevant records is very limited, but the
 whole XML file has more than 500 thousand
 similarly-structured records. And the parser has to go
 through all of them as to find the matches. Attached
 is part of the XML files with two records. I am trying
 to retrieve the content of moleculeName nodes from
 molecule records where name nodes bear specific
 gene names.
 Is it possible to locate based on node content (or
 xmlValue) rather than node names (since they are the
 same in all records) first and then parse the xml
 record locally? Would query based on XPath be faster
 in this case? I understand that we do have the
 facility in the XML package for XPath based queries,
 called getNodeSet(). But that requires reading the
 whole XML tree into the memory first, which is not
 feasible for my large XML file. Or can I call
 XML::XPath statements using your R-Perl interface
 package? Any suggestions/thoughts? Thank you!
 Weijun
 
 
 Part of my XML file: 
 
 molecule
 provimimid20/imid/im/provmoleculeID119043/moleculeID
 moleculeTypeproteinprovimimid20/imid/im/prov/moleculeType
 organismID10090provimimid20/imid/im/prov/organismID
 idprovimimid20/imid/im/providTypeGI/idTypeidValue6677981/idValue/id
 nameSKD1provimimid20/imid/im/prov/name
 nameVps4bprovimimid20/imid/im/prov/name
 name8030489C12Rikprovimimid20/imid/im/prov/name
 descriptiondistributionvalueMouse homologue of
 yeast Vacuolar protein sorting 4 (Vps4); Suppressor of
 potassium transport defect 1. Mem
 ber of mammalian class E Vps proteins involved in
 endosomal transport; AAA-type
 ATPase.provimimid20/imid/im/prov/valuevalueMo
 use homologue of yeast  Vacuolar protein sorting 4
 (Vps4); Suppressor of potassium  transport defect 1.
 Member of  mammalian class E Vps prot
 eins involved in endosomal transport; AAA-type
 ATPase.provimimid20/imid/im/prov/value/distribution/description
 orthologue
 methodmethodID337974/methodIDmethodNamemiClust80/methodName/method
 /orthologue
 variant
 provimimid20/imid/im/provvariantID0/variantID
 /variant
 interactioninteractionRef201581/interactionRefmoleculeRef89434/moleculeRefmoleculeNameSBP1/moleculeName
 selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
 interactioninteractionRef201582/interactionRefmoleculeRef17953/moleculeRefmoleculeNamemVps2/moleculeName
 selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
 /molecule
 
 molecule
 provimimid30/imid/im/provmoleculeID116226/moleculeID
 moleculeTypeproteinprovimimid30/imid/im/prov/moleculeType
 organismID9606provimimid30/imid/im/prov/organismID
 idprovimimid30/imid/im/providTypeHGNC/idTypeidValue9859/idValue/id
 nameRAP1GDS1provimimid30/imid/im/prov/name
 nameGDS1provimimid30/imid/im/prov/name
 nameMGC118859provimimid30/imid/im/prov/name
 nameMGC118861provimimid30/imid/im/prov/name
 variant
 

Re: [R] Loading problem with XML_1.9

2007-07-08 Thread Duncan Temple Lang

Well, as you mention at the end of the mail,
several people have given you suggestions about
how to solve the problem using different approaches.
You might search on the Web for how to install a 64 bit version of libxml2?
Using xmlTreeParse(, useInternalNodes = TRUE) is an approach
to reducing the memory consumption as is using the handlers
argument. And if size is really the issue, you should consider
the SAX model which is very memory efficient and made available
via the xmlEventParse() function in the XML package.
And it even provides the concepts of branches to provide a
hybrid of SAX and DOM-style parsing together.

However, to solve the problem of the xmlMemDisplay
symbol not being found, you can look for where
that is used and remove it.It is in src/DocParse.c
in the routine RS_XML_MemoryShow().  You can remove
the line
  xmlMemDisplay(stderr)
or indeed the entire routine.  Then re-install and
reload the package.

 D.


Luo Weijun wrote:
 Hello Dr. Lang and all,
 I posted this message in R-help mail list, but haven’t
 solved my problem so far. Therefore, could you help me
 look at it?
 I have loading problem with XML_1.9 under 64 bit
 R2.3.1 for Mac OS X, which I got from
 http://R.research.att.com/. XML_1.9 works fine under
 32 bit R2.5.0. I thought that could be installation
 problem, and I tried install.packages or biocLite,
 every time the package installed fine, except some
 warning messages below:
 ld64 warning: in /usr/lib/libxml2.dylib, file does not
 contain requested architecture
 ld64 warning: in /usr/lib/libz.dylib, file does not
 contain requested architecture
 ld64 warning: in /usr/lib/libiconv.dylib, file does
 not contain requested architecture
 ld64 warning: in /usr/lib/libz.dylib, file does not
 contain requested architecture
 ld64 warning: in /usr/lib/libxml2.dylib, file does not
 contain requested architecture
 
 Here is the error messages I got, when XML is loaded:
 library(XML)
 Error in dyn.load(x, as.logical(local),
 as.logical(now)) : 
 unable to load shared library
 '/usr/local/lib64/R/library/XML/libs/XML.so':
   dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
 6): Symbol not found: _xmlMemDisplay
   Referenced from:
 /usr/local/lib64/R/library/XML/libs/XML.so
   Expected in: flat namespace
 Error: .onLoad failed in 'loadNamespace' for 'XML'
 Error: package/namespace load failed for 'XML'
 
 Session information
 sessionInfo()
 Version 2.3.1 Patched (2006-06-27 r38447) 
 powerpc64-apple-darwin8.7.0 
 
 attached base packages:
 [1] methods   stats graphics  grDevices
 utils datasets 
 [7] base 
 
 Prof Brian Ripley also suggested that this could be
 that I don’t have a 64-bit version of libxml2
 installed. Where I get it and where/how to install it,
 if that’s the problem? 
 The reason I need to use R64 is that I have memory
 limitation issue with R 32 bit version when I load
 some very large XML trees (the data file is about
 800M). And Martin suggested me to use 'handler'
 argument of xmlTreeParse, tried 'handler' with
 useInternalNodes=T, but I still got this memory
 problem with R 32 bit version. Please tell me what I
 can do now. Thank you so much!
 Weijun
 
 
 

 
 
 Comedy with an Edge to see what's on, when.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Loading problem with XML_1.9

2007-07-08 Thread Luo Weijun
Thanks, Dr. Lang,
I used xmlEventParse() + branches concept as you
suggested, it really works, and the memory issue is
gone. Now I can query large XML files from within R.
but here is another problem: it is too slows (a simple
query has not finished for 1.5h), even though the
number of relevant records is very limited, but the
whole XML file has more than 500 thousand
similarly-structured records. And the parser has to go
through all of them as to find the matches. Attached
is part of the XML files with two records. I am trying
to retrieve the content of moleculeName nodes from
molecule records where name nodes bear specific
gene names.
Is it possible to locate based on node content (or
xmlValue) rather than node names (since they are the
same in all records) first and then parse the xml
record locally? Would query based on XPath be faster
in this case? I understand that we do have the
facility in the XML package for XPath based queries,
called getNodeSet(). But that requires reading the
whole XML tree into the memory first, which is not
feasible for my large XML file. Or can I call
XML::XPath statements using your R-Perl interface
package? Any suggestions/thoughts? Thank you!
Weijun


Part of my XML file: 

molecule
provimimid20/imid/im/provmoleculeID119043/moleculeID
moleculeTypeproteinprovimimid20/imid/im/prov/moleculeType
organismID10090provimimid20/imid/im/prov/organismID
idprovimimid20/imid/im/providTypeGI/idTypeidValue6677981/idValue/id
nameSKD1provimimid20/imid/im/prov/name
nameVps4bprovimimid20/imid/im/prov/name
name8030489C12Rikprovimimid20/imid/im/prov/name
descriptiondistributionvalueMouse homologue of
yeast Vacuolar protein sorting 4 (Vps4); Suppressor of
potassium transport defect 1. Mem
ber of mammalian class E Vps proteins involved in
endosomal transport; AAA-type
ATPase.provimimid20/imid/im/prov/valuevalueMo
use homologue of yeast  Vacuolar protein sorting 4
(Vps4); Suppressor of potassium  transport defect 1.
Member of  mammalian class E Vps prot
eins involved in endosomal transport; AAA-type
ATPase.provimimid20/imid/im/prov/value/distribution/description
orthologue
methodmethodID337974/methodIDmethodNamemiClust80/methodName/method
/orthologue
variant
provimimid20/imid/im/provvariantID0/variantID
/variant
interactioninteractionRef201581/interactionRefmoleculeRef89434/moleculeRefmoleculeNameSBP1/moleculeName
selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
interactioninteractionRef201582/interactionRefmoleculeRef17953/moleculeRefmoleculeNamemVps2/moleculeName
selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
/molecule

molecule
provimimid30/imid/im/provmoleculeID116226/moleculeID
moleculeTypeproteinprovimimid30/imid/im/prov/moleculeType
organismID9606provimimid30/imid/im/prov/organismID
idprovimimid30/imid/im/providTypeHGNC/idTypeidValue9859/idValue/id
nameRAP1GDS1provimimid30/imid/im/prov/name
nameGDS1provimimid30/imid/im/prov/name
nameMGC118859provimimid30/imid/im/prov/name
nameMGC118861provimimid30/imid/im/prov/name
variant
provimimid30/imid/im/provvariantID0/variantID
/variant
interactioninteractionRef93569/interactionRefmoleculeRef116280/moleculeRefmoleculeNameRAC1/moleculeName
selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
interactioninteractionRef104132/interactionRefmoleculeRef103040/moleculeRefmoleculeNameRHOA/moleculeName
selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
interactioninteractionRef121818/interactionRefmoleculeRef74726/moleculeRefmoleculeNameMBIP/moleculeName
selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction
/molecule

--- Duncan Temple Lang [EMAIL PROTECTED]
wrote:

 
 Well, as you mention at the end of the mail,
 several people have given you suggestions about
 how to solve the problem using different approaches.
 You might search on the Web for how to install a 64
 bit version of libxml2?
 Using xmlTreeParse(, useInternalNodes = TRUE) is an
 approach
 to reducing the memory consumption as is using the
 handlers
 argument. And if size is really the issue, you
 should consider
 the SAX model which is very memory efficient and
 made available
 via the xmlEventParse() function in the XML package.
 And it even provides the concepts of branches to
 provide a
 hybrid of SAX and DOM-style parsing together.
 
 However, to solve the problem of the xmlMemDisplay
 symbol not being found, you can look for where
 that is used and remove it.It is in
 src/DocParse.c
 in the routine RS_XML_MemoryShow().  You can remove
 the line
   xmlMemDisplay(stderr)
 or indeed the entire routine.  Then re-install and
 reload the package.
 
  D.
 
 
 Luo Weijun wrote:
  Hello Dr. Lang and all,
  I posted this message in R-help mail list, but
 haven’t
  solved my problem so far. Therefore, could you
 help me
  look at it?
  I have loading problem with XML_1.9 under 64 bit
  R2.3.1 for Mac OS X, which I got from
  http://R.research.att.com/. 

Re: [R] Loading problem with XML_1.9

2007-07-07 Thread Luo Weijun
Hello Dr. Lang and all,
I posted this message in R-help mail list, but haven’t
solved my problem so far. Therefore, could you help me
look at it?
I have loading problem with XML_1.9 under 64 bit
R2.3.1 for Mac OS X, which I got from
http://R.research.att.com/. XML_1.9 works fine under
32 bit R2.5.0. I thought that could be installation
problem, and I tried install.packages or biocLite,
every time the package installed fine, except some
warning messages below:
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libiconv.dylib, file does
not contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture

Here is the error messages I got, when XML is loaded:
 library(XML)
Error in dyn.load(x, as.logical(local),
as.logical(now)) : 
unable to load shared library
'/usr/local/lib64/R/library/XML/libs/XML.so':
  dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
6): Symbol not found: _xmlMemDisplay
  Referenced from:
/usr/local/lib64/R/library/XML/libs/XML.so
  Expected in: flat namespace
Error: .onLoad failed in 'loadNamespace' for 'XML'
Error: package/namespace load failed for 'XML'

Session information
 sessionInfo()
Version 2.3.1 Patched (2006-06-27 r38447) 
powerpc64-apple-darwin8.7.0 

attached base packages:
[1] methods   stats graphics  grDevices
utils datasets 
[7] base 

Prof Brian Ripley also suggested that this could be
that I don’t have a 64-bit version of libxml2
installed. Where I get it and where/how to install it,
if that’s the problem? 
The reason I need to use R64 is that I have memory
limitation issue with R 32 bit version when I load
some very large XML trees (the data file is about
800M). And Martin suggested me to use 'handler'
argument of xmlTreeParse, tried 'handler' with
useInternalNodes=T, but I still got this memory
problem with R 32 bit version. Please tell me what I
can do now. Thank you so much!
Weijun



   


Comedy with an Edge to see what's on, when.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Loading problem with XML_1.9

2007-06-28 Thread Martin Morgan
Weijun --

If memory is a problem, you might try using the 'handler' argument of
xmlTreeParse. This provides access to each node as it is processed, so
that you can, for instance, choose to ignore nodes, or save only
numeric values, or ... I'm not sure whether the entire document is
read into a C 'external pointer', or whether the savings is just in
the R representation of the document.

Also, depending on how you use the resulting document, you might want
to watch out for the memory leak mentioned in
http://www.omegahat.org/RSXML/Changes

Martin

Luo Weijun [EMAIL PROTECTED] writes:

 Hello all,
 I have loading problem with XML_1.9 under 64 bit
 R2.3.1, which I got from http://R.research.att.com/.
 XML_1.9 works fine under 32 bit R2.5.0. I thought that
 could be installation problem, and I tried
 install.packages or biocLite, every time the package
 installed fine, except some warning messages below:
 ld64 warning: in /usr/lib/libxml2.dylib, file does not
 contain requested architecture
 ld64 warning: in /usr/lib/libz.dylib, file does not
 contain requested architecture
 ld64 warning: in /usr/lib/libiconv.dylib, file does
 not contain requested architecture
 ld64 warning: in /usr/lib/libz.dylib, file does not
 contain requested architecture
 ld64 warning: in /usr/lib/libxml2.dylib, file does not
 contain requested architecture

 Here is the error messages I got, when XML is loaded:
 library(XML)
 Error in dyn.load(x, as.logical(local),
 as.logical(now)) : 
 unable to load shared library
 '/usr/local/lib64/R/library/XML/libs/XML.so':
   dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
 6): Symbol not found: _xmlMemDisplay
   Referenced from:
 /usr/local/lib64/R/library/XML/libs/XML.so
   Expected in: flat namespace
 Error: .onLoad failed in 'loadNamespace' for 'XML'
 Error: package/namespace load failed for 'XML'

 I understand that it has been pointed out that
 Sys.getenv(PATH) needs to be revised in the file
 XML/R/zzz.R, but I canâ�t even find that file under
 XML/R/ directory. Does anybody have any idea what
 might be the problem, and how to solve it? Thanks a
 lot!
 BTW, the reason I need to use R64 is that I have
 memory limitation issue with R 32 bit version when I
 load some very large XML trees. 

 Session information
 sessionInfo()
 Version 2.3.1 Patched (2006-06-27 r38447) 
 powerpc64-apple-darwin8.7.0 

 attached base packages:
 [1] methods   stats graphics  grDevices
 utils datasets 
 [7] base 

 Weijun

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Loading problem with XML_1.9

2007-06-27 Thread Prof Brian Ripley

Please don't post to multiple lists: I have removed the BioC-devel list.
This is about MacOS X, and the appropriate list is R-sig-mac.

There is no intrinsic 64-bit problem: package XML 1.9-0 (sic) works fine 
on 64-bit versions of Solaris and Linux.  Most likely there was an 
installation problem, and you do not have a 64-bit version of libxml2 
installed or in the run-time library path.


On Wed, 27 Jun 2007, Luo Weijun wrote:


Hello all,
I have loading problem with XML_1.9 under 64 bit
R2.3.1, which I got from http://R.research.att.com/.


For MacOS X, unstated.


XML_1.9 works fine under 32 bit R2.5.0. I thought that
could be installation problem, and I tried
install.packages or biocLite, every time the package
installed fine, except some warning messages below:
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libiconv.dylib, file does
not contain requested architecture
ld64 warning: in /usr/lib/libz.dylib, file does not
contain requested architecture
ld64 warning: in /usr/lib/libxml2.dylib, file does not
contain requested architecture

Here is the error messages I got, when XML is loaded:

library(XML)

Error in dyn.load(x, as.logical(local),
as.logical(now)) :
   unable to load shared library
'/usr/local/lib64/R/library/XML/libs/XML.so':
 dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
6): Symbol not found: _xmlMemDisplay
 Referenced from:
/usr/local/lib64/R/library/XML/libs/XML.so
 Expected in: flat namespace
Error: .onLoad failed in 'loadNamespace' for 'XML'
Error: package/namespace load failed for 'XML'

I understand that it has been pointed out that
Sys.getenv(PATH) needs to be revised in the file
XML/R/zzz.R, but I can’t even find that file under
XML/R/ directory. Does anybody have any idea what
might be the problem, and how to solve it? Thanks a
lot!
BTW, the reason I need to use R64 is that I have
memory limitation issue with R 32 bit version when I
load some very large XML trees.

Session information

sessionInfo()

Version 2.3.1 Patched (2006-06-27 r38447)
powerpc64-apple-darwin8.7.0

attached base packages:
[1] methods   stats graphics  grDevices
utils datasets
[7] base

Weijun


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.