Re: [R] Loading problem with XML_1.9
Weijun and I corresponded off-list so that I could get a copy of the data. On a relatively modest machine with 2G of RAM, 10G swap, dual core 1Ghz 64bit AMDs, the code below takes approximately 100 seconds. It is not optimized in any particular way, so there is room for improvement. doc - xmlTreeParse(mi1.txt.gz, useInternal = TRUE) mols - getNodeSet(doc, //molecule) ans = lapply(mols, function(node, targets) { names = as.character(xpathApply(node, //name/text(), xmlValue)) if(any(names %in% targets)) xpathApply(node, //moleculeName, xmlValue) else character() }, c(Vps4b, SKD1, frm-1)) ans = ans[sapply(ans, length) 0] We can read the file without uncompressing which probably slows things down slightly. The parsing of the tree takes about 20 seconds and occupies approximately 1G (very roughly). Then we find all the molecule nodes of which there are 25452. Then we loop over each of these and do a sub-query using XPath and see if the text child in the name nodes are in the set of interest (entirely made up for my test), and if so fetch the content of any moleculeName within this molecule node. It would be nice if we could build the hash for targets just once. And we could get clever with the XPath query to try do the matching and selection in one query. This might actually slow things down. (There are garbage collection issues with XPath sub-queries for which I am still deciding on the optimal strategy.) So perhaps the lesson her is that for those working with XML, XPath is worth using before more specialized approaches and large XML data files can fit into memory. The tree is not using contiguous memory so nodes can be squeezed into available spaces. D. Luo Weijun wrote: Thanks, Dr. Lang, I used xmlEventParse() + branches concept as you suggested, it really works, and the memory issue is gone. Now I can query large XML files from within R. but here is another problem: it is too slows (a simple query has not finished for 1.5h), even though the number of relevant records is very limited, but the whole XML file has more than 500 thousand similarly-structured records. And the parser has to go through all of them as to find the matches. Attached is part of the XML files with two records. I am trying to retrieve the content of moleculeName nodes from molecule records where name nodes bear specific gene names. Is it possible to locate based on node content (or xmlValue) rather than node names (since they are the same in all records) first and then parse the xml record locally? Would query based on XPath be faster in this case? I understand that we do have the facility in the XML package for XPath based queries, called getNodeSet(). But that requires reading the whole XML tree into the memory first, which is not feasible for my large XML file. Or can I call XML::XPath statements using your R-Perl interface package? Any suggestions/thoughts? Thank you! Weijun Part of my XML file: molecule provimimid20/imid/im/provmoleculeID119043/moleculeID moleculeTypeproteinprovimimid20/imid/im/prov/moleculeType organismID10090provimimid20/imid/im/prov/organismID idprovimimid20/imid/im/providTypeGI/idTypeidValue6677981/idValue/id nameSKD1provimimid20/imid/im/prov/name nameVps4bprovimimid20/imid/im/prov/name name8030489C12Rikprovimimid20/imid/im/prov/name descriptiondistributionvalueMouse homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Mem ber of mammalian class E Vps proteins involved in endosomal transport; AAA-type ATPase.provimimid20/imid/im/prov/valuevalueMo use homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Member of mammalian class E Vps prot eins involved in endosomal transport; AAA-type ATPase.provimimid20/imid/im/prov/value/distribution/description orthologue methodmethodID337974/methodIDmethodNamemiClust80/methodName/method /orthologue variant provimimid20/imid/im/provvariantID0/variantID /variant interactioninteractionRef201581/interactionRefmoleculeRef89434/moleculeRefmoleculeNameSBP1/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction interactioninteractionRef201582/interactionRefmoleculeRef17953/moleculeRefmoleculeNamemVps2/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction /molecule molecule provimimid30/imid/im/provmoleculeID116226/moleculeID moleculeTypeproteinprovimimid30/imid/im/prov/moleculeType organismID9606provimimid30/imid/im/prov/organismID idprovimimid30/imid/im/providTypeHGNC/idTypeidValue9859/idValue/id nameRAP1GDS1provimimid30/imid/im/prov/name nameGDS1provimimid30/imid/im/prov/name nameMGC118859provimimid30/imid/im/prov/name nameMGC118861provimimid30/imid/im/prov/name variant
Re: [R] Loading problem with XML_1.9
Well, as you mention at the end of the mail, several people have given you suggestions about how to solve the problem using different approaches. You might search on the Web for how to install a 64 bit version of libxml2? Using xmlTreeParse(, useInternalNodes = TRUE) is an approach to reducing the memory consumption as is using the handlers argument. And if size is really the issue, you should consider the SAX model which is very memory efficient and made available via the xmlEventParse() function in the XML package. And it even provides the concepts of branches to provide a hybrid of SAX and DOM-style parsing together. However, to solve the problem of the xmlMemDisplay symbol not being found, you can look for where that is used and remove it.It is in src/DocParse.c in the routine RS_XML_MemoryShow(). You can remove the line xmlMemDisplay(stderr) or indeed the entire routine. Then re-install and reload the package. D. Luo Weijun wrote: Hello Dr. Lang and all, I posted this message in R-help mail list, but haven’t solved my problem so far. Therefore, could you help me look at it? I have loading problem with XML_1.9 under 64 bit R2.3.1 for Mac OS X, which I got from http://R.research.att.com/. XML_1.9 works fine under 32 bit R2.5.0. I thought that could be installation problem, and I tried install.packages or biocLite, every time the package installed fine, except some warning messages below: ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libiconv.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture Here is the error messages I got, when XML is loaded: library(XML) Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/usr/local/lib64/R/library/XML/libs/XML.so': dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, 6): Symbol not found: _xmlMemDisplay Referenced from: /usr/local/lib64/R/library/XML/libs/XML.so Expected in: flat namespace Error: .onLoad failed in 'loadNamespace' for 'XML' Error: package/namespace load failed for 'XML' Session information sessionInfo() Version 2.3.1 Patched (2006-06-27 r38447) powerpc64-apple-darwin8.7.0 attached base packages: [1] methods stats graphics grDevices utils datasets [7] base Prof Brian Ripley also suggested that this could be that I don’t have a 64-bit version of libxml2 installed. Where I get it and where/how to install it, if that’s the problem? The reason I need to use R64 is that I have memory limitation issue with R 32 bit version when I load some very large XML trees (the data file is about 800M). And Martin suggested me to use 'handler' argument of xmlTreeParse, tried 'handler' with useInternalNodes=T, but I still got this memory problem with R 32 bit version. Please tell me what I can do now. Thank you so much! Weijun Comedy with an Edge to see what's on, when. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Loading problem with XML_1.9
Thanks, Dr. Lang, I used xmlEventParse() + branches concept as you suggested, it really works, and the memory issue is gone. Now I can query large XML files from within R. but here is another problem: it is too slows (a simple query has not finished for 1.5h), even though the number of relevant records is very limited, but the whole XML file has more than 500 thousand similarly-structured records. And the parser has to go through all of them as to find the matches. Attached is part of the XML files with two records. I am trying to retrieve the content of moleculeName nodes from molecule records where name nodes bear specific gene names. Is it possible to locate based on node content (or xmlValue) rather than node names (since they are the same in all records) first and then parse the xml record locally? Would query based on XPath be faster in this case? I understand that we do have the facility in the XML package for XPath based queries, called getNodeSet(). But that requires reading the whole XML tree into the memory first, which is not feasible for my large XML file. Or can I call XML::XPath statements using your R-Perl interface package? Any suggestions/thoughts? Thank you! Weijun Part of my XML file: molecule provimimid20/imid/im/provmoleculeID119043/moleculeID moleculeTypeproteinprovimimid20/imid/im/prov/moleculeType organismID10090provimimid20/imid/im/prov/organismID idprovimimid20/imid/im/providTypeGI/idTypeidValue6677981/idValue/id nameSKD1provimimid20/imid/im/prov/name nameVps4bprovimimid20/imid/im/prov/name name8030489C12Rikprovimimid20/imid/im/prov/name descriptiondistributionvalueMouse homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Mem ber of mammalian class E Vps proteins involved in endosomal transport; AAA-type ATPase.provimimid20/imid/im/prov/valuevalueMo use homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Member of mammalian class E Vps prot eins involved in endosomal transport; AAA-type ATPase.provimimid20/imid/im/prov/value/distribution/description orthologue methodmethodID337974/methodIDmethodNamemiClust80/methodName/method /orthologue variant provimimid20/imid/im/provvariantID0/variantID /variant interactioninteractionRef201581/interactionRefmoleculeRef89434/moleculeRefmoleculeNameSBP1/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction interactioninteractionRef201582/interactionRefmoleculeRef17953/moleculeRefmoleculeNamemVps2/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction /molecule molecule provimimid30/imid/im/provmoleculeID116226/moleculeID moleculeTypeproteinprovimimid30/imid/im/prov/moleculeType organismID9606provimimid30/imid/im/prov/organismID idprovimimid30/imid/im/providTypeHGNC/idTypeidValue9859/idValue/id nameRAP1GDS1provimimid30/imid/im/prov/name nameGDS1provimimid30/imid/im/prov/name nameMGC118859provimimid30/imid/im/prov/name nameMGC118861provimimid30/imid/im/prov/name variant provimimid30/imid/im/provvariantID0/variantID /variant interactioninteractionRef93569/interactionRefmoleculeRef116280/moleculeRefmoleculeNameRAC1/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction interactioninteractionRef104132/interactionRefmoleculeRef103040/moleculeRefmoleculeNameRHOA/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction interactioninteractionRef121818/interactionRefmoleculeRef74726/moleculeRefmoleculeNameMBIP/moleculeName selfVariantRef0/selfVariantRefpartnerVariantRef0/partnerVariantRef/interaction /molecule --- Duncan Temple Lang [EMAIL PROTECTED] wrote: Well, as you mention at the end of the mail, several people have given you suggestions about how to solve the problem using different approaches. You might search on the Web for how to install a 64 bit version of libxml2? Using xmlTreeParse(, useInternalNodes = TRUE) is an approach to reducing the memory consumption as is using the handlers argument. And if size is really the issue, you should consider the SAX model which is very memory efficient and made available via the xmlEventParse() function in the XML package. And it even provides the concepts of branches to provide a hybrid of SAX and DOM-style parsing together. However, to solve the problem of the xmlMemDisplay symbol not being found, you can look for where that is used and remove it.It is in src/DocParse.c in the routine RS_XML_MemoryShow(). You can remove the line xmlMemDisplay(stderr) or indeed the entire routine. Then re-install and reload the package. D. Luo Weijun wrote: Hello Dr. Lang and all, I posted this message in R-help mail list, but havenât solved my problem so far. Therefore, could you help me look at it? I have loading problem with XML_1.9 under 64 bit R2.3.1 for Mac OS X, which I got from http://R.research.att.com/.
Re: [R] Loading problem with XML_1.9
Hello Dr. Lang and all, I posted this message in R-help mail list, but havent solved my problem so far. Therefore, could you help me look at it? I have loading problem with XML_1.9 under 64 bit R2.3.1 for Mac OS X, which I got from http://R.research.att.com/. XML_1.9 works fine under 32 bit R2.5.0. I thought that could be installation problem, and I tried install.packages or biocLite, every time the package installed fine, except some warning messages below: ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libiconv.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture Here is the error messages I got, when XML is loaded: library(XML) Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/usr/local/lib64/R/library/XML/libs/XML.so': dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, 6): Symbol not found: _xmlMemDisplay Referenced from: /usr/local/lib64/R/library/XML/libs/XML.so Expected in: flat namespace Error: .onLoad failed in 'loadNamespace' for 'XML' Error: package/namespace load failed for 'XML' Session information sessionInfo() Version 2.3.1 Patched (2006-06-27 r38447) powerpc64-apple-darwin8.7.0 attached base packages: [1] methods stats graphics grDevices utils datasets [7] base Prof Brian Ripley also suggested that this could be that I dont have a 64-bit version of libxml2 installed. Where I get it and where/how to install it, if thats the problem? The reason I need to use R64 is that I have memory limitation issue with R 32 bit version when I load some very large XML trees (the data file is about 800M). And Martin suggested me to use 'handler' argument of xmlTreeParse, tried 'handler' with useInternalNodes=T, but I still got this memory problem with R 32 bit version. Please tell me what I can do now. Thank you so much! Weijun Comedy with an Edge to see what's on, when. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Loading problem with XML_1.9
Weijun -- If memory is a problem, you might try using the 'handler' argument of xmlTreeParse. This provides access to each node as it is processed, so that you can, for instance, choose to ignore nodes, or save only numeric values, or ... I'm not sure whether the entire document is read into a C 'external pointer', or whether the savings is just in the R representation of the document. Also, depending on how you use the resulting document, you might want to watch out for the memory leak mentioned in http://www.omegahat.org/RSXML/Changes Martin Luo Weijun [EMAIL PROTECTED] writes: Hello all, I have loading problem with XML_1.9 under 64 bit R2.3.1, which I got from http://R.research.att.com/. XML_1.9 works fine under 32 bit R2.5.0. I thought that could be installation problem, and I tried install.packages or biocLite, every time the package installed fine, except some warning messages below: ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libiconv.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture Here is the error messages I got, when XML is loaded: library(XML) Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/usr/local/lib64/R/library/XML/libs/XML.so': dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, 6): Symbol not found: _xmlMemDisplay Referenced from: /usr/local/lib64/R/library/XML/libs/XML.so Expected in: flat namespace Error: .onLoad failed in 'loadNamespace' for 'XML' Error: package/namespace load failed for 'XML' I understand that it has been pointed out that Sys.getenv(PATH) needs to be revised in the file XML/R/zzz.R, but I canâ�t even find that file under XML/R/ directory. Does anybody have any idea what might be the problem, and how to solve it? Thanks a lot! BTW, the reason I need to use R64 is that I have memory limitation issue with R 32 bit version when I load some very large XML trees. Session information sessionInfo() Version 2.3.1 Patched (2006-06-27 r38447) powerpc64-apple-darwin8.7.0 attached base packages: [1] methods stats graphics grDevices utils datasets [7] base Weijun __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Bioconductor / Computational Biology http://bioconductor.org __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Loading problem with XML_1.9
Please don't post to multiple lists: I have removed the BioC-devel list. This is about MacOS X, and the appropriate list is R-sig-mac. There is no intrinsic 64-bit problem: package XML 1.9-0 (sic) works fine on 64-bit versions of Solaris and Linux. Most likely there was an installation problem, and you do not have a 64-bit version of libxml2 installed or in the run-time library path. On Wed, 27 Jun 2007, Luo Weijun wrote: Hello all, I have loading problem with XML_1.9 under 64 bit R2.3.1, which I got from http://R.research.att.com/. For MacOS X, unstated. XML_1.9 works fine under 32 bit R2.5.0. I thought that could be installation problem, and I tried install.packages or biocLite, every time the package installed fine, except some warning messages below: ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libiconv.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture Here is the error messages I got, when XML is loaded: library(XML) Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/usr/local/lib64/R/library/XML/libs/XML.so': dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, 6): Symbol not found: _xmlMemDisplay Referenced from: /usr/local/lib64/R/library/XML/libs/XML.so Expected in: flat namespace Error: .onLoad failed in 'loadNamespace' for 'XML' Error: package/namespace load failed for 'XML' I understand that it has been pointed out that Sys.getenv(PATH) needs to be revised in the file XML/R/zzz.R, but I canât even find that file under XML/R/ directory. Does anybody have any idea what might be the problem, and how to solve it? Thanks a lot! BTW, the reason I need to use R64 is that I have memory limitation issue with R 32 bit version when I load some very large XML trees. Session information sessionInfo() Version 2.3.1 Patched (2006-06-27 r38447) powerpc64-apple-darwin8.7.0 attached base packages: [1] methods stats graphics grDevices utils datasets [7] base Weijun -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.