Re: [R] Loading problem with XML_1.9
Weijun and I corresponded off-list so that I could get a copy of the data. On a relatively modest machine with 2G of RAM, 10G swap, dual core 1Ghz 64bit AMDs, the code below takes approximately 100 seconds. It is not optimized in any particular way, so there is room for improvement. doc <- xmlTreeParse("mi1.txt.gz", useInternal = TRUE) mols <- getNodeSet(doc, "//molecule") ans = lapply(mols, function(node, targets) { names = as.character(xpathApply(node, "//name/text()", xmlValue)) if(any(names %in% targets)) xpathApply(node, "//moleculeName", xmlValue) else character() }, c("Vps4b", "SKD1", "frm-1")) ans = ans[sapply(ans, length) > 0] We can read the file without uncompressing which probably slows things down slightly. The parsing of the tree takes about 20 seconds and occupies approximately 1G (very roughly). Then we find all the nodes of which there are 25452. Then we loop over each of these and do a sub-query using XPath and see if the text child in the nodes are in the set of interest (entirely made up for my test), and if so fetch the content of any within this node. It would be nice if we could build the hash for targets just once. And we could get clever with the XPath query to try do the matching and selection in one query. This might actually slow things down. (There are garbage collection issues with XPath sub-queries for which I am still deciding on the optimal strategy.) So perhaps the lesson her is that for those working with XML, XPath is worth using before more specialized approaches and large XML data files can fit into memory. The tree is not using contiguous memory so nodes can be squeezed into available spaces. D. Luo Weijun wrote: > Thanks, Dr. Lang, > I used xmlEventParse() + branches concept as you > suggested, it really works, and the memory issue is > gone. Now I can query large XML files from within R. > but here is another problem: it is too slows (a simple > query has not finished for 1.5h), even though the > number of relevant records is very limited, but the > whole XML file has more than 500 thousand > similarly-structured records. And the parser has to go > through all of them as to find the matches. Attached > is part of the XML files with two records. I am trying > to retrieve the content of nodes from > records where nodes bear specific > gene names. > Is it possible to locate based on node content (or > xmlValue) rather than node names (since they are the > same in all records) first and then parse the xml > record locally? Would query based on XPath be faster > in this case? I understand that we do have the > facility in the XML package for XPath based queries, > called getNodeSet(). But that requires reading the > whole XML tree into the memory first, which is not > feasible for my large XML file. Or can I call > XML::XPath statements using your R-Perl interface > package? Any suggestions/thoughts? Thank you! > Weijun > > > Part of my XML file: > > > 20119043 > protein20 > 1009020 > 20GI6677981 > SKD120 > Vps4b20 > 8030489C12Rik20 > Mouse homologue of > yeast Vacuolar protein sorting 4 (Vps4); Suppressor of > potassium transport defect 1. Mem > ber of mammalian class E Vps proteins involved in > endosomal transport; AAA-type > ATPase.20Mo > use homologue of yeast Vacuolar protein sorting 4 > (Vps4); Suppressor of potassium transport defect 1. > Member of mammalian class E Vps prot > eins involved in endosomal transport; AAA-type > ATPase.20 > > 337974miClust80 > > > 200 > > 20158189434SBP1 > 00 > 20158217953mVps2 > 00 > > > > 30116226 > protein30 > 960630 > 30HGNC9859 > RAP1GDS130 > GDS130 > MGC11885930 > MGC11886130 > > 300 > > 93569116280RAC1 > 00 > 104132103040RHOA > 00 > 12181874726MBIP > 00 > > > --- Duncan Temple Lang <[EMAIL PROTECTED]> > wrote: > >> Well, as you mention at the end of the mail, >> several people have given you suggestions about >> how to solve the problem using different approaches. >> You might search on the Web for how to install a 64 >> bit version of libxml2? >> Using xmlTreeParse(, useInternalNodes = TRUE) is an >> approach >> to reducing the memory consumption as is using the >> handlers >> argument. And if size is really the issue, you >> should consider >> the SAX model which is very memory efficient and >> made available >> via the xmlEventParse() function in the XML package. >> And it even provides the concepts of branches to >> provide a >> hybrid of SAX and DOM-style parsing together. >> >> However, to solve the problem of the xmlMemDisplay >> symbol not being found, you can look for where >> that is used and remove it.It is in >> src/DocParse.c >> in the routine RS_XML_MemoryShow(). You can remove >> the line >> xmlMemDisplay(stderr) >> or indeed the entire routine. Then re-install and >> reload the package. >> >> D. >> >> >> Luo Weijun wrote: >>> Hello Dr. Lang an
Re: [R] Loading problem with XML_1.9
Thanks, Dr. Lang, I used xmlEventParse() + branches concept as you suggested, it really works, and the memory issue is gone. Now I can query large XML files from within R. but here is another problem: it is too slows (a simple query has not finished for 1.5h), even though the number of relevant records is very limited, but the whole XML file has more than 500 thousand similarly-structured records. And the parser has to go through all of them as to find the matches. Attached is part of the XML files with two records. I am trying to retrieve the content of nodes from records where nodes bear specific gene names. Is it possible to locate based on node content (or xmlValue) rather than node names (since they are the same in all records) first and then parse the xml record locally? Would query based on XPath be faster in this case? I understand that we do have the facility in the XML package for XPath based queries, called getNodeSet(). But that requires reading the whole XML tree into the memory first, which is not feasible for my large XML file. Or can I call XML::XPath statements using your R-Perl interface package? Any suggestions/thoughts? Thank you! Weijun Part of my XML file: 20119043 protein20 1009020 20GI6677981 SKD120 Vps4b20 8030489C12Rik20 Mouse homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Mem ber of mammalian class E Vps proteins involved in endosomal transport; AAA-type ATPase.20Mo use homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Member of mammalian class E Vps prot eins involved in endosomal transport; AAA-type ATPase.20 337974miClust80 200 20158189434SBP1 00 20158217953mVps2 00 30116226 protein30 960630 30HGNC9859 RAP1GDS130 GDS130 MGC11885930 MGC11886130 300 93569116280RAC1 00 104132103040RHOA 00 12181874726MBIP 00 --- Duncan Temple Lang <[EMAIL PROTECTED]> wrote: > > Well, as you mention at the end of the mail, > several people have given you suggestions about > how to solve the problem using different approaches. > You might search on the Web for how to install a 64 > bit version of libxml2? > Using xmlTreeParse(, useInternalNodes = TRUE) is an > approach > to reducing the memory consumption as is using the > handlers > argument. And if size is really the issue, you > should consider > the SAX model which is very memory efficient and > made available > via the xmlEventParse() function in the XML package. > And it even provides the concepts of branches to > provide a > hybrid of SAX and DOM-style parsing together. > > However, to solve the problem of the xmlMemDisplay > symbol not being found, you can look for where > that is used and remove it.It is in > src/DocParse.c > in the routine RS_XML_MemoryShow(). You can remove > the line > xmlMemDisplay(stderr) > or indeed the entire routine. Then re-install and > reload the package. > > D. > > > Luo Weijun wrote: > > Hello Dr. Lang and all, > > I posted this message in R-help mail list, but > havenât > > solved my problem so far. Therefore, could you > help me > > look at it? > > I have loading problem with XML_1.9 under 64 bit > > R2.3.1 for Mac OS X, which I got from > > http://R.research.att.com/. XML_1.9 works fine > under > > 32 bit R2.5.0. I thought that could be > installation > > problem, and I tried install.packages or biocLite, > > every time the package installed fine, except some > > warning messages below: > > ld64 warning: in /usr/lib/libxml2.dylib, file does > not > > contain requested architecture > > ld64 warning: in /usr/lib/libz.dylib, file does > not > > contain requested architecture > > ld64 warning: in /usr/lib/libiconv.dylib, file > does > > not contain requested architecture > > ld64 warning: in /usr/lib/libz.dylib, file does > not > > contain requested architecture > > ld64 warning: in /usr/lib/libxml2.dylib, file does > not > > contain requested architecture > > > > Here is the error messages I got, when XML is > loaded: > >> library(XML) > > Error in dyn.load(x, as.logical(local), > > as.logical(now)) : > > unable to load shared library > > '/usr/local/lib64/R/library/XML/libs/XML.so': > > > dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, > > 6): Symbol not found: _xmlMemDisplay > > Referenced from: > > /usr/local/lib64/R/library/XML/libs/XML.so > > Expected in: flat namespace > > Error: .onLoad failed in 'loadNamespace' for 'XML' > > Error: package/namespace load failed for 'XML' > > > > Session information > >> sessionInfo() > > Version 2.3.1 Patched (2006-06-27 r38447) > > powerpc64-apple-darwin8.7.0 > > > > attached base packages: > > [1] "methods" "stats" "graphics" > "grDevices" > > "utils" "datasets" > > [7] "base" > > > > Prof Brian Ripley also suggested that this could > be > > that I donât have a 64-bit version of libxml2 > > installed. Where I get it and where/how to install > it, > > if thatâs the problem? > > The reason
Re: [R] Loading problem with XML_1.9
Well, as you mention at the end of the mail, several people have given you suggestions about how to solve the problem using different approaches. You might search on the Web for how to install a 64 bit version of libxml2? Using xmlTreeParse(, useInternalNodes = TRUE) is an approach to reducing the memory consumption as is using the handlers argument. And if size is really the issue, you should consider the SAX model which is very memory efficient and made available via the xmlEventParse() function in the XML package. And it even provides the concepts of branches to provide a hybrid of SAX and DOM-style parsing together. However, to solve the problem of the xmlMemDisplay symbol not being found, you can look for where that is used and remove it.It is in src/DocParse.c in the routine RS_XML_MemoryShow(). You can remove the line xmlMemDisplay(stderr) or indeed the entire routine. Then re-install and reload the package. D. Luo Weijun wrote: > Hello Dr. Lang and all, > I posted this message in R-help mail list, but haven’t > solved my problem so far. Therefore, could you help me > look at it? > I have loading problem with XML_1.9 under 64 bit > R2.3.1 for Mac OS X, which I got from > http://R.research.att.com/. XML_1.9 works fine under > 32 bit R2.5.0. I thought that could be installation > problem, and I tried install.packages or biocLite, > every time the package installed fine, except some > warning messages below: > ld64 warning: in /usr/lib/libxml2.dylib, file does not > contain requested architecture > ld64 warning: in /usr/lib/libz.dylib, file does not > contain requested architecture > ld64 warning: in /usr/lib/libiconv.dylib, file does > not contain requested architecture > ld64 warning: in /usr/lib/libz.dylib, file does not > contain requested architecture > ld64 warning: in /usr/lib/libxml2.dylib, file does not > contain requested architecture > > Here is the error messages I got, when XML is loaded: >> library(XML) > Error in dyn.load(x, as.logical(local), > as.logical(now)) : > unable to load shared library > '/usr/local/lib64/R/library/XML/libs/XML.so': > dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, > 6): Symbol not found: _xmlMemDisplay > Referenced from: > /usr/local/lib64/R/library/XML/libs/XML.so > Expected in: flat namespace > Error: .onLoad failed in 'loadNamespace' for 'XML' > Error: package/namespace load failed for 'XML' > > Session information >> sessionInfo() > Version 2.3.1 Patched (2006-06-27 r38447) > powerpc64-apple-darwin8.7.0 > > attached base packages: > [1] "methods" "stats" "graphics" "grDevices" > "utils" "datasets" > [7] "base" > > Prof Brian Ripley also suggested that this could be > that I don’t have a 64-bit version of libxml2 > installed. Where I get it and where/how to install it, > if that’s the problem? > The reason I need to use R64 is that I have memory > limitation issue with R 32 bit version when I load > some very large XML trees (the data file is about > 800M). And Martin suggested me to use 'handler' > argument of xmlTreeParse, tried 'handler' with > useInternalNodes=T, but I still got this memory > problem with R 32 bit version. Please tell me what I > can do now. Thank you so much! > Weijun > > > > > > > Comedy with an Edge to see what's on, when. > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Loading problem with XML_1.9
Hello Dr. Lang and all, I posted this message in R-help mail list, but havent solved my problem so far. Therefore, could you help me look at it? I have loading problem with XML_1.9 under 64 bit R2.3.1 for Mac OS X, which I got from http://R.research.att.com/. XML_1.9 works fine under 32 bit R2.5.0. I thought that could be installation problem, and I tried install.packages or biocLite, every time the package installed fine, except some warning messages below: ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libiconv.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture Here is the error messages I got, when XML is loaded: > library(XML) Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/usr/local/lib64/R/library/XML/libs/XML.so': dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, 6): Symbol not found: _xmlMemDisplay Referenced from: /usr/local/lib64/R/library/XML/libs/XML.so Expected in: flat namespace Error: .onLoad failed in 'loadNamespace' for 'XML' Error: package/namespace load failed for 'XML' Session information > sessionInfo() Version 2.3.1 Patched (2006-06-27 r38447) powerpc64-apple-darwin8.7.0 attached base packages: [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" [7] "base" Prof Brian Ripley also suggested that this could be that I dont have a 64-bit version of libxml2 installed. Where I get it and where/how to install it, if thats the problem? The reason I need to use R64 is that I have memory limitation issue with R 32 bit version when I load some very large XML trees (the data file is about 800M). And Martin suggested me to use 'handler' argument of xmlTreeParse, tried 'handler' with useInternalNodes=T, but I still got this memory problem with R 32 bit version. Please tell me what I can do now. Thank you so much! Weijun Comedy with an Edge to see what's on, when. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Loading problem with XML_1.9
Weijun -- If memory is a problem, you might try using the 'handler' argument of xmlTreeParse. This provides access to each node as it is processed, so that you can, for instance, choose to ignore nodes, or save only numeric values, or ... I'm not sure whether the entire document is read into a C 'external pointer', or whether the savings is just in the R representation of the document. Also, depending on how you use the resulting document, you might want to watch out for the memory leak mentioned in http://www.omegahat.org/RSXML/Changes Martin Luo Weijun <[EMAIL PROTECTED]> writes: > Hello all, > I have loading problem with XML_1.9 under 64 bit > R2.3.1, which I got from http://R.research.att.com/. > XML_1.9 works fine under 32 bit R2.5.0. I thought that > could be installation problem, and I tried > install.packages or biocLite, every time the package > installed fine, except some warning messages below: > ld64 warning: in /usr/lib/libxml2.dylib, file does not > contain requested architecture > ld64 warning: in /usr/lib/libz.dylib, file does not > contain requested architecture > ld64 warning: in /usr/lib/libiconv.dylib, file does > not contain requested architecture > ld64 warning: in /usr/lib/libz.dylib, file does not > contain requested architecture > ld64 warning: in /usr/lib/libxml2.dylib, file does not > contain requested architecture > > Here is the error messages I got, when XML is loaded: >> library(XML) > Error in dyn.load(x, as.logical(local), > as.logical(now)) : > unable to load shared library > '/usr/local/lib64/R/library/XML/libs/XML.so': > dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, > 6): Symbol not found: _xmlMemDisplay > Referenced from: > /usr/local/lib64/R/library/XML/libs/XML.so > Expected in: flat namespace > Error: .onLoad failed in 'loadNamespace' for 'XML' > Error: package/namespace load failed for 'XML' > > I understand that it has been pointed out that > Sys.getenv("PATH") needs to be revised in the file > XML/R/zzz.R, but I canâ�t even find that file under > XML/R/ directory. Does anybody have any idea what > might be the problem, and how to solve it? Thanks a > lot! > BTW, the reason I need to use R64 is that I have > memory limitation issue with R 32 bit version when I > load some very large XML trees. > > Session information >> sessionInfo() > Version 2.3.1 Patched (2006-06-27 r38447) > powerpc64-apple-darwin8.7.0 > > attached base packages: > [1] "methods" "stats" "graphics" "grDevices" > "utils" "datasets" > [7] "base" > > Weijun > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Bioconductor / Computational Biology http://bioconductor.org __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Loading problem with XML_1.9
Please don't post to multiple lists: I have removed the BioC-devel list. This is about MacOS X, and the appropriate list is R-sig-mac. There is no intrinsic 64-bit problem: package XML 1.9-0 (sic) works fine on 64-bit versions of Solaris and Linux. Most likely there was an installation problem, and you do not have a 64-bit version of libxml2 installed or in the run-time library path. On Wed, 27 Jun 2007, Luo Weijun wrote: Hello all, I have loading problem with XML_1.9 under 64 bit R2.3.1, which I got from http://R.research.att.com/. For MacOS X, unstated. XML_1.9 works fine under 32 bit R2.5.0. I thought that could be installation problem, and I tried install.packages or biocLite, every time the package installed fine, except some warning messages below: ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libiconv.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libz.dylib, file does not contain requested architecture ld64 warning: in /usr/lib/libxml2.dylib, file does not contain requested architecture Here is the error messages I got, when XML is loaded: library(XML) Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/usr/local/lib64/R/library/XML/libs/XML.so': dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, 6): Symbol not found: _xmlMemDisplay Referenced from: /usr/local/lib64/R/library/XML/libs/XML.so Expected in: flat namespace Error: .onLoad failed in 'loadNamespace' for 'XML' Error: package/namespace load failed for 'XML' I understand that it has been pointed out that Sys.getenv("PATH") needs to be revised in the file XML/R/zzz.R, but I canât even find that file under XML/R/ directory. Does anybody have any idea what might be the problem, and how to solve it? Thanks a lot! BTW, the reason I need to use R64 is that I have memory limitation issue with R 32 bit version when I load some very large XML trees. Session information sessionInfo() Version 2.3.1 Patched (2006-06-27 r38447) powerpc64-apple-darwin8.7.0 attached base packages: [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" [7] "base" Weijun -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.