RE: OutOfMemoryException while Indexing an XML file/PdfParser
Thanks Matt, I am working on using the xpdf as suggested by you. I get error at the following statement. Could you elloborate on the statement String[] cmd = new String[] { PATH_TO_XPDF, -enc, UTF-8, -q, filename, -}; I defined PATH_TO_XPDF as c:/xpdf/pdftotext.exe the rest remaining same. I get error saying some incomapatible types, file and string, could not understand! Thanks again! Pinky Matt Tucker [EMAIL PROTECTED] wrote:Rob, We ran into this problem too, and our solution was to use a native PDF text extractor (PDFBox just can't seem to handle large PDFs well). Basically, we try to parse with the native app first, and if that fails, we parse with PDFBox. We used: http://www.foolabs.com/xpdf/ A code snippet for using this is: String[] cmd = new String[] { PATH_TO_XPDF, -enc, UTF-8, -q, filename, -}; Process p = Runtime.getRuntime().exec(cmd); BufferedInputStream bis = new BufferedInputStream(p.getInputStream()); InputStreamReader reader = new InputStreamReader(bis, UTF-8); out = new StringWriter(); char [] buf = new char[512]; int len; while ((len = reader.read(buf)) = 0) { out.write(buf, 0, len); } reader.close(); Regards, Matt -Original Message- From: Pinky Iyer [mailto:[EMAIL PROTECTED]] Sent: Tuesday, February 18, 2003 5:23 PM To: Lucene Users List Subject: RE: OutOfMemoryException while Indexing an XML file/PdfParser I am having similar problem but indexing pdf documents using pdfbox parser (available at www.pdfbox.com). I get an exception saying Exception in thread main java.lang.OutOfMemoryError Any body who has implemented the above code? Any help appreciated??? Thanks! PI Rob Outar wrote:We are aware of DOM limitations/memory problems, but I am using SAX to parse the file and index elements and attributes in my content handler. Thanks, Rob -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:18 PM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day
RE: OutOfMemoryException while Indexing an XML file
We are aware of DOM limitations/memory problems, but I am using SAX to parse the file and index elements and attributes in my content handler. Thanks, Rob -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:18 PM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryException while Indexing an XML file/PdfParser
I am having similar problem but indexing pdf documents using pdfbox parser (available at www.pdfbox.com). I get an exception saying Exception in thread main java.lang.OutOfMemoryError Any body who has implemented the above code? Any help appreciated??? Thanks! PI Rob Outar [EMAIL PROTECTED] wrote:We are aware of DOM limitations/memory problems, but I am using SAX to parse the file and index elements and attributes in my content handler. Thanks, Rob -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:18 PM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day
RE: OutOfMemoryException while Indexing an XML file/PdfParser
Rob, We ran into this problem too, and our solution was to use a native PDF text extractor (PDFBox just can't seem to handle large PDFs well). Basically, we try to parse with the native app first, and if that fails, we parse with PDFBox. We used: http://www.foolabs.com/xpdf/ A code snippet for using this is: String[] cmd = new String[] { PATH_TO_XPDF, -enc, UTF-8, -q, filename, -}; Process p = Runtime.getRuntime().exec(cmd); BufferedInputStream bis = new BufferedInputStream(p.getInputStream()); InputStreamReader reader = new InputStreamReader(bis, UTF-8); out = new StringWriter(); char [] buf = new char[512]; int len; while ((len = reader.read(buf)) = 0) { out.write(buf, 0, len); } reader.close(); Regards, Matt -Original Message- From: Pinky Iyer [mailto:[EMAIL PROTECTED]] Sent: Tuesday, February 18, 2003 5:23 PM To: Lucene Users List Subject: RE: OutOfMemoryException while Indexing an XML file/PdfParser I am having similar problem but indexing pdf documents using pdfbox parser (available at www.pdfbox.com). I get an exception saying Exception in thread main java.lang.OutOfMemoryError Any body who has implemented the above code? Any help appreciated??? Thanks! PI Rob Outar [EMAIL PROTECTED] wrote:We are aware of DOM limitations/memory problems, but I am using SAX to parse the file and index elements and attributes in my content handler. Thanks, Rob -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:18 PM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryException while Indexing an XML file/PdfParser
I am aware of the issues with parsing certain PDF documents. I am currently working on refactoring PDFBox to deal with large documents. You will see this in the next release. I would like to thank people for feedback and sending problem documents. Ben Litchfield http://www.pdfbox.org On Tue, 18 Feb 2003, Pinky Iyer wrote: I am having similar problem but indexing pdf documents using pdfbox parser (available at www.pdfbox.com). I get an exception saying Exception in thread main java.lang.OutOfMemoryError Any body who has implemented the above code? Any help appreciated??? Thanks! PI Rob Outar [EMAIL PROTECTED] wrote:We are aware of DOM limitations/memory problems, but I am using SAX to parse the file and index elements and attributes in my content handler. Thanks, Rob -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:18 PM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryException while Indexing an XML file
-Original Message- From: Rob Outar [mailto:[EMAIL PROTECTED]] Sent: Freitag, 14. Februar 2003 14:13 To: Lucene Users List Subject: OutOfMemoryException while Indexing an XML file Hi all, I was using the sample code provided I believe by Doug Cutting to index an XML file, the XML file was 2 megs (kinda large) but while adding fields to the Document object I got an OutOfMemoryException exception. I work with XML files a lot, I can easily parse that 2 meg file into a DOM tree, I can't imagine a Lucene document being larger than a DOM Tree, pasted below is the SAX handler. [...code...] Try adding -Xmx256M as an argument for java to increase the heap size in memory. Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryException while Indexing an XML file
Nothing in the code snippet you sent would cause that exception. If I were you I'd run it under a profiler to quickly see where the leak is. You can even use something free like JMP. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, I was using the sample code provided I believe by Doug Cutting to index an XML file, the XML file was 2 megs (kinda large) but while adding fields to the Document object I got an OutOfMemoryException exception. I work with XML files a lot, I can easily parse that 2 meg file into a DOM tree, I can't imagine a Lucene document being larger than a DOM Tree, pasted below is the SAX handler. public class XMLDocumentBuilder extends DefaultHandler { /** A buffer for each XML element */ private StringBuffer elementBuffer = new StringBuffer(); private Document mDocument; public void buildDocument(Document doc, String xmlFile) throws IOException, SAXException { this.mDocument = doc; SAXReader.parse(xmlFile, this); } public void startElement(String uri, String localName, String qName, Attributes atts) { elementBuffer.setLength(0); if (atts != null) { for (int i = 0; i atts.getLength(); i++) { String attname = atts.getLocalName(i); mDocument.add(new Field(attname, atts.getValue(i), true, true, true)); } } } // call when cdata found public void characters(char[] text, int start, int length) { elementBuffer.append(text, start, length); } public void endElement(String uri, String localName, String qName) { mDocument.add(Field.Text(localName, elementBuffer.toString())); } public Document getDocument() { return mDocument; } } Any help would be appreciated. Thanks, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryException while Indexing an XML file
So to the best of your knowledge the Lucene Document Object should not cause the exception even though the XML file is huge and 1000's of fields are being added to the Lucene Document Object? Thanks, Rob -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:21 AM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file Nothing in the code snippet you sent would cause that exception. If I were you I'd run it under a profiler to quickly see where the leak is. You can even use something free like JMP. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, I was using the sample code provided I believe by Doug Cutting to index an XML file, the XML file was 2 megs (kinda large) but while adding fields to the Document object I got an OutOfMemoryException exception. I work with XML files a lot, I can easily parse that 2 meg file into a DOM tree, I can't imagine a Lucene document being larger than a DOM Tree, pasted below is the SAX handler. public class XMLDocumentBuilder extends DefaultHandler { /** A buffer for each XML element */ private StringBuffer elementBuffer = new StringBuffer(); private Document mDocument; public void buildDocument(Document doc, String xmlFile) throws IOException, SAXException { this.mDocument = doc; SAXReader.parse(xmlFile, this); } public void startElement(String uri, String localName, String qName, Attributes atts) { elementBuffer.setLength(0); if (atts != null) { for (int i = 0; i atts.getLength(); i++) { String attname = atts.getLocalName(i); mDocument.add(new Field(attname, atts.getValue(i), true, true, true)); } } } // call when cdata found public void characters(char[] text, int start, int length) { elementBuffer.append(text, start, length); } public void endElement(String uri, String localName, String qName) { mDocument.add(Field.Text(localName, elementBuffer.toString())); } public Document getDocument() { return mDocument; } } Any help would be appreciated. Thanks, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryException while Indexing an XML file
I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create a shell script that invokes a java program for each xml file that adds it to the index. Hope this helps... Aaron -- Original Message -- From: Rob Outar [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] Date: Fri, 14 Feb 2003 08:43:34 -0500 Forgot to mention I am indexing 1000's of XML files. I ran a little test to see if that file was the problem, but it was abled to be indexed after some time and memory usage was huge. I think maybe because I index these files one after the other something is not getting cleaned up leading to the exception. Thanks, Rob -Original Message- From: Rob Outar [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:25 AM To: Lucene Users List Subject: RE: OutOfMemoryException while Indexing an XML file So to the best of your knowledge the Lucene Document Object should not cause the exception even though the XML file is huge and 1000's of fields are being added to the Lucene Document Object? Thanks, Rob -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Friday, February 14, 2003 8:21 AM To: Lucene Users List Subject: Re: OutOfMemoryException while Indexing an XML file Nothing in the code snippet you sent would cause that exception. If I were you I'd run it under a profiler to quickly see where the leak is. You can even use something free like JMP. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, I was using the sample code provided I believe by Doug Cutting to index an XML file, the XML file was 2 megs (kinda large) but while adding fields to the Document object I got an OutOfMemoryException exception. I work with XML files a lot, I can easily parse that 2 meg file into a DOM tree, I can't imagine a Lucene document being larger than a DOM Tree, pasted below is the SAX handler. public class XMLDocumentBuilder extends DefaultHandler { /** A buffer for each XML element */ private StringBuffer elementBuffer = new StringBuffer(); private Document mDocument; public void buildDocument(Document doc, String xmlFile) throws IOException, SAXException { this.mDocument = doc; SAXReader.parse(xmlFile, this); } public void startElement(String uri, String localName, String qName, Attributes atts) { elementBuffer.setLength(0); if (atts != null) { for (int i = 0; i atts.getLength(); i++) { String attname = atts.getLocalName(i); mDocument.add(new Field(attname, atts.getValue(i), true, true, true)); } } } // call when cdata found public void characters(char[] text, int start, int length) { elementBuffer.append(text, start, length); } public void endElement(String uri, String localName, String qName) { mDocument.add(Field.Text(localName, elementBuffer.toString())); } public Document getDocument() { return mDocument; } } Any help would be appreciated. Thanks, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- [This E-mail was scanned for spam and viruses by NextGen.net.] Sent through the WebMail system at nextgen.net.mt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryException while Indexing an XML file
On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]