RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-19 Thread Pinky Iyer

Thanks Matt, I am working on using the xpdf as suggested by you. I get error at the 
following statement.
Could you elloborate on the statement 
String[] cmd = new String[] { 
PATH_TO_XPDF, 
-enc, UTF-8, -q, filename, -}; 
I defined PATH_TO_XPDF as c:/xpdf/pdftotext.exe the rest remaining same. I get error 
saying some incomapatible types, file and string, could not understand!
Thanks again!
Pinky
 Matt Tucker [EMAIL PROTECTED] wrote:Rob,

We ran into this problem too, and our solution was to use a native PDF
text extractor (PDFBox just can't seem to handle large PDFs well).
Basically, we try to parse with the native app first, and if that fails,
we parse with PDFBox. We used:

http://www.foolabs.com/xpdf/

A code snippet for using this is:

String[] cmd = new String[] { 
PATH_TO_XPDF, 
-enc, UTF-8, -q, filename, -}; 
Process p = Runtime.getRuntime().exec(cmd); 
BufferedInputStream bis = new
BufferedInputStream(p.getInputStream()); 
InputStreamReader reader = new InputStreamReader(bis, UTF-8); 
out = new StringWriter(); 
char [] buf = new char[512]; 
int len; 
while ((len = reader.read(buf)) = 0) { 
out.write(buf, 0, len); 
} 
reader.close();

Regards,
Matt

 -Original Message-
 From: Pinky Iyer [mailto:[EMAIL PROTECTED]] 
 Sent: Tuesday, February 18, 2003 5:23 PM
 To: Lucene Users List
 Subject: RE: OutOfMemoryException while Indexing an XML file/PdfParser
 
 
 
 I am having similar problem but indexing pdf documents using 
 pdfbox parser (available at www.pdfbox.com). I get an 
 exception saying Exception in thread main 
 java.lang.OutOfMemoryError Any body who has implemented the 
 above code? Any help appreciated??? Thanks! PI Rob Outar 
 wrote:We are aware of DOM 
 limitations/memory problems, but I am using SAX to parse the 
 file and index elements and attributes in my content handler.
 
 Thanks,
 
 Rob
 
 -Original Message-
 From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
 Sent: Friday, February 14, 2003 8:18 PM
 To: Lucene Users List
 Subject: Re: OutOfMemoryException while Indexing an XML file
 
 
 On Friday 14 February 2003 07:27, Aaron Galea wrote:
  I had this problem when using xerces to parse xml documents. The 
  problem I think lies in the Java garbage collector. The way 
 I solved 
  it was to
 create
 
 It's unlikely that GC is the culprit. Current ones are good 
 at purging objects that are unreachable, and only throw 
 OutOfMem exception when they really have no other choice. 
 Usually it's the app that has some dangling references to 
 objects that prevent GC from collecting objects not useful any more.
 
 However, it's good to note that Xerces (and DOM parsers in 
 general) generally use more memory than the input XML files 
 they process; this because they usually have to keep the 
 whole document struct in memory, and there is overhead on top 
 of text segments. So it's likely to be at least 2 * input 
 file size (files usually use UTF-8 which most of the time 
 uses 1 byte per char; in memory 16-bit unicode-2 chars are 
 used for performance), plus some additional overhead for 
 storing element structure information and all that.
 
 And since default max. java heap size is 64 megs, big XML 
 files can cause problems.
 
 More likely however is that references to already processed 
 DOM trees are not nulled in a loop or something like that? 
 Especially if doing one JVM process for item solves the problem.
 
  a shell script that invokes a java program for each xml 
 file that adds 
  it to the index.
 
 -+ Tatu +-
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 Yahoo! Shopping - Send Flowers for Valentine's Day
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day


RE: OutOfMemoryException while Indexing an XML file

2003-02-18 Thread Rob Outar
We are aware of DOM limitations/memory problems, but I am using SAX to parse
the file and index elements and attributes in my content handler.

Thanks,

Rob

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 14, 2003 8:18 PM
To: Lucene Users List
Subject: Re: OutOfMemoryException while Indexing an XML file


On Friday 14 February 2003 07:27, Aaron Galea wrote:
 I had this problem when using xerces to parse xml documents. The problem I
 think lies in the Java garbage collector. The way I solved it was to
create

It's unlikely that GC is the culprit. Current ones are good at purging
objects
that are unreachable, and only throw OutOfMem exception when they really
have
no other choice.
Usually it's the app that has some dangling references to objects that
prevent
GC from collecting objects not useful any more.

However, it's good to note that Xerces (and DOM parsers in general)
generally
use more memory than the input XML files they process; this because they
usually have to keep the whole document struct in memory, and there is
overhead on top of text segments. So it's likely to be at least 2 * input
file size (files usually use UTF-8 which most of the time uses 1 byte per
char; in memory 16-bit unicode-2 chars are used for performance), plus some
additional overhead for storing element structure information and all that.

And since default max. java heap size is 64 megs, big XML files can cause
problems.

More likely however is that references to already processed DOM trees are
not
nulled in a loop or something like that? Especially if doing one JVM process
for item solves the problem.

 a shell script that invokes a java program for each xml file that adds it
 to the index.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Pinky Iyer

I am having similar problem but indexing pdf documents using pdfbox parser (available 
at www.pdfbox.com). I get an exception saying Exception in thread main 
java.lang.OutOfMemoryError Any body who has implemented the above code? Any help 
appreciated???
Thanks!
PI
 Rob Outar [EMAIL PROTECTED] wrote:We are aware of DOM limitations/memory 
problems, but I am using SAX to parse
the file and index elements and attributes in my content handler.

Thanks,

Rob

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 14, 2003 8:18 PM
To: Lucene Users List
Subject: Re: OutOfMemoryException while Indexing an XML file


On Friday 14 February 2003 07:27, Aaron Galea wrote:
 I had this problem when using xerces to parse xml documents. The problem I
 think lies in the Java garbage collector. The way I solved it was to
create

It's unlikely that GC is the culprit. Current ones are good at purging
objects
that are unreachable, and only throw OutOfMem exception when they really
have
no other choice.
Usually it's the app that has some dangling references to objects that
prevent
GC from collecting objects not useful any more.

However, it's good to note that Xerces (and DOM parsers in general)
generally
use more memory than the input XML files they process; this because they
usually have to keep the whole document struct in memory, and there is
overhead on top of text segments. So it's likely to be at least 2 * input
file size (files usually use UTF-8 which most of the time uses 1 byte per
char; in memory 16-bit unicode-2 chars are used for performance), plus some
additional overhead for storing element structure information and all that.

And since default max. java heap size is 64 megs, big XML files can cause
problems.

More likely however is that references to already processed DOM trees are
not
nulled in a loop or something like that? Especially if doing one JVM process
for item solves the problem.

 a shell script that invokes a java program for each xml file that adds it
 to the index.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day


RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Matt Tucker
Rob,

We ran into this problem too, and our solution was to use a native PDF
text extractor (PDFBox just can't seem to handle large PDFs well).
Basically, we try to parse with the native app first, and if that fails,
we parse with PDFBox. We used:

http://www.foolabs.com/xpdf/

A code snippet for using this is:

String[] cmd = new String[] { 
PATH_TO_XPDF, 
-enc, UTF-8, -q, filename, -}; 
Process p = Runtime.getRuntime().exec(cmd); 
BufferedInputStream bis = new
BufferedInputStream(p.getInputStream()); 
InputStreamReader reader = new InputStreamReader(bis, UTF-8); 
out = new StringWriter(); 
char [] buf = new char[512]; 
int len; 
while ((len = reader.read(buf)) = 0) { 
out.write(buf, 0, len); 
} 
reader.close();

Regards,
Matt

 -Original Message-
 From: Pinky Iyer [mailto:[EMAIL PROTECTED]] 
 Sent: Tuesday, February 18, 2003 5:23 PM
 To: Lucene Users List
 Subject: RE: OutOfMemoryException while Indexing an XML file/PdfParser
 
 
 
 I am having similar problem but indexing pdf documents using 
 pdfbox parser (available at www.pdfbox.com). I get an 
 exception saying Exception in thread main 
 java.lang.OutOfMemoryError Any body who has implemented the 
 above code? Any help appreciated??? Thanks! PI  Rob Outar 
 [EMAIL PROTECTED] wrote:We are aware of DOM 
 limitations/memory problems, but I am using SAX to parse the 
 file and index elements and attributes in my content handler.
 
 Thanks,
 
 Rob
 
 -Original Message-
 From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
 Sent: Friday, February 14, 2003 8:18 PM
 To: Lucene Users List
 Subject: Re: OutOfMemoryException while Indexing an XML file
 
 
 On Friday 14 February 2003 07:27, Aaron Galea wrote:
  I had this problem when using xerces to parse xml documents. The 
  problem I think lies in the Java garbage collector. The way 
 I solved 
  it was to
 create
 
 It's unlikely that GC is the culprit. Current ones are good 
 at purging objects that are unreachable, and only throw 
 OutOfMem exception when they really have no other choice. 
 Usually it's the app that has some dangling references to 
 objects that prevent GC from collecting objects not useful any more.
 
 However, it's good to note that Xerces (and DOM parsers in 
 general) generally use more memory than the input XML files 
 they process; this because they usually have to keep the 
 whole document struct in memory, and there is overhead on top 
 of text segments. So it's likely to be at least 2 * input 
 file size (files usually use UTF-8 which most of the time 
 uses 1 byte per char; in memory 16-bit unicode-2 chars are 
 used for performance), plus some additional overhead for 
 storing element structure information and all that.
 
 And since default max. java heap size is 64 megs, big XML 
 files can cause problems.
 
 More likely however is that references to already processed 
 DOM trees are not nulled in a loop or something like that? 
 Especially if doing one JVM process for item solves the problem.
 
  a shell script that invokes a java program for each xml 
 file that adds 
  it to the index.
 
 -+ Tatu +-
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 Yahoo! Shopping - Send Flowers for Valentine's Day
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Ben Litchfield

I am aware of the issues with parsing certain PDF documents.  I am
currently working on refactoring PDFBox to deal with large documents.  You
will see this in the next release.  I would like to thank people for
feedback and sending problem documents.

Ben Litchfield
http://www.pdfbox.org


On Tue, 18 Feb 2003, Pinky Iyer wrote:


 I am having similar problem but indexing pdf documents using pdfbox parser 
(available at www.pdfbox.com). I get an exception saying Exception in thread main 
java.lang.OutOfMemoryError Any body who has implemented the above code? Any help 
appreciated???
 Thanks!
 PI
  Rob Outar [EMAIL PROTECTED] wrote:We are aware of DOM limitations/memory 
problems, but I am using SAX to parse
 the file and index elements and attributes in my content handler.

 Thanks,

 Rob

 -Original Message-
 From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
 Sent: Friday, February 14, 2003 8:18 PM
 To: Lucene Users List
 Subject: Re: OutOfMemoryException while Indexing an XML file


 On Friday 14 February 2003 07:27, Aaron Galea wrote:
  I had this problem when using xerces to parse xml documents. The problem I
  think lies in the Java garbage collector. The way I solved it was to
 create

 It's unlikely that GC is the culprit. Current ones are good at purging
 objects
 that are unreachable, and only throw OutOfMem exception when they really
 have
 no other choice.
 Usually it's the app that has some dangling references to objects that
 prevent
 GC from collecting objects not useful any more.

 However, it's good to note that Xerces (and DOM parsers in general)
 generally
 use more memory than the input XML files they process; this because they
 usually have to keep the whole document struct in memory, and there is
 overhead on top of text segments. So it's likely to be at least 2 * input
 file size (files usually use UTF-8 which most of the time uses 1 byte per
 char; in memory 16-bit unicode-2 chars are used for performance), plus some
 additional overhead for storing element structure information and all that.

 And since default max. java heap size is 64 megs, big XML files can cause
 problems.

 More likely however is that references to already processed DOM trees are
 not
 nulled in a loop or something like that? Especially if doing one JVM process
 for item solves the problem.

  a shell script that invokes a java program for each xml file that adds it
  to the index.

 -+ Tatu +-


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 Do you Yahoo!?
 Yahoo! Shopping - Send Flowers for Valentine's Day

-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Marcel Stor
 -Original Message-
 From: Rob Outar [mailto:[EMAIL PROTECTED]] 
 Sent: Freitag, 14. Februar 2003 14:13
 To: Lucene Users List
 Subject: OutOfMemoryException while Indexing an XML file
 
 
 Hi all,
 
   I was using the sample code provided I believe by Doug 
 Cutting to index an
 XML file, the XML file was 2 megs (kinda large) but while 
 adding fields to
 the Document object I got an OutOfMemoryException exception.  
 I work with
 XML files a lot, I can easily parse that 2 meg file into a 
 DOM tree, I can't
 imagine a Lucene document being larger than a DOM Tree, 
 pasted below is the
 SAX handler.
[...code...]

Try adding -Xmx256M as an argument for java to increase the heap size in
memory.

Marcel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Otis Gospodnetic
Nothing in the code snippet you sent would cause that exception.
If I were you I'd run it under a profiler to quickly see where the leak
is.  You can even use something free like JMP.

Otis

--- Rob Outar [EMAIL PROTECTED] wrote:
 Hi all,
 
   I was using the sample code provided I believe by Doug Cutting to
 index an
 XML file, the XML file was 2 megs (kinda large) but while adding
 fields to
 the Document object I got an OutOfMemoryException exception.  I work
 with
 XML files a lot, I can easily parse that 2 meg file into a DOM tree,
 I can't
 imagine a Lucene document being larger than a DOM Tree, pasted below
 is the
 SAX handler.
 
 public class XMLDocumentBuilder
 extends DefaultHandler {
 
 /** A buffer for each XML element */
 private StringBuffer elementBuffer = new StringBuffer();
 
 private Document mDocument;
 
 
 public void buildDocument(Document doc, String xmlFile) throws
 IOException,
 SAXException {
 
 this.mDocument = doc;
 SAXReader.parse(xmlFile, this);
 }
 
 public void startElement(String uri, String localName, String
 qName,
 Attributes atts) {
 
 elementBuffer.setLength(0);
 
 if (atts != null) {
 
 for (int i = 0; i  atts.getLength(); i++) {
 
 String attname = atts.getLocalName(i);
 mDocument.add(new Field(attname, atts.getValue(i),
 true, true, true));
 }
 }
 }
 
 // call when cdata found
 public void characters(char[] text, int start, int length) {
 elementBuffer.append(text, start, length);
 }
 
 public void endElement(String uri, String localName, String
 qName) {
 mDocument.add(Field.Text(localName,
 elementBuffer.toString()));
 }
 public Document getDocument() {
 return mDocument;
 }
 }
 
 Any help would be appreciated.
 
 Thanks,
 
 Rob
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Rob Outar
So to the best of your knowledge the Lucene Document Object should not cause
the exception even though the XML file is huge and 1000's of fields are
being added to the Lucene Document Object?

Thanks,

Rob


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 14, 2003 8:21 AM
To: Lucene Users List
Subject: Re: OutOfMemoryException while Indexing an XML file


Nothing in the code snippet you sent would cause that exception.
If I were you I'd run it under a profiler to quickly see where the leak
is.  You can even use something free like JMP.

Otis

--- Rob Outar [EMAIL PROTECTED] wrote:
 Hi all,

   I was using the sample code provided I believe by Doug Cutting to
 index an
 XML file, the XML file was 2 megs (kinda large) but while adding
 fields to
 the Document object I got an OutOfMemoryException exception.  I work
 with
 XML files a lot, I can easily parse that 2 meg file into a DOM tree,
 I can't
 imagine a Lucene document being larger than a DOM Tree, pasted below
 is the
 SAX handler.

 public class XMLDocumentBuilder
 extends DefaultHandler {

 /** A buffer for each XML element */
 private StringBuffer elementBuffer = new StringBuffer();

 private Document mDocument;


 public void buildDocument(Document doc, String xmlFile) throws
 IOException,
 SAXException {

 this.mDocument = doc;
 SAXReader.parse(xmlFile, this);
 }

 public void startElement(String uri, String localName, String
 qName,
 Attributes atts) {

 elementBuffer.setLength(0);

 if (atts != null) {

 for (int i = 0; i  atts.getLength(); i++) {

 String attname = atts.getLocalName(i);
 mDocument.add(new Field(attname, atts.getValue(i),
 true, true, true));
 }
 }
 }

 // call when cdata found
 public void characters(char[] text, int start, int length) {
 elementBuffer.append(text, start, length);
 }

 public void endElement(String uri, String localName, String
 qName) {
 mDocument.add(Field.Text(localName,
 elementBuffer.toString()));
 }
 public Document getDocument() {
 return mDocument;
 }
 }

 Any help would be appreciated.

 Thanks,

 Rob


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



__
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Aaron Galea
I had this problem when using xerces to parse xml documents. The problem I think lies 
in the Java garbage collector. The way I solved it was to create a shell script that 
invokes a java program for each xml file that adds it to the index.

Hope this helps...

Aaron
-- Original Message --
From: Rob Outar [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
Date:  Fri, 14 Feb 2003 08:43:34 -0500

Forgot to mention I am indexing 1000's of XML files.  I ran a little test to
see if that file was the problem, but it was abled to be indexed after some
time and memory usage was huge.  I think maybe because I index these files
one after the other something is not getting cleaned up leading to the
exception.

Thanks,

Rob


-Original Message-
From: Rob Outar [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 14, 2003 8:25 AM
To: Lucene Users List
Subject: RE: OutOfMemoryException while Indexing an XML file


So to the best of your knowledge the Lucene Document Object should not cause
the exception even though the XML file is huge and 1000's of fields are
being added to the Lucene Document Object?

Thanks,

Rob


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 14, 2003 8:21 AM
To: Lucene Users List
Subject: Re: OutOfMemoryException while Indexing an XML file


Nothing in the code snippet you sent would cause that exception.
If I were you I'd run it under a profiler to quickly see where the leak
is.  You can even use something free like JMP.

Otis

--- Rob Outar [EMAIL PROTECTED] wrote:
 Hi all,

  I was using the sample code provided I believe by Doug Cutting to
 index an
 XML file, the XML file was 2 megs (kinda large) but while adding
 fields to
 the Document object I got an OutOfMemoryException exception.  I work
 with
 XML files a lot, I can easily parse that 2 meg file into a DOM tree,
 I can't
 imagine a Lucene document being larger than a DOM Tree, pasted below
 is the
 SAX handler.

 public class XMLDocumentBuilder
 extends DefaultHandler {

 /** A buffer for each XML element */
 private StringBuffer elementBuffer = new StringBuffer();

 private Document mDocument;


 public void buildDocument(Document doc, String xmlFile) throws
 IOException,
 SAXException {

 this.mDocument = doc;
 SAXReader.parse(xmlFile, this);
 }

 public void startElement(String uri, String localName, String
 qName,
 Attributes atts) {

 elementBuffer.setLength(0);

 if (atts != null) {

 for (int i = 0; i  atts.getLength(); i++) {

 String attname = atts.getLocalName(i);
 mDocument.add(new Field(attname, atts.getValue(i),
 true, true, true));
 }
 }
 }

 // call when cdata found
 public void characters(char[] text, int start, int length) {
 elementBuffer.append(text, start, length);
 }

 public void endElement(String uri, String localName, String
 qName) {
 mDocument.add(Field.Text(localName,
 elementBuffer.toString()));
 }
 public Document getDocument() {
 return mDocument;
 }
 }

 Any help would be appreciated.

 Thanks,

 Rob


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



__
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
[This E-mail was scanned for spam and viruses by NextGen.net.]


 





Sent through the WebMail system at nextgen.net.mt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Tatu Saloranta
On Friday 14 February 2003 07:27, Aaron Galea wrote:
 I had this problem when using xerces to parse xml documents. The problem I
 think lies in the Java garbage collector. The way I solved it was to create

It's unlikely that GC is the culprit. Current ones are good at purging objects 
that are unreachable, and only throw OutOfMem exception when they really have 
no other choice.
Usually it's the app that has some dangling references to objects that prevent 
GC from collecting objects not useful any more.

However, it's good to note that Xerces (and DOM parsers in general) generally 
use more memory than the input XML files they process; this because they 
usually have to keep the whole document struct in memory, and there is 
overhead on top of text segments. So it's likely to be at least 2 * input 
file size (files usually use UTF-8 which most of the time uses 1 byte per 
char; in memory 16-bit unicode-2 chars are used for performance), plus some 
additional overhead for storing element structure information and all that.

And since default max. java heap size is 64 megs, big XML files can cause 
problems.

More likely however is that references to already processed DOM trees are not 
nulled in a loop or something like that? Especially if doing one JVM process 
for item solves the problem.

 a shell script that invokes a java program for each xml file that adds it
 to the index.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]