Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a
new BodyContentHandler for each XML file I'm parsing, I no longer see the
OOM.  It is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single
instance of AutoDetectParser and Metadata throughout the life of my
application?  The reason why I'm reusing a single instance is to cut down
on overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> In your actual code, are you using one BodyContentHandler for all of your
> files?  Or are you creating a new BodyContentHandler for each file?  If the
> former, then, y, there’s a problem with your code; if the latter, that’s
> not something I’ve seen before.
>
>
>
> *From:* Steven White [mailto:swhite4...@gmail.com]
> *Sent:* Monday, February 08, 2016 4:56 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Preventing OutOfMemory exception
>
>
>
> Hi Tim,
>
>
>
> The code I showed is a minimal example code to show the issue I'm running
> into, which is: memory keeps on growing.
>
>
>
> In production, the loop that you see will read files off a file system and
> parse them using the logic close to what I sowed.  I use
> contentHandler.toString() to get back the raw text so I can save it.  Even
> if I get ride of that call, I run into OOM.
>
>
>
> Note that, if I test the exact same code against PDF or PPT or ODP or RTF
> (I still have far more formats to test) I do *NOT* see the OOM issue even
> when I increase the loop to 1000 -- memory usage remains steady and
> stable.  This is why in my original email I asked if there is an issue with
> XML files or with my code such as if I'm missing to close / release
> something.
>
>
>
> Here is the full call stack when I get the OOM:
>
>
>
>   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>     at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
>
>     at java.lang.StringBuffer.append(StringBuffer.java:114)
>
>     at java.io.StringWriter.write(StringWriter.java:106)
>
>     at
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
>
>     at
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>
>     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
>
>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>
>     at javax.xml.parsers.SAXParser.parse(Unknown Source)
>
>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>
>
>
> Thanks
>
>
>
> Steve
>
>
>
>
>
> On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> I’m not sure why you’d want to append document contents across documents
> into one handler.  Typically, you’d use a new ContentHandler and new
> Metadata object for each parse.  Calling “toString()” does not clear the
> content handler, and you should have 20 copies of the extracted content on
> your final loop.
>
>
>
> There shouldn’t be any difference across file types in the fact that you
> are appending a new copy of the extracted text with each loop.  You might
> not be seeing the memory growth if your other file types aren’t big enough
> and if you are only doing 20 loops.
>
>
>
> But the larger question…what are you trying to accomplish?
>
>
>
> *From:* Steven White [mailto:swhite4...@gmail.com]
> *Sent:* Monday, February 08, 2016 1:38 PM
> *To:* user@tika.apache.org
> *Subject:* Preventing OutOfMemory exception
>
>
>
> Hi everyone,
>
>
>
> I'm integrating Tika with my application and need your help to figure out
> if the OOM I'm getting is due to the way I'm using Tika or if it is an
> issue with parsing XML files.
>
>
>
> The following example code is causing OOM on 7th iteration with -Xmx2g.
> The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb
> in size.  I do not see this issue with other file types that I tested so
> far.  Memory usage keeps on growing with XML file types, but stays constant
> with other file types.
>
>
>
>     public class Extractor {
>
>         private BodyContentHandler contentHandler = new
> BodyContentHandler(-1);
>
>         private AutoDetectParser parser = new AutoDetectParser();
>
>         private Metadata metadata = new Metadata();
>
>
>
>         public String extract(File file) throws Exception {
>
>             try {
>
>                 stream = TikaInputStream.get(file);
>
>                 parser.parse(stream, contentHandler, metadata);
>
>                 return contentHandler.toString();
>
>             }
>
>             finally {
>
>                 stream.close();
>
>             }
>
>         }
>
>     }
>
>
>
>     public static void main(...) {
>
>         Extractor extractor = new Extractor();
>
>         File file = new File("C:\\temp\\test.xml");
>
>         for (int i = 0; i < 20; i++) {
>
>             extractor.extract(file);
>
>         }
>
>
>
> Any idea if this is an issue with XML files or if the issue in my code?
>
>
>
> Thanks
>
>
>
> Steve
>
>
>
>
>

Reply via email to