Thanks Tim!! You helped me find the defect in my code. Yes, I'm using one BodyContentHandler. When I changed my code to create a new BodyContentHandler for each XML file I'm parsing, I no longer see the OOM. It is weird that I see this issue with XML files only.
For completeness, can you confirm if I have an issue in re-using a single instance of AutoDetectParser and Metadata throughout the life of my application? The reason why I'm reusing a single instance is to cut down on overhead (I have yet to time this). Steve On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > In your actual code, are you using one BodyContentHandler for all of your > files? Or are you creating a new BodyContentHandler for each file? If the > former, then, y, there’s a problem with your code; if the latter, that’s > not something I’ve seen before. > > > > *From:* Steven White [mailto:swhite4...@gmail.com] > *Sent:* Monday, February 08, 2016 4:56 PM > *To:* user@tika.apache.org > *Subject:* Re: Preventing OutOfMemory exception > > > > Hi Tim, > > > > The code I showed is a minimal example code to show the issue I'm running > into, which is: memory keeps on growing. > > > > In production, the loop that you see will read files off a file system and > parse them using the logic close to what I sowed. I use > contentHandler.toString() to get back the raw text so I can save it. Even > if I get ride of that call, I run into OOM. > > > > Note that, if I test the exact same code against PDF or PPT or ODP or RTF > (I still have far more formats to test) I do *NOT* see the OOM issue even > when I increase the loop to 1000 -- memory usage remains steady and > stable. This is why in my original email I asked if there is an issue with > XML files or with my code such as if I'm missing to close / release > something. > > > > Here is the full call stack when I get the OOM: > > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338) > > at java.lang.StringBuffer.append(StringBuffer.java:114) > > at java.io.StringWriter.write(StringWriter.java:106) > > at > org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55) > > at > org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > > at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown > Source) > > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown > Source) > > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > Source) > > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > Source) > > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > > at javax.xml.parsers.SAXParser.parse(Unknown Source) > > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > > > > Thanks > > > > Steve > > > > > > On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <talli...@mitre.org> > wrote: > > I’m not sure why you’d want to append document contents across documents > into one handler. Typically, you’d use a new ContentHandler and new > Metadata object for each parse. Calling “toString()” does not clear the > content handler, and you should have 20 copies of the extracted content on > your final loop. > > > > There shouldn’t be any difference across file types in the fact that you > are appending a new copy of the extracted text with each loop. You might > not be seeing the memory growth if your other file types aren’t big enough > and if you are only doing 20 loops. > > > > But the larger question…what are you trying to accomplish? > > > > *From:* Steven White [mailto:swhite4...@gmail.com] > *Sent:* Monday, February 08, 2016 1:38 PM > *To:* user@tika.apache.org > *Subject:* Preventing OutOfMemory exception > > > > Hi everyone, > > > > I'm integrating Tika with my application and need your help to figure out > if the OOM I'm getting is due to the way I'm using Tika or if it is an > issue with parsing XML files. > > > > The following example code is causing OOM on 7th iteration with -Xmx2g. > The test will pass with -Xmx4g. The XML file I'm trying to parse is 51mb > in size. I do not see this issue with other file types that I tested so > far. Memory usage keeps on growing with XML file types, but stays constant > with other file types. > > > > public class Extractor { > > private BodyContentHandler contentHandler = new > BodyContentHandler(-1); > > private AutoDetectParser parser = new AutoDetectParser(); > > private Metadata metadata = new Metadata(); > > > > public String extract(File file) throws Exception { > > try { > > stream = TikaInputStream.get(file); > > parser.parse(stream, contentHandler, metadata); > > return contentHandler.toString(); > > } > > finally { > > stream.close(); > > } > > } > > } > > > > public static void main(...) { > > Extractor extractor = new Extractor(); > > File file = new File("C:\\temp\\test.xml"); > > for (int i = 0; i < 20; i++) { > > extractor.extract(file); > > } > > > > Any idea if this is an issue with XML files or if the issue in my code? > > > > Thanks > > > > Steve > > > > >