I am using Apache Tika to extract text from PPT/PPTX files. Tika is using Apache POI to extract texts.
I tried to compare processing time and memory usage for POI vs Aspose (www.aspose.com) The processing time and memory requirement for Tika (i-e POI) is almost double of Aspose. Is Poi really using streaming to parse files? Why it is taking much more memory than Aspose that I thought reads the whole file into memory. I found this thread http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.html where Tika founder is claiming that Poi is not steaming input files. That thread is quite old, is it still the same? My goal is to minimize the memory requirement. Here is my code ParseContext context - new ParseContext(); Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); context.set(Parser.class, parser); MetaData metaData = new MetaData(); File file = new File ("temp.ppt"); Url url = file.toURI().toURL(); OutputStream o = new ByteArrayOutputStream() InputStream input = TikaInputStream.get(url, metadata); ContentHandler handler = new BodyContentHandler(outputStream); parser.parse(input, handler, metadata,context); String extractedText = outputStream.toStream(); It looks like that whole extracted text will be written to output stream and hence it may be the reason for large memory consumption. How can I make memory usage as least as possible? Any response will be appreciated. Thanks,
