i have same question, i call the python tika, the cpu 100% then crashed ....
2012/11/10 Norman M <[email protected]> > I am using Apache Tika to extract text from PPT/PPTX files. > > Tika is using Apache POI to extract texts. > > I tried to compare processing time and memory usage for POI vs Aspose ( > www.aspose.com) > > The processing time and memory requirement for Tika (i-e POI) is almost > double of Aspose. > > Is Poi really using streaming to parse files? Why it is taking much more > memory than Aspose that I thought reads the whole file into memory. > > I found this thread > http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.htmlwhere > Tika founder is claiming that Poi is not steaming input files. That > thread is quite old, is it still the same? > > My goal is to minimize the memory requirement. > > Here is my code > > ParseContext context - new ParseContext(); > Detector detector = new DefaultDetector(); > Parser parser = new AutoDetectParser(detector); > context.set(Parser.class, parser); > MetaData metaData = new MetaData(); > > File file = new File ("temp.ppt"); > Url url = file.toURI().toURL(); > OutputStream o = new ByteArrayOutputStream() > > InputStream input = TikaInputStream.get(url, metadata); > ContentHandler handler = new BodyContentHandler(outputStream); > > parser.parse(input, handler, metadata,context); > > String extractedText = outputStream.toStream(); > > It looks like that whole extracted text will be written to output stream > and hence it may be the reason for large memory consumption. How can I make > memory usage as least as possible? > > Any response will be appreciated. > > Thanks, >
