Re: Is Tika really using streaming to parse files?

goog cheng Fri, 09 Nov 2012 21:28:42 -0800

i have same question, i call the python tika,  the cpu 100%  then crashed
....



2012/11/10 Norman M <[email protected]>

> I am using Apache Tika to extract text from PPT/PPTX files.
>
> Tika is using Apache POI to extract texts.
>
> I tried to compare processing time and memory usage for POI vs Aspose (
> www.aspose.com)
>
> The processing time and memory requirement for Tika (i-e POI) is almost
> double of Aspose.
>
> Is Poi really using streaming to parse files? Why it is taking much more
> memory than Aspose that I thought reads the whole file into memory.
>
> I found this thread
> http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.htmlwhere
>  Tika founder is claiming that Poi is not steaming input files. That
> thread is quite old, is it still the same?
>
> My goal is to minimize the memory requirement.
>
> Here is my code
>
> ParseContext context - new ParseContext();
> Detector detector = new DefaultDetector();
> Parser parser = new AutoDetectParser(detector);
> context.set(Parser.class, parser);
> MetaData metaData = new MetaData();
>
> File file = new File ("temp.ppt");
> Url url = file.toURI().toURL();
> OutputStream o = new ByteArrayOutputStream()
>
> InputStream input = TikaInputStream.get(url, metadata);
> ContentHandler handler = new BodyContentHandler(outputStream);
>
> parser.parse(input, handler, metadata,context);
>
> String extractedText = outputStream.toStream();
>
> It looks like that whole extracted text will be written to output stream
> and hence it may be the reason for large memory consumption. How can I make
> memory usage as least as possible?
>
>  Any response will be appreciated.
>
> Thanks,
>

Re: Is Tika really using streaming to parse files?

Reply via email to