On Fri, 9 Nov 2012, Norman M wrote:
I am using Apache Tika to extract text from PPT/PPTX files.
Is Poi really using streaming to parse files?
Some bits. xls file processing is stream based, for ppt the whole file
gets processed and then the text parts are located and picked out.
File file = new File ("temp.ppt");
Url url = file.toURI().toURL();
OutputStream o = new ByteArrayOutputStream()
InputStream input = TikaInputStream.get(url, metadata);
Is there a reason why you're not passing the file to TikaInputStream, but
going via the URL instead?
ContentHandler handler = new BodyContentHandler(outputStream);
parser.parse(input, handler, metadata,context);
String extractedText = outputStream.toStream();
The text you extract will probably be fairly small, but the code above
will mean it all has to get buffered first. You might want to look at
processing the sax events as they come in, to reduce the memory instead of
buffering everything, especially for very large amounts of text
Nick