Avoid multiple passes over the input stream in Microsoft parsers
----------------------------------------------------------------
Key: TIKA-63
URL: https://issues.apache.org/jira/browse/TIKA-63
Project: Tika
Issue Type: Improvement
Components: general
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Fix For: 0.1-incubator
The current Excel, Word, and PowerPoint parsers make multiple passes over the
given input stream - first to read document metadata, and then to extract text
content. We can avoid this duplicate consumption by using the POIFSFileSystem
class as a source of both the metadata and text content in the parser classes
since these Office documents are in any case parsed into memory.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.