Refactor Excel extractor to parse per sheet and add hyperlink support
---------------------------------------------------------------------
Key: TIKA-132
URL: https://issues.apache.org/jira/browse/TIKA-132
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 0.1-incubating
Reporter: Niall Pemberton
Priority: Minor
In the excel record stream, hyperlink records come at the end of the sheet,
after the cell value records. This is a problem for the current streaming
implementation of the excel parser since it means the hyperlink cannot be
output when a cell is being processed.
Jukka suggested the following on the mailing list:
"How about if the streaming Excel parser maintained a sparse in-memory table of
the contents of the sheet that is currently being parsed and would only
generate the respective SAX events once the sheet has been parsed? Since we can
focus on only the information that's relevant to Tika clients, the memory
requirements sould be moderate even for huge sheets (i.e. much less than the
file size even for a single-sheet file). This should satisfy the low memory
footprint requirements reasonably well while allowing us to generate more
accurate output."
See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.