[
https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niall Pemberton updated TIKA-132:
---------------------------------
Attachment: TIKA-132-ExcelExtractor-refactor-v2.patch
Apologies - attaching a second patch, with minor changes
- make visibility of methods in new private static inner classes consistent
- use row/column parameter names rather than rowNo/columnNo as POI does
> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
> Key: TIKA-132
> URL: https://issues.apache.org/jira/browse/TIKA-132
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.1-incubating
> Reporter: Niall Pemberton
> Priority: Minor
> Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet,
> after the cell value records. This is a problem for the current streaming
> implementation of the excel parser since it means the hyperlink cannot be
> output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table
> of the contents of the sheet that is currently being parsed and would only
> generate the respective SAX events once the sheet has been parsed? Since we
> can focus on only the information that's relevant to Tika clients, the memory
> requirements sould be moderate even for huge sheets (i.e. much less than the
> file size even for a single-sheet file). This should satisfy the low memory
> footprint requirements reasonably well while allowing us to generate more
> accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.