[
https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niall Pemberton updated TIKA-132:
---------------------------------
Attachment: TIKA-132-ExcelExtractor-refactor-v1.patch
Attaching a patch to refactor ExcelExtractor as per Jukka's suggestion. A few
points to note:
- Maintains "linked-lists" of Rows and Cells (each Row/Cell has a reference to
the next Row/Cell)
- Hyperlink support is currently commented out as it includes un-released POI
features - marked with FIXME
- Empty sheets are ignored - is this OK
- Still doesn't produce links in the output using the WriteOutContentHandler
as the link is a "href" attribute of an <a> element - is this correct?
To try out the hyperlink support - uncomment the relevant lines and use a POI
version built from the latest subversion trunk.
> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
> Key: TIKA-132
> URL: https://issues.apache.org/jira/browse/TIKA-132
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.1-incubating
> Reporter: Niall Pemberton
> Priority: Minor
> Attachments: TIKA-132-ExcelExtractor-refactor-v1.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet,
> after the cell value records. This is a problem for the current streaming
> implementation of the excel parser since it means the hyperlink cannot be
> output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table
> of the contents of the sheet that is currently being parsed and would only
> generate the respective SAX events once the sheet has been parsed? Since we
> can focus on only the information that's relevant to Tika clients, the memory
> requirements sould be moderate even for huge sheets (i.e. much less than the
> file size even for a single-sheet file). This should satisfy the low memory
> footprint requirements reasonably well while allowing us to generate more
> accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.