Re: Streaming vs. other features in parsers

Niall Pemberton Fri, 21 Mar 2008 07:51:37 -0700

On Thu, Mar 20, 2008 at 5:05 PM, Niall Pemberton
<[EMAIL PROTECTED]> wrote:
>
> On Thu, Mar 20, 2008 at 2:56 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
>  > Hi,
>  >
>  >
>  >  On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton
>  >  <[EMAIL PROTECTED]> wrote:
>  >  > On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[EMAIL PROTECTED]> 
> wrote:
>  >
>  > >  >  I was looking at implementing link extraction for Excel files, and
>  >  >  >  found out that the link information is only available at the end of
>  >  >  >  the file as a special "cell X links to URI Y" record. The parser 
> could
>  >  >
>  >  >  Its probably academic, but I believe they come at the end of each
>  >  >  sheet, rather than file.
>  >
>  >  You're right, good point!
>  >
>  >  PDF parsing can typically be streamed one page at a time, i.e. you
>  >  need to parse a whole page to be able to render the output, and this
>  >  is something we might want to consider doing also for Excel sheets:
>  >
>  >  How about if the streaming Excel parser maintained a sparse in-memory
>  >  table of the contents of the sheet that is currently being parsed and
>  >  would only generate the respective SAX events once the sheet has been
>  >  parsed? Since we can focus on only the information that's relevant to
>  >  Tika clients, the memory requirements sould be moderate even for huge
>  >  sheets (i.e. much less than the file size even for a single-sheet
>  >  file). This should satisfy the low memory footprint requirements
>  >  reasonably well while allowing us to generate more accurate output.
>  >
>  >
>  >  >  I didn't think link support was in the latest POI release and was only
>  >  >  added a few weeks ago:
>  >  >  
> http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java
>  >  >
>  >  >  Not trying to make any point, just wondering whether I got this wrong
>  >  >  or you found another way or you tried the lastest POI from svn?
>  >
>  >  I'm using POI trunk.
>  >
>  >
>  >  >  I think a low-memory-footprint parser still has value, despite this
>  >  >  drawback - I'm pretty sure that where I work lack of hyperlink support
>  >  >  is not an issue. Is there not room for two implementations in Tika?
>  >
>  >  There certainly is, my main concern are just the duplicate maintenance
>  >  effort and the added configuration complexity.
>  >
>  >  Would the above sheet-by-sheet streaming option work for your
>  >  requirements?
>
>  Sounds good to me. I'll put a patch together.


I've created a JIRA ticket and attached a patch:
  https://issues.apache.org/jira/browse/TIKA-132

Suggestions welcome, if you don't like how it resolves this - I can
work up another patch

Niall

>  Niall
>
>
>
>  > Alternatively, we could avoid much duplication by making
>  >  the sheet-by-sheet feature a configurable mode of the normal streaming
>  >  Excel parser instead of using a separate parser class.
>  >
>  >  BR,
>  >
>  >  Jukka Zitting
>  >
>

Re: Streaming vs. other features in parsers

Reply via email to