[jira] Commented: (TIKA-109) WordParser fails on some Word files

Dave Meikle (JIRA) Fri, 04 Jan 2008 08:10:58 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555960#action_12555960
 ]


Dave Meikle commented on TIKA-109:
----------------------------------

The problem is that the current code does not follow the contact of the Word 
format. Comparing the start and end of a TextPiece and a CHPX will not work.

There are three options here:

a) we remove the code to check if text is marked as deleted and just loop 
around the text pieces outputting each ones content - this will allow the 
extraction of the text to be fast but will include text marked as deleted in 
the output

b) we utilise POI to load the full document up, as the POI code will handle 
extracting the CHPX and PAPX required to make both the text and style available 
- this will take longer than option a but will allow text marked as deleted to 
be excluded from the output, as well as presenting the rest of the formatting 
options known by POI

c) we use the POI internal model to do the least amount of extraction to make 
the text and style available, as add the required code to use this.

Whilst I am not strongly in favour of any particular approach, if the 
requirement to excluded text marked as deleted from the output is required I 
would suggest using approach b. I say this because it will allow us to utilise 
the existing POI code (on which we currently have a hard dependency anyway) to 
make this information available. If we use approach c we are then maintaining 
this code separately from POI and will not benefit from any fixes/changes there.

That said if the community would prefer to go with option c, I am happy to make 
the change.

Cheers,
Dave

> WordParser fails on some Word files
> -----------------------------------
>
>                 Key: TIKA-109
>                 URL: https://issues.apache.org/jira/browse/TIKA-109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.1-incubating
>         Environment: Windows XP
> Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
>            Reporter: Mats Norén
>         Attachments: fil6.doc
>
>
> WordParser fail on some word files. A negative value is sent to 
> TextPiece.substring in POI for some corner case in the algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-109) WordParser fails on some Word files

Reply via email to