[ 
https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633487#comment-14633487
 ] 

Tim Allison edited comment on TIKA-1238 at 7/20/15 3:34 PM:
------------------------------------------------------------

The stacktrace is related to my original problem, but actually shows an 
inconsistency in POI's handling of {{UnsupportedEncodingException}}.  POI has a 
try-catch block for that exception only on the first choice for guessing 7 bit 
encoding.  The second and third choice take whatever value could be pulled out 
of the header or the html meta-equiv and {{set7BitEncoding(charset)}} without 
the try-catch block.

Turns out another problem is that, of course, {{Charset.forName()}} can throw 
an {{UnsupportedCharsetException}} (not {{UnsupportedEncodingException}})...so 
that's not even checked for in POI's code.  And, while we're defending against 
trying to create a charset from whatever value we find in msg/html headers or 
codepoint values, we should also add IllegalCharsetName in the catch block...or 
just go for IllegalArgumentException and be done with it. :)

As an immediate fix at the Tika level, we can duplicate POI's 
{{guess7BitEncoding}} but add the try-catch blocks.  I'll open an issue in 
POI's bugtracker, though, to fix this at the POI level too.

Test files will be very helpful.  If you can share, please do.


was (Author: talli...@mitre.org):
The stacktrace is related to my original problem, but actually shows an 
inconsistency in POI's handling of {{UnsupportedEncodingException}}.  POI has a 
try-catch block for that exception only on the first choice for guessing 7 bit 
encoding.  The second and third choice take whatever value could be pulled out 
of the header or the html meta-equiv and {{set7BitEncoding(charset)}} without 
the try-catch block.

As an immediate fix at the Tika level, we can duplicate POI's 
{{guess7BitEncoding}} but add the try-catch blocks.  I'll open an issue in 
POI's bugtracker, though, to fix this at the POI level too.

Test files will be very helpful.  If you can share, please do.

> Update OutlookExtractor to handle codepage identification more rigorously
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1238
>                 URL: https://issues.apache.org/jira/browse/TIKA-1238
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.10
>
>
> Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has 
> added more robutst capabilities for identifying codepages in Outlook .msg 
> files.  As a first step to integrating those improvements, I'll copy and 
> paste some of POI's code into OutlookExtractor.  As a second step, I'll 
> expose more of HSMF's capabilities within POI and then factor out the 
> duplicate code in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to