[jira] Updated: (TIKA-339) HtmlParser & TXTParser should not use language returned by CharsetDetector if language hint has been provided

Ken Krugler (JIRA) Tue, 01 Dec 2009 15:57:45 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ken Krugler updated TIKA-339:
-----------------------------

    Attachment: TIKA-339.patch

Patch that uses metadata.add(CharsetDetector language) in the TXTParser, and 
skips this entirely in the HtmlParser.

Plus tests.

> HtmlParser & TXTParser should not use language returned by CharsetDetector if 
> language hint has been provided
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-339
>                 URL: https://issues.apache.org/jira/browse/TIKA-339
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: TIKA-339.patch
>
>
> Currently the code used to call CharsetDetector in both TXTParser and 
> HtmlParser is that any incoming language in the metadata map gets replaced if 
> the detector returns a language.
> Given the low reliability of this language result, it should only be used in 
> cases where there is no provided language, as typically this is coming in 
> from either the Http response header or (for the HtmlParser) a meta tag or 
> some other tag attribute. In all those cases, the incoming language is more 
> accurate than the guess by the CharsetDetector.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-339) HtmlParser & TXTParser should not use language returned by CharsetDetector if language hint has been provided

Reply via email to