Hans Brende created TIKA-2771:
---------------------------------

             Summary: enableInputFilter() wrecks charset detection for some 
short html documents
                 Key: TIKA-2771
                 URL: https://issues.apache.org/jira/browse/TIKA-2771
             Project: Tika
          Issue Type: Bug
          Components: detector
    Affects Versions: 1.19.1
            Reporter: Hans Brende


When I try to run the CharsetDetector on 
http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most 
confident result of "IBM500" with a confidence of 60 when I enable the input 
filter, *even if I set the declared encoding to UTF-8*.

This can be replicated with the following code:

{code:java}
CharsetDetector detect = new CharsetDetector();
detect.enableInputFilter(true);
detect.setDeclaredEncoding("UTF-8");
detect.setText(("<!DOCTYPE html>\n" +
        "<div>\n" +
        "  <div itemscope itemtype=\"http://schema.org/Person\"; id=\"amanda\" 
itemref=\"a b\"></div>\n" +
        "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
        "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
        "</div>").getBytes(StandardCharsets.UTF_8));
Arrays.stream(detect.detectAll()).forEach(System.out::println);
{code}

which prints:
{noformat}
Match of IBM500 in fr with confidence 60
Match of UTF-8 with confidence 57
Match of ISO-8859-9 in tr with confidence 50
Match of ISO-8859-1 in en with confidence 50
Match of ISO-8859-2 in cs with confidence 12
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
{noformat}

Note that if I do not set the declared encoding to UTF-8, the result is even 
worse, with UTF-8 falling from a confidence of 57 to 15. 

This is screwing up 1 out of 84 of my online microdata extraction tests over in 
Any23 (as that particular page is being rendered into complete gibberish), so I 
had to implement some hacky workarounds which I'd like to remove if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to