[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520
 ] 

Hans Brende edited comment on TIKA-2771 at 11/2/18 3:19 AM:


Just had another thought: when the input filter is enabled, it strips 
everything within "<" and ">" brackets (i.e. 0x3C and 0x3E), correct? But doing 
so *presupposes* an ASCII-compatible encoding! Thus, if a significant number of 
matching "<" and ">" symbols are found, you *already* know it can't be IBM500! 
("<" and ">" in IBM500 are 0x4C and 0x6E, respectively, whereas 0x3C and 0x3E 
map to the control characters U+0014 and U+009E.) I assume you could extend 
this logic to other ASCII-incompatible charsets as well.


was (Author: hansbrende):
Just had another thought: when the input filter is enabled, it strips 
everything within "<" and ">" brackets (i.e. 0x3C and 0x3E), correct? But doing 
so *presupposes* an ASCII-compatible encoding! Thus, if a significant number of 
matching "<" and ">" symbols are found, you *already* know it can't be IBM500! 
("<" and ">" in IBM500 are 0x4C and 0x6E, respectively.) I assume you could 
extend this logic to other ASCII-incompatible charsets as well.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520
 ] 

Hans Brende commented on TIKA-2771:
---

Just had another thought: when the input filter is enabled, it strips 
everything within "<" and ">" brackets (i.e. 0x3C and 0x3E), correct? But doing 
so *presupposes* an ASCII-compatible encoding! Thus, if a significant number of 
matching "<" and ">" symbols are found, you *already* know it can't be IBM500! 
("<" and ">" in IBM500 are 0x4C and 0x6E, respectively.) I assume you could 
extend this logic to other ASCII-incompatible charsets as well.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134
 ] 

Hans Brende edited comment on TIKA-2771 at 11/1/18 9:47 PM:


I mean, because otherwise, if you're doing n-gram detection for IBM500 and you 
don't include adjustments to the confidence based on the length of the input, 
you should also be doing n-gram detection for UTF-8! Because if n-gram 
detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 
interpreted as "en" would probably give a confidence of at least 60 as well! 
(EDIT: Or rather 50, since I see that you've used n-gram detection for 
ISO-8859-1 and gotten a score of 50 for "en". But still, with the declared 
encoding being set to UTF-8, and given the 50 score from n-gram detection, that 
would land you at a score of (100 + 50) / 2 = 75, which would win. But in any 
case, I think that decreasing n-gram confidence across the board for very short 
text would be a good idea.)


was (Author: hansbrende):
I mean, because otherwise, if you're doing n-gram detection for IBM500 and you 
don't include adjustments to the confidence based on the length of the input, 
you should also be doing n-gram detection for UTF-8! Because if n-gram 
detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 
interpreted as "en" would probably give a confidence of at least 60 as well!

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2769) Error while using tika-app on some docs

2018-11-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672233#comment-16672233
 ] 

Tim Allison commented on TIKA-2769:
---

Until we can support glossary documents in POI, I added a check+log to skip 
them instead of throwing a ClassCastException.

> Error while using tika-app on some docs
> ---
>
> Key: TIKA-2769
> URL: https://issues.apache.org/jira/browse/TIKA-2769
> Project: Tika
>  Issue Type: Bug
>Reporter: IvanSorokin
>Priority: Major
> Attachments: Examples.zip, Заявление в экспертную организацию.docx
>
>
> I tried to open some files using ingest plugin for elasticsearch and directly 
> in tika-app-19.jar and got an error.
>  StackTrace: [https://pastebin.com/CHfFMc52]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672203#comment-16672203
 ] 

Hans Brende commented on TIKA-2771:
---

(Source: https://w3techs.com/technologies/overview/character_encoding/all )

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672196#comment-16672196
 ] 

Hans Brende commented on TIKA-2771:
---

Oh... and probably the best hint of all that this is not IBM500 is that 
approximately 0.0% of html markup worldwide is encoded in IBM500.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134
 ] 

Hans Brende edited comment on TIKA-2771 at 11/1/18 8:52 PM:


I mean, because otherwise, if you're doing n-gram detection for IBM500 and you 
don't include adjustments to the confidence based on the length of the input, 
you should also be doing n-gram detection for UTF-8! Because if n-gram 
detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 
interpreted as "en" would probably give a confidence of at least 60 as well!


was (Author: hansbrende):
I mean, because otherwise, if you're doing n-gram detection for IBM500 and you 
don't include adjustments to the confidence based on the length of the input, 
you should also be doing n-gram detection for UTF-8! Because if n-gram 
detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 
interpreted as "en" should give a confidence of at least 60 as well!

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672178#comment-16672178
 ] 

Hans Brende commented on TIKA-2771:
---

One good hint that this is not IBM500 is that *all* of the characters are 
printable US-ASCII characters, i.e., in the range 0x20 to 0x7E (whereas most of 
IBM500's printable characters are non-ASCII-printable). An even better hint 
that this is not IBM500 is that *all* of the characters are in the range of 
ASCII corresponding to *letters* (plus the space and colon characters).

Hope that helps.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134
 ] 

Hans Brende commented on TIKA-2771:
---

I mean, because otherwise, if you're doing n-gram detection for IBM500 and you 
don't include adjustments to the confidence based on the length of the input, 
you should also be doing n-gram detection for UTF-8! Because if n-gram 
detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 
interpreted as "en" should give a confidence of at least 60 as well!

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116
 ] 

Hans Brende edited comment on TIKA-2771 at 11/1/18 8:12 PM:


Not sure if this is a contributing factor, but peering into the source code 
reveals that the IBM500 detector is based on ngram detection with a space 
character of 0x20. But the space character for IBM500 is actually 0x40. 

Also, it appears that the confidence for IBM500 is obtained by multiplying the 
raw fractional percentage of ngram hits by 300%. Is that number arbitrary? 
Shouldn't the "confidence" decrease by a lot if the length of the input is very 
small, and therefore not very statistically significant?


was (Author: hansbrende):
Not sure if this is a contributing factor, but peering into the source code 
reveals that IBM500 is based on ngrams with a space character of 0x20. But the 
space character for IBM500 is actually 0x40. 

Also, it appears that the confidence for IBM500 is obtained by multiplying the 
raw fractional percentage of ngram hits by 300%. Is that number arbitrary? 
Shouldn't the "confidence" decrease by a lot if the length of the input is very 
small, and therefore not very statistically significant?

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116
 ] 

Hans Brende commented on TIKA-2771:
---

Not sure if this is a contributing factor, but peering into the source code 
reveals that IBM500 is based on ngrams with a space character of 0x20. But the 
space character for IBM500 is actually 0x40. 

Also, it appears that the confidence for IBM500 is obtained by multiplying the 
raw fractional percentage of ngram hits by 300%. Is that number arbitrary? 
Shouldn't the "confidence" decrease by a lot if the length of the input is very 
small, and therefore not very statistically significant?

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671965#comment-16671965
 ] 

Tim Allison edited comment on TIKA-2750 at 11/1/18 6:22 PM:


I just attached the output of counting pairs of "mime" and "mime-detected" in 
last month's common crawl.  This is tangentially related to this issue, and 
might yield something interesting...although I didn't see anything that 
requires immediate attention.

Many, many thanks to [~wastl-nagel] and Common Crawl for running Tika detection 
as part of the crawl and then including that info in the indices!!!


was (Author: talli...@mitre.org):
I just attached the output of counting pairs of "mime" and "mime-detected" in 
last month's common crawl.  This is tangentially related to this issue, and 
might yield something interesting...although I didn't see anything that 
requires immediate attention.

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip, 
> CC-MAIN-2018-39-mimes-v-detected.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671965#comment-16671965
 ] 

Tim Allison commented on TIKA-2750:
---

I just attached the output of counting pairs of "mime" and "mime-detected" in 
last month's common crawl.  This is tangentially related to this issue, and 
might yield something interesting...although I didn't see anything that 
requires immediate attention.

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip, 
> CC-MAIN-2018-39-mimes-v-detected.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2750:
--
Attachment: CC-MAIN-2018-39-mimes-v-detected.zip

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip, 
> CC-MAIN-2018-39-mimes-v-detected.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
Hans Brende created TIKA-2771:
-

 Summary: enableInputFilter() wrecks charset detection for some 
short html documents
 Key: TIKA-2771
 URL: https://issues.apache.org/jira/browse/TIKA-2771
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.19.1
Reporter: Hans Brende


When I try to run the CharsetDetector on 
http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most 
confident result of "IBM500" with a confidence of 60 when I enable the input 
filter, *even if I set the declared encoding to UTF-8*.

This can be replicated with the following code:

{code:java}
CharsetDetector detect = new CharsetDetector();
detect.enableInputFilter(true);
detect.setDeclaredEncoding("UTF-8");
detect.setText(("\n" +
"\n" +
"  http://schema.org/Person\; id=\"amanda\" 
itemref=\"a b\">\n" +
"  Name: Amanda\n" +
"  Jazz Band\n" +
"").getBytes(StandardCharsets.UTF_8));
Arrays.stream(detect.detectAll()).forEach(System.out::println);
{code}

which prints:
{noformat}
Match of IBM500 in fr with confidence 60
Match of UTF-8 with confidence 57
Match of ISO-8859-9 in tr with confidence 50
Match of ISO-8859-1 in en with confidence 50
Match of ISO-8859-2 in cs with confidence 12
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
{noformat}

Note that if I do not set the declared encoding to UTF-8, the result is even 
worse, with UTF-8 falling from a confidence of 57 to 15. 

This is screwing up 1 out of 84 of my online microdata extraction tests over in 
Any23 (as that particular page is being rendered into complete gibberish), so I 
had to implement some hacky workarounds which I'd like to remove if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Using code coverage metrics to help winnow our large scale regression corpus

2018-11-01 Thread Tim Allison
Rohan and Tobias,
   This isn't quite a question about fuzzing, but I suspect you might
be able to help with this:

https://issues.apache.org/jira/browse/TIKA-2750?focusedCommentId=16671472=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16671472

Cheers,

  Tim


[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671472#comment-16671472
 ] 

Tim Allison commented on TIKA-2750:
---

I'd like to remove "boring" and/or basically duplicative documents from our 
regression corpus.  We already effectively remove exact dupes because we store 
files by their hashes.

Unfortunately, I don't have a great definition of boring (aside from 
ascii/UTF-8 English text files), and I recognize that "boring" today may not be 
"boring" tomorrow if a given document contains a feature that our parsers 
ignore at the moment.

Ideally, if two documents exercise the same lines of code, I'd want to remove 
one of them.

Could we use jacoco or something similar to identify documents that exercise 
similar code paths?  Or, more generally, can we measure coverage of our code 
base for a given set of documents fairly easily?

I have zero experience in this realm and welcome input!

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671452#comment-16671452
 ] 

Markus Jelsma commented on TIKA-2760:
-

Hello [~davemeikle],

Of course! I cannot understand why i did not see this, i am so sorry to have 
bothered anybody with this nuisance.

My apologies,
Markus

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed TIKA-2760.
---

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved TIKA-2760.
-
Resolution: Not A Problem

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (TIKA-2770) Convert EnviHeader "map info" from UTM to LatLon

2018-11-01 Thread Kristen Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristen Cheung closed TIKA-2770.

Resolution: Fixed

> Convert EnviHeader "map info" from UTM to LatLon
> 
>
> Key: TIKA-2770
> URL: https://issues.apache.org/jira/browse/TIKA-2770
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: Kristen Cheung
>Priority: Major
>
> Would like to create the conversion logic for coordinates from Universal 
> Transverse Mercator (UTM) to Geographic (Lat/Lon) Coordinate System in Apache 
> Tika Extraction logic, found in 
> tika/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2770) Convert EnviHeader "map info" from UTM to LatLon

2018-11-01 Thread Kristen Cheung (JIRA)
Kristen Cheung created TIKA-2770:


 Summary: Convert EnviHeader "map info" from UTM to LatLon
 Key: TIKA-2770
 URL: https://issues.apache.org/jira/browse/TIKA-2770
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.0.0
Reporter: Kristen Cheung


Would like to create the conversion logic for coordinates from Universal 
Transverse Mercator (UTM) to Geographic (Lat/Lon) Coordinate System in Apache 
Tika Extraction logic, found in 
tika/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671189#comment-16671189
 ] 

Dave Meikle commented on TIKA-2760:
---

Hi [~markus17],

Looking at the Nutch code I can see that TikaParser has logic to honour the 
setting in the robots metadata.  As this page is setting _nofollow,_ the parser 
doesn't add the links found by Tika's LinkContentHandler to the outlinks.

If you remove the nofollow from the HTML files metadata you'll see it all flow 
through into Nutch.

{{}}

to

{{}}

It should all flow through as normal.

Cheers,
Dave

 

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)