[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680557#comment-16680557 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Great! I will definitely check that out.

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680404#comment-16680404 ] Nick Burch commented on TIKA-2771: -- I'm not sure we do. We have documents along with the encoding that

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680452#comment-16680452 ] Tim Allison commented on TIKA-2771: --- [~HansBrende], funny you mention that...as [~gagravarr] pointed out

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680378#comment-16680378 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Does Tika have a corpus of documents paired

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-06 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677340#comment-16677340 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I've implemented my ideas for charset

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-06 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676926#comment-16676926 ] Hans Brende commented on TIKA-2771: --- One thing I am sure of, however, is that if your chances of getting

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-06 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676887#comment-16676887 ] Hans Brende commented on TIKA-2771: --- Compare to the following analogous test for ISO-8859-1 variants:

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676109#comment-16676109 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I did a little experimentation with each of

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675880#comment-16675880 ] Hans Brende commented on TIKA-2771: --- [~wave] Yep, just ran the following {code:java} IntStream.range(0,

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675839#comment-16675839 ] Tim Allison commented on TIKA-2771: --- I was thinking something similar... > enableInputFilter() wrecks

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Ah, you're correct as regards the byteMap.

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675740#comment-16675740 ] Tim Allison commented on TIKA-2771: --- Got it. Thank you. bq. which calls: match(det, ngrams, byteMap,

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675708#comment-16675708 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I'm not sure which all of the charsets are

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675527#comment-16675527 ] Tim Allison commented on TIKA-2771: --- I'm happy enough adding this check into EBCDIC500. Are there any

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675520#comment-16675520 ] Tim Allison commented on TIKA-2771: --- When I add a {{tagsWereStripped}}, and have the EBCDIC500 charsets

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675511#comment-16675511 ] Tim Allison commented on TIKA-2771: --- Let me try again. I _think_ I've re-engaged my brain before I

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673481#comment-16673481 ] Hans Brende commented on TIKA-2771: --- (Also relating to my last thought, on the subject of "waiting for

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673429#comment-16673429 ] Hans Brende commented on TIKA-2771: --- (For my last thought, I'd recommend taking a look at this:

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673400#comment-16673400 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I totally understand not wanting to modify

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673220#comment-16673220 ] Tim Allison commented on TIKA-2771: --- let me re-engage brain before typing again...sorry. >

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673215#comment-16673215 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] IBM500 (a.k.a. EBCDIC 500) is an EBCDIC

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673188#comment-16673188 ] Tim Allison commented on TIKA-2771: --- [~HansBrende], thank you for raising this issue and sharing this

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520 ] Hans Brende commented on TIKA-2771: --- Just had another thought: when the input filter is enabled, it

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672203#comment-16672203 ] Hans Brende commented on TIKA-2771: --- (Source:

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672196#comment-16672196 ] Hans Brende commented on TIKA-2771: --- Oh... and probably the best hint of all that this is not IBM500 is

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672178#comment-16672178 ] Hans Brende commented on TIKA-2771: --- One good hint that this is not IBM500 is that *all* of the

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134 ] Hans Brende commented on TIKA-2771: --- I mean, because otherwise, if you're doing n-gram detection for

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116 ] Hans Brende commented on TIKA-2771: --- Not sure if this is a contributing factor, but peering into the