[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680557#comment-16680557
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] Great! I will definitely check that out.
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680404#comment-16680404
]
Nick Burch commented on TIKA-2771:
--
I'm not sure we do. We have documents along with the encoding that
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680452#comment-16680452
]
Tim Allison commented on TIKA-2771:
---
[~HansBrende], funny you mention that...as [~gagravarr] pointed out
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680378#comment-16680378
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] Does Tika have a corpus of documents paired
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677340#comment-16677340
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] I've implemented my ideas for charset
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676926#comment-16676926
]
Hans Brende commented on TIKA-2771:
---
One thing I am sure of, however, is that if your chances of getting
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676887#comment-16676887
]
Hans Brende commented on TIKA-2771:
---
Compare to the following analogous test for ISO-8859-1 variants:
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676109#comment-16676109
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] I did a little experimentation with each of
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675880#comment-16675880
]
Hans Brende commented on TIKA-2771:
---
[~wave] Yep, just ran the following
{code:java}
IntStream.range(0,
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675839#comment-16675839
]
Tim Allison commented on TIKA-2771:
---
I was thinking something similar...
> enableInputFilter() wrecks
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] Ah, you're correct as regards the byteMap.
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675740#comment-16675740
]
Tim Allison commented on TIKA-2771:
---
Got it. Thank you.
bq. which calls: match(det, ngrams, byteMap,
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675708#comment-16675708
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] I'm not sure which all of the charsets are
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675527#comment-16675527
]
Tim Allison commented on TIKA-2771:
---
I'm happy enough adding this check into EBCDIC500. Are there any
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675520#comment-16675520
]
Tim Allison commented on TIKA-2771:
---
When I add a {{tagsWereStripped}}, and have the EBCDIC500 charsets
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675511#comment-16675511
]
Tim Allison commented on TIKA-2771:
---
Let me try again. I _think_ I've re-engaged my brain before I
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673481#comment-16673481
]
Hans Brende commented on TIKA-2771:
---
(Also relating to my last thought, on the subject of "waiting for
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673429#comment-16673429
]
Hans Brende commented on TIKA-2771:
---
(For my last thought, I'd recommend taking a look at this:
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673400#comment-16673400
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] I totally understand not wanting to modify
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673220#comment-16673220
]
Tim Allison commented on TIKA-2771:
---
let me re-engage brain before typing again...sorry.
>
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673215#comment-16673215
]
Hans Brende commented on TIKA-2771:
---
[~talli...@apache.org] IBM500 (a.k.a. EBCDIC 500) is an EBCDIC
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673188#comment-16673188
]
Tim Allison commented on TIKA-2771:
---
[~HansBrende], thank you for raising this issue and sharing this
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520
]
Hans Brende commented on TIKA-2771:
---
Just had another thought: when the input filter is enabled, it
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672203#comment-16672203
]
Hans Brende commented on TIKA-2771:
---
(Source:
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672196#comment-16672196
]
Hans Brende commented on TIKA-2771:
---
Oh... and probably the best hint of all that this is not IBM500 is
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672178#comment-16672178
]
Hans Brende commented on TIKA-2771:
---
One good hint that this is not IBM500 is that *all* of the
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134
]
Hans Brende commented on TIKA-2771:
---
I mean, because otherwise, if you're doing n-gram detection for
[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116
]
Hans Brende commented on TIKA-2771:
---
Not sure if this is a contributing factor, but peering into the
28 matches
Mail list logo