[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676109#comment-16676109 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I did a little experimentation with each of

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675880#comment-16675880 ] Hans Brende commented on TIKA-2771: --- [~wave] Yep, just ran the following {code:java} IntStream.range(0,

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675839#comment-16675839 ] Tim Allison commented on TIKA-2771: --- I was thinking something similar... > enableInputFilter() wrecks

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828 ] Hans Brende edited comment on TIKA-2771 at 11/5/18 10:44 PM: -

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828 ] Hans Brende edited comment on TIKA-2771 at 11/5/18 10:44 PM: -

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Ah, you're correct as regards the byteMap.

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675740#comment-16675740 ] Tim Allison commented on TIKA-2771: --- Got it. Thank you. bq. which calls: match(det, ngrams, byteMap,

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675708#comment-16675708 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I'm not sure which all of the charsets are

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675527#comment-16675527 ] Tim Allison commented on TIKA-2771: --- I'm happy enough adding this check into EBCDIC500. Are there any

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675520#comment-16675520 ] Tim Allison commented on TIKA-2771: --- When I add a {{tagsWereStripped}}, and have the EBCDIC500 charsets

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675511#comment-16675511 ] Tim Allison commented on TIKA-2771: --- Let me try again. I _think_ I've re-engaged my brain before I

[jira] [Created] (TIKA-2772) Problem if cell contains quotation marks (")

2018-11-05 Thread ionut hodor (JIRA)
ionut hodor created TIKA-2772: - Summary: Problem if cell contains quotation marks (") Key: TIKA-2772 URL: https://issues.apache.org/jira/browse/TIKA-2772 Project: Tika Issue Type: Bug

[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675342#comment-16675342 ] Tim Allison commented on TIKA-2750: --- To my query above about jacoco, see the responses by Tobias Ospelt

[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675322#comment-16675322 ] Tim Allison commented on TIKA-2750: --- I just added charset and lang by tld in last month's CommonCrawl

[jira] [Updated] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2750: -- Attachment: CC-MAIN-2018-39-charset_lang_by_tld.zip > Update regression corpus >

[jira] [Commented] (TIKA-2765) Regression extracting text from corrupted docx files

2018-11-05 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675281#comment-16675281 ] Luis Filipe Nassif commented on TIKA-2765: -- POI-62886 created. Thanks [~talli...@apache.org] and

[jira] [Commented] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674833#comment-16674833 ] ionut hodor commented on TIKA-2767: --- Hi [~davemeikle], thank you to answered me, i attached 2 files,

[jira] [Updated] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ionut hodor updated TIKA-2767: -- Attachment: exampleXLS.xls exampleXLSX.xlsx > Problem with import xlsx with null cells

[jira] [Commented] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674826#comment-16674826 ] ionut hodor commented on TIKA-2767: --- Hi [~davemeikle] I have 2 example for you > Problem with import

[jira] [Issue Comment Deleted] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ionut hodor updated TIKA-2767: -- Comment: was deleted (was: Hi [~davemeikle] I have 2 example for you) > Problem with import xlsx with