[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520 ] Hans Brende edited comment on TIKA-2771 at 11/2/18 3:19 AM: Just had another

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520 ] Hans Brende commented on TIKA-2771: --- Just had another thought: when the input filter is enabled, it

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134 ] Hans Brende edited comment on TIKA-2771 at 11/1/18 9:47 PM: I mean, because

[jira] [Commented] (TIKA-2769) Error while using tika-app on some docs

2018-11-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672233#comment-16672233 ] Tim Allison commented on TIKA-2769: --- Until we can support glossary documents in POI, I added a check+log

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672203#comment-16672203 ] Hans Brende commented on TIKA-2771: --- (Source:

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672196#comment-16672196 ] Hans Brende commented on TIKA-2771: --- Oh... and probably the best hint of all that this is not IBM500 is

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134 ] Hans Brende edited comment on TIKA-2771 at 11/1/18 8:52 PM: I mean, because

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672178#comment-16672178 ] Hans Brende commented on TIKA-2771: --- One good hint that this is not IBM500 is that *all* of the

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134 ] Hans Brende commented on TIKA-2771: --- I mean, because otherwise, if you're doing n-gram detection for

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116 ] Hans Brende edited comment on TIKA-2771 at 11/1/18 8:12 PM: Not sure if this

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116 ] Hans Brende commented on TIKA-2771: --- Not sure if this is a contributing factor, but peering into the

[jira] [Comment Edited] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671965#comment-16671965 ] Tim Allison edited comment on TIKA-2750 at 11/1/18 6:22 PM: I just attached

[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671965#comment-16671965 ] Tim Allison commented on TIKA-2750: --- I just attached the output of counting pairs of "mime" and

[jira] [Updated] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2750: -- Attachment: CC-MAIN-2018-39-mimes-v-detected.zip > Update regression corpus >

[jira] [Created] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)
Hans Brende created TIKA-2771: - Summary: enableInputFilter() wrecks charset detection for some short html documents Key: TIKA-2771 URL: https://issues.apache.org/jira/browse/TIKA-2771 Project: Tika

Using code coverage metrics to help winnow our large scale regression corpus

2018-11-01 Thread Tim Allison
Rohan and Tobias, This isn't quite a question about fuzzing, but I suspect you might be able to help with this: https://issues.apache.org/jira/browse/TIKA-2750?focusedCommentId=16671472=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16671472 Cheers, Tim

[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671472#comment-16671472 ] Tim Allison commented on TIKA-2750: --- I'd like to remove "boring" and/or basically duplicative documents

[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671452#comment-16671452 ] Markus Jelsma commented on TIKA-2760: - Hello [~davemeikle], Of course! I cannot understand why i did

[jira] [Closed] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed TIKA-2760. --- > LinkContentHandler does not report hyperlinks > - > >

[jira] [Resolved] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved TIKA-2760. - Resolution: Not A Problem > LinkContentHandler does not report hyperlinks >

[jira] [Closed] (TIKA-2770) Convert EnviHeader "map info" from UTM to LatLon

2018-11-01 Thread Kristen Cheung (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kristen Cheung closed TIKA-2770. Resolution: Fixed > Convert EnviHeader "map info" from UTM to LatLon >

[jira] [Created] (TIKA-2770) Convert EnviHeader "map info" from UTM to LatLon

2018-11-01 Thread Kristen Cheung (JIRA)
Kristen Cheung created TIKA-2770: Summary: Convert EnviHeader "map info" from UTM to LatLon Key: TIKA-2770 URL: https://issues.apache.org/jira/browse/TIKA-2770 Project: Tika Issue Type:

[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Dave Meikle (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671189#comment-16671189 ] Dave Meikle commented on TIKA-2760: --- Hi [~markus17], Looking at the Nutch code I can see that