[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086045#comment-16086045 ] Tim Allison commented on TIKA-2428: --- bq. Our algorithm for recovering deleted files often recovers

[jira] [Comment Edited] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085640#comment-16085640 ] Tim Allison edited comment on TIKA-2428 at 7/13/17 4:48 PM: Thank you,

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085965#comment-16085965 ] Luis Filipe Nassif commented on TIKA-2428: -- bq. If bytes skipped is more than requested, we've hit

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085665#comment-16085665 ] Luis Filipe Nassif commented on TIKA-2428: -- I just put the stacktrace, you found the cause. But

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085742#comment-16085742 ] Tim Allison commented on TIKA-2428: --- bq. But I understood it can skip more than are remaining in the

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085852#comment-16085852 ] Luis Filipe Nassif commented on TIKA-2428: -- Strange, I don't think the javadocs allow that. Maybe

[jira] [Comment Edited] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085742#comment-16085742 ] Tim Allison edited comment on TIKA-2428 at 7/13/17 3:14 PM: bq. But I

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085709#comment-16085709 ] Matthew Caruana Galizia commented on TIKA-2042: --- I'd like to ask for this issue to be

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085730#comment-16085730 ] Tim Allison commented on TIKA-2428: --- I wonder why I didn't see this in our common crawl/govdocs1 corpus?

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085842#comment-16085842 ] Matthew Caruana Galizia commented on TIKA-2042: --- [~gagravarr] thank you - that fixes the

[jira] [Updated] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2042: -- Attachment: mbox_email_section.txt Sample of one of the message sections from

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085881#comment-16085881 ] Tim Allison commented on TIKA-2428: --- bq. Maybe there is an issue with IOUtils.skipFully() Y, completely.

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085880#comment-16085880 ] Luis Filipe Nassif commented on TIKA-2042: -- This problem is very very recurrent. I think we should

[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-879: - Attachment: mbox_email_section.txt As described in TIKA-2042, the attached file

RE: [jira] [Created] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Allison, Timothy B.
> Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... Let us know what else you find...perhaps give tika-eval a try on a sample of docs? Between this and TIKA-2042, at least I won't have time to forget my signing key's password. :) Onward to 1.17!

[jira] [Comment Edited] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085847#comment-16085847 ] Matthew Caruana Galizia edited comment on TIKA-2042 at 7/13/17 3:13 PM:

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085747#comment-16085747 ] Nick Burch commented on TIKA-2042: -- [~mcaruanagalizia] I've added some more patterns in

[jira] [Updated] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2042: -- Attachment: mbox_header.txt Header attached with identifying information

[jira] [Comment Edited] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085742#comment-16085742 ] Tim Allison edited comment on TIKA-2428 at 7/13/17 2:12 PM: bq. But I

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085838#comment-16085838 ] Hudson commented on TIKA-2042: -- FAILURE: Integrated in Jenkins build Tika-trunk #1331 (See

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085863#comment-16085863 ] Tim Allison commented on TIKA-2428: --- https://bz.apache.org/bugzilla/show_bug.cgi?id=61294 > EMFParser

[jira] [Comment Edited] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085709#comment-16085709 ] Matthew Caruana Galizia edited comment on TIKA-2042 at 7/13/17 2:22 PM:

[jira] [Created] (TIKA-2429) Upgrade to POI 3.17-beta2 when available

2017-07-13 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2429: - Summary: Upgrade to POI 3.17-beta2 when available Key: TIKA-2429 URL: https://issues.apache.org/jira/browse/TIKA-2429 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085877#comment-16085877 ] Tim Allison commented on TIKA-2428: --- bq. I don't think the javadocs allow that. I think the javadocs

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085883#comment-16085883 ] Luis Filipe Nassif commented on TIKA-2042: -- See Tika-879. Looks like widening the magic search

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086318#comment-16086318 ] Luis Filipe Nassif commented on TIKA-2428: -- That would be very nice! > EMFParser loops forever

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086436#comment-16086436 ] Tim Allison commented on TIKA-2428: --- https://bz.apache.org/bugzilla/show_bug.cgi?id=61295 I suspect

[jira] [Created] (TIKA-2430) Add at least dev test capability to run Tika against corrupted files in our test suite

2017-07-13 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2430: - Summary: Add at least dev test capability to run Tika against corrupted files in our test suite Key: TIKA-2430 URL: https://issues.apache.org/jira/browse/TIKA-2430

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085979#comment-16085979 ] Tim Allison commented on TIKA-2428: --- Sorry. I misunderstood. Right. That's my belief. > EMFParser

[jira] [Comment Edited] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085640#comment-16085640 ] Tim Allison edited comment on TIKA-2428 at 7/13/17 12:34 PM: - Thank you,

[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085640#comment-16085640 ] Tim Allison commented on TIKA-2428: --- Thank you, [~lfcnassif], for reporting this and finding the cause.