[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-02 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545383#comment-17545383 ] Tim Allison commented on TIKA-3710: --- Thank you, [~lfcnassif]! > HTML document detected incorrect as

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-01 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545156#comment-17545156 ] Luís Filipe Nassif commented on TIKA-3710: -- Seems good to me [~tallison] ! > HTML document

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-01 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545044#comment-17545044 ] Hudson commented on TIKA-3710: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #624 (See

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544964#comment-17544964 ] Tim Allison commented on TIKA-3710: --- I just committed and pushed this. Please let me know if there are

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539607#comment-17539607 ] Tim Allison commented on TIKA-3710: --- The current main block is 40, which is intentionally below RFC822.

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594 ] Nick Burch commented on TIKA-3710: -- As a "normal" html file wouldn't start with these snippets, and

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539590#comment-17539590 ] Tim Allison commented on TIKA-3710: --- Sounds good. What do you think of breaking those out into a higher

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582 ] Nick Burch commented on TIKA-3710: -- I was thinking we'd do (open)h1(close) or (open)h1(space) to cover

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539580#comment-17539580 ] Tim Allison commented on TIKA-3710: --- This works on the test file: {noformat}

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539574#comment-17539574 ] Tim Allison commented on TIKA-3710: --- Sorry, that comment must have referred to the patterns in that

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Sam Stephens (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539054#comment-17539054 ] Sam Stephens commented on TIKA-3710: {quote}The h1 isn't quite as unique as we might like, and maybe

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Sam Stephens (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539051#comment-17539051 ] Sam Stephens commented on TIKA-3710: Is it valid for a message/rfc822 message to have a bunch of

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538974#comment-17538974 ] Tim Allison commented on TIKA-3710: --- The hiccup is this point in the mimetypes.xml file. {noformat}

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538963#comment-17538963 ] Tim Allison commented on TIKA-3710: --- Thank you, [~nick]. I was being imprecise on {{h1}}, we actually do

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538896#comment-17538896 ] Nick Burch commented on TIKA-3710: -- The h1 isn't quite as unique as we might like, and maybe not as good

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538885#comment-17538885 ] Tim Allison commented on TIKA-3710: --- As I look at our mime type for html, we do include {{h1}} at offset

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-17 Thread Sam Stephens (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538524#comment-17538524 ] Sam Stephens commented on TIKA-3710: Note that I exclude org.apache.tika.parser.mail.RFC822Parser as a

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-04-01 Thread Sam Stephens (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516142#comment-17516142 ] Sam Stephens commented on TIKA-3710: The HTML document is exactly what you see there; these documents

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-04-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515921#comment-17515921 ] Tim Allison commented on TIKA-3710: --- Did the original html file actually have an html header? Or did it