[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545383#comment-17545383
]
Tim Allison commented on TIKA-3710:
---
Thank you, [~lfcnassif]!
> HTML document detected incorrect as
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545156#comment-17545156
]
Luís Filipe Nassif commented on TIKA-3710:
--
Seems good to me [~tallison] !
> HTML document
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545044#comment-17545044
]
Hudson commented on TIKA-3710:
--
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #624 (See
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544964#comment-17544964
]
Tim Allison commented on TIKA-3710:
---
I just committed and pushed this. Please let me know if there are
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539607#comment-17539607
]
Tim Allison commented on TIKA-3710:
---
The current main block is 40, which is intentionally below RFC822.
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594
]
Nick Burch commented on TIKA-3710:
--
As a "normal" html file wouldn't start with these snippets, and
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539590#comment-17539590
]
Tim Allison commented on TIKA-3710:
---
Sounds good. What do you think of breaking those out into a higher
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582
]
Nick Burch commented on TIKA-3710:
--
I was thinking we'd do (open)h1(close) or (open)h1(space) to cover
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539580#comment-17539580
]
Tim Allison commented on TIKA-3710:
---
This works on the test file:
{noformat}
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539574#comment-17539574
]
Tim Allison commented on TIKA-3710:
---
Sorry, that comment must have referred to the patterns in that
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539054#comment-17539054
]
Sam Stephens commented on TIKA-3710:
{quote}The h1 isn't quite as unique as we might like, and maybe
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539051#comment-17539051
]
Sam Stephens commented on TIKA-3710:
Is it valid for a message/rfc822 message to have a bunch of
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538974#comment-17538974
]
Tim Allison commented on TIKA-3710:
---
The hiccup is this point in the mimetypes.xml file.
{noformat}
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538963#comment-17538963
]
Tim Allison commented on TIKA-3710:
---
Thank you, [~nick]. I was being imprecise on {{h1}}, we actually do
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538896#comment-17538896
]
Nick Burch commented on TIKA-3710:
--
The h1 isn't quite as unique as we might like, and maybe not as good
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538885#comment-17538885
]
Tim Allison commented on TIKA-3710:
---
As I look at our mime type for html, we do include {{h1}} at offset
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538524#comment-17538524
]
Sam Stephens commented on TIKA-3710:
Note that I exclude org.apache.tika.parser.mail.RFC822Parser as a
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516142#comment-17516142
]
Sam Stephens commented on TIKA-3710:
The HTML document is exactly what you see there; these documents
[
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515921#comment-17515921
]
Tim Allison commented on TIKA-3710:
---
Did the original html file actually have an html header? Or did it
19 matches
Mail list logo