[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573561#comment-16573561
]
Tim Allison commented on TIKA-2673:
---
That'd be great, [~wastl-nagel]! My second crawl only pulled in
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573406#comment-16573406
]
Sebastian Nagel commented on TIKA-2673:
---
Hi [~talli...@apache.org], I'm also about to fetch these
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573355#comment-16573355
]
Tim Allison commented on TIKA-2673:
---
I tried to download the urls with Nutch, and then I 'dumped' the
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571207#comment-16571207
]
Gerard Bouchar commented on TIKA-2673:
--
Yes, the pages for which fetching failed are not included in
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570587#comment-16570587
]
Tim Allison commented on TIKA-2673:
---
[~gbouchar], On the evaluation, it looks like 3 of the files have
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569913#comment-16569913
]
Gerard Bouchar commented on TIKA-2673:
--
[~talli...@apache.org], thank you very much for merging my
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568567#comment-16568567
]
Hudson commented on TIKA-2673:
--
SUCCESS: Integrated in Jenkins build tika-branch-1x #67 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568549#comment-16568549
]
Hudson commented on TIKA-2673:
--
SUCCESS: Integrated in Jenkins build Tika-trunk #1536 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568465#comment-16568465
]
Hudson commented on TIKA-2673:
--
UNSTABLE: Integrated in Jenkins build tika-2.x-windows #292 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568398#comment-16568398
]
Tim Allison commented on TIKA-2673:
---
[~gbouchar], I'm sorry if I missed it in the above, but would you
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546231#comment-16546231
]
Sebastian Nagel commented on TIKA-2673:
---
Ok, this should set the declared encoding (after a look
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546190#comment-16546190
]
Gerard Bouchar commented on TIKA-2673:
--
If anyone wants to replicate the experiment with a different
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546162#comment-16546162
]
Gerard Bouchar commented on TIKA-2673:
--
[~wastl-nagel] : I used our internal fork of nutch, with
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545135#comment-16545135
]
Sebastian Nagel commented on TIKA-2673:
---
Hi [~gbouchar], one question: did you evaluate Tika's
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542763#comment-16542763
]
Gerard Bouchar commented on TIKA-2673:
--
I pushed my data to a branch on github:
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542016#comment-16542016
]
Tim Allison commented on TIKA-2673:
---
W00t! Thank you for your evaluation, [~gbouchar]! Would you be
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541906#comment-16541906
]
Gerard Bouchar commented on TIKA-2673:
--
I made a pull request on github:
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541307#comment-16541307
]
Gerard Bouchar commented on TIKA-2673:
--
[~talli...@apache.org] : great, thank you very much ! Of
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535147#comment-16535147
]
Hudson commented on TIKA-2673:
--
SUCCESS: Integrated in Jenkins build tika-branch-1x #56 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535145#comment-16535145
]
Hudson commented on TIKA-2673:
--
SUCCESS: Integrated in Jenkins build Tika-trunk #1517 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535130#comment-16535130
]
Hudson commented on TIKA-2673:
--
UNSTABLE: Integrated in Jenkins build tika-2.x-windows #282 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535041#comment-16535041
]
Tim Allison commented on TIKA-2673:
---
I've added this to both 'master' and 'branch_1x'. Let me know if
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534992#comment-16534992
]
Tim Allison commented on TIKA-2673:
---
[~gbouchar], thank you for contributing this! I won't have time to
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520281#comment-16520281
]
Gerard Bouchar commented on TIKA-2673:
--
> If you'd like to contribute a StrictHTMLEncodingDetector,
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520137#comment-16520137
]
Gerard Bouchar commented on TIKA-2673:
--
I think at least the following tests should be included, and
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520131#comment-16520131
]
Gerard Bouchar commented on TIKA-2673:
--
[This blog post by the whatwg on character
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520127#comment-16520127
]
Gerard Bouchar commented on TIKA-2673:
--
Another part of the specification I think we should respect
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520117#comment-16520117
]
Gerard Bouchar commented on TIKA-2673:
--
[~talli...@apache.org] I think at least the utf16 test
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519802#comment-16519802
]
Hudson commented on TIKA-2673:
--
SUCCESS: Integrated in Jenkins build Tika-trunk #1512 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519793#comment-16519793
]
Hudson commented on TIKA-2673:
--
UNSTABLE: Integrated in Jenkins build tika-2.x-windows #277 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519748#comment-16519748
]
Hudson commented on TIKA-2673:
--
FAILURE: Integrated in Jenkins build tika-branch-1x #50 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519708#comment-16519708
]
Hudson commented on TIKA-2673:
--
FAILURE: Integrated in Jenkins build Tika-trunk #1511 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519704#comment-16519704
]
Hudson commented on TIKA-2673:
--
FAILURE: Integrated in Jenkins build tika-2.x-windows #276 (See
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519679#comment-16519679
]
Tim Allison commented on TIKA-2673:
---
[~gbouchar], thank you for these unit tests! I've added them and
[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519100#comment-16519100
]
Gerard Bouchar commented on TIKA-2673:
--
Another important part of the specification that is ignored
35 matches
Mail list logo