[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-08 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573561#comment-16573561
 ] 

Tim Allison commented on TIKA-2673:
---

That'd be great, [~wastl-nagel]!  My second crawl only pulled in 48329 files.  
I trust that you know how to configure Nutch better than I do!

If you'd be willing to scp the raw {{dumped}} files or segments to our vm, I'll 
grant you access.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-08 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573406#comment-16573406
 ] 

Sebastian Nagel commented on TIKA-2673:
---

Hi [~talli...@apache.org], I'm also about to fetch these URLs/pages (30% done 
right now). I'll prepare Nutch segments and also WARC files.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-08 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573355#comment-16573355
 ] 

Tim Allison commented on TIKA-2673:
---

I tried to download the urls with Nutch, and then I 'dumped' the crawled data 
to literal files, but I only have ~32k.  Looks like I failed to configure pkix 
correctly for https...will try again.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-07 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571207#comment-16571207
 ] 

Gerard Bouchar commented on TIKA-2673:
--

Yes, the pages for which fetching failed are not included in the non-chrome 
files. The analysis is based on the pages that were successfully fetched and 
parsed with all the strategies. (when an error was thrown during fetching in 
chrome, the charset in marked as "unknown", but the URL is still included).

If you want to redo the experiment yourself, I would advise to take the 200k 
URLs, and then filter only the ones for which fetching and parsing succeeded, 
and the resulting document was actual HTML.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570587#comment-16570587
 ] 

Tim Allison commented on TIKA-2673:
---

[~gbouchar], On the evaluation, it looks like 3 of the files have the same 
urls: 105,956, but {{segment_big_chrome_charsets.jsonl.xz}} has ~200k...  
Should I ignore that one?  Second point on the evaluation, I really like how 
you classified "correct", "similar" and "wrong"...this continues to be an 
ongoing pain, but it is necessary.

bq. I think most people want an encoding detector that "just works" by default.
Y, I agree.  My thinking is that if we migrate to the newer detector, we'd 
specify it correctly in the SPI file as we do now with html->universal->icu4j.  
That would then be "just works" by default.  Until that point, though, users 
would have to specify the newer detector, and we can show them that they ought 
to include icu4j after the newer detector... Let me think about this some more.

bq.  I can make a pull request for a separate encoding detector using only the 
BOM. 
I don't feel strongly about this.  Let's wait to see if there's a need.  Thank 
you!

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-06 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569913#comment-16569913
 ] 

Gerard Bouchar commented on TIKA-2673:
--

[~talli...@apache.org], thank you very much for merging my work !

{quote}would you be able to share the urls you used for your study?
{quote}
See [my comment above|#comment-16542763] for the list of URLs I used, together 
with the testing methodology, and the code used to analyze the results.  The 
URL lists are really not big, just a few Mb when compressed. You can [download 
the compressed jsonl data 
files|https://github.com/GerardBouchar/tika/tree/TIKA-2673-benchmark/benchmark],
 and then extract the URLs using jq, for instance:
{code}
< segment_big_icu_charsets.jsonl.xz xz -d | jq -r '.url'
{code}



{quote}Please do check that I got the updates right in both master and 
branch_1x (which we'll use for Tika 1.19).{quote}

Everything looks correct to me. 👍


{quote}I made the decision to remove the FullStandardEncodingDetector and the 
StandardICU4JEncodingDetector because we try to keep encoding detectors 
composable. I was actually wondering if we should pull out the BOM detection 
into a separate detector, but let's leave it as is for now.{quote}

OK. I can make a pull request for a separate encoding detector using only the 
BOM. But anyway, I think it would be nice and helpful to provide a pre-composed 
charset detector that includes all the required step. I think most people want 
an encoding detector that "just works" by default. 

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568567#comment-16568567
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #67 (See 
[https://builds.apache.org/job/tika-branch-1x/67/])
TIKA-2673 -- add StandardHtmlEncodingDetector via Gerard Bouchar (tallison: 
[https://github.com/apache/tika/commit/6badaead79e3350414536a5e4972871f66e97e90])
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/MetaProcessor.java
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java
* (delete) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/ReplacementCharset.java
TIKA-2673 -- fix forbidden-apis failure and retro-fit for branch_1x (tallison: 
[https://github.com/apache/tika/commit/4475b726db9e81d2daaddc2d74908141540301da])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java
* (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568549#comment-16568549
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1536 (See 
[https://builds.apache.org/job/Tika-trunk/1536/])
TIKA-2673 Fix race condition in CharsetAliases (gbouchar: 
[https://github.com/apache/tika/commit/7d96565e25b3d607b91f9384e9df48b235423d0b])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
TIKA-2673 Remove wildcard imports (gbouchar: 
[https://github.com/apache/tika/commit/82a1c61d628e604663b3b73326d44fbf57177af3])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
TIKA-2673 PreScanner: use read() instead of skip(long) (gbouchar: 
[https://github.com/apache/tika/commit/c27f53b8cb4918b957c782e3c856000f105c6d46])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
TIKA-2673 Make the read limit in StandardHtmlEncodingDetector (gbouchar: 
[https://github.com/apache/tika/commit/e7cda261f6aa2eb0102a8ffb53e454a4f3655bdd])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
TIKA-2673 -- small modifications (tallison: 
[https://github.com/apache/tika/commit/f8f5e23841d23cfcaa13bfba6dccf7f44f33fdd5])
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
TIKA-2673 -- fix forbidden-apis failures (tallison: 
[https://github.com/apache/tika/commit/f8a8447550afcff941700375774ad3cb92db4d8b])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568465#comment-16568465
 ] 

Hudson commented on TIKA-2673:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #292 (See 
[https://builds.apache.org/job/tika-2.x-windows/292/])
TIKA-2673 Fix race condition in CharsetAliases (gbouchar: rev 
7d96565e25b3d607b91f9384e9df48b235423d0b)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
TIKA-2673 Remove wildcard imports (gbouchar: rev 
82a1c61d628e604663b3b73326d44fbf57177af3)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java
TIKA-2673 PreScanner: use read() instead of skip(long) (gbouchar: rev 
c27f53b8cb4918b957c782e3c856000f105c6d46)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
TIKA-2673 Make the read limit in StandardHtmlEncodingDetector (gbouchar: rev 
e7cda261f6aa2eb0102a8ffb53e454a4f3655bdd)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java
TIKA-2673 -- small modifications (tallison: rev 
f8f5e23841d23cfcaa13bfba6dccf7f44f33fdd5)
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
TIKA-2673 -- fix forbidden-apis failures (tallison: rev 
f8a8447550afcff941700375774ad3cb92db4d8b)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-08-03 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568398#comment-16568398
 ] 

Tim Allison commented on TIKA-2673:
---

[~gbouchar], I'm sorry if I missed it in the above, but would you be able to 
share the urls you used for your study, or...better yet... scp/sftp the files 
to our regression vm?

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-17 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546231#comment-16546231
 ] 

Sebastian Nagel commented on TIKA-2673:
---

Ok, this should set the declared encoding (after a look into 
o.a.tika.parser.txt.Icu4jEncodingDetector). TIKA-431 deprecates the usage of 
Metadata.CONTENT_ENCODING to hold the charset. After TIKA-974 is done one has 
to parse the charset out from Metadata.CONTENT_TYPE. And thanks for evaluation 
results, I'll plan to fetch the URLs again to reproduce the test as I need a 
reliable charset detection as precondition for language detection.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-17 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546190#comment-16546190
 ] 

Gerard Bouchar commented on TIKA-2673:
--

If anyone wants to replicate the experiment with a different setting, I cannot 
provide our version of nutch, but here is [the node script I used to fetch 
pages in 
chrome|https://gist.github.com/GerardBouchar/3ef6ff9de8f0f1b06656c0d8275c0757] 
and get the detected charset.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-17 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546162#comment-16546162
 ] 

Gerard Bouchar commented on TIKA-2673:
--

[~wastl-nagel] : I used our internal fork of nutch, with Tika's ICU 
(org.apache.tika.parser.txt.Icu4jEncodingDetector). I used tika's metadata to 
set the declared charset (Metadata.CONTENT_ENCODING), extracting the charset 
from the HTTP Content-Type header with the following regex : 
_charset=\\s*\"?([^\\s;\"]*) ._ 

(By the way, this is a strange way to do, as the HTTP header Content-Encoding 
is not meant to declare a charset, but a compression algorithm)

 

I tested it that way because this is how our internal crawler worked, and I 
wanted to compare new results to our current baseline.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-16 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545135#comment-16545135
 ] 

Sebastian Nagel commented on TIKA-2673:
---

Hi [~gbouchar], one question: did you evaluate Tika's modified version of ICU's 
encoding detector or the original version? And did you call 
setDeclaredEncoding(...)? Background: ICU's charset detector does not take the 
"declared" encoding (from HTTP or HTML metadata) into account - the 
implementation contradicts the documentation 
([setDeclaredEncoding(...)|http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html#setDeclaredEncoding-java.lang.String-]
 vs. 
[CharsetDetector.java|http://source.icu-project.org/repos/icu/trunk/icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java]).
 Tika uses a [patched 
version|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java#L320]
 (TIKA-335) of the ICU detector which boosts the declared encoding if also 
detected as possible encoding. That's very important for single-byte encodings, 
e.g, of the ISO-8859-x family. Also note: Nutch uses the ICU version (and also 
does not set the declared encoding).

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-13 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542763#comment-16542763
 ] 

Gerard Bouchar commented on TIKA-2673:
--

I pushed my data to a branch on github: 
[https://github.com/GerardBouchar/tika/blob/TIKA-2673-benchmark/benchmark/chrome.ipynb]

This is not very clean, but now you can see which data I used, and what exactly 
I analyzed.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542016#comment-16542016
 ] 

Tim Allison commented on TIKA-2673:
---

W00t!  Thank you for your evaluation, [~gbouchar]!  Would you be able to 
re-attach the results or find another way to share them with us?

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-12 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541906#comment-16541906
 ] 

Gerard Bouchar commented on TIKA-2673:
--

I made a pull request on github: https://github.com/apache/tika/pull/242

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-12 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541307#comment-16541307
 ] 

Gerard Bouchar commented on TIKA-2673:
--

[~talli...@apache.org] : great, thank you very much ! Of course I agree for it 
to be merged. I'm sorry for forgetting the license header in the first place.

I have done more work on this in the last days. I am going to make a pull 
request to include my last changes.

We have conducted an internal testing on this, and have seen great results. We 
selected a random subset of ~100 000 URLs from a nutch segment, fetched it once 
in nutched, and parsed it using different strategies. We fetched the same URLs 
using puppeteer (a headless chrome), and compared the charset detected. Here 
are the results

!https://confluence.qwant.ninja/confluence/download/attachments/25790597/image2018-7-11_16-50-32.png?version=1&modificationDate=1531320645751&api=v2!

standard_noparse is a composite detector with a version of my detector that 
just takes into account the BOM and HTTP headers, chained with the existing 
HtmlEncodingDetector, chained with Icu4JEncodingDetector.

standard is a composite detector with the last version of my detector, chained 
with Icu4JEncodingDetector.

Labeled as "correct" are the pages that were detected the same in chrome and 
tika. "similar" means that although incorrect, the detected charset is close to 
the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). 
"wrong" means that the detected charset was not close to the one detected by 
chrome.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535147#comment-16535147
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #56 (See 
[https://builds.apache.org/job/tika-branch-1x/56/])
TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: 
[https://github.com/apache/tika/commit/525889a4f928d1d448c6aaf6b1ddc19081e07404])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535145#comment-16535145
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1517 (See 
[https://builds.apache.org/job/Tika-trunk/1517/])
TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: 
[https://github.com/apache/tika/commit/790c1248207371e6cb2a3e7a1ec3a021503ec7a4])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535130#comment-16535130
 ] 

Hudson commented on TIKA-2673:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #282 (See 
[https://builds.apache.org/job/tika-2.x-windows/282/])
TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: 
rev 790c1248207371e6cb2a3e7a1ec3a021503ec7a4)
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535041#comment-16535041
 ] 

Tim Allison commented on TIKA-2673:
---

I've added this to both 'master' and 'branch_1x'.  Let me know if you disagree 
with this or would like to make modifications.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534992#comment-16534992
 ] 

Tim Allison commented on TIKA-2673:
---

[~gbouchar], thank you for contributing this!  I won't have time to run the 
regression tests any time soon.  Would you be ok if I added your 
StrictHtmlEncodingDetector to Tika now?  Users would then be able to configure 
Tika to use it via tika-config.xml.  If you're ok with this, is it ok if I add 
the Apache Software License 2.0 headers to your main class, test class and .tsv 
files?

 

Thank you, again!

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-22 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520281#comment-16520281
 ] 

Gerard Bouchar commented on TIKA-2673:
--

> If you'd like to contribute a StrictHTMLEncodingDetector, we could compare 
> the performance of that with what we have on our 1TB regression corpus.

I agree that testing on real-world data is the key for such a problem. We are 
going to conduct our own testing internally, but more testing can only be 
beneficial.

I am attaching my first attempt at writing a StrictHtmlEncodingDetector for 
tika. The code is still quite messy, but I tried to write a lot of tests, and 
have 99% code coverage. It should take into account user-defined metadata, 
unicode BOM, and meta tags according to the specification. It is meant to be 
used in a composite encoding detector, with an existing probabilistic detector 
such as Icu4jEncodingDetector as fallback.

I would be very curious to see how it performs on your corpus, compared to a 
real modern browser.

 

[^StrictHtmlEncodingDetector.tar.gz]

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-22 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520137#comment-16520137
 ] 

Gerard Bouchar commented on TIKA-2673:
--

I think at least the following tests should be included, and pass:

{{
@Test
public void bom() throws IOException {
// A BOM should have precedence over the meta
// let assertCharset encode the string in the expected charset
assertCharset("\ufeff", 
StandardCharsets.UTF_8);
assertCharset("\ufeff", 
StandardCharsets.UTF_16LE);
assertCharset("\ufeff", 
StandardCharsets.UTF_16BE);
}

@Test
public void utf16() throws IOException {
// According to the specification 'If charset is a UTF-16 encoding, 
then set charset to UTF-8.'
assertCharset("", StandardCharsets.UTF_8);
}

@Test
public void macintoshEncoding() throws IOException {
// In the spec, iso-8859-1 is an alias for WINDOWS-1252
assertCharset("", 
Charset.forName("x-MacRoman"));
}

@Test
public void iso88591() throws IOException {
// In the spec, iso-8859-1 is an alias for WINDOWS-1252
assertWindows1252("");
}
}}

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-22 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520131#comment-16520131
 ] 

Gerard Bouchar commented on TIKA-2673:
--

[This blog post by the whatwg on character 
encodings|https://blog.whatwg.org/the-road-to-html-5-character-encoding] is an 
interesting read that explains the gist of the specification and some of the 
reasons motivating it.

[These 
tests|https://www.w3.org/International/tests/repository/html5/the-input-byte-stream/results-basics]
 show how well browsers respect the specification.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-22 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520127#comment-16520127
 ] 

Gerard Bouchar commented on TIKA-2673:
--

Another part of the specification I think we should respect is [character 
encoding names and labels|https://encoding.spec.whatwg.org/#names-and-labels]. 
Several aliases are made from aliases to different chraset names, and I think 
using the labels in this table makes more sense then using the ones defined by 
java (that were not meant to be used in HTML, or to be in any way compatible 
with HTML). 

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-22 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520117#comment-16520117
 ] 

Gerard Bouchar commented on TIKA-2673:
--

[~talli...@apache.org] I think at least the utf16 test shouldn't be ignored. 
HtmlEncodingDetector does its detection using regular expressions on the byte 
stream decoded as ASCII. So if the file were actually in UTF-16 (a two bytes 
per character encoding that is not compatible with ASCII), then it wouldn't 
have matched the regular expression in the first place. Decoding it as UTF-16 
will almost certainly result in garbled text. [The 
specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream]
 was written by people with experience in real-world misuses of character 
encodings on the web, I think we can confidently trust it concerning various 
edge-cases. 

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519802#comment-16519802
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1512 (See 
[https://builds.apache.org/job/Tika-trunk/1512/])
TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: 
[https://github.com/apache/tika/commit/5eec28ae0203820364dbcdef58335fd64aeb90ec])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519793#comment-16519793
 ] 

Hudson commented on TIKA-2673:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #277 (See 
[https://builds.apache.org/job/tika-2.x-windows/277/])
TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: 
rev 5eec28ae0203820364dbcdef58335fd64aeb90ec)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519748#comment-16519748
 ] 

Hudson commented on TIKA-2673:
--

FAILURE: Integrated in Jenkins build tika-branch-1x #50 (See 
[https://builds.apache.org/job/tika-branch-1x/50/])
TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: 
[https://github.com/apache/tika/commit/df9ed8260c91800baa202a748b0ff3854937ff5f])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java
TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: 
[https://github.com/apache/tika/commit/c6f7b45ae6ace89ee2398f251c97dd23d220355b])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519708#comment-16519708
 ] 

Hudson commented on TIKA-2673:
--

FAILURE: Integrated in Jenkins build Tika-trunk #1511 (See 
[https://builds.apache.org/job/Tika-trunk/1511/])
TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: 
[https://github.com/apache/tika/commit/b688afa01939a1f32775a7e7f0797a8ea466c612])
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519704#comment-16519704
 ] 

Hudson commented on TIKA-2673:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #276 (See 
[https://builds.apache.org/job/tika-2.x-windows/276/])
TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: 
rev b688afa01939a1f32775a7e7f0797a8ea466c612)
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519679#comment-16519679
 ] 

Tim Allison commented on TIKA-2673:
---

[~gbouchar], thank you for these unit tests!  I've added them and made the easy 
fixes where I could.  As you know, to do a full parse is non-trivial, and I'd 
like evidence from some corpus that the effort is worth it.  

 

If you'd like to contribute a StrictHTMLEncodingDetector, we could compare the 
performance of that with what we have on our 1TB regression corpus.

 

If you'd like access to our VM either to run your own comparisons or to help us 
curate it and make it more representative of modern websites with diverse 
languages and encodings, let me know.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-06-21 Thread Gerard Bouchar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519100#comment-16519100
 ] 

Gerard Bouchar commented on TIKA-2673:
--

Another important part of the specification that is ignored by 
HtmlEncodingDetector is [paragraph 4.2 of the encoding 
spec|https://encoding.spec.whatwg.org/#names-and-labels], that specifies 
aliases to use for encodings. HtmlEncodingDetector simply uses 
[Charset.forName|https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#forName-java.lang.String-].

The specification maps some labels to DIFFERENT charsets. For instance 
"ISO-8859-1" is mapped to the WINDOWS-1252 encoding. 

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)