[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573561#comment-16573561 ] Tim Allison commented on TIKA-2673: --- That'd be great, [~wastl-nagel]! My second crawl only pulled in 48329 files. I trust that you know how to configure Nutch better than I do! If you'd be willing to scp the raw {{dumped}} files or segments to our vm, I'll grant you access. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573406#comment-16573406 ] Sebastian Nagel commented on TIKA-2673: --- Hi [~talli...@apache.org], I'm also about to fetch these URLs/pages (30% done right now). I'll prepare Nutch segments and also WARC files. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573355#comment-16573355 ] Tim Allison commented on TIKA-2673: --- I tried to download the urls with Nutch, and then I 'dumped' the crawled data to literal files, but I only have ~32k. Looks like I failed to configure pkix correctly for https...will try again. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571207#comment-16571207 ] Gerard Bouchar commented on TIKA-2673: -- Yes, the pages for which fetching failed are not included in the non-chrome files. The analysis is based on the pages that were successfully fetched and parsed with all the strategies. (when an error was thrown during fetching in chrome, the charset in marked as "unknown", but the URL is still included). If you want to redo the experiment yourself, I would advise to take the 200k URLs, and then filter only the ones for which fetching and parsing succeeded, and the resulting document was actual HTML. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570587#comment-16570587 ] Tim Allison commented on TIKA-2673: --- [~gbouchar], On the evaluation, it looks like 3 of the files have the same urls: 105,956, but {{segment_big_chrome_charsets.jsonl.xz}} has ~200k... Should I ignore that one? Second point on the evaluation, I really like how you classified "correct", "similar" and "wrong"...this continues to be an ongoing pain, but it is necessary. bq. I think most people want an encoding detector that "just works" by default. Y, I agree. My thinking is that if we migrate to the newer detector, we'd specify it correctly in the SPI file as we do now with html->universal->icu4j. That would then be "just works" by default. Until that point, though, users would have to specify the newer detector, and we can show them that they ought to include icu4j after the newer detector... Let me think about this some more. bq. I can make a pull request for a separate encoding detector using only the BOM. I don't feel strongly about this. Let's wait to see if there's a need. Thank you! > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569913#comment-16569913 ] Gerard Bouchar commented on TIKA-2673: -- [~talli...@apache.org], thank you very much for merging my work ! {quote}would you be able to share the urls you used for your study? {quote} See [my comment above|#comment-16542763] for the list of URLs I used, together with the testing methodology, and the code used to analyze the results. The URL lists are really not big, just a few Mb when compressed. You can [download the compressed jsonl data files|https://github.com/GerardBouchar/tika/tree/TIKA-2673-benchmark/benchmark], and then extract the URLs using jq, for instance: {code} < segment_big_icu_charsets.jsonl.xz xz -d | jq -r '.url' {code} {quote}Please do check that I got the updates right in both master and branch_1x (which we'll use for Tika 1.19).{quote} Everything looks correct to me. 👍 {quote}I made the decision to remove the FullStandardEncodingDetector and the StandardICU4JEncodingDetector because we try to keep encoding detectors composable. I was actually wondering if we should pull out the BOM detection into a separate detector, but let's leave it as is for now.{quote} OK. I can make a pull request for a separate encoding detector using only the BOM. But anyway, I think it would be nice and helpful to provide a pre-composed charset detector that includes all the required step. I think most people want an encoding detector that "just works" by default. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568567#comment-16568567 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #67 (See [https://builds.apache.org/job/tika-branch-1x/67/]) TIKA-2673 -- add StandardHtmlEncodingDetector via Gerard Bouchar (tallison: [https://github.com/apache/tika/commit/6badaead79e3350414536a5e4972871f66e97e90]) * (delete) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/MetaProcessor.java * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java * (delete) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/ReplacementCharset.java TIKA-2673 -- fix forbidden-apis failure and retro-fit for branch_1x (tallison: [https://github.com/apache/tika/commit/4475b726db9e81d2daaddc2d74908141540301da]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java * (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568549#comment-16568549 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1536 (See [https://builds.apache.org/job/Tika-trunk/1536/]) TIKA-2673 Fix race condition in CharsetAliases (gbouchar: [https://github.com/apache/tika/commit/7d96565e25b3d607b91f9384e9df48b235423d0b]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java TIKA-2673 Remove wildcard imports (gbouchar: [https://github.com/apache/tika/commit/82a1c61d628e604663b3b73326d44fbf57177af3]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java TIKA-2673 PreScanner: use read() instead of skip(long) (gbouchar: [https://github.com/apache/tika/commit/c27f53b8cb4918b957c782e3c856000f105c6d46]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java TIKA-2673 Make the read limit in StandardHtmlEncodingDetector (gbouchar: [https://github.com/apache/tika/commit/e7cda261f6aa2eb0102a8ffb53e454a4f3655bdd]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java TIKA-2673 -- small modifications (tallison: [https://github.com/apache/tika/commit/f8f5e23841d23cfcaa13bfba6dccf7f44f33fdd5]) * (delete) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java TIKA-2673 -- fix forbidden-apis failures (tallison: [https://github.com/apache/tika/commit/f8a8447550afcff941700375774ad3cb92db4d8b]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568465#comment-16568465 ] Hudson commented on TIKA-2673: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #292 (See [https://builds.apache.org/job/tika-2.x-windows/292/]) TIKA-2673 Fix race condition in CharsetAliases (gbouchar: rev 7d96565e25b3d607b91f9384e9df48b235423d0b) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java TIKA-2673 Remove wildcard imports (gbouchar: rev 82a1c61d628e604663b3b73326d44fbf57177af3) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java TIKA-2673 PreScanner: use read() instead of skip(long) (gbouchar: rev c27f53b8cb4918b957c782e3c856000f105c6d46) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java TIKA-2673 Make the read limit in StandardHtmlEncodingDetector (gbouchar: rev e7cda261f6aa2eb0102a8ffb53e454a4f3655bdd) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java TIKA-2673 -- small modifications (tallison: rev f8f5e23841d23cfcaa13bfba6dccf7f44f33fdd5) * (delete) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java TIKA-2673 -- fix forbidden-apis failures (tallison: rev f8a8447550afcff941700375774ad3cb92db4d8b) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568398#comment-16568398 ] Tim Allison commented on TIKA-2673: --- [~gbouchar], I'm sorry if I missed it in the above, but would you be able to share the urls you used for your study, or...better yet... scp/sftp the files to our regression vm? > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546231#comment-16546231 ] Sebastian Nagel commented on TIKA-2673: --- Ok, this should set the declared encoding (after a look into o.a.tika.parser.txt.Icu4jEncodingDetector). TIKA-431 deprecates the usage of Metadata.CONTENT_ENCODING to hold the charset. After TIKA-974 is done one has to parse the charset out from Metadata.CONTENT_TYPE. And thanks for evaluation results, I'll plan to fetch the URLs again to reproduce the test as I need a reliable charset detection as precondition for language detection. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546190#comment-16546190 ] Gerard Bouchar commented on TIKA-2673: -- If anyone wants to replicate the experiment with a different setting, I cannot provide our version of nutch, but here is [the node script I used to fetch pages in chrome|https://gist.github.com/GerardBouchar/3ef6ff9de8f0f1b06656c0d8275c0757] and get the detected charset. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546162#comment-16546162 ] Gerard Bouchar commented on TIKA-2673: -- [~wastl-nagel] : I used our internal fork of nutch, with Tika's ICU (org.apache.tika.parser.txt.Icu4jEncodingDetector). I used tika's metadata to set the declared charset (Metadata.CONTENT_ENCODING), extracting the charset from the HTTP Content-Type header with the following regex : _charset=\\s*\"?([^\\s;\"]*) ._ (By the way, this is a strange way to do, as the HTTP header Content-Encoding is not meant to declare a charset, but a compression algorithm) I tested it that way because this is how our internal crawler worked, and I wanted to compare new results to our current baseline. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545135#comment-16545135 ] Sebastian Nagel commented on TIKA-2673: --- Hi [~gbouchar], one question: did you evaluate Tika's modified version of ICU's encoding detector or the original version? And did you call setDeclaredEncoding(...)? Background: ICU's charset detector does not take the "declared" encoding (from HTTP or HTML metadata) into account - the implementation contradicts the documentation ([setDeclaredEncoding(...)|http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html#setDeclaredEncoding-java.lang.String-] vs. [CharsetDetector.java|http://source.icu-project.org/repos/icu/trunk/icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java]). Tika uses a [patched version|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java#L320] (TIKA-335) of the ICU detector which boosts the declared encoding if also detected as possible encoding. That's very important for single-byte encodings, e.g, of the ISO-8859-x family. Also note: Nutch uses the ICU version (and also does not set the declared encoding). > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542763#comment-16542763 ] Gerard Bouchar commented on TIKA-2673: -- I pushed my data to a branch on github: [https://github.com/GerardBouchar/tika/blob/TIKA-2673-benchmark/benchmark/chrome.ipynb] This is not very clean, but now you can see which data I used, and what exactly I analyzed. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542016#comment-16542016 ] Tim Allison commented on TIKA-2673: --- W00t! Thank you for your evaluation, [~gbouchar]! Would you be able to re-attach the results or find another way to share them with us? > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541906#comment-16541906 ] Gerard Bouchar commented on TIKA-2673: -- I made a pull request on github: https://github.com/apache/tika/pull/242 > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541307#comment-16541307 ] Gerard Bouchar commented on TIKA-2673: -- [~talli...@apache.org] : great, thank you very much ! Of course I agree for it to be merged. I'm sorry for forgetting the license header in the first place. I have done more work on this in the last days. I am going to make a pull request to include my last changes. We have conducted an internal testing on this, and have seen great results. We selected a random subset of ~100 000 URLs from a nutch segment, fetched it once in nutched, and parsed it using different strategies. We fetched the same URLs using puppeteer (a headless chrome), and compared the charset detected. Here are the results !https://confluence.qwant.ninja/confluence/download/attachments/25790597/image2018-7-11_16-50-32.png?version=1&modificationDate=1531320645751&api=v2! standard_noparse is a composite detector with a version of my detector that just takes into account the BOM and HTTP headers, chained with the existing HtmlEncodingDetector, chained with Icu4JEncodingDetector. standard is a composite detector with the last version of my detector, chained with Icu4JEncodingDetector. Labeled as "correct" are the pages that were detected the same in chrome and tika. "similar" means that although incorrect, the detected charset is close to the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). "wrong" means that the detected charset was not close to the one detected by chrome. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535147#comment-16535147 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #56 (See [https://builds.apache.org/job/tika-branch-1x/56/]) TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: [https://github.com/apache/tika/commit/525889a4f928d1d448c6aaf6b1ddc19081e07404]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535145#comment-16535145 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1517 (See [https://builds.apache.org/job/Tika-trunk/1517/]) TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: [https://github.com/apache/tika/commit/790c1248207371e6cb2a3e7a1ec3a021503ec7a4]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535130#comment-16535130 ] Hudson commented on TIKA-2673: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #282 (See [https://builds.apache.org/job/tika-2.x-windows/282/]) TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: rev 790c1248207371e6cb2a3e7a1ec3a021503ec7a4) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535041#comment-16535041 ] Tim Allison commented on TIKA-2673: --- I've added this to both 'master' and 'branch_1x'. Let me know if you disagree with this or would like to make modifications. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534992#comment-16534992 ] Tim Allison commented on TIKA-2673: --- [~gbouchar], thank you for contributing this! I won't have time to run the regression tests any time soon. Would you be ok if I added your StrictHtmlEncodingDetector to Tika now? Users would then be able to configure Tika to use it via tika-config.xml. If you're ok with this, is it ok if I add the Apache Software License 2.0 headers to your main class, test class and .tsv files? Thank you, again! > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520281#comment-16520281 ] Gerard Bouchar commented on TIKA-2673: -- > If you'd like to contribute a StrictHTMLEncodingDetector, we could compare > the performance of that with what we have on our 1TB regression corpus. I agree that testing on real-world data is the key for such a problem. We are going to conduct our own testing internally, but more testing can only be beneficial. I am attaching my first attempt at writing a StrictHtmlEncodingDetector for tika. The code is still quite messy, but I tried to write a lot of tests, and have 99% code coverage. It should take into account user-defined metadata, unicode BOM, and meta tags according to the specification. It is meant to be used in a composite encoding detector, with an existing probabilistic detector such as Icu4jEncodingDetector as fallback. I would be very curious to see how it performs on your corpus, compared to a real modern browser. [^StrictHtmlEncodingDetector.tar.gz] > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520137#comment-16520137 ] Gerard Bouchar commented on TIKA-2673: -- I think at least the following tests should be included, and pass: {{ @Test public void bom() throws IOException { // A BOM should have precedence over the meta // let assertCharset encode the string in the expected charset assertCharset("\ufeff", StandardCharsets.UTF_8); assertCharset("\ufeff", StandardCharsets.UTF_16LE); assertCharset("\ufeff", StandardCharsets.UTF_16BE); } @Test public void utf16() throws IOException { // According to the specification 'If charset is a UTF-16 encoding, then set charset to UTF-8.' assertCharset("", StandardCharsets.UTF_8); } @Test public void macintoshEncoding() throws IOException { // In the spec, iso-8859-1 is an alias for WINDOWS-1252 assertCharset("", Charset.forName("x-MacRoman")); } @Test public void iso88591() throws IOException { // In the spec, iso-8859-1 is an alias for WINDOWS-1252 assertWindows1252(""); } }} > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520131#comment-16520131 ] Gerard Bouchar commented on TIKA-2673: -- [This blog post by the whatwg on character encodings|https://blog.whatwg.org/the-road-to-html-5-character-encoding] is an interesting read that explains the gist of the specification and some of the reasons motivating it. [These tests|https://www.w3.org/International/tests/repository/html5/the-input-byte-stream/results-basics] show how well browsers respect the specification. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520127#comment-16520127 ] Gerard Bouchar commented on TIKA-2673: -- Another part of the specification I think we should respect is [character encoding names and labels|https://encoding.spec.whatwg.org/#names-and-labels]. Several aliases are made from aliases to different chraset names, and I think using the labels in this table makes more sense then using the ones defined by java (that were not meant to be used in HTML, or to be in any way compatible with HTML). > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520117#comment-16520117 ] Gerard Bouchar commented on TIKA-2673: -- [~talli...@apache.org] I think at least the utf16 test shouldn't be ignored. HtmlEncodingDetector does its detection using regular expressions on the byte stream decoded as ASCII. So if the file were actually in UTF-16 (a two bytes per character encoding that is not compatible with ASCII), then it wouldn't have matched the regular expression in the first place. Decoding it as UTF-16 will almost certainly result in garbled text. [The specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream] was written by people with experience in real-world misuses of character encodings on the web, I think we can confidently trust it concerning various edge-cases. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519802#comment-16519802 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1512 (See [https://builds.apache.org/job/Tika-trunk/1512/]) TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: [https://github.com/apache/tika/commit/5eec28ae0203820364dbcdef58335fd64aeb90ec]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519793#comment-16519793 ] Hudson commented on TIKA-2673: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #277 (See [https://builds.apache.org/job/tika-2.x-windows/277/]) TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: rev 5eec28ae0203820364dbcdef58335fd64aeb90ec) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519748#comment-16519748 ] Hudson commented on TIKA-2673: -- FAILURE: Integrated in Jenkins build tika-branch-1x #50 (See [https://builds.apache.org/job/tika-branch-1x/50/]) TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: [https://github.com/apache/tika/commit/df9ed8260c91800baa202a748b0ff3854937ff5f]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: [https://github.com/apache/tika/commit/c6f7b45ae6ace89ee2398f251c97dd23d220355b]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519708#comment-16519708 ] Hudson commented on TIKA-2673: -- FAILURE: Integrated in Jenkins build Tika-trunk #1511 (See [https://builds.apache.org/job/Tika-trunk/1511/]) TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: [https://github.com/apache/tika/commit/b688afa01939a1f32775a7e7f0797a8ea466c612]) * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519704#comment-16519704 ] Hudson commented on TIKA-2673: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #276 (See [https://builds.apache.org/job/tika-2.x-windows/276/]) TIKA-2673 -- unit tests for stricter adherence to spec via Gerard (tallison: rev b688afa01939a1f32775a7e7f0797a8ea466c612) * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519679#comment-16519679 ] Tim Allison commented on TIKA-2673: --- [~gbouchar], thank you for these unit tests! I've added them and made the easy fixes where I could. As you know, to do a full parse is non-trivial, and I'd like evidence from some corpus that the effort is worth it. If you'd like to contribute a StrictHTMLEncodingDetector, we could compare the performance of that with what we have on our 1TB regression corpus. If you'd like access to our VM either to run your own comparisons or to help us curate it and make it more representative of modern websites with diverse languages and encodings, let me know. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519100#comment-16519100 ] Gerard Bouchar commented on TIKA-2673: -- Another important part of the specification that is ignored by HtmlEncodingDetector is [paragraph 4.2 of the encoding spec|https://encoding.spec.whatwg.org/#names-and-labels], that specifies aliases to use for encodings. HtmlEncodingDetector simply uses [Charset.forName|https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#forName-java.lang.String-]. The specification maps some labels to DIFFERENT charsets. For instance "ISO-8859-1" is mapped to the WINDOWS-1252 encoding. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)