[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419998#comment-16419998 ] Tim Allison commented on TIKA-2582: --- Y, all my fault. Sorry, and thank you! > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > Fix For: 1.18, 2.0.0 > > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419798#comment-16419798 ] Ewan Mellor commented on TIKA-2582: --- Build failures were not this change; they were from TIKA-2621 which went through at the same time. > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > Fix For: 1.18, 2.0.0 > > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419781#comment-16419781 ] Hudson commented on TIKA-2582: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1463 (See [https://builds.apache.org/job/Tika-trunk/1463/]) Fix for TIKA-2582 contributed by ewanmellor. (commits: [https://github.com/apache/tika/commit/65defb20301d40397e94076a4b2011688cb94637]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > Fix For: 1.18, 2.0.0 > > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419683#comment-16419683 ] Hudson commented on TIKA-2582: -- FAILURE: Integrated in Jenkins build tika-branch-1x #14 (See [https://builds.apache.org/job/tika-branch-1x/14/]) Fix for TIKA-2582 contributed by ewanmellor. (tallison: [https://github.com/apache/tika/commit/d1526d053f91497ac7bcd4509f1555f4347377d6]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > Fix For: 1.18, 2.0.0 > > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419668#comment-16419668 ] Hudson commented on TIKA-2582: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #226 (See [https://builds.apache.org/job/tika-2.x-windows/226/]) Fix for TIKA-2582 contributed by ewanmellor. (commits: rev 65defb20301d40397e94076a4b2011688cb94637) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > Fix For: 1.18, 2.0.0 > > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419580#comment-16419580 ] ASF GitHub Bot commented on TIKA-2582: -- tballison closed pull request #222: Fix for TIKA-2582 contributed by ewanmellor. URL: https://github.com/apache/tika/pull/222 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java index c8c8bc93e..d8723dd34 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java @@ -91,6 +91,9 @@ // factor by which image is to be scaled. private int resize = 900; +// See setPageSeparator. +private String pageSeparator = ""; + // whether or not to preserve interword spacing private boolean preserveInterwordSpacing = false; @@ -255,6 +258,25 @@ public void setPageSegMode(String pageSegMode) { this.pageSegMode = pageSegMode; } +/** + * @see #setPageSeparator(String pageSeparator) + */ +public String getPageSeparator() { +return pageSeparator; +} + +/** + * The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. + * The default here is the empty string (i.e. no page separators). Note that this is also the default in + * Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding + * Tesseract 4.0's default here. + * + * @param pageSeparator + */ +public void setPageSeparator(String pageSeparator) { +this.pageSeparator = pageSeparator; +} + /** * Whether or not to maintain interword spacing. Default is false. * diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java index 08847fd74..3e15c4495 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java @@ -468,6 +468,7 @@ private void doOCR(File input, File output, TesseractOCRConfig config) throws IO String[] cmd = { config.getTesseractPath() + getTesseractProg(), input.getPath(), output.getPath(), "-l", config.getLanguage(), "-psm", config.getPageSegMode(), config.getOutputType().name().toLowerCase(Locale.US), +"-c", "page_separator=" + config.getPageSeparator(), "-c", (config.getPreserveInterwordSpacing())? "preserve_interword_spaces=1" : "preserve_interword_spaces=0"}; ProcessBuilder pb = new ProcessBuilder(cmd); This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from
[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372029#comment-16372029 ] ASF GitHub Bot commented on TIKA-2582: -- ewanmellor opened a new pull request #222: Fix for TIKA-2582 contributed by ewanmellor. URL: https://github.com/apache/tika/pull/222 Tesseract 4.0 includes a change to use form feed characters to separate pages by default in its text output. Previous versions used no separator unless you specified the include_page_breaks option. This confuses any parser that is not expecting the FF. ODFParserTest.testOO2Metadata fails, because it is expecting the output of a blank image to be the empty string, but now the FF is there. I haven't seen any other failures, but I expect that user code will now see either FF or U+FFFD where they are not expecting it (SafeContentHandler replaces the FF with U+FFFD when converting to text to XML). Fix this by setting Tesseract's page_separator option to the empty string. This will preserve the no-page-breaks behavior with both Tesseract 3.x and 4.0. Also, add an option TesseractOCRConfig.pageSeparator so that user code can request the FF or any other separator, if they want it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tesseract 4.0 includes a FF character by default, breaking parsers > -- > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Major > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop> Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)