[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419998#comment-16419998
 ] 

Tim Allison commented on TIKA-2582:
---

Y, all my fault.  Sorry, and thank you!

> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil 
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread Ewan Mellor (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419798#comment-16419798
 ] 

Ewan Mellor commented on TIKA-2582:
---

Build failures were not this change; they were from TIKA-2621 which went 
through at the same time.

> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil 
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419781#comment-16419781
 ] 

Hudson commented on TIKA-2582:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1463 (See 
[https://builds.apache.org/job/Tika-trunk/1463/])
Fix for TIKA-2582 contributed by ewanmellor. (commits: 
[https://github.com/apache/tika/commit/65defb20301d40397e94076a4b2011688cb94637])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java


> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil 
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419683#comment-16419683
 ] 

Hudson commented on TIKA-2582:
--

FAILURE: Integrated in Jenkins build tika-branch-1x #14 (See 
[https://builds.apache.org/job/tika-branch-1x/14/])
Fix for TIKA-2582 contributed by ewanmellor. (tallison: 
[https://github.com/apache/tika/commit/d1526d053f91497ac7bcd4509f1555f4347377d6])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil 
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419668#comment-16419668
 ] 

Hudson commented on TIKA-2582:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #226 (See 
[https://builds.apache.org/job/tika-2.x-windows/226/])
Fix for TIKA-2582 contributed by ewanmellor. (commits: rev 
65defb20301d40397e94076a4b2011688cb94637)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil 
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419580#comment-16419580
 ] 

ASF GitHub Bot commented on TIKA-2582:
--

tballison closed pull request #222: Fix for TIKA-2582 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/222
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
index c8c8bc93e..d8723dd34 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
@@ -91,6 +91,9 @@
 // factor by which image is to be scaled.
 private int resize = 900;
 
+// See setPageSeparator.
+private String pageSeparator = "";
+
 // whether or not to preserve interword spacing
 private boolean preserveInterwordSpacing = false;
 
@@ -255,6 +258,25 @@ public void setPageSegMode(String pageSegMode) {
 this.pageSegMode = pageSegMode;
 }
 
+/**
+ * @see #setPageSeparator(String pageSeparator)
+ */
+public String getPageSeparator() {
+return pageSeparator;
+}
+
+/**
+ * The page separator to use in plain text output.  This corresponds to 
Tesseract's page_separator config option.
+ * The default here is the empty string (i.e. no page separators).  Note 
that this is also the default in
+ * Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed 
control character.  We are overriding
+ * Tesseract 4.0's default here.
+ *
+ * @param pageSeparator
+ */
+public void setPageSeparator(String pageSeparator) {
+this.pageSeparator = pageSeparator;
+}
+
 /**
  * Whether or not to maintain interword spacing.  Default is 
false.
  *
diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
index 08847fd74..3e15c4495 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
@@ -468,6 +468,7 @@ private void doOCR(File input, File output, 
TesseractOCRConfig config) throws IO
 String[] cmd = { config.getTesseractPath() + getTesseractProg(), 
input.getPath(), output.getPath(), "-l",
 config.getLanguage(), "-psm", config.getPageSegMode(),
 config.getOutputType().name().toLowerCase(Locale.US),
+"-c", "page_separator=" + config.getPageSeparator(),
 "-c",
 (config.getPreserveInterwordSpacing())? 
"preserve_interword_spaces=1" : "preserve_interword_spaces=0"};
 ProcessBuilder pb = new ProcessBuilder(cmd);


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from 

[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372029#comment-16372029
 ] 

ASF GitHub Bot commented on TIKA-2582:
--

ewanmellor opened a new pull request #222: Fix for TIKA-2582 contributed by 
ewanmellor.
URL: https://github.com/apache/tika/pull/222
 
 
   Tesseract 4.0 includes a change to use form feed characters to separate
   pages by default in its text output. Previous versions used no separator
   unless you specified the include_page_breaks option.
   
   This confuses any parser that is not expecting the FF.
   ODFParserTest.testOO2Metadata fails, because it is expecting the output of
   a blank image to be the empty string, but now the FF is there.
   
   I haven't seen any other failures, but I expect that user code will now see
   either FF or U+FFFD where they are not expecting it (SafeContentHandler
   replaces the FF with U+FFFD when converting to text to XML).
   
   Fix this by setting Tesseract's page_separator option to the empty string.
   This will preserve the no-page-breaks behavior with both Tesseract 3.x and
   4.0.
   
   Also, add an option TesseractOCRConfig.pageSeparator so that user code can
   request the FF or any other separator, if they want it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tesseract 4.0 includes a FF character by default, breaking parsers
> --
>
> Key: TIKA-2582
> URL: https://issues.apache.org/jira/browse/TIKA-2582
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop 
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil 
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)