[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2025-01-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913320#comment-17913320
 ] 

Hudson commented on TIKA-4363:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch_3x-jdk11 #1944 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/1944/])
TIKA-4363: refactor (tilman: 
[https://github.com/apache/tika/commit/939eff71140043446d48ad58c16e09e4291c89b0])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java


> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2025-01-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913312#comment-17913312
 ] 

Hudson commented on TIKA-4363:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #603 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/603/])
TIKA-4363: refactor (tilman: 
[https://github.com/apache/tika/commit/657e75b53b82b03d5e296c23687f2e913e0ba4ac])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java


> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2025-01-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913307#comment-17913307
 ] 

Hudson commented on TIKA-4363:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch_2x-jdk11 #585 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch_2x-jdk11/585/])
TIKA-4363: refactor (tilman: 
[https://github.com/apache/tika/commit/636f57b40ad610f5dfbc8dce203a0b251ccff56d])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java


> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2025-01-14 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913168#comment-17913168
 ] 

Tilman Hausherr commented on TIKA-4363:
---

Maybe I misunderstood the question... 
{code:java}
COSBase object = ((COSObject) pageBase).getObject();
if (object instanceof COSDictionary) {
int index = document.getPages().indexOf(new PDPage((COSDictionary) object)) 
+ 1;
System.out.println("page: " + index);
}
{code}
Also I don't understand why currentPageRef is used with a new type ObjectRef 
instead of just using COSObject or COSBase to have a unique key for MCID. (I 
made a TODO comment about that)

> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2025-01-14 Thread Alexey Pismenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913073#comment-17913073
 ] 

Alexey Pismenskiy commented on TIKA-4363:
-

The getCurrentPageNo() relies on the pageIndex variable that only gets updated 
inside the processPages() method. In 2.9.2, pageIndex lives inside the 
AbstractPDF2XHTML class and is only updated inside the processPages() of that 
class. The issue is that the PDFMarkedContent2XHTML overrides processPages() 
and does not update pageIndex. I think doing something with the page refs makes 
sense.

> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2024-12-13 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905507#comment-17905507
 ] 

Tilman Hausherr commented on TIKA-4363:
---

getCurrentPageNo()

> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2024-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905506#comment-17905506
 ] 

Tim Allison commented on TIKA-4363:
---

[~tilman], is there an easy way to figure out which page number we're on at 
this line?

The knuckle dragging method would call getPages() before processing the marked 
content and then do a lookup in that array to see if they object ref is in that 
list???


https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java#L302

> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

2024-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905503#comment-17905503
 ] 

Tim Allison commented on TIKA-4363:
---

Thank you for opening this and explaining the problem in detail.

As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the 
text from the structure tree root. 

{noformat}
//TODO: figure out when we're crossing page boundaries during the recursion
// step above and do the page by page processing then...rather than 
dumping this
// all here.
{noformat}

The current code does not calculate which content from the structure tree root 
appears on which page. In short, it currently has no way of knowing how many 
{{totalCharsPerPage}} there are.

The right solution is to do the {{TODO}}. Maybe we could do a minimal effort 
algorithm of keeping a tally of "totalCharsPerPage" based on the 
currentPageRef??? 

Short of that, maybe turn off ocr if the codepath goes through 
PDFMarkedContent2XHTML#processPages()?



> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pismenskiy
>Assignee: Tim Allison
>Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)