[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906522#comment-17906522 ] Hudson commented on TIKA-2342: -- SUCCESS: Integrated in Jenkins build Tika » tika-branch_3x-jdk11 #1918 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/1918/]) TIKA-2342: suppport PDFBox IgnoreContentStreamSpaceGlyphs; add test; remove dead code line (tilman: [https://github.com/apache/tika/commit/fb1f238a25e4be680ab92ae5684583857c33ddbe]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testContentStreamSpaceGlyphs.pdf * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java TIKA-2342: remove dead code line (tilman: [https://github.com/apache/tika/commit/771fd1b848cca1ea70851627ab404a6d464de8e4]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java TIKA-2342: add comment (tilman: [https://github.com/apache/tika/commit/067c3f84e755568421c198208dc550c8e3abe96a]) * (edit) tika-parent/pom.xml > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.1, 4.0.0 > > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906510#comment-17906510 ] Hudson commented on TIKA-2342: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #580 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/580/]) TIKA-2342: remove dead code line (tilman: [https://github.com/apache/tika/commit/5bbbf92f264a38f480743bf6755dd0dfce52f56b]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java TIKA-2342: add comment (tilman: [https://github.com/apache/tika/commit/b38ebdf4ff0fbf2f11e9caf7fa724c0b1a20424b]) * (edit) tika-parent/pom.xml > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.1, 4.0.0 > > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906492#comment-17906492 ] Tilman Hausherr commented on TIKA-2342: --- I haven't made the changes to 2.* because that one is EOL in in April 2025. (I can still do it if wanted) > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.1, 4.0.0 > > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906475#comment-17906475 ] Hudson commented on TIKA-2342: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #579 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/579/]) TIKA-2342: suppport PDFBox IgnoreContentStreamSpaceGlyphs; add test; remove dead code line (tilman: [https://github.com/apache/tika/commit/c4885fae7111e748b9a7cfeee86cd78ebea7f600]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testContentStreamSpaceGlyphs.pdf * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.1, 4.0.0 > > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906335#comment-17906335 ] Tilman Hausherr commented on TIKA-2342: --- Reopened to add the new option > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac >Assignee: Tilman Hausherr >Priority: Major > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992739#comment-15992739 ] Nino Skopac commented on TIKA-2342: --- I've traced it down to PDFBox issue: https://issues.apache.org/jira/browse/PDFBOX-3774 Thank you Tim. > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986466#comment-15986466 ] Nino Skopac commented on TIKA-2342: --- Awesome, will do. Thanks for the prompt reply! > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2342) Broken words
[ https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986442#comment-15986442 ] Tim Allison commented on TIKA-2342: --- Welcome to PDFs! This _may_ be fixable at the PDFBox level. See: https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems If you can reproduce this with pure PDFBox, please open an issue on their JIRA. and more generally: https://wiki.apache.org/tika/PDFParser%20(Apache%20PDFBox) > Broken words > > > Key: TIKA-2342 > URL: https://issues.apache.org/jira/browse/TIKA-2342 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 > Environment: Tika app and Tika server >Reporter: Nino Skopac > > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
