[jira] [Commented] (TIKA-2342) Broken words

2024-12-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906522#comment-17906522
 ] 

Hudson commented on TIKA-2342:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch_3x-jdk11 #1918 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/1918/])
TIKA-2342: suppport PDFBox IgnoreContentStreamSpaceGlyphs; add test; remove 
dead code line (tilman: 
[https://github.com/apache/tika/commit/fb1f238a25e4be680ab92ae5684583857c33ddbe])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testContentStreamSpaceGlyphs.pdf
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
TIKA-2342: remove dead code line (tilman: 
[https://github.com/apache/tika/commit/771fd1b848cca1ea70851627ab404a6d464de8e4])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
TIKA-2342: add comment (tilman: 
[https://github.com/apache/tika/commit/067c3f84e755568421c198208dc550c8e3abe96a])
* (edit) tika-parent/pom.xml


> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.1, 4.0.0
>
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2342) Broken words

2024-12-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906510#comment-17906510
 ] 

Hudson commented on TIKA-2342:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #580 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/580/])
TIKA-2342: remove dead code line (tilman: 
[https://github.com/apache/tika/commit/5bbbf92f264a38f480743bf6755dd0dfce52f56b])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
TIKA-2342: add comment (tilman: 
[https://github.com/apache/tika/commit/b38ebdf4ff0fbf2f11e9caf7fa724c0b1a20424b])
* (edit) tika-parent/pom.xml


> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.1, 4.0.0
>
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2342) Broken words

2024-12-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906492#comment-17906492
 ] 

Tilman Hausherr commented on TIKA-2342:
---

I haven't made the changes to 2.* because that one is EOL in in April 2025. (I 
can still do it if wanted)

> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.1, 4.0.0
>
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2342) Broken words

2024-12-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906475#comment-17906475
 ] 

Hudson commented on TIKA-2342:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #579 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/579/])
TIKA-2342: suppport PDFBox IgnoreContentStreamSpaceGlyphs; add test; remove 
dead code line (tilman: 
[https://github.com/apache/tika/commit/c4885fae7111e748b9a7cfeee86cd78ebea7f600])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testContentStreamSpaceGlyphs.pdf
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.1, 4.0.0
>
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2342) Broken words

2024-12-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906335#comment-17906335
 ] 

Tilman Hausherr commented on TIKA-2342:
---

Reopened to add the new option

> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>Assignee: Tilman Hausherr
>Priority: Major
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2342) Broken words

2017-05-02 Thread Nino Skopac (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992739#comment-15992739
 ] 

Nino Skopac commented on TIKA-2342:
---

I've traced it down to PDFBox issue: 
https://issues.apache.org/jira/browse/PDFBOX-3774

Thank you Tim.

> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2342) Broken words

2017-04-27 Thread Nino Skopac (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986466#comment-15986466
 ] 

Nino Skopac commented on TIKA-2342:
---

Awesome, will do. Thanks for the prompt reply!

> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2342) Broken words

2017-04-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986442#comment-15986442
 ] 

Tim Allison commented on TIKA-2342:
---

Welcome to PDFs!  This _may_ be fixable at the PDFBox level.  See: 
https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems 

If you can reproduce this with pure PDFBox, please open an issue on their JIRA.

and more generally: 
https://wiki.apache.org/tika/PDFParser%20(Apache%20PDFBox)



> Broken words
> 
>
> Key: TIKA-2342
> URL: https://issues.apache.org/jira/browse/TIKA-2342
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
> Environment: Tika app and Tika server
>Reporter: Nino Skopac
>
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)