[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output

2022-11-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635509#comment-17635509
 ] 

Tilman Hausherr commented on PDFBOX-5540:
-

No unless you have free time, because we'd still need another test before 
release.

> export:text creates jibberish / malformed output
> 
>
> Key: PDFBOX-5540
> URL: https://issues.apache.org/jira/browse/PDFBOX-5540
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox
> Environment: Same on Windows, Linux and macOS
>Reporter: Alfons
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: regression
> Fix For: 2.0.28, 3.0.0 PDFBox
>
> Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt
>
>
> Using PDFBox as part of Tika and having issues with some PDFs outputting 
> unreadable content. Copying text from Adobe / macOS Preview / Browsers works 
> as expected.
> I have also tried "re-encoding" the PDF by editing and saving it with 
> Acrobat, thinking it could be an issue with their original PDF creator and 
> using pdfbox with different encodings, but output mostly remained unchanged.
> I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
> {code:java}
> root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5544) Add error-prone to build

2022-11-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635505#comment-17635505
 ] 

Tilman Hausherr commented on PDFBOX-5544:
-

We don't use git, we use svn

> Add error-prone to build
> 
>
> Key: PDFBOX-5544
> URL: https://issues.apache.org/jira/browse/PDFBOX-5544
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Ahmad
>Priority: Minor
>
> Would this project, welcome a contribution to add 
> [https://errorprone.info|https://errorprone.info/] ?
> The work would include [https://errorprone.info/docs/installation] and fixing 
> any warnings and errors found one time.
> Contributors and committers would then have an additional quality check, 
> similar to Spotbugs & PMD.
> Error Prone is [already used by many projects at the 
> ASF|https://issues.apache.org/jira/issues/?jql=text%20~%20%22error-prone%22].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5544) Add error-prone to build

2022-11-17 Thread Ahmad (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635414#comment-17635414
 ] 

Ahmad edited comment on PDFBOX-5544 at 11/17/22 3:37 PM:
-

This could be added as a github precommit check so that it's not private. WDYT?


was (Author: JIRAUSER298200):
This could be added as a git precommit check so that it's not private. WDYT?

> Add error-prone to build
> 
>
> Key: PDFBOX-5544
> URL: https://issues.apache.org/jira/browse/PDFBOX-5544
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Ahmad
>Priority: Minor
>
> Would this project, welcome a contribution to add 
> [https://errorprone.info|https://errorprone.info/] ?
> The work would include [https://errorprone.info/docs/installation] and fixing 
> any warnings and errors found one time.
> Contributors and committers would then have an additional quality check, 
> similar to Spotbugs & PMD.
> Error Prone is [already used by many projects at the 
> ASF|https://issues.apache.org/jira/issues/?jql=text%20~%20%22error-prone%22].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5544) Add error-prone to build

2022-11-17 Thread Ahmad (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635414#comment-17635414
 ] 

Ahmad commented on PDFBOX-5544:
---

This could be added as a git precommit check so that it's not private. WDYT?

> Add error-prone to build
> 
>
> Key: PDFBOX-5544
> URL: https://issues.apache.org/jira/browse/PDFBOX-5544
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Ahmad
>Priority: Minor
>
> Would this project, welcome a contribution to add 
> [https://errorprone.info|https://errorprone.info/] ?
> The work would include [https://errorprone.info/docs/installation] and fixing 
> any warnings and errors found one time.
> Contributors and committers would then have an additional quality check, 
> similar to Spotbugs & PMD.
> Error Prone is [already used by many projects at the 
> ASF|https://issues.apache.org/jira/issues/?jql=text%20~%20%22error-prone%22].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output

2022-11-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635337#comment-17635337
 ] 

Tim Allison commented on PDFBOX-5540:
-

Should I kick that off now?

> export:text creates jibberish / malformed output
> 
>
> Key: PDFBOX-5540
> URL: https://issues.apache.org/jira/browse/PDFBOX-5540
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox
> Environment: Same on Windows, Linux and macOS
>Reporter: Alfons
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: regression
> Fix For: 2.0.28, 3.0.0 PDFBox
>
> Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt
>
>
> Using PDFBox as part of Tika and having issues with some PDFs outputting 
> unreadable content. Copying text from Adobe / macOS Preview / Browsers works 
> as expected.
> I have also tried "re-encoding" the PDF by editing and saving it with 
> Acrobat, thinking it could be an issue with their original PDF creator and 
> using pdfbox with different encodings, but output mostly remained unchanged.
> I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
> {code:java}
> root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org