[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output
[ https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635509#comment-17635509 ] Tilman Hausherr commented on PDFBOX-5540: - No unless you have free time, because we'd still need another test before release. > export:text creates jibberish / malformed output > > > Key: PDFBOX-5540 > URL: https://issues.apache.org/jira/browse/PDFBOX-5540 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox > Environment: Same on Windows, Linux and macOS >Reporter: Alfons >Assignee: Tilman Hausherr >Priority: Minor > Labels: regression > Fix For: 2.0.28, 3.0.0 PDFBox > > Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt > > > Using PDFBox as part of Tika and having issues with some PDFs outputting > unreadable content. Copying text from Adobe / macOS Preview / Browsers works > as expected. > I have also tried "re-encoding" the PDF by editing and saving it with > Acrobat, thinking it could be an issue with their original PDF creator and > using pdfbox with different encodings, but output mostly remained unchanged. > I attached the PDF and text it produces. Running it PDFBox via CLI as follows: > {code:java} > root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5544) Add error-prone to build
[ https://issues.apache.org/jira/browse/PDFBOX-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635505#comment-17635505 ] Tilman Hausherr commented on PDFBOX-5544: - We don't use git, we use svn > Add error-prone to build > > > Key: PDFBOX-5544 > URL: https://issues.apache.org/jira/browse/PDFBOX-5544 > Project: PDFBox > Issue Type: Improvement >Reporter: Ahmad >Priority: Minor > > Would this project, welcome a contribution to add > [https://errorprone.info|https://errorprone.info/] ? > The work would include [https://errorprone.info/docs/installation] and fixing > any warnings and errors found one time. > Contributors and committers would then have an additional quality check, > similar to Spotbugs & PMD. > Error Prone is [already used by many projects at the > ASF|https://issues.apache.org/jira/issues/?jql=text%20~%20%22error-prone%22]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5544) Add error-prone to build
[ https://issues.apache.org/jira/browse/PDFBOX-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635414#comment-17635414 ] Ahmad edited comment on PDFBOX-5544 at 11/17/22 3:37 PM: - This could be added as a github precommit check so that it's not private. WDYT? was (Author: JIRAUSER298200): This could be added as a git precommit check so that it's not private. WDYT? > Add error-prone to build > > > Key: PDFBOX-5544 > URL: https://issues.apache.org/jira/browse/PDFBOX-5544 > Project: PDFBox > Issue Type: Improvement >Reporter: Ahmad >Priority: Minor > > Would this project, welcome a contribution to add > [https://errorprone.info|https://errorprone.info/] ? > The work would include [https://errorprone.info/docs/installation] and fixing > any warnings and errors found one time. > Contributors and committers would then have an additional quality check, > similar to Spotbugs & PMD. > Error Prone is [already used by many projects at the > ASF|https://issues.apache.org/jira/issues/?jql=text%20~%20%22error-prone%22]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5544) Add error-prone to build
[ https://issues.apache.org/jira/browse/PDFBOX-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635414#comment-17635414 ] Ahmad commented on PDFBOX-5544: --- This could be added as a git precommit check so that it's not private. WDYT? > Add error-prone to build > > > Key: PDFBOX-5544 > URL: https://issues.apache.org/jira/browse/PDFBOX-5544 > Project: PDFBox > Issue Type: Improvement >Reporter: Ahmad >Priority: Minor > > Would this project, welcome a contribution to add > [https://errorprone.info|https://errorprone.info/] ? > The work would include [https://errorprone.info/docs/installation] and fixing > any warnings and errors found one time. > Contributors and committers would then have an additional quality check, > similar to Spotbugs & PMD. > Error Prone is [already used by many projects at the > ASF|https://issues.apache.org/jira/issues/?jql=text%20~%20%22error-prone%22]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output
[ https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635337#comment-17635337 ] Tim Allison commented on PDFBOX-5540: - Should I kick that off now? > export:text creates jibberish / malformed output > > > Key: PDFBOX-5540 > URL: https://issues.apache.org/jira/browse/PDFBOX-5540 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox > Environment: Same on Windows, Linux and macOS >Reporter: Alfons >Assignee: Tilman Hausherr >Priority: Minor > Labels: regression > Fix For: 2.0.28, 3.0.0 PDFBox > > Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt > > > Using PDFBox as part of Tika and having issues with some PDFs outputting > unreadable content. Copying text from Adobe / macOS Preview / Browsers works > as expected. > I have also tried "re-encoding" the PDF by editing and saving it with > Acrobat, thinking it could be an issue with their original PDF creator and > using pdfbox with different encodings, but output mostly remained unchanged. > I attached the PDF and text it produces. Running it PDFBox via CLI as follows: > {code:java} > root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org