Public bug reported: PDF Creation Problem
Bob Swanson [email protected] 26 June 2018 This file is part of the test package: http://swansongrp.com/misc/testcase.zip I have been able to demonstrate PDF printing issues with LibreOffice and web browsers. (For contrast, I have also used the "wkhtmltopdf" command-line utility output.) USING LIBREOFFICE ----------------- (Base file: mytest.odt) This problem was originally associated with a LibreOffice file containing mixed font usages. When printed with "cups-pdf", most of the displayed text could not be selected in "evince" and extracted text was garbage (The PDFBox Java code could not extract reasonable text from the PDF file.) See: https://issues.apache.org/jira/browse/PDFBOX-4250 (working environment described in that bug report) It is much easier to demonstrate this problem without using PDFBox Java. To demonstrate, view any of the resulting PDF files with "evince". When viewing a PDF test file, simply press CTRL-A to select all text. Then "paste" the selected area into a text editor (Gedit or VIM, for instance) to see the resulting plain text. (Okular did no better than evince) The following results occur: o) With original testcase: mytest3_cups_pdf.pdf, created from LibreOffice using cups-pdf Only one line highlights, but its text is correct. This particular line uses a "standard" PDF font. The other lines are not highlighted, and are not placed on clipboard. This was the test that failed in PDFBox Java text extraction. Evince shows the many embedded fonts. o) With original testcase: mytest_libreoffice_direct.pdf created from LibreOffice using its "built in" PDF creation option ALL lines highlight, and when pasted, all text is present. Evince shows the many embedded fonts. USING CHROMIUM BROWSER ---------------------- (Base file: mytest.html) I created several lines using different fonts, as an HTML file. Viewed in Chromium browser, then printed. o) File: mytest_html_cups_pdf.pdf, was printed from Chromium using the "cups-pdf" "printer". All lines appear in the PDF, and can be selected. But when pasted all resulting text is garbage. Only one font embedded: "No name". o) File: mytest_html_save_as_pdf.pdf, was printed from Chromium using the "save as file" option. All lines appear in the PDF, and can be selected. All text (including text added by the PDF creator) are present. Evince shows the many embedded fonts. (In the HTML cases, fonts used are no doubt those already installed on my Ubuntu system. The HTML code asked for fonts that may not be present, and probably were substituted.) USING BRAVE BROWSER ------------------- (Base file: mytest.html) Same testcase as for Chromium browser. Viewed in Brave browser, then printed. o) File: mytest_html_brave_cups_pdf.pdf, was printed from Brave using the "cups-pdf" "printer". All lines appear in the PDF. However, when all selected, every character is highlighted EXCEPT the initial "T" on the first line. When pasted, all resulting text is garbage. Only one font embedded: "No name". o) File: mytest_html_brave_save_as_pdf.pdf, was printed from Brave using the "save as file" option. All lines appear in the PDF, and can be selected. All text is present. (No text added by Brave). Evince shows the many embedded fonts. (Same notes may apply regarding fonts installed on Ubuntu system) USING WKHTMLTOPDF COMMAND ------------------------- (Base file: mytest.html) Same testcase as for browsers. I'm using this example to show that multiple font test output can be created in different ways. Command: wkhtmltopdf mytest.html mytest_wk.pdf o) File: mytest_wk.pdf, All lines appear in the PDF, and fonts are sometimes quite different than those shown in the browsers (they may actually be more correct). All content can be highlighted and can be pasted as text. Several text lines, however, contain additional whitespace (tabs). Evince shows the many embedded fonts. NOTES ----- The "creator" name embedded in the metadata for these PDF files varies considerably, and it is unclear to me whether the same engine is being used by these various packages. It is clear, at least that cups-pdf is using Ghostscript for PDF creation. ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: cups-pdf 2.6.1-21 ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16 Uname: Linux 4.13.0-45-generic x86_64 ApportVersion: 2.20.1-0ubuntu2.18 Architecture: amd64 CurrentDesktop: Unity Date: Wed Jun 27 15:36:32 2018 InstallationDate: Installed on 2017-05-16 (406 days ago) InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64 (20170215.2) SourcePackage: cups-pdf UpgradeStatus: No upgrade log present (probably fresh install) ** Affects: cups-pdf (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug xenial ** Attachment added: "ZIP file with README and several testcases" https://bugs.launchpad.net/bugs/1778988/+attachment/5157183/+files/testcase.zip -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778988 Title: After PDF Files created by cups-pdf, cannot extract text from them To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cups-pdf/+bug/1778988/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
