Re: When will the next version from the 3.x line be available?
Hi, Am 27.06.23 um 15:10 schrieb Brangs, Erik: Hi, version 2.0.28 of PDFBox was released recently. Will there also be a new version from the 3.x line in the near future? First of all there will be another 2.0 release, hopefully tomorrow Andreas Lehmkühler mentioned a possible beta1 release last month ( https://lists.apache.org/thread/0bgg6pd4d48qd49bxsdgvb9vsxr9r3v6 ). Yes, due to some personal issues I had to postpone the 3.0 release. Once the 2.0.29 ist out I'm going to target the first beta of 3.0 Andreas We are interested in the 3.x line because of the reduced memory usage and because it is tested on newer Java versions. Mit freundlichen Grüßen Erik Brangs *** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek *** - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Does preflight check for "character encoding"?
Thanks Tim! I just ran veraPDF on 140K pdfs using the UA1 flag, it gave me a 9+GB XML file that I'm parsing now. Here's the output of running a file where the CMap is missing entirely, I ran this with the "a1" flag. I'll try your tika-eval.jar on this file. Previously I ran the Python Tika against this pile of files. I'll look for the "out of vocabulary" statistic in that report. -susan On Tue, Jun 27, 2023 at 12:25 PM Tim Allison wrote: > Over on Apache Tika (via PDFBox!), we report the number of characters > without Unicode mappings, and, if you add our tika-eval jar, you can also > get an "out of vocabulary" statistic that is an indicator that extracted > text is garbage. Happy to chat over on u...@tika.apache.org on either of > those topics. > > Would be interesting to see if veraPDF is also extracting unmapped Unicode > chars...missing/broken fonts etc. > > On Tue, Jun 27, 2023 at 11:30 AM Susan Borda wrote: > > > Thanks Tillman, exactly the info I needed. > > > > On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr > > wrote: > > > > > Hi, > > > PDFBox preflight only checks for PDF/A-1b, not for any accessibility > > > topics. Maybe your PDF isn't meant to be accessible to prevent > scraping. > > > Try https://verapdf.org/ > > > Tilman > > > > > > On 26.06.2023 19:36, Susan Borda wrote: > > > > Hi All- > > > > I'd like to check PDFs that have character encoding issues, does > > > Preflight > > > > do that? I checked the accessibility of a pdf file in Adobe Pro and > it > > > gave > > > > me a "Character encoding -Failed" message. When I checked this same > > file > > > in > > > > Preflight I got this: > > > > > > > > Jun 26, 2023 1:24:41 PM > > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > > ensureDisplayProfile > > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display > class > > > > Jun 26, 2023 1:24:41 PM > > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > > ensureDisplayProfile > > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display > class > > > > Jun 26, 2023 1:24:41 PM > > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > > ensureDisplayProfile > > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display > class > > > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b > > file > > > > > > > > When I try to copy/paste the text from this PDF it's all garbage and > > the > > > > CMap is missing. > > > > > > > > Any advice would be greatly appreciated. > > > > Thanks, > > > > susan > > > > > > > > > > > > - > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > > > > > > -- > > Susan Borda > > Digital Preservation Projects Manager > > Digital Preservation Unit > > University of Michigan Libraries > > Buhr Building > > sbo...@umich.edu > > *My office phone number is temporarily disconnected while I work remotely > > due to COVID-19. Please contact me via email.* > > > -- Susan Borda Digital Preservation Projects Manager Digital Preservation Unit University of Michigan Libraries Buhr Building sbo...@umich.edu *My office phone number is temporarily disconnected while I work remotely due to COVID-19. Please contact me via email.* /Users/sborda/Desktop/PDFbook/0636920021483-master-examples/examples/BritishLibrary-PDF_Assessment_v1.3.pdf A Level A conforming file shall specify the value of pdfaid:conformance as A. PDFAIdentification conformance == "A" root/document[0]/metadata[0](545 0 obj PDMetadata)/XMPPackage[0]/Identification[0] The "conformance" property of the PDF/A Identification Schema is B instead of "A" for PDF/A-1a conforming file. The font dictionary shall include a ToUnicode entry whose value is a CMap stream object that maps character codes to Unicode values, as described in PDF Reference 5.9, unless the font meets any of the following three conditions: (*) fonts that use the predefined encodings MacRomanEncoding, MacExpertEncoding or WinAnsiEncoding, or that use the predefined Identity-H or Identity-V CMaps; (*) Type 1 fonts whose character names are taken from the Adobe standard Latin character set or the set of named characters in the Symbol font, as defined in PDF Reference Appendix D; (*) Type 0 fonts whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1 or Adobe-Korea1 character collections. Glyph toUnicode != null root/document[0]/pages[0](551 0 obj PDPage)/contentStream[0]/operators[19]/usedGlyphs[0](OBPMDO+Arial OBPMDO+Arial 3 0 0) The font does not define Unicode character map
Re: Does preflight check for "character encoding"?
Over on Apache Tika (via PDFBox!), we report the number of characters without Unicode mappings, and, if you add our tika-eval jar, you can also get an "out of vocabulary" statistic that is an indicator that extracted text is garbage. Happy to chat over on u...@tika.apache.org on either of those topics. Would be interesting to see if veraPDF is also extracting unmapped Unicode chars...missing/broken fonts etc. On Tue, Jun 27, 2023 at 11:30 AM Susan Borda wrote: > Thanks Tillman, exactly the info I needed. > > On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr > wrote: > > > Hi, > > PDFBox preflight only checks for PDF/A-1b, not for any accessibility > > topics. Maybe your PDF isn't meant to be accessible to prevent scraping. > > Try https://verapdf.org/ > > Tilman > > > > On 26.06.2023 19:36, Susan Borda wrote: > > > Hi All- > > > I'd like to check PDFs that have character encoding issues, does > > Preflight > > > do that? I checked the accessibility of a pdf file in Adobe Pro and it > > gave > > > me a "Character encoding -Failed" message. When I checked this same > file > > in > > > Preflight I got this: > > > > > > Jun 26, 2023 1:24:41 PM > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > ensureDisplayProfile > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > > Jun 26, 2023 1:24:41 PM > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > ensureDisplayProfile > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > > Jun 26, 2023 1:24:41 PM > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > ensureDisplayProfile > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b > file > > > > > > When I try to copy/paste the text from this PDF it's all garbage and > the > > > CMap is missing. > > > > > > Any advice would be greatly appreciated. > > > Thanks, > > > susan > > > > > > > > - > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > > -- > Susan Borda > Digital Preservation Projects Manager > Digital Preservation Unit > University of Michigan Libraries > Buhr Building > sbo...@umich.edu > *My office phone number is temporarily disconnected while I work remotely > due to COVID-19. Please contact me via email.* >
Re: Does preflight check for "character encoding"?
Thanks Tillman, exactly the info I needed. On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr wrote: > Hi, > PDFBox preflight only checks for PDF/A-1b, not for any accessibility > topics. Maybe your PDF isn't meant to be accessible to prevent scraping. > Try https://verapdf.org/ > Tilman > > On 26.06.2023 19:36, Susan Borda wrote: > > Hi All- > > I'd like to check PDFs that have character encoding issues, does > Preflight > > do that? I checked the accessibility of a pdf file in Adobe Pro and it > gave > > me a "Character encoding -Failed" message. When I checked this same file > in > > Preflight I got this: > > > > Jun 26, 2023 1:24:41 PM > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > ensureDisplayProfile > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > Jun 26, 2023 1:24:41 PM > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > ensureDisplayProfile > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > Jun 26, 2023 1:24:41 PM > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > ensureDisplayProfile > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b file > > > > When I try to copy/paste the text from this PDF it's all garbage and the > > CMap is missing. > > > > Any advice would be greatly appreciated. > > Thanks, > > susan > > > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Susan Borda Digital Preservation Projects Manager Digital Preservation Unit University of Michigan Libraries Buhr Building sbo...@umich.edu *My office phone number is temporarily disconnected while I work remotely due to COVID-19. Please contact me via email.*
When will the next version from the 3.x line be available?
Hi, version 2.0.28 of PDFBox was released recently. Will there also be a new version from the 3.x line in the near future? Andreas Lehmkühler mentioned a possible beta1 release last month ( https://lists.apache.org/thread/0bgg6pd4d48qd49bxsdgvb9vsxr9r3v6 ). We are interested in the 3.x line because of the reduced memory usage and because it is tested on newer Java versions. Mit freundlichen Grüßen Erik Brangs *** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek *** -- Erik Brangs Deutsche Nationalbibliothek Informationstechnik Adickesallee 1 60322 Frankfurt am Main Telefon: +49 69 1525-1792 Telefax: +49 69 1525-1799 mailto:e.bra...@dnb.de http://www.dnb.de