Re: When will the next version from the 3.x line be available?

2023-06-27 Thread Andreas Lehmkühler

Hi,


Am 27.06.23 um 15:10 schrieb Brangs, Erik:

Hi,

version 2.0.28 of PDFBox was released recently. Will there also be a new 
version from the 3.x line in the near future?


First of all there will be another 2.0 release, hopefully tomorrow



Andreas Lehmkühler mentioned a possible beta1 release last month ( 
https://lists.apache.org/thread/0bgg6pd4d48qd49bxsdgvb9vsxr9r3v6 ).


Yes, due to some personal issues I had to postpone the 3.0 release. Once 
the 2.0.29 ist out I'm going to target the first beta of 3.0



Andreas




We are interested in the 3.x line because of the reduced memory usage and 
because it is tested on newer Java versions.


Mit freundlichen Grüßen
Erik Brangs
*** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Does preflight check for "character encoding"?

2023-06-27 Thread Susan Borda
Thanks Tim!

I just ran veraPDF on 140K pdfs using the UA1 flag, it gave me a 9+GB XML
file that I'm parsing now.
Here's the output of running a file where the CMap is missing entirely, I
ran this with the "a1" flag.

I'll try your tika-eval.jar on this file. Previously I ran the Python Tika
against this pile of files. I'll look for the "out of vocabulary" statistic
in that report.
-susan

On Tue, Jun 27, 2023 at 12:25 PM Tim Allison  wrote:

> Over on Apache Tika (via PDFBox!), we report the number of characters
> without Unicode mappings, and, if you add our tika-eval jar, you can also
> get an "out of vocabulary" statistic that is an indicator that extracted
> text is garbage. Happy to chat over on u...@tika.apache.org on either of
> those topics.
>
> Would be interesting to see if veraPDF is also extracting unmapped Unicode
> chars...missing/broken fonts etc.
>
> On Tue, Jun 27, 2023 at 11:30 AM Susan Borda  wrote:
>
> > Thanks Tillman, exactly the info I needed.
> >
> > On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr 
> > wrote:
> >
> > > Hi,
> > > PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> > > topics. Maybe your PDF isn't meant to be accessible to prevent
> scraping.
> > > Try https://verapdf.org/
> > > Tilman
> > >
> > > On 26.06.2023 19:36, Susan Borda wrote:
> > > > Hi All-
> > > > I'd like to check PDFs that have character encoding issues, does
> > > Preflight
> > > > do that? I checked the accessibility of a pdf file in Adobe Pro and
> it
> > > gave
> > > > me a "Character encoding -Failed" message. When I checked this same
> > file
> > > in
> > > > Preflight I got this:
> > > >
> > > > Jun 26, 2023 1:24:41 PM
> > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > > ensureDisplayProfile
> > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display
> class
> > > > Jun 26, 2023 1:24:41 PM
> > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > > ensureDisplayProfile
> > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display
> class
> > > > Jun 26, 2023 1:24:41 PM
> > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > > ensureDisplayProfile
> > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display
> class
> > > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b
> > file
> > > >
> > > > When I try to copy/paste the text from this PDF it's all garbage and
> > the
> > > > CMap is missing.
> > > >
> > > > Any advice would be greatly appreciated.
> > > > Thanks,
> > > > susan
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > >
> > >
> >
> > --
> > Susan Borda
> > Digital Preservation Projects Manager
> > Digital Preservation Unit
> > University of Michigan Libraries
> > Buhr Building
> > sbo...@umich.edu
> > *My office phone number is temporarily disconnected while I work remotely
> > due to COVID-19. Please contact me via email.*
> >
>


-- 
Susan Borda
Digital Preservation Projects Manager
Digital Preservation Unit
University of Michigan Libraries
Buhr Building
sbo...@umich.edu
*My office phone number is temporarily disconnected while I work remotely
due to COVID-19. Please contact me via email.*


  



  
  

  
/Users/sborda/Desktop/PDFbook/0636920021483-master-examples/examples/BritishLibrary-PDF_Assessment_v1.3.pdf
  
  

  
A Level A conforming file shall specify the value of pdfaid:conformance as A.
PDFAIdentification
conformance == "A"

  root/document[0]/metadata[0](545 0 obj PDMetadata)/XMPPackage[0]/Identification[0]
  The "conformance" property of the PDF/A Identification Schema is B instead of "A" for PDF/A-1a conforming file.

  
  
The font dictionary shall include a ToUnicode entry whose value is a CMap stream object that maps character codes to Unicode values, 
			as described in PDF Reference 5.9, unless the font meets any of the following three conditions:
			(*) fonts that use the predefined encodings MacRomanEncoding, MacExpertEncoding or WinAnsiEncoding, or that use the predefined Identity-H or Identity-V CMaps;
			(*) Type 1 fonts whose character names are taken from the Adobe standard Latin character set or the set of named characters in the Symbol font, as defined in PDF Reference Appendix D;
			(*) Type 0 fonts whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1 or Adobe-Korea1 character collections.
Glyph
toUnicode != null

  root/document[0]/pages[0](551 0 obj PDPage)/contentStream[0]/operators[19]/usedGlyphs[0](OBPMDO+Arial OBPMDO+Arial 3 0  0)
  The font does not define Unicode character map


  

Re: Does preflight check for "character encoding"?

2023-06-27 Thread Tim Allison
Over on Apache Tika (via PDFBox!), we report the number of characters
without Unicode mappings, and, if you add our tika-eval jar, you can also
get an "out of vocabulary" statistic that is an indicator that extracted
text is garbage. Happy to chat over on u...@tika.apache.org on either of
those topics.

Would be interesting to see if veraPDF is also extracting unmapped Unicode
chars...missing/broken fonts etc.

On Tue, Jun 27, 2023 at 11:30 AM Susan Borda  wrote:

> Thanks Tillman, exactly the info I needed.
>
> On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr 
> wrote:
>
> > Hi,
> > PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> > topics. Maybe your PDF isn't meant to be accessible to prevent scraping.
> > Try https://verapdf.org/
> > Tilman
> >
> > On 26.06.2023 19:36, Susan Borda wrote:
> > > Hi All-
> > > I'd like to check PDFs that have character encoding issues, does
> > Preflight
> > > do that? I checked the accessibility of a pdf file in Adobe Pro and it
> > gave
> > > me a "Character encoding -Failed" message. When I checked this same
> file
> > in
> > > Preflight I got this:
> > >
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b
> file
> > >
> > > When I try to copy/paste the text from this PDF it's all garbage and
> the
> > > CMap is missing.
> > >
> > > Any advice would be greatly appreciated.
> > > Thanks,
> > > susan
> >
> >
> >
> > -
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> >
> >
>
> --
> Susan Borda
> Digital Preservation Projects Manager
> Digital Preservation Unit
> University of Michigan Libraries
> Buhr Building
> sbo...@umich.edu
> *My office phone number is temporarily disconnected while I work remotely
> due to COVID-19. Please contact me via email.*
>


Re: Does preflight check for "character encoding"?

2023-06-27 Thread Susan Borda
Thanks Tillman, exactly the info I needed.

On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr 
wrote:

> Hi,
> PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> topics. Maybe your PDF isn't meant to be accessible to prevent scraping.
> Try https://verapdf.org/
> Tilman
>
> On 26.06.2023 19:36, Susan Borda wrote:
> > Hi All-
> > I'd like to check PDFs that have character encoding issues, does
> Preflight
> > do that? I checked the accessibility of a pdf file in Adobe Pro and it
> gave
> > me a "Character encoding -Failed" message. When I checked this same file
> in
> > Preflight I got this:
> >
> > Jun 26, 2023 1:24:41 PM
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > ensureDisplayProfile
> > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > Jun 26, 2023 1:24:41 PM
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > ensureDisplayProfile
> > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > Jun 26, 2023 1:24:41 PM
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > ensureDisplayProfile
> > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b file
> >
> > When I try to copy/paste the text from this PDF it's all garbage and the
> > CMap is missing.
> >
> > Any advice would be greatly appreciated.
> > Thanks,
> > susan
>
>
>
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

-- 
Susan Borda
Digital Preservation Projects Manager
Digital Preservation Unit
University of Michigan Libraries
Buhr Building
sbo...@umich.edu
*My office phone number is temporarily disconnected while I work remotely
due to COVID-19. Please contact me via email.*


When will the next version from the 3.x line be available?

2023-06-27 Thread Brangs, Erik
Hi,

version 2.0.28 of PDFBox was released recently. Will there also be a new 
version from the 3.x line in the near future?

Andreas Lehmkühler mentioned a possible beta1 release last month ( 
https://lists.apache.org/thread/0bgg6pd4d48qd49bxsdgvb9vsxr9r3v6 ).

We are interested in the 3.x line because of the reduced memory usage and 
because it is tested on newer Java versions.


Mit freundlichen Grüßen
Erik Brangs
*** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***

-- 
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.bra...@dnb.de
http://www.dnb.de