Over on Apache Tika (via PDFBox!), we report the number of characters without Unicode mappings, and, if you add our tika-eval jar, you can also get an "out of vocabulary" statistic that is an indicator that extracted text is garbage. Happy to chat over on u...@tika.apache.org on either of those topics.
Would be interesting to see if veraPDF is also extracting unmapped Unicode chars...missing/broken fonts etc. On Tue, Jun 27, 2023 at 11:30 AM Susan Borda <sbo...@umich.edu> wrote: > Thanks Tillman, exactly the info I needed. > > On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr <thaush...@t-online.de> > wrote: > > > Hi, > > PDFBox preflight only checks for PDF/A-1b, not for any accessibility > > topics. Maybe your PDF isn't meant to be accessible to prevent scraping. > > Try https://verapdf.org/ > > Tilman > > > > On 26.06.2023 19:36, Susan Borda wrote: > > > Hi All- > > > I'd like to check PDFs that have character encoding issues, does > > Preflight > > > do that? I checked the accessibility of a pdf file in Adobe Pro and it > > gave > > > me a "Character encoding -Failed" message. When I checked this same > file > > in > > > Preflight I got this: > > > > > > Jun 26, 2023 1:24:41 PM > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > ensureDisplayProfile > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > > Jun 26, 2023 1:24:41 PM > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > ensureDisplayProfile > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > > Jun 26, 2023 1:24:41 PM > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased > > > ensureDisplayProfile > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class > > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b > file > > > > > > When I try to copy/paste the text from this PDF it's all garbage and > the > > > CMap is missing. > > > > > > Any advice would be greatly appreciated. > > > Thanks, > > > susan > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > > -- > Susan Borda > Digital Preservation Projects Manager > Digital Preservation Unit > University of Michigan Libraries > Buhr Building > sbo...@umich.edu > *My office phone number is temporarily disconnected while I work remotely > due to COVID-19. Please contact me via email.* >