Re: PDF to PDF/A conversion

Tilman Hausherr Wed, 06 Jul 2016 09:34:59 -0700

Am 06.07.2016 um 13:22 schrieb Paul Bergstrom:

Hi!


I'm totally new to Apache PDFBox to please bear any stupid question:-)

In my work I do some digital archiving where I usually OCR scanned PDF-images 
with Tesseract and then do the conversion from PDF to PDF/A-1b with Ghostscript.

However, there has recently been a change in the OCR specifications - don't 
really know when and exactly how - but the consequences are that Ghostscript 
now is mangling and altering the OCR so it can't be used. As what I understand 
it has something to do with the ToUnicode CMap processing.

However I tried some other software to do the conversion and the problem does 
not occur there. That's why I also would like to try to do the conversion with 
PDFBox to see what happens.

The problem is I have absolutely no idea how to do this. I'm not really in to 
java-based software. Can it be done nad how is it done? Preferably from the 
Linux commandline.

I saw this https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html but I can't 
make any sense out of it.

Is it possible something like this:

java -jar pdfbox-app-x.y.z.jar Convert [OPTIONS] <inputfile> [outputfile] 
(where options might be compability level)?

We don't have a tool that converts PDF to PDF/A-1b (there are commercialtools that do that, e.g. from Callas or PDF-Tools). It might be possibleto implement this if the flaws are known in advance. Usually the metadata and the output intent are missing, but there might be much more.


Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDF to PDF/A conversion

Reply via email to