Hi! I'm totally new to Apache PDFBox to please bear any stupid question:-)
In my work I do some digital archiving where I usually OCR scanned PDF-images with Tesseract and then do the conversion from PDF to PDF/A-1b with Ghostscript. However, there has recently been a change in the OCR specifications - don't really know when and exactly how - but the consequences are that Ghostscript now is mangling and altering the OCR so it can't be used. As what I understand it has something to do with the ToUnicode CMap processing. However I tried some other software to do the conversion and the problem does not occur there. That's why I also would like to try to do the conversion with PDFBox to see what happens. The problem is I have absolutely no idea how to do this. I'm not really in to java-based software. Can it be done nad how is it done? Preferably from the Linux commandline. I saw this https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html but I can't make any sense out of it. Is it possible something like this: java -jar pdfbox-app-x.y.z.jar Convert [OPTIONS] <inputfile> [outputfile] (where options might be compability level)? Many thanks for your effort! Best regards Paul Bergström Sweden --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

