Hi Paul, I just finished doing exactly that for the same reason (archiving documents). I wrote a Java program that can be called from the command line. It would probably need to be tweaked to suit your exact needs but I would be happy to share it and provide some help with customizing it. I'm away from my desk right now but I will follow up with you later today.
Mick Davis On Jul 6, 2016 8:30 AM, "Paul Bergstrom" <[email protected]> wrote: > Hi! > > I'm totally new to Apache PDFBox to please bear any stupid question:-) > > In my work I do some digital archiving where I usually OCR scanned > PDF-images with Tesseract and then do the conversion from PDF to PDF/A-1b > with Ghostscript. > > However, there has recently been a change in the OCR specifications - > don't really know when and exactly how - but the consequences are that > Ghostscript now is mangling and altering the OCR so it can't be used. As > what I understand it has something to do with the ToUnicode CMap processing. > > However I tried some other software to do the conversion and the problem > does not occur there. That's why I also would like to try to do the > conversion with PDFBox to see what happens. > > The problem is I have absolutely no idea how to do this. I'm not really in > to java-based software. Can it be done nad how is it done? Preferably from > the Linux commandline. > > I saw this https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html but I > can't make any sense out of it. > > Is it possible something like this: > > java -jar pdfbox-app-x.y.z.jar Convert [OPTIONS] <inputfile> [outputfile] > (where options might be compability level)? > > Many thanks for your effort! > > Best regards > > Paul Bergström > Sweden > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

