Am 08.11.2018 um 15:54 schrieb JZ Q:
Hi everyone,

I used the following code (lib version 2.0.12) to extract text from some
PDF file. It appears number "3" is occasionally interpreted as "6", for
example, E4283211 becomes E4286211.

Is it normally? Is the code using OCR? Thanks.

No, PDFBox doesn't have OCR (but Tika has it as an option).

It could be that your PDF is an image with invisible OCR. Please link to your PDF somewhere. (don't attach)

Tilman



PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(i);
pdfStripper.setEndPage(i);

String text = pdfStripper.getText(pdDoc);
String[] docxLines = text.split(System.lineSeparator());
for (String line : docxLines) {



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to