Re: PDFTextStripper() does not extract text correct

Tilman Hausherr Thu, 08 Nov 2018 20:14:10 -0800

Am 08.11.2018 um 15:54 schrieb JZ Q:

Hi everyone,


I used the following code (lib version 2.0.12) to extract text from some
PDF file. It appears number "3" is occasionally interpreted as "6", for
example, E4283211 becomes E4286211.

Is it normally? Is the code using OCR? Thanks.


No, PDFBox doesn't have OCR (but Tika has it as an option).

It could be that your PDF is an image with invisible OCR. Please link toyour PDF somewhere. (don't attach)


Tilman



PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(i);
pdfStripper.setEndPage(i);

String text = pdfStripper.getText(pdDoc);
String[] docxLines = text.split(System.lineSeparator());
for (String line : docxLines) {



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFTextStripper() does not extract text correct

Reply via email to