Re: Extract text from PDF, wrong sort order

Gilad Denneboom Sat, 16 Jan 2016 03:59:09 -0800

The internal structure of a PDF file does not always correspond to the
visual structure you see on the page, or what we would interpret as the
"correct" structure. PDFBox extracts the former, but you can write your own
text parser that will do it differently, such as sort the characters on the
page according to their x,y location. This is not a simple task, though.


On Sat, Jan 16, 2016 at 12:52 PM, Diogo Ribeiro <[email protected]>
wrote:

> Hi guys,
>
> I'm using PDFBox 1.8.10 to extract some text from a PDF (see attachment).
>
> The output lines are not correctly sorted.
>
> Got:
>
> 1/435 S LOPES CÂNDIDO FELIX LOPESABEL DIA 27-09-1964
> FRANCISCA MARIA DIAS
>
> Was expecting:
>
> 1/435 ABEL DIAS LOPES CÂNDIDO FELIX LOPES 27-09-1964
> FRANCISCA MARIA DIAS
>
> My simple code:
>
>          PDDocument pdf = PDDocument.load(new File(FILE_PATH));
>
>         PDFTextStripper stripper = new PDFTextStripper();
>
>         stripper.setStartPage(1);
>         stripper.setEndPage(1);
>         stripper.setSortByPosition(true);
>
>         String plainText = stripper.getText(pdf);
>
>         System.out.println(plainText);
>
>
> Thanks in advance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: Extract text from PDF, wrong sort order

Reply via email to