The internal structure of a PDF file does not always correspond to the visual structure you see on the page, or what we would interpret as the "correct" structure. PDFBox extracts the former, but you can write your own text parser that will do it differently, such as sort the characters on the page according to their x,y location. This is not a simple task, though.
On Sat, Jan 16, 2016 at 12:52 PM, Diogo Ribeiro <[email protected]> wrote: > Hi guys, > > I'm using PDFBox 1.8.10 to extract some text from a PDF (see attachment). > > The output lines are not correctly sorted. > > Got: > > 1/435 S LOPES CÂNDIDO FELIX LOPESABEL DIA 27-09-1964 > FRANCISCA MARIA DIAS > > Was expecting: > > 1/435 ABEL DIAS LOPES CÂNDIDO FELIX LOPES 27-09-1964 > FRANCISCA MARIA DIAS > > My simple code: > > PDDocument pdf = PDDocument.load(new File(FILE_PATH)); > > PDFTextStripper stripper = new PDFTextStripper(); > > stripper.setStartPage(1); > stripper.setEndPage(1); > stripper.setSortByPosition(true); > > String plainText = stripper.getText(pdf); > > System.out.println(plainText); > > > Thanks in advance. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >

