Re: Wrong extracted text order from a PDF

Jukka Zitting Mon, 04 Apr 2011 02:15:48 -0700

Hi,

On 04/02/2011 03:24 PM, Hesham G. wrote:

I have a PDF file that I am extracting data from it using PDFBox
v1.5. If i copy text from it manually like: "SUPPLY FAN | G0320
B11-14998" to Notepad, it is copied fine ... But in PDFBox it is read
like this: "SUPPLY FAN | B11-14998G0320" ... Many other text does the
same thing. You can test a 1 page sample PDF here :
http://www.4shared.com/document/XDzWQFyY/wrong_extracted_text_sample.html

Enabling the sortByPosition option [1] in the text extraction typicallyhelps solve problems like this. See also the equivalent -sort option ofthe ExtractText command [2].

[1]http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html#setSortByPosition(boolean)

[2] http://pdfbox.apache.org/commandlineutilities/ExtractText.html

--
Jukka Zitting

Re: Wrong extracted text order from a PDF

Reply via email to