Am 29.05.2017 um 08:56 schrieb Kunal Kashyap:
I am trying to read text data from a pdf file using PdfBox API. So ,I want to skip all the charts data and images in the output .txt file . Can anyone help me regarding this. Also I want to extract data in proper alignment. PFA is the sample pdf file and sample .txt file(this is my desired output file)

Please have a look at the ExtractTextByArea.java example in the source code download, this will allow you to extract from a predefined area.

There is no way in PDF to "exclude tables" because there is no table concept in PDF like in HTML. It's just a bunch of lines with text. You would need heuristics to guess what's a table and what isn't.

Re order, use the setSortByPosition() method.

If you want exact positions of everything, have a look at the PrintTextLocations.java example.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to