Am 29.05.2017 um 08:56 schrieb Kunal Kashyap:
I am trying to read text data from a pdf file using PdfBox API. So ,I
want to skip all the charts data and images in the output .txt file .
Can anyone help me regarding this. Also I want to extract data in
proper alignment.
PFA is the sample pdf file and sample .txt file(this is my desired
output file)
Please have a look at the ExtractTextByArea.java example in the source
code download, this will allow you to extract from a predefined area.
There is no way in PDF to "exclude tables" because there is no table
concept in PDF like in HTML. It's just a bunch of lines with text. You
would need heuristics to guess what's a table and what isn't.
Re order, use the setSortByPosition() method.
If you want exact positions of everything, have a look at the
PrintTextLocations.java example.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]