Re: Issues regarding PDFBOX

Tilman Hausherr Mon, 29 May 2017 08:26:32 -0700

Am 29.05.2017 um 08:56 schrieb Kunal Kashyap:

I am trying to read text data from a pdf file using PdfBox API. So ,Iwant to skip all the charts data and images in the output .txt file .Can anyone help me regarding this. Also I want to extract data inproper alignment.PFA is the sample pdf file and sample .txt file(this is my desiredoutput file)

Please have a look at the ExtractTextByArea.java example in the sourcecode download, this will allow you to extract from a predefined area.

There is no way in PDF to "exclude tables" because there is no tableconcept in PDF like in HTML. It's just a bunch of lines with text. Youwould need heuristics to guess what's a table and what isn't.


Re order, use the setSortByPosition() method.

If you want exact positions of everything, have a look at thePrintTextLocations.java example.


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Issues regarding PDFBOX

Reply via email to