Re: Context aware text extraction

David Hoffer Thu, 30 Dec 2010 08:17:45 -0800

Yeah I'm having some luck...its not elegant but it's working.

What I'm doing is looking for the table header text and finding it's
starting and ending X pos,  then because I know about how wide my table is I
extract all the subsequent rows that are withing this X range.


It's got lots of issues that are not ideal.
- Sometimes the table header (something that makes a unique string to look
for) is two or three rows...I can't handle this.  Btw, I use regex for the
header text because you can't be certain of how many spaces will be in the
string.
- It would be nice if it could figure out how wide the table is...it has the
boundary/rectangle info...but I don't know how to get this info so I am
telling it how wide the table is.
- I have to tell it what the max number of table rows is...because again I
don't know how to get the boundary/rectangle info which knows where the
table ends.

Other than this...it's working.

-Dave

P.S. The newer iText has a context aware parsing strategy...but it costs
thousands of $...too rich for me.


On Thu, Dec 30, 2010 at 8:37 AM, Kevin Brown <[email protected]> wrote:

> Any luck with this? I couldn't figure a way to do this with PDFBox, or
> anything else.
>
> The best tool I've ever seen is something called BCL Jade which allows you
> to extract zones by selecting them. It's non programmable and not supported
> or sold any more!
>
> On Tue, Dec 28, 2010 at 11:26 PM, David Hoffer <[email protected]> wrote:
>
> > Hi I'm new to PDFBox and need to do PDF text extraction but the standard
> > PDFTextStripper behavior isn't what I need.  The problem with
> > PDFTextStripper is that it left aligns all the output so you have no way
> of
> > knowing where in the horizontial position the text came from.
> >
> > I have to extract text from (small) tables within the document and I need
> > to
> > know which table the data came from.  A simple example might be:
> >
> > Table 1        Table2
> > 1 2 3 4         1 2 3 4
> >                   5 6 7 8
> >
> > PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are
> > fine but it will left align row 3 so there is no way of knowing that it
> was
> > part of Table 2 and not Table 1.
> >
> > What I can't show is there there is table formatting (rectangles) around
> > all
> > the tables.
> >
> > How can I use PDFBox to extract the data keeping it context aware?
>  Ideally
> > getting each table (I know the text in the top) and then extracting text
> > like PDFTextStripper does would be great.
> >
> > What's the best way to do this?
> >
> > -Dave
> >
>

Re: Context aware text extraction

Reply via email to