Context aware text extraction

David Hoffer Tue, 28 Dec 2010 20:26:33 -0800

Hi I'm new to PDFBox and need to do PDF text extraction but the standard
PDFTextStripper behavior isn't what I need.  The problem with
PDFTextStripper is that it left aligns all the output so you have no way of
knowing where in the horizontial position the text came from.


I have to extract text from (small) tables within the document and I need to
know which table the data came from.  A simple example might be:

Table 1        Table2
1 2 3 4         1 2 3 4
                   5 6 7 8

PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are
fine but it will left align row 3 so there is no way of knowing that it was
part of Table 2 and not Table 1.

What I can't show is there there is table formatting (rectangles) around all
the tables.

How can I use PDFBox to extract the data keeping it context aware?  Ideally
getting each table (I know the text in the top) and then extracting text
like PDFTextStripper does would be great.

What's the best way to do this?

-Dave

Context aware text extraction

Reply via email to