Re: Context aware text extraction

Kevin Brown Thu, 30 Dec 2010 07:37:54 -0800

Any luck with this? I couldn't figure a way to do this with PDFBox, or
anything else.


The best tool I've ever seen is something called BCL Jade which allows you
to extract zones by selecting them. It's non programmable and not supported
or sold any more!

On Tue, Dec 28, 2010 at 11:26 PM, David Hoffer <[email protected]> wrote:

> Hi I'm new to PDFBox and need to do PDF text extraction but the standard
> PDFTextStripper behavior isn't what I need.  The problem with
> PDFTextStripper is that it left aligns all the output so you have no way of
> knowing where in the horizontial position the text came from.
>
> I have to extract text from (small) tables within the document and I need
> to
> know which table the data came from.  A simple example might be:
>
> Table 1        Table2
> 1 2 3 4         1 2 3 4
>                   5 6 7 8
>
> PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are
> fine but it will left align row 3 so there is no way of knowing that it was
> part of Table 2 and not Table 1.
>
> What I can't show is there there is table formatting (rectangles) around
> all
> the tables.
>
> How can I use PDFBox to extract the data keeping it context aware?  Ideally
> getting each table (I know the text in the top) and then extracting text
> like PDFTextStripper does would be great.
>
> What's the best way to do this?
>
> -Dave
>

Re: Context aware text extraction

Reply via email to