Re: PDFTextStripper.processTextPosition

Villu Ruusmann Fri, 19 Feb 2010 08:21:38 -0800

Hello there,

>
> I read the link you have send me. It is above my understanding of the PDFs 
> and PDFBoxTextStripper.
> I am trying to parse this content from the PDF. With 0.8, the  
> PDFTextStripper.processTextPosition()
> was called for every column value(e.g: "Mt. Pleasant, SC 29466-8583").
>


First of all, your assumption that every "field" should result in
exactly one invocation of
PDFTextStripper#processTextPosition(TextPosition) is too naive when it
comes to real-world PDF documents.

Maybe it helps if you consider that there is no such thing as a "white
space" literal in PDF. Imagine a PDF document that prints "Hello
World". When this document is rendered by a conforming PDF software
(for example, Acrobat Reader) then what happens is that the software
first draws the string "Hello", leaves some horizontal space, and then
draws the string "World". When this document is processed with
PDFBox's utilities such as PDFTextStripper, there would be two
invocations of PDFTextStripper#processTextPosition(TextPosition) - the
first for the string "Hello" and the second for the string "World". It
is the responsibility of the application who is consuming those
TextPositions to figure out (by comparing their relative positions on
screen) that they should be combined to yield "Hello World".

> So I thought I will use the getYDirAdj and getXDirAdj methods to sort them 
> and take the values.
> Now I do not know where each of those column value end. For eg. How will I 
> know "Mt. Pleasant,
> SC 29466-8583" is from one "field" if I get one character at a time and 
> setSortByPosition(true) also
> doesn't work with the processTextPosition(). Could you please tell me if 
> there is a better way of do that.
>

The sample you sent to me revealed a rather complex table structure.
Assuming this is a fixed layout you can obtain "fields" if you define
the bounding box of each cell (x, y, width, height), collect all the
TextPositions that fall into that region, and finally join the
collected TextPositions into the result string. You are correct that
you must use TextPosition#getXDirAdj, #getYDirAdj, #getWidthDirAdj to
do the job.

PDF really isn't a good choice for data storage or exchange. You would
be better off if you could obtain this data in some structured format
such as XML.


VR

Re: PDFTextStripper.processTextPosition

Reply via email to