Re: How to keep PDF format when extracting text

Jack Bush Fri, 27 May 2011 06:51:42 -0700

Hi Elbin,

Is it too much to ask if you could point me to where the sample code are on 
this 
area?


Thanks a lot,

Jack


----- Original Message ----
From: Elbin Elias <[email protected]>
To: [email protected]
Sent: Fri, 27 May, 2011 11:18:28 PM
Subject: Re: How to keep PDF format when extracting text

Hi Jack

Try extractByArea instead of getText. There is also sample code explaining
the same

Regards
Elbin

On Fri, May 27, 2011 at 3:02 PM, Jack Bush <[email protected]> wrote:

> Hi Eric,
>
> Thanks for responding back to my call for assistance.
>
> I am extracting text from a PDF file only. The rows of data has been moved
> around and the heading is down the bottom of the rows of data, possibly
> from a
> table. The order of the page has also gone out of sync.
>
> Here is an example of the file that I am try to extract from
> http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
>
> I am only interested in the stats in the middle of the page.
>
> Thanks again,
>
> Jack
> ----- Original Message ----
> From: Eric Douglas <[email protected]>
> To: [email protected]
> Sent: Fri, 27 May, 2011 12:28:52 AM
> Subject: RE: How to keep PDF format when extracting text
>
> This sounds a bit vague.  PDF format sounds like you're creating a PDF, but
> your
> description sounds more like you're getting text from a PDF trying to make
> it
> look like it does in the PDF.  Are you trying to modify a PDF or are you
> just
> losing font information on etracted text?
> Is the font information embedded?
> Do you have any samples of your text extraction code or a PDF you're
> extracting?
>
>
> -----Original Message-----
> From: Jack Bush [mailto:[email protected]]
> Sent: Thursday, May 26, 2011 10:12 AM
> To: [email protected]
> Subject: How to keep PDF format when extracting text
>
> Hi All,
>
> I have no problem extracting text from pdf document using
> pdfbox-app-1.5.0.jar
> but found that the format has been lost. Also downloaded fontbox-1.5.0.jar
> and
> jempbox-1.5.0.jar but not sure how to use them to improve the format of the
> extracted text file to be as close to the orginial pdf file as possible.
>
> Are there any good document around on this topic on using recent jars. I
> found
> some material from Google but they are either using a much earlier version
> (0.8) of pdfbox or the explanantion is insufficient to follow. It is not in
> PDDFBox FAQ.
>
> Do you have an archived mailing list I could lookup?
>
> Many thanks,
>
> Jack
>
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Reply via email to