Re: How to keep PDF format when extracting text

Elbin Elias Fri, 27 May 2011 06:55:15 -0700

pdfbox\examples\util

On Fri, May 27, 2011 at 3:51 PM, Jack Bush <[email protected]> wrote:


> Hi Elbin,
>
> Is it too much to ask if you could point me to where the sample code are on
> this
> area?
>
> Thanks a lot,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <[email protected]>
> To: [email protected]
>  Sent: Fri, 27 May, 2011 11:18:28 PM
> Subject: Re: How to keep PDF format when extracting text
>
> Hi Jack
>
> Try extractByArea instead of getText. There is also sample code explaining
> the same
>
> Regards
> Elbin
>
> On Fri, May 27, 2011 at 3:02 PM, Jack Bush <[email protected]>
> wrote:
>
> > Hi Eric,
> >
> > Thanks for responding back to my call for assistance.
> >
> > I am extracting text from a PDF file only. The rows of data has been
> moved
> > around and the heading is down the bottom of the rows of data, possibly
> > from a
> > table. The order of the page has also gone out of sync.
> >
> > Here is an example of the file that I am try to extract from
> > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> >
> > I am only interested in the stats in the middle of the page.
> >
> > Thanks again,
> >
> > Jack
> > ----- Original Message ----
> > From: Eric Douglas <[email protected]>
> > To: [email protected]
> > Sent: Fri, 27 May, 2011 12:28:52 AM
> > Subject: RE: How to keep PDF format when extracting text
> >
> > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> but
> > your
> > description sounds more like you're getting text from a PDF trying to
> make
> > it
> > look like it does in the PDF.  Are you trying to modify a PDF or are you
> > just
> > losing font information on etracted text?
> > Is the font information embedded?
> > Do you have any samples of your text extraction code or a PDF you're
> > extracting?
> >
> >
> > -----Original Message-----
> > From: Jack Bush [mailto:[email protected]]
> > Sent: Thursday, May 26, 2011 10:12 AM
> > To: [email protected]
> > Subject: How to keep PDF format when extracting text
> >
> > Hi All,
> >
> > I have no problem extracting text from pdf document using
> > pdfbox-app-1.5.0.jar
> > but found that the format has been lost. Also downloaded
> fontbox-1.5.0.jar
> > and
> > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> the
> > extracted text file to be as close to the orginial pdf file as possible.
> >
> > Are there any good document around on this topic on using recent jars. I
> > found
> > some material from Google but they are either using a much earlier
> version
> > (0.8) of pdfbox or the explanantion is insufficient to follow. It is not
> in
> > PDDFBox FAQ.
> >
> > Do you have an archived mailing list I could lookup?
> >
> > Many thanks,
> >
> > Jack
> >
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Reply via email to