pdfbox\examples\util On Fri, May 27, 2011 at 3:51 PM, Jack Bush <[email protected]> wrote:
> Hi Elbin, > > Is it too much to ask if you could point me to where the sample code are on > this > area? > > Thanks a lot, > > Jack > > > ----- Original Message ---- > From: Elbin Elias <[email protected]> > To: [email protected] > Sent: Fri, 27 May, 2011 11:18:28 PM > Subject: Re: How to keep PDF format when extracting text > > Hi Jack > > Try extractByArea instead of getText. There is also sample code explaining > the same > > Regards > Elbin > > On Fri, May 27, 2011 at 3:02 PM, Jack Bush <[email protected]> > wrote: > > > Hi Eric, > > > > Thanks for responding back to my call for assistance. > > > > I am extracting text from a PDF file only. The rows of data has been > moved > > around and the heading is down the bottom of the rows of data, possibly > > from a > > table. The order of the page has also gone out of sync. > > > > Here is an example of the file that I am try to extract from > > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf > > > > I am only interested in the stats in the middle of the page. > > > > Thanks again, > > > > Jack > > ----- Original Message ---- > > From: Eric Douglas <[email protected]> > > To: [email protected] > > Sent: Fri, 27 May, 2011 12:28:52 AM > > Subject: RE: How to keep PDF format when extracting text > > > > This sounds a bit vague. PDF format sounds like you're creating a PDF, > but > > your > > description sounds more like you're getting text from a PDF trying to > make > > it > > look like it does in the PDF. Are you trying to modify a PDF or are you > > just > > losing font information on etracted text? > > Is the font information embedded? > > Do you have any samples of your text extraction code or a PDF you're > > extracting? > > > > > > -----Original Message----- > > From: Jack Bush [mailto:[email protected]] > > Sent: Thursday, May 26, 2011 10:12 AM > > To: [email protected] > > Subject: How to keep PDF format when extracting text > > > > Hi All, > > > > I have no problem extracting text from pdf document using > > pdfbox-app-1.5.0.jar > > but found that the format has been lost. Also downloaded > fontbox-1.5.0.jar > > and > > jempbox-1.5.0.jar but not sure how to use them to improve the format of > the > > extracted text file to be as close to the orginial pdf file as possible. > > > > Are there any good document around on this topic on using recent jars. I > > found > > some material from Google but they are either using a much earlier > version > > (0.8) of pdfbox or the explanantion is insufficient to follow. It is not > in > > PDDFBox FAQ. > > > > Do you have an archived mailing list I could lookup? > > > > Many thanks, > > > > Jack > > > > > > > > > -- > Thanks & Regards > Elbin K Elias > > -- Thanks & Regards Elbin K Elias

