OK Elbin, I will take up your suggestion.
Thank you once again, Jack ----- Original Message ---- From: Elbin Elias <[email protected]> To: [email protected] Sent: Sun, 29 May, 2011 12:17:34 AM Subject: Re: How to keep PDF format when extracting text Hi Jack Glad to hear that this works for you. For the issue with delimiter, i would suggest you to use space as the delimitter instead of pipe. Thanks Elbin On Sat, May 28, 2011 at 4:10 PM, Jack Bush <[email protected]> wrote: > Hi Elbin, > > Excellent. Below is the code that has successfully converted only the > required > rows of PDF data to text: > > parser.parse(); > cosDoc = parser.getDocument(); > pdDoc = new PDDocument(cosDoc); > pdfStripper = new PDFTextStripperByArea(); > pdfStripper.setSortByPosition( true ); > Rectangle rect = new Rectangle( 10, 250, 750, 550 ); > pdfStripper.addRegion( "class1", rect ); > List allPages = pdDoc.getDocumentCatalog().getAllPages(); > PDPage firstPage = (PDPage)allPages.get( 0 ); > pdfStripper.extractRegions( firstPage ); > System.out.println( "Text in the area:" + rect ); > System.out.println( pdfStripper.getTextForRegion( "class1" ) ); > > However, I need to go a step further by splitting up each row of data with > pipe > ('|') delimited, to capture the values (make up of words and spaces) > which represent the content of each column. Below is an example: > > Current data > ----------------- > Suburb Address Type Price Result Agent > Fairyland 10 Rochester St 3 br h $500,000 VB My Real Estate Agent > > Desire outcome > ---------------------- > Suburb Address Type Price Result Agent > Fairyland|10 Rochester St|3 br h|$500,000|VB|My Real Estate Agent > Is this possible? Or do I need to define additional layers of rectangles > for > each columns. If so, any suggestion on how this could be achieved? > > We are nearly there. > > Many thanks again, > > Jack > > > ----- Original Message ---- > From: Elbin Elias <[email protected]> > To: [email protected] > Sent: Fri, 27 May, 2011 11:54:48 PM > Subject: Re: How to keep PDF format when extracting text > > pdfbox\examples\util > > On Fri, May 27, 2011 at 3:51 PM, Jack Bush <[email protected]> > wrote: > > > Hi Elbin, > > > > Is it too much to ask if you could point me to where the sample code are > on > > this > > area? > > > > Thanks a lot, > > > > Jack > > > > > > ----- Original Message ---- > > From: Elbin Elias <[email protected]> > > To: [email protected] > > Sent: Fri, 27 May, 2011 11:18:28 PM > > Subject: Re: How to keep PDF format when extracting text > > > > Hi Jack > > > > Try extractByArea instead of getText. There is also sample code > explaining > > the same > > > > Regards > > Elbin > > > > On Fri, May 27, 2011 at 3:02 PM, Jack Bush <[email protected]> > > wrote: > > > > > Hi Eric, > > > > > > Thanks for responding back to my call for assistance. > > > > > > I am extracting text from a PDF file only. The rows of data has been > > moved > > > around and the heading is down the bottom of the rows of data, possibly > > > from a > > > table. The order of the page has also gone out of sync. > > > > > > Here is an example of the file that I am try to extract from > > > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf > > > > > > I am only interested in the stats in the middle of the page. > > > > > > Thanks again, > > > > > > Jack > > > ----- Original Message ---- > > > From: Eric Douglas <[email protected]> > > > To: [email protected] > > > Sent: Fri, 27 May, 2011 12:28:52 AM > > > Subject: RE: How to keep PDF format when extracting text > > > > > > This sounds a bit vague. PDF format sounds like you're creating a PDF, > > but > > > your > > > description sounds more like you're getting text from a PDF trying to > > make > > > it > > > look like it does in the PDF. Are you trying to modify a PDF or are > you > > > just > > > losing font information on etracted text? > > > Is the font information embedded? > > > Do you have any samples of your text extraction code or a PDF you're > > > extracting? > > > > > > > > > -----Original Message----- > > > From: Jack Bush [mailto:[email protected]] > > > Sent: Thursday, May 26, 2011 10:12 AM > > > To: [email protected] > > > Subject: How to keep PDF format when extracting text > > > > > > Hi All, > > > > > > I have no problem extracting text from pdf document using > > > pdfbox-app-1.5.0.jar > > > but found that the format has been lost. Also downloaded > > fontbox-1.5.0.jar > > > and > > > jempbox-1.5.0.jar but not sure how to use them to improve the format of > > the > > > extracted text file to be as close to the orginial pdf file as > possible. > > > > > > Are there any good document around on this topic on using recent jars. > I > > > found > > > some material from Google but they are either using a much earlier > > version > > > (0.8) of pdfbox or the explanantion is insufficient to follow. It is > not > > in > > > PDDFBox FAQ. > > > > > > Do you have an archived mailing list I could lookup? > > > > > > Many thanks, > > > > > > Jack > > > > > > > > > > > > > > > -- > > Thanks & Regards > > Elbin K Elias > > > > > > > -- > Thanks & Regards > Elbin K Elias > > -- Thanks & Regards Elbin K Elias

