Re: How to keep PDF format when extracting text

Jack Bush Mon, 30 May 2011 06:34:52 -0700

OK Elbin,

I will take up your suggestion.


Thank you once again,

Jack



----- Original Message ----
From: Elbin Elias <[email protected]>
To: [email protected]
Sent: Sun, 29 May, 2011 12:17:34 AM
Subject: Re: How to keep PDF format when extracting text

Hi Jack

Glad to hear that this works for you. For the issue with delimiter, i would
suggest you to use space as the delimitter instead of pipe.

Thanks
Elbin

On Sat, May 28, 2011 at 4:10 PM, Jack Bush <[email protected]> wrote:

> Hi Elbin,
>
> Excellent. Below is the code that has successfully converted only the
> required
> rows of PDF data to text:
>
>            parser.parse();
>            cosDoc = parser.getDocument();
>            pdDoc = new PDDocument(cosDoc);
>            pdfStripper = new PDFTextStripperByArea();
>            pdfStripper.setSortByPosition( true );
>            Rectangle rect = new Rectangle( 10, 250, 750, 550 );
>            pdfStripper.addRegion( "class1", rect );
>            List allPages = pdDoc.getDocumentCatalog().getAllPages();
>            PDPage firstPage = (PDPage)allPages.get( 0 );
>            pdfStripper.extractRegions( firstPage );
>            System.out.println( "Text in the area:" + rect );
>            System.out.println( pdfStripper.getTextForRegion( "class1" ) );
>
> However, I need to go a step further by splitting up each row of data with
> pipe
> ('|') delimited, to capture the values (make up of words and spaces)
> which represent the content of each column. Below is an example:
>
> Current data
> -----------------
> Suburb  Address            Type  Price      Result  Agent
> Fairyland 10 Rochester St 3 br h  $500,000    VB    My Real Estate Agent
>
> Desire outcome
> ----------------------
> Suburb  Address            Type  Price      Result  Agent
> Fairyland|10 Rochester St|3 br h|$500,000|VB|My Real Estate Agent
> Is this possible? Or do I need to define additional layers of rectangles
> for
> each columns. If so, any suggestion on how this could be achieved?
>
> We are nearly there.
>
> Many thanks again,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <[email protected]>
> To: [email protected]
>  Sent: Fri, 27 May, 2011 11:54:48 PM
> Subject: Re: How to keep PDF format when extracting text
>
> pdfbox\examples\util
>
> On Fri, May 27, 2011 at 3:51 PM, Jack Bush <[email protected]>
> wrote:
>
> > Hi Elbin,
> >
> > Is it too much to ask if you could point me to where the sample code are
> on
> > this
> > area?
> >
> > Thanks a lot,
> >
> > Jack
> >
> >
> > ----- Original Message ----
> > From: Elbin Elias <[email protected]>
> > To: [email protected]
> >  Sent: Fri, 27 May, 2011 11:18:28 PM
> > Subject: Re: How to keep PDF format when extracting text
> >
> > Hi Jack
> >
> > Try extractByArea instead of getText. There is also sample code
> explaining
> > the same
> >
> > Regards
> > Elbin
> >
> > On Fri, May 27, 2011 at 3:02 PM, Jack Bush <[email protected]>
> > wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for responding back to my call for assistance.
> > >
> > > I am extracting text from a PDF file only. The rows of data has been
> > moved
> > > around and the heading is down the bottom of the rows of data, possibly
> > > from a
> > > table. The order of the page has also gone out of sync.
> > >
> > > Here is an example of the file that I am try to extract from
> > > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> > >
> > > I am only interested in the stats in the middle of the page.
> > >
> > > Thanks again,
> > >
> > > Jack
> > > ----- Original Message ----
> > > From: Eric Douglas <[email protected]>
> > > To: [email protected]
> > > Sent: Fri, 27 May, 2011 12:28:52 AM
> > > Subject: RE: How to keep PDF format when extracting text
> > >
> > > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> > but
> > > your
> > > description sounds more like you're getting text from a PDF trying to
> > make
> > > it
> > > look like it does in the PDF.  Are you trying to modify a PDF or are
> you
> > > just
> > > losing font information on etracted text?
> > > Is the font information embedded?
> > > Do you have any samples of your text extraction code or a PDF you're
> > > extracting?
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Bush [mailto:[email protected]]
> > > Sent: Thursday, May 26, 2011 10:12 AM
> > > To: [email protected]
> > > Subject: How to keep PDF format when extracting text
> > >
> > > Hi All,
> > >
> > > I have no problem extracting text from pdf document using
> > > pdfbox-app-1.5.0.jar
> > > but found that the format has been lost. Also downloaded
> > fontbox-1.5.0.jar
> > > and
> > > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> > the
> > > extracted text file to be as close to the orginial pdf file as
> possible.
> > >
> > > Are there any good document around on this topic on using recent jars.
> I
> > > found
> > > some material from Google but they are either using a much earlier
> > version
> > > (0.8) of pdfbox or the explanantion is insufficient to follow. It is
> not
> > in
> > > PDDFBox FAQ.
> > >
> > > Do you have an archived mailing list I could lookup?
> > >
> > > Many thanks,
> > >
> > > Jack
> > >
> > >
> > >
> >
> >
> > --
> > Thanks & Regards
> > Elbin K Elias
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Reply via email to