Re: How to keep PDF format when extracting text

Jack Bush Sat, 28 May 2011 07:11:01 -0700

Hi Elbin,

Excellent. Below is the code that has successfully converted only the required 
rows of PDF data to text:


            parser.parse();
            cosDoc = parser.getDocument();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper = new PDFTextStripperByArea();
            pdfStripper.setSortByPosition( true );
            Rectangle rect = new Rectangle( 10, 250, 750, 550 );
            pdfStripper.addRegion( "class1", rect );
            List allPages = pdDoc.getDocumentCatalog().getAllPages();
            PDPage firstPage = (PDPage)allPages.get( 0 );
            pdfStripper.extractRegions( firstPage );
            System.out.println( "Text in the area:" + rect );
            System.out.println( pdfStripper.getTextForRegion( "class1" ) );

However, I need to go a step further by splitting up each row of data with pipe 
('|') delimited, to capture the values (make up of words and spaces) 
which represent the content of each column. Below is an example:

Current data
-----------------
Suburb   Address            Type   Price       Result  Agent
Fairyland 10 Rochester St 3 br h  $500,000    VB     My Real Estate Agent

Desire outcome
----------------------
Suburb   Address            Type   Price       Result  Agent
Fairyland|10 Rochester St|3 br h|$500,000|VB|My Real Estate Agent  
Is this possible? Or do I need to define additional layers of rectangles for 
each columns. If so, any suggestion on how this could be achieved?

We are nearly there.

Many thanks again,

Jack


----- Original Message ----
From: Elbin Elias <[email protected]>
To: [email protected]
Sent: Fri, 27 May, 2011 11:54:48 PM
Subject: Re: How to keep PDF format when extracting text

pdfbox\examples\util

On Fri, May 27, 2011 at 3:51 PM, Jack Bush <[email protected]> wrote:

> Hi Elbin,
>
> Is it too much to ask if you could point me to where the sample code are on
> this
> area?
>
> Thanks a lot,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <[email protected]>
> To: [email protected]
>  Sent: Fri, 27 May, 2011 11:18:28 PM
> Subject: Re: How to keep PDF format when extracting text
>
> Hi Jack
>
> Try extractByArea instead of getText. There is also sample code explaining
> the same
>
> Regards
> Elbin
>
> On Fri, May 27, 2011 at 3:02 PM, Jack Bush <[email protected]>
> wrote:
>
> > Hi Eric,
> >
> > Thanks for responding back to my call for assistance.
> >
> > I am extracting text from a PDF file only. The rows of data has been
> moved
> > around and the heading is down the bottom of the rows of data, possibly
> > from a
> > table. The order of the page has also gone out of sync.
> >
> > Here is an example of the file that I am try to extract from
> > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> >
> > I am only interested in the stats in the middle of the page.
> >
> > Thanks again,
> >
> > Jack
> > ----- Original Message ----
> > From: Eric Douglas <[email protected]>
> > To: [email protected]
> > Sent: Fri, 27 May, 2011 12:28:52 AM
> > Subject: RE: How to keep PDF format when extracting text
> >
> > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> but
> > your
> > description sounds more like you're getting text from a PDF trying to
> make
> > it
> > look like it does in the PDF.  Are you trying to modify a PDF or are you
> > just
> > losing font information on etracted text?
> > Is the font information embedded?
> > Do you have any samples of your text extraction code or a PDF you're
> > extracting?
> >
> >
> > -----Original Message-----
> > From: Jack Bush [mailto:[email protected]]
> > Sent: Thursday, May 26, 2011 10:12 AM
> > To: [email protected]
> > Subject: How to keep PDF format when extracting text
> >
> > Hi All,
> >
> > I have no problem extracting text from pdf document using
> > pdfbox-app-1.5.0.jar
> > but found that the format has been lost. Also downloaded
> fontbox-1.5.0.jar
> > and
> > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> the
> > extracted text file to be as close to the orginial pdf file as possible.
> >
> > Are there any good document around on this topic on using recent jars. I
> > found
> > some material from Google but they are either using a much earlier
> version
> > (0.8) of pdfbox or the explanantion is insufficient to follow. It is not
> in
> > PDDFBox FAQ.
> >
> > Do you have an archived mailing list I could lookup?
> >
> > Many thanks,
> >
> > Jack
> >
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Reply via email to