Dear Mark,
Thanks for your reply. Unfortunately, I don't understand the relation between 
your post and the question! 
I'm newbie in PDFBox, would you please elaborate how to extract the position of 
a specific paragraph using the attached code?
It seems that it works with "fields" in the input pdf file. I'm looking for 
paragraphs, what's their relation?
Kind regards,
Amir




________________________________
 From: "Strein, Mark C CIV USARMY TRADOC ANALYSIS CTR (US)" 
<[email protected]>
To: "[email protected]" <[email protected]>; Amir H. Jadidinejad 
<[email protected]> 
Sent: Monday, August 4, 2014 3:52 PM
Subject: RE: How to find the position of a specific paragraph in the input PDF? 
(UNCLASSIFIED)
 

Classification: UNCLASSIFIED
Caveats: NONE

Morning Sir,
The basic construct for extracting the value in a field is:

field.getFullyQualifiedName().equalsIgnoreCase(fullyQualifiedName).getValue(
) - note: I use fully qualified names(FQN) to prevent errors

My way of extracting the FQN is as follows(the short version):

private void processField(PDField field,boolean buildPDList) throws
IOException
        {
            List kids = field.getKids();
            if(kids != null)
            {
                Iterator kidsIter = kids.iterator();
                while(kidsIter.hasNext())
                {
                    Object pdfObj = kidsIter.next();
                    if(pdfObj instanceof PDField)
                    {
                        PDField kid = (PDField)pdfObj;
                        processField(kid,buildPDList);
                    }
                }
            }
            else
            {  
            If(!buildPDlist)
            {
    
System.err.println(field.getFullyQualifiedName());
            }
            else
            {
                //other processing
            }
    }    
}

Hope that helps.


V/R,
Mark Strein


-----Original Message-----
From: Amir H. Jadidinejad [mailto:[email protected]] 
Sent: Sunday, August 03, 2014 8:53 PM
To: user pdfbox
Subject: How to find the position of a specific paragraph in the input PDF?



I'm going to extract the content of a PDF file using PDFBox library. The
content should be processed paragraph-by-paragraph and for each paragraph, I
need its position for follow-up processing. Using the following code, I can
extract the whole content of an input PDF:

PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper(); String txt =
stripper.getText(doc); doc.close();

I have two problems:

    1. I don't know how to extract the content paragraph by paragraph.
    2. I don't know how to store the position of a paragraph for follow-up
processing (for example highlighting and etc.)

Thanks.

Classification: UNCLASSIFIED
Caveats: NONE

Reply via email to