Please note that I've never used text extraction from PDFBox myself, so
these are just a few ideas for a possible direction:

If you look at the PDFText2HTML class (take it as an example) you can
guess that PDFTextStripper seems to be easily subclassable so you can do
custom processing, like, for example, overriding
writeParagraphStart()/writeParagraphEnd() to be notified when a
paragraph starts and ends. Then you can intercept the text sent to the
writeString() method (accumulate string in a StringBuilder until you
have the full paragraph). Once you have a paragraph, you can use Java's
Regex feature to match a pattern. Something like:

if (Pattern.matches("^\(\d*\)\s", paragraphText)) {
    //I have (possibly) found a footnote
}

This would match "(77) " and "(1) " and so on. The "^" makes sure that
pattern is only matched at the beginning of a string.

The rest is up to you. Good luck.


On 26.10.2011 21:46:53 Hesham G. wrote:
> Jeremias ,
> 
> Thanks a lot ... That might be helpful, especially I want also to detect the 
> number of the footnote. But how can I get the pattern "(<number>) in terms of 
> the PDFBox language  ?
> 
> 
> Best regards ,
> Hesham
> 
> ---------------------------------------------
> Included message :
> 
> 
> > Not reliably, no, because the PDF is not tagged. Together with text
> > extraction you might be able to come up with some heuristics to identify
> > footnotes. Like looking for a pattern "(<number>) " at the beginning of
> > a paragraph, for example. HTH
> > 
> > 
> > On 26.10.2011 19:52:46 Hesham G. wrote:
> >> May be my question was not clear enough ... I meant is there a way to know 
> >> that the current extracted part from the PDF page is the footnote section ?
> >> 
> >> 
> >> Best regards ,
> >> Hesham
> >> 
> >> 
> >> ---------------------------------------------
> >> Included message :
> >> 
> >> > I seee PDFBox (current trunk) extracting the footnote text correctly
> >> > from this PDF.  (I just ran the org.apache.pdfbox.ExtractText tool).
> >> > 
> >> > Mike McCandless
> >> > 
> >> > http://blog.mikemccandless.com
> >> > 
> >> > On Wed, Oct 26, 2011 at 8:25 AM, Hesham G. <[email protected]> 
> >> > wrote:
> >> >> Hello ,
> >> >>
> >> >> Is there a way to detect the footnotes section in a PDF file ?
> >> >> Here is a sample 2-pages PDF with footnotes:
> >> >> http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html
> >> >>
> >> >>
> >> >> Best regards ,
> >> >> Hesham
> >> >>
> >> >
> > 
> > 
> > 
> > 
> > Jeremias Maerki
> > 
> >




Jeremias Maerki

Reply via email to