Re: Get content of a specific object

Roberto Nibali Thu, 27 Aug 2015 11:18:33 -0700

Hi Maruan

> And again thanks heaps for your suggestion. It pointed me exactly towards
> > the right direction. Solved it using the following code:
> >
> > PDFTextStripper pdfTextStripper = new PDFTextStripper();
> > String text = pdfTextStripper.getText(srcDoc);
> > String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}",
> " ");
> > List<String> metaData = getMetaData(textNormalized);
> > metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));
> >
> > public static List<String> getMetaData(String largeText){
> >    Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
> >    Matcher mtch = pattern.matcher(largeText);
> >    List<String> entries = new ArrayList<>();
> >    while (mtch.find()) {
> >        entries.add(mtch.group());
> >    }
> >    return entries;
> > }
> >
> > Works like a charm!
> >
> > Question: Would it be possible to extract the text only from one page
> (the
> > first one) via the PDFTextStripper API?
>
> you can use PDFTextStripper.setStartPage() and PDFTextStripper.setEndPage()
>
>
Indeed, and it works wonderfully. Now, I know why PDFTextStripper has all
those methods ;). Why not just convert the class into a Builder pattern?
Anyway, it works for my case. Strangely enough the API of PDFTextStripper
starts with page 1 as index 1, while PDDocument getPage() uses index 0 as
page 1.


I also did not figure out the semantics of setParagraphStart(String ...).

Cheers
Roberto

Re: Get content of a specific object

Reply via email to