Re: Get content of a specific object

Tilman Hausherr Thu, 27 Aug 2015 13:11:32 -0700

Am 27.08.2015 um 20:17 schrieb Roberto Nibali:

Hi Maruan

And again thanks heaps for your suggestion. It pointed me exactly towards

the right direction. Solved it using the following code:

PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(srcDoc);
String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}",

" ");

List<String> metaData = getMetaData(textNormalized);
metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));

public static List<String> getMetaData(String largeText){
    Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
    Matcher mtch = pattern.matcher(largeText);
    List<String> entries = new ArrayList<>();
    while (mtch.find()) {
        entries.add(mtch.group());
    }
    return entries;
}

Works like a charm!

Question: Would it be possible to extract the text only from one page

(the

first one) via the PDFTextStripper API?

you can use PDFTextStripper.setStartPage() and PDFTextStripper.setEndPage()

Indeed, and it works wonderfully. Now, I know why PDFTextStripper has all
those methods ;). Why not just convert the class into a Builder pattern?
Anyway, it works for my case. Strangely enough the API of PDFTextStripper
starts with page 1 as index 1, while PDDocument getPage() uses index 0 as
page 1.


Yes.... too late to change that now.

I also did not figure out the semantics of setParagraphStart(String ...).


I suspect it is for derived classes, e.g. PDFText2HTML

Tilman




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Get content of a specific object

Reply via email to