Am 27.08.2015 um 20:17 schrieb Roberto Nibali:
Hi Maruan
And again thanks heaps for your suggestion. It pointed me exactly towards
the right direction. Solved it using the following code:
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(srcDoc);
String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}",
" ");
List<String> metaData = getMetaData(textNormalized);
metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));
public static List<String> getMetaData(String largeText){
Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
Matcher mtch = pattern.matcher(largeText);
List<String> entries = new ArrayList<>();
while (mtch.find()) {
entries.add(mtch.group());
}
return entries;
}
Works like a charm!
Question: Would it be possible to extract the text only from one page
(the
first one) via the PDFTextStripper API?
you can use PDFTextStripper.setStartPage() and PDFTextStripper.setEndPage()
Indeed, and it works wonderfully. Now, I know why PDFTextStripper has all
those methods ;). Why not just convert the class into a Builder pattern?
Anyway, it works for my case. Strangely enough the API of PDFTextStripper
starts with page 1 as index 1, while PDDocument getPage() uses index 0 as
page 1.
Yes.... too late to change that now.
I also did not figure out the semantics of setParagraphStart(String ...).
I suspect it is for derived classes, e.g. PDFText2HTML
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]