Hi Tilman
On Thu, Aug 27, 2015 at 12:48 AM, Tilman Hausherr <[email protected]>
wrote:
Am 27.08.2015 um 00:39 schrieb Roberto Nibali:
>
>> Hi
>>
>> I'm looking at a PDF using PDFDebugger and the text I'd like to extract
>> from the PDF is inside the Content of node Root/Pages/Kids/[0]/Contents,
>> according to PDFDebugger. How do I programmatically dig down to this node
>> to extract the flatdecoded ASCII stream hiding there inside object [3 0
>> R]?
>>
>> The stream's content first bytes look as follows in the PDFDebugger:
>>
>> q
>> 1 1 1 rg
>> /a0 gs
>> 14.16 827.76 566.879 -824.879 re
>> f
>> BT
>> 9.9984 0 0 9.9984 70.8 806.64 Tm
>> /f-0-0 1 Tf
>> [ ($) 6 ($) 6 (Do) 6 (s) 20 (s) 20 (i) -17 (er) -7 (n) 6 (r=) -16 (3)
>> 26 ('3) 6 (9) 6 (4'5) 6 (98) ] TJ
>>
>> Any pointers would be most welcome. In the above example, I'd like to
>> extract the text "$$DossierNr"
>>
>
> You could use PDFTextStripper.
>
> I hope you don't ask how to replace $$DossierNr in the PDF. Because that
> would be really tricky, if not impossible.
>
>
And again thanks heaps for your suggestion. It pointed me exactly towards
the right direction. Solved it using the following code:
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(srcDoc);
String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}", " ");
List<String> metaData = getMetaData(textNormalized);
metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));
public static List<String> getMetaData(String largeText){
Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
Matcher mtch = pattern.matcher(largeText);
List<String> entries = new ArrayList<>();
while (mtch.find()) {
entries.add(mtch.group());
}
return entries;
}
Works like a charm!
Question: Would it be possible to extract the text only from one page (the
first one) via the PDFTextStripper API?
Cheers
Roberto