Re: Get content of a specific object

Roberto Nibali Thu, 27 Aug 2015 05:56:06 -0700

Hi Tilman

On Thu, Aug 27, 2015 at 12:48 AM, Tilman Hausherr <[email protected]>
wrote:


Am 27.08.2015 um 00:39 schrieb Roberto Nibali:
>
>> Hi
>>
>> I'm looking at a PDF using PDFDebugger and the text I'd like to extract
>> from the PDF is inside the Content of node Root/Pages/Kids/[0]/Contents,
>> according to PDFDebugger. How do I programmatically dig down to this node
>> to extract the flatdecoded ASCII stream hiding there inside object [3 0
>> R]?
>>
>> The stream's content first bytes look as follows in the PDFDebugger:
>>
>> q
>>    1 1 1 rg
>>    /a0 gs
>>    14.16 827.76 566.879 -824.879 re
>>    f
>>    BT
>>      9.9984 0 0 9.9984 70.8 806.64 Tm
>>      /f-0-0 1 Tf
>>      [ ($) 6 ($) 6 (Do) 6 (s) 20 (s) 20 (i) -17 (er) -7 (n) 6 (r=) -16 (3)
>> 26 ('3) 6 (9) 6 (4'5) 6 (98) ] TJ
>>
>> Any pointers would be most welcome. In the above example, I'd like to
>> extract the text "$$DossierNr"
>>
>
> You could use PDFTextStripper.
>
> I hope you don't ask how to replace $$DossierNr in the PDF. Because that
> would be really tricky, if not impossible.
>
>
And again thanks heaps for your suggestion. It pointed me exactly towards
the right direction. Solved it using the following code:

PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(srcDoc);
String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}", " ");
List<String> metaData = getMetaData(textNormalized);
metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));

public static List<String> getMetaData(String largeText){
    Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
    Matcher mtch = pattern.matcher(largeText);
    List<String> entries = new ArrayList<>();
    while (mtch.find()) {
        entries.add(mtch.group());
    }
    return entries;
}

Works like a charm!

Question: Would it be possible to extract the text only from one page (the
first one) via the PDFTextStripper API?

Cheers
Roberto

Re: Get content of a specific object

Reply via email to