Hi,

> Am 27.08.2015 um 14:55 schrieb Roberto Nibali <[email protected]>:
> 
> Hi Tilman
> 
> On Thu, Aug 27, 2015 at 12:48 AM, Tilman Hausherr <[email protected]>
> wrote:
> 
> Am 27.08.2015 um 00:39 schrieb Roberto Nibali:
>> 
>>> Hi
>>> 
>>> I'm looking at a PDF using PDFDebugger and the text I'd like to extract
>>> from the PDF is inside the Content of node Root/Pages/Kids/[0]/Contents,
>>> according to PDFDebugger. How do I programmatically dig down to this node
>>> to extract the flatdecoded ASCII stream hiding there inside object [3 0
>>> R]?
>>> 
>>> The stream's content first bytes look as follows in the PDFDebugger:
>>> 
>>> q
>>>   1 1 1 rg
>>>   /a0 gs
>>>   14.16 827.76 566.879 -824.879 re
>>>   f
>>>   BT
>>>     9.9984 0 0 9.9984 70.8 806.64 Tm
>>>     /f-0-0 1 Tf
>>>     [ ($) 6 ($) 6 (Do) 6 (s) 20 (s) 20 (i) -17 (er) -7 (n) 6 (r=) -16 (3)
>>> 26 ('3) 6 (9) 6 (4'5) 6 (98) ] TJ
>>> 
>>> Any pointers would be most welcome. In the above example, I'd like to
>>> extract the text "$$DossierNr"
>>> 
>> 
>> You could use PDFTextStripper.
>> 
>> I hope you don't ask how to replace $$DossierNr in the PDF. Because that
>> would be really tricky, if not impossible.
>> 
>> 
> And again thanks heaps for your suggestion. It pointed me exactly towards
> the right direction. Solved it using the following code:
> 
> PDFTextStripper pdfTextStripper = new PDFTextStripper();
> String text = pdfTextStripper.getText(srcDoc);
> String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}", " ");
> List<String> metaData = getMetaData(textNormalized);
> metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));
> 
> public static List<String> getMetaData(String largeText){
>    Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
>    Matcher mtch = pattern.matcher(largeText);
>    List<String> entries = new ArrayList<>();
>    while (mtch.find()) {
>        entries.add(mtch.group());
>    }
>    return entries;
> }
> 
> Works like a charm!
> 
> Question: Would it be possible to extract the text only from one page (the
> first one) via the PDFTextStripper API?

you can use PDFTextStripper.setStartPage() and PDFTextStripper.setEndPage()

BR Maruan


> 
> Cheers
> Roberto


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to