Re: More questions about page iteration

David Patterson Tue, 16 May 2017 06:30:44 -0700

Tilman,

Thanks. That was what I had come to realize when the PageLabels were null.


Just out of curiosity, how do page labels get created?

Dave Patterson

On Tue, May 16, 2017 at 9:26 AM, Tilman Hausherr <[email protected]>
wrote:

> Sadly for you, that one has nothing to do with page labels. It's really
> just a footer on the page. And there is no concept of "footer" in PDF. It's
> just text at the bottom.
>
> Tilman
>
>
> Am 16.05.2017 um 15:21 schrieb David Patterson:
>
>> They show up when I print the PDF or open it to read it. I want to extract
>> the Table of Contents from each of > 100 PDFs so I can make a super-Table
>> of Contents and allow users to search for the document they need to read.
>> (The file name of the desired contents is not obvious, and so with a
>> consolidated Table of Contents, a more novice user can find the content
>> they want to read and open the correct document to see the text. These are
>> Standard Operating Procedures for a 24x7 production facility and the
>> operators might need to review what to do in case of a problem.
>>
>> I was hoping that in the transition from Word (where the documents are
>> authored, the saving as a PDF and combining them into Portfolios some part
>> of the process would have identified it as a page label, but I guess that
>> did not happen.
>>
>> I'm able to find the text of that string since it only occurs in the
>> footer
>> of the page.
>>
>> Thanks.
>>
>> Dave Patterson
>>
>> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <[email protected]>
>> wrote:
>>
>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>>
>>> Tilman,
>>>>
>>>> The code I tried is:
>>>>
>>>> byte[] bytes = // content of file as a byte array
>>>> PDDocument pdDocument = PDDocument.load( bytes );
>>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>>> if ( pageLabels == null ) {
>>>> System.out.println( "Page labels missing " );
>>>> }
>>>>
>>>>
>>>> I'm getting "Page labels missing" on each document.
>>>>
>>>> Then lets go back to the beginning. You mentioned "I've got page numbers
>>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>>
>>> Tilman
>>>
>>>
>>>
>>>
>>> I have no idea of, or control over the process used to convert a Word
>>>> file
>>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to
>>>> interpret.
>>>>
>>>> Dave Patterson
>>>>
>>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <[email protected]
>>>> >
>>>> wrote:
>>>>
>>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>>
>>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>>
>>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>>>>>
>>>>> try
>>>>> with PDFDebugger, it will show the labels if there are some.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> Thank you for the help and encouragement.
>>>>>
>>>>>> Dave Patterson
>>>>>>
>>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>>
>>>>>> Tilman,
>>>>>>>
>>>>>>> Thank you very much. (I feel bad asking some of the questions, but
>>>>>>>> the
>>>>>>>> data
>>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>>
>>>>>>>> Don't :-)
>>>>>>>>
>>>>>>>> Is there any documentation that explains how the linkages work?
>>>>>>> Would
>>>>>>> it
>>>>>>>
>>>>>>> help to have the PDF Standard Document?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. I read there all the time. The PDFBox API closely follows the
>>>>>>>> PDF
>>>>>>>>
>>>>>>> specification. So here it's linked from the document catalog, so the
>>>>>>> methods used are in the PDDocumentCatalog class. But asking was a
>>>>>>> good
>>>>>>> decision as this got you that convenience method (that is in
>>>>>>> PDFDebugger).
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Dave Patterson
>>>>>>>>
>>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>>
>>>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>>>
>>>>>>>>> process
>>>>>>>>>
>>>>>>>>> it
>>>>>>>>>
>>>>>>>>>> page by page.
>>>>>>>>>>
>>>>>>>>>> Next hurdle: Is there a way to get the page number as printed?
>>>>>>>>>> I've
>>>>>>>>>> got
>>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>>
>>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> Dave Patterson
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         /**
>>>>>>>>>>
>>>>>>>>>>          * Convenience method to get the page label if available.
>>>>>>>>>>
>>>>>>>>>          *
>>>>>>>>>          * @param document
>>>>>>>>>          * @param pageIndex 0-based page number.
>>>>>>>>>          * @return a page label or null if not available.
>>>>>>>>>          */
>>>>>>>>>         public static String getPageLabel(PDDocument document, int
>>>>>>>>> pageIndex)
>>>>>>>>>         {
>>>>>>>>>             PDPageLabels pageLabels;
>>>>>>>>>             try
>>>>>>>>>             {
>>>>>>>>>                 pageLabels = document.getDocumentCatalog().
>>>>>>>>> getPageLabels();
>>>>>>>>>             }
>>>>>>>>>             catch (IOException ex)
>>>>>>>>>             {
>>>>>>>>>                 return ex.getMessage();
>>>>>>>>>             }
>>>>>>>>>             if (pageLabels != null)
>>>>>>>>>             {
>>>>>>>>>                 String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>>> ces();
>>>>>>>>>                 if (labels[pageIndex] != null)
>>>>>>>>>                 {
>>>>>>>>>                     return labels[pageIndex];
>>>>>>>>>                 }
>>>>>>>>>             }
>>>>>>>>>             return null;
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>>
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: More questions about page iteration

Reply via email to