RE: Getting page counts of PDFs

Ralph DiMola via use-livecode Sun, 23 Aug 2020 07:17:44 -0700

PDF Widget will do the trick. I start reading the PDF spec 20 years ago and got 
a giant headache.


Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net

-----Original Message-----
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf Of 
David V Glasgow via use-livecode
Sent: Sunday, August 23, 2020 8:07 AM
To: How to use LiveCode
Cc: David V Glasgow
Subject: Getting page counts of PDFs

Livecoders,

In my day job, some of my income comes from the number of pages from  a number 
of PDF documents thatI have to read for individual cases.  I thought it would 
be fun and useful to write an LC script that would either count the pages or 
(even better) get the page count of a folder full of PDFs.

I didn’t imagine it would be too hard, because both Mac and Win OSs report page 
number instantly and accurately in the file information windows.

I discovered that in a small sample of PDFs a line… << /Type /Pages /MediaBox 
[0 0 612 792] /Count 149 /Kids [ 1396 0 R 1397 0 R


...contained the page count, which was a bit confusing because I read that the 
Mediabox was only about page dimensions.  Then I found that some PDFs don’t 
contain that line, or at least not in the clear.

There is a general online consensus that reliably finding the page count of a 
PDF involves quite a lot of messing about and parsing, and may involve pretty 
much counting the pages.

I found some code here <http://www.angusj.com/delphitips/pdfpagecount.php> with 
the following walk through:

//1.  See if there's a 'Linearization dictionary' for easy parsing.
//    Mostly there isn't so ...
//2.  Locate 'startxref' at end of file
//3.  get 'xref' offset and go to xref table
//4.  depending on version the xref table may or may not be in a compressed
//    stream. If it's in a compressed stream (PDF ver 1.5+) then getting the
//    page number requires a LOT of code which is too convoluted to summarise
//    here. Otherwise it still requires a moderate amount of code ...
//5.  parse the xref table and fill a list with object numbers and offsets
//6.  handle subsections within xref table.
//7.  read 'trailer' section at end of each xref
//8.  store 'Root' object number if found in 'trailer'
//9.  if 'Prev' xref found in 'trailer' - loop back to step 3
//10. locate Root in the object list
//11. locate 'Pages' object from Root
//12. get Count from Pages.


If this is right, how on earth do OSs do it so quickly?  Also, and more to the 
point, am I on a fools errand to do this with LC?  I haven’t seen anything that 
obviously couldn’t be done (didn’t understand the regex, but assumed with 
effort…).  However parsing huge files just doesn’t look like it would be worth 
the effort, particularly as I can select all the documents,  get info, and sum 
the pages in my head..

Cheers,

David Glasgow
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: Getting page counts of PDFs

Reply via email to