[PHP] Re: PHP class or functions to manipulate PDF metadata?

2009-04-07 Thread O. Lavell
Peter Ford wrote:

> O. Lavell wrote:
>> Peter Ford wrote:

[..]

>>> I do accept that the metadata should be machine-readable: that part of
>>> your project is reasonable and I'm fairly sure that ought to be
>>> possible with something simple. The best bet I found so far is PDFTK
>>> (http://www.pdfhacks.com/pdftk/) which is a command-line tool that you
>>> could presumably call with exec or whatever...
>> 
>> Like I said, this is what I am already doing with the pdfinfo utility
>> from xpdf.
> 
> Sorry - I guess I didn't read that bit carefully enough...

No problem at all, I was really glad someone wanted to share their 
thoughts anyway after it first seemed that no one was interested.

[..]

>> So thank you again for pushing me in that direction, even if
>> unintentionally and despite the fact that what I am doing goes against
>> your judgement ;)
>> 
>> 
> As I know only too well, you can't always choose your customers
> (especially if they choose you...) and you certainly can't control all
> of the sources of data you have to deal with!

Exactly.

> I have spent many hours/days/possibly longer hacking through files that
> are in one form to get data into another, and PDF is the one that always
> makes me nervous :(

So far you, Tedd and I agree on this. The so-called portable document 
format is a rather convoluted thing.

> My judgement is certainly not final, or even particularly important: if
> I had time I would also look into at least getting the metadata with
> pure PHP.
> 
> Good luck...

Thank you. If I did have the time (to spare) I would feel almost obliged 
to try to figure it out. Perhaps in a week or two...


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Re: PHP class or functions to manipulate PDF metadata?

2009-04-07 Thread Peter Ford
O. Lavell wrote:
> Peter Ford wrote:
> 
>> O. Lavell wrote:
> 
> [..]
> 
>>> Any and all suggestions are welcome. Thank you in advance.
>>>
>> So many people ask about manipulating, editing and generally processing
>> PDF files. In my experience, PDF is a write-once format - any
>> manipulation should have been done in whatever source generated the PDF.
>> I think of a PDF as being a piece of paper: if you want to change the
>> content of a piece of paper it is usually best to chuck it away and
>> start again...
>>
>> Even more so, this would apply to the PDF metadata: metadata is supposed
>> to describe the nature of the document: it's author, creation time etc.
>> That sort of data should be maintained with the document and ideally not
>> changed throughout the document's lifetime (like the footer, or
>> end-papers in a physical book)
> 
> Thank you very much for your reply. And it's not that I don't agree with 
> you. Because I do, completely.
> 
> However...
> 
> PDFs often come from sources that can't be bothered to fill in the 
> relevant fields correctly, completely, or at all. For those cases I would 
> like the users of my application to be able to correct the values found 
> in the metadata. Upload the PDF, get a nice little HTML form with 4 or 5 
> values to review or edit. That sort of thing.
> 
>> I do accept that the metadata should be machine-readable: that part of
>> your project is reasonable and I'm fairly sure that ought to be possible
>> with something simple. The best bet I found so far is PDFTK
>> (http://www.pdfhacks.com/pdftk/) which is a command-line tool that you
>> could presumably call with exec or whatever...
> 
> Like I said, this is what I am already doing with the pdfinfo utility 
> from xpdf.

Sorry - I guess I didn't read that bit carefully enough...

> 
> But now that you mentioned pdftk... I just tried it and it does seem to 
> come close to what I want. It is capable of writing a new PDF with the 
> contents of an existing one, with new metadata fed as a text file. So it 
> shouldn't be very hard to write a little PHP around that process.
> 
> Now I need to think a bit more about this approach. Perhaps it can be 
> implemented using only pure PHP, after all. But for the time being, pdftk 
> will do.
> 
> So thank you again for pushing me in that direction, even if 
> unintentionally and despite the fact that what I am doing goes against 
> your judgement ;)
> 

As I know only too well, you can't always choose your customers (especially if
they choose you...) and you certainly can't control all of the sources of data
you have to deal with!
I have spent many hours/days/possibly longer hacking through files that are in
one form to get data into another, and PDF is the one that always makes me
nervous :(
My judgement is certainly not final, or even particularly important: if I had
time I would also look into at least getting the metadata with pure PHP.

Good luck...

-- 
Peter Ford  phone: 01580 89
Developer   fax:   01580 893399
Justcroft International Ltd., Staplehurst, Kent

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: PHP class or functions to manipulate PDF metadata?

2009-04-06 Thread O. Lavell
tedd wrote:

[..]

> All the attempts I have done into opening up a PDF file and then trying
> to make sense of it and put it back together with something changed have
> been absolute failures.
> 
> The algorithm used to make a PDF file reminds me of a replacement-type
> compression technique -- it's not easy to understand what was done.

It's definitely voodoo. And I'm not adverse to a little voodoo myself, 
but someone else's voodoo in which you aren't initiated always seems to 
be so much more impenetrable...


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Re: PHP class or functions to manipulate PDF metadata?

2009-04-06 Thread O. Lavell
Peter Ford wrote:

> O. Lavell wrote:

[..]

>> Any and all suggestions are welcome. Thank you in advance.
>> 
> So many people ask about manipulating, editing and generally processing
> PDF files. In my experience, PDF is a write-once format - any
> manipulation should have been done in whatever source generated the PDF.
> I think of a PDF as being a piece of paper: if you want to change the
> content of a piece of paper it is usually best to chuck it away and
> start again...
> 
> Even more so, this would apply to the PDF metadata: metadata is supposed
> to describe the nature of the document: it's author, creation time etc.
> That sort of data should be maintained with the document and ideally not
> changed throughout the document's lifetime (like the footer, or
> end-papers in a physical book)

Thank you very much for your reply. And it's not that I don't agree with 
you. Because I do, completely.

However...

PDFs often come from sources that can't be bothered to fill in the 
relevant fields correctly, completely, or at all. For those cases I would 
like the users of my application to be able to correct the values found 
in the metadata. Upload the PDF, get a nice little HTML form with 4 or 5 
values to review or edit. That sort of thing.

> I do accept that the metadata should be machine-readable: that part of
> your project is reasonable and I'm fairly sure that ought to be possible
> with something simple. The best bet I found so far is PDFTK
> (http://www.pdfhacks.com/pdftk/) which is a command-line tool that you
> could presumably call with exec or whatever...

Like I said, this is what I am already doing with the pdfinfo utility 
from xpdf.

But now that you mentioned pdftk... I just tried it and it does seem to 
come close to what I want. It is capable of writing a new PDF with the 
contents of an existing one, with new metadata fed as a text file. So it 
shouldn't be very hard to write a little PHP around that process.

Now I need to think a bit more about this approach. Perhaps it can be 
implemented using only pure PHP, after all. But for the time being, pdftk 
will do.

So thank you again for pushing me in that direction, even if 
unintentionally and despite the fact that what I am doing goes against 
your judgement ;)


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Re: PHP class or functions to manipulate PDF metadata?

2009-04-06 Thread tedd

At 10:06 AM +0100 4/6/09, Peter Ford wrote:

O. Lavell wrote:

 > Any and all suggestions are welcome. Thank you in advance.




So many people ask about manipulating, editing and generally processing PDF
files. In my experience, PDF is a write-once format - any manipulation should
have been done in whatever source generated the PDF. I think of a 
PDF as being a

piece of paper: if you want to change the content of a piece of paper it is
usually best to chuck it away and start again...


That's a good way to put it.

All the attempts I have done into opening up a PDF file and then 
trying to make sense of it and put it back together with something 
changed have been absolute failures.


The algorithm used to make a PDF file reminds me of a 
replacement-type compression technique -- it's not easy to understand 
what was done.


Cheers,

tedd

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Re: PHP class or functions to manipulate PDF metadata?

2009-04-06 Thread Peter Ford
O. Lavell wrote:
> Hi group,
> 
> I am looking for an easy way to manipulate (read, write) the metadata 
> (title, subject, keywords, author) in PDF files through PHP.
> 
> Most PHP/PDF solutions I have found so far (through Google) are aimed at 
> constructing PDFs from text and graphics, with lots of fancy features, 
> but most of them omit metadata functions altogether.
> 
> I would also prefer something extremely lightweight that I could just 
> include_once() into my script, i.e. not a module or external program. I 
> am currently using pdfinfo from xpdf-utils, but it has to go.
> 
> My use case is I want to build a database with the metadata of a bunch 
> (many hundreds, perhaps thousands) of PDF files in a directory on the 
> server for easy search, statistics and retrieval. I also want users to be 
> able to make edits to any PDF's metadata from the web.
> 
> If it can be at all avoided, I would rather not have to invent the wheel 
> myself here. I have looked at the Adobe PDF specification a bit and it 
> looks quite... challenging. Or should I say daunting.
> 
> Any and all suggestions are welcome. Thank you in advance.
> 

So many people ask about manipulating, editing and generally processing PDF
files. In my experience, PDF is a write-once format - any manipulation should
have been done in whatever source generated the PDF. I think of a PDF as being a
piece of paper: if you want to change the content of a piece of paper it is
usually best to chuck it away and start again...

Even more so, this would apply to the PDF metadata: metadata is supposed to
describe the nature of the document: it's author, creation time etc. That sort
of data should be maintained with the document and ideally not changed
throughout the document's lifetime (like the footer, or end-papers in a physical
book)

I do accept that the metadata should be machine-readable: that part of your
project is reasonable and I'm fairly sure that ought to be possible with
something simple. The best bet I found so far is PDFTK
(http://www.pdfhacks.com/pdftk/) which is a command-line tool that you could
presumably call with exec or whatever...


-- 
Peter Ford  phone: 01580 89
Developer   fax:   01580 893399
Justcroft International Ltd., Staplehurst, Kent

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php