Re: [PHP] SCanning text of PDF documents
2008/5/15 Angelo Zanetti [EMAIL PROTECTED]: A client of ours wants a solution that when a PDF document is uploaded that we use PHP to scan the documents contents and save it in a DB. I know you can do this with normal text documents using the file commands and functions. Is it possible with PDF documents? My feeling is NO, but perhaps someone will prove me wrong. There's a good chance your installation already has pdf2ps and ps2ascii -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] SCanning text of PDF documents
A reliable solution depends partly on the pdf document itself. Consider if your pdf document contains roted text or text that spans about several different blocks/pages. My experience with ps2acsii and other ghostscript related tools is that sometimes it works quite well, sometimes the output is rather messy. The most reliable way of extracting text from a pdf is (I think) a product called PDF TET from PDFlib Gmbh. Yes, it costs some money for a license, but you are able to get almost everything out of the pdf then. http://www.pdflib.com/products/tet/ Maybe some magic with OpenOffice could do the trick as well? //frank 15 maj 2008 kl. 10.19 skrev Angelo Zanetti: Hi All. This is a quick question. A client of ours wants a solution that when a PDF document is uploaded that we use PHP to scan the documents contents and save it in a DB. I know you can do this with normal text documents using the file commands and functions. Is it possible with PDF documents? My feeling is NO, but perhaps someone will prove me wrong. Thanks in advance. Angelo Web: http://www.elemental.co.za -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Frank Arensmeier Webmaster IT Development NIKE Hydraulics AB Box 1107 631 80 Eskilstuna Sweden phone +46 - (0)16 16 82 34 fax +46 - (0)16 13 93 16 [EMAIL PROTECTED] www.nikehydraulics.se
Re: [PHP] SCanning text of PDF documents
On Thu, May 15, 2008 at 4:19 AM, Angelo Zanetti [EMAIL PROTECTED] wrote: Hi All. This is a quick question. A client of ours wants a solution that when a PDF document is uploaded that we use PHP to scan the documents contents and save it in a DB. I know you can do this with normal text documents using the file commands and functions. Is it possible with PDF documents? My feeling is NO, but perhaps someone will prove me wrong. Thanks in advance. Angelo Web: http://www.elemental.co.za -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php You might want to look at Zend_Pdf. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] SCanning text of PDF documents
Angelo Zanetti wrote: Hi All. This is a quick question. A client of ours wants a solution that when a PDF document is uploaded that we use PHP to scan the documents contents and save it in a DB. I know you can do this with normal text documents using the file commands and functions. Is it possible with PDF documents? My feeling is NO, but perhaps someone will prove me wrong. Thanks in advance. Angelo Web: http://www.elemental.co.za One thing you'll have to watch is that if the PDF was created by a scanner, then the text on the PDF is actually just an image and cannot be read without OCR. I got stumped on that one for a while when I was doing something similar :) -- Ray Hauge www.primateapplications.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] SCanning text of PDF documents
On Thu, 2008-05-15 at 20:17 -0500, Ray Hauge wrote: One thing you'll have to watch is that if the PDF was created by a scanner, then the text on the PDF is actually just an image and cannot be read without OCR. I got stumped on that one for a while when I was doing something similar :) I love the tables where you have something like the following: .-.--. | This is a short | This is a different | | paragraph about | piece of content | | something in a | about another thing. | | table | | `-^--' And of course when you cut and paste you get the following: This is a short This is a different paragraph about piece of content something in a about another thing. table Oh yes, that's what I expected too. It's not even something you can clean with a macro. You have carefully piece them back together, or copy/paste one line at a time-- or just type it :) Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php