Re: [PHP] SCanning text of PDF documents

2008-05-15 Thread David Otton
2008/5/15 Angelo Zanetti [EMAIL PROTECTED]:

  A client of ours wants a solution that when a PDF document is uploaded that
  we use PHP to scan the documents contents and save it in a DB.

  I know you can do this with normal text documents using the file commands
  and functions.

  Is it possible with PDF documents?

  My feeling is NO, but perhaps someone will prove me wrong.

There's a good chance your installation already has pdf2ps and ps2ascii

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] SCanning text of PDF documents

2008-05-15 Thread Frank Arensmeier
A reliable solution depends partly on the pdf document itself.  
Consider if your pdf document contains roted text or text that spans  
about several different blocks/pages. My experience with ps2acsii and  
other ghostscript related tools is that sometimes it works quite  
well, sometimes the output is rather messy.


The most reliable way of extracting text from a pdf is (I think) a  
product called PDF TET from PDFlib Gmbh. Yes, it costs some money for  
a license, but you are able to get almost everything out of the pdf  
then.


http://www.pdflib.com/products/tet/

Maybe some magic with OpenOffice could do the trick as well?

//frank

15 maj 2008 kl. 10.19 skrev Angelo Zanetti:


Hi All.

This is a quick question.

A client of ours wants a solution that when a PDF document is  
uploaded that

we use PHP to scan the documents contents and save it in a DB.

I know you can do this with normal text documents using the file  
commands

and functions.

Is it possible with PDF documents?

My feeling is NO, but perhaps someone will prove me wrong.

Thanks in advance.

Angelo

Web: http://www.elemental.co.za



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php






Frank Arensmeier
 


Webmaster  IT Development

NIKE Hydraulics AB
Box 1107
631 80 Eskilstuna
Sweden

phone +46 - (0)16 16 82 34
fax +46 - (0)16 13 93 16
[EMAIL PROTECTED]
www.nikehydraulics.se
 







Re: [PHP] SCanning text of PDF documents

2008-05-15 Thread Eric Butera
On Thu, May 15, 2008 at 4:19 AM, Angelo Zanetti [EMAIL PROTECTED] wrote:
 Hi All.

 This is a quick question.

 A client of ours wants a solution that when a PDF document is uploaded that
 we use PHP to scan the documents contents and save it in a DB.

 I know you can do this with normal text documents using the file commands
 and functions.

 Is it possible with PDF documents?

 My feeling is NO, but perhaps someone will prove me wrong.

 Thanks in advance.

 Angelo

 Web: http://www.elemental.co.za



 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php



You might want to look at Zend_Pdf.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] SCanning text of PDF documents

2008-05-15 Thread Ray Hauge

Angelo Zanetti wrote:

Hi All.

This is a quick question.

A client of ours wants a solution that when a PDF document is uploaded that
we use PHP to scan the documents contents and save it in a DB.

I know you can do this with normal text documents using the file commands
and functions.

Is it possible with PDF documents?

My feeling is NO, but perhaps someone will prove me wrong.

Thanks in advance.

Angelo

Web: http://www.elemental.co.za 






One thing you'll have to watch is that if the PDF was created by a 
scanner, then the text on the PDF is actually just an image and cannot 
be read without OCR.  I got stumped on that one for a while when I was 
doing something similar :)


--
Ray Hauge
www.primateapplications.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] SCanning text of PDF documents

2008-05-15 Thread Robert Cummings

On Thu, 2008-05-15 at 20:17 -0500, Ray Hauge wrote:

 One thing you'll have to watch is that if the PDF was created by a 
 scanner, then the text on the PDF is actually just an image and cannot 
 be read without OCR.  I got stumped on that one for a while when I was 
 doing something similar :)

I love the tables where you have something like the following:

.-.--.
| This is a short | This is a different  |
| paragraph about | piece of content |
| something in a  | about another thing. |
| table   |  |
`-^--'

And of course when you cut and paste you get the following:

This is a short This is a different
paragraph about piece of content
something in a about another thing.
table

Oh yes, that's what I expected too. It's not even something you can
clean with a macro. You have carefully piece them back together, or
copy/paste one line at a time-- or just type it :)

Cheers,
Rob.
-- 
http://www.interjinn.com
Application and Templating Framework for PHP


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php