"Shashwat Anand" <[email protected]> wrote

I need to make a database from some PDFs. I need to extract logos as well as
the information (i.e. name,address) beneath the logo and fill it up in
database. The logo can be text as well as picture as shown in two of the
screenshots of one of the sample PDF file:
http://imagebin.org/77378
http://imagebin.org/77379

You could try PDFMiner to extract direct from the PDF using Python.

Will converting to html a good option? Later on I need to apply some image
processing too. What should be the ideal way towards it ?

Converting to html (assuming you have a tool to do that!) may be better
since there are a wider choice of tools and more experience to help you.
Or there are various commercial tools for converting PDF into Word etc.

I've never personally had to extract data from a PDF, I've always had access
to the source documents so I can't comment on how effective each approach
is...

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to