I used PDFMiner and I was pretty satisfied with the text portions. I retrieved all the text and was able to manipulate it according to my wish. However I failed on Image part. So Technically my question reduces to 'If there a PDF document and some verbose text below them and the pattern is followed i.e. per page of PDF there will be one image and some texts following it, how can I retrieve both the images and the text without loss' ?
On Tue, Dec 29, 2009 at 2:59 PM, Alan Gauld <[email protected]>wrote: > "Shashwat Anand" <[email protected]> wrote > > > I need to make a database from some PDFs. I need to extract logos as well >> as >> the information (i.e. name,address) beneath the logo and fill it up in >> database. The logo can be text as well as picture as shown in two of the >> screenshots of one of the sample PDF file: >> http://imagebin.org/77378 >> http://imagebin.org/77379 >> > > You could try PDFMiner to extract direct from the PDF using Python. > > > Will converting to html a good option? Later on I need to apply some >> image >> processing too. What should be the ideal way towards it ? >> > > Converting to html (assuming you have a tool to do that!) may be better > since there are a wider choice of tools and more experience to help you. > Or there are various commercial tools for converting PDF into Word etc. > > I've never personally had to extract data from a PDF, I've always had > access > to the source documents so I can't comment on how effective each approach > is... > > -- > Alan Gauld > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > > _______________________________________________ > Tutor maillist - [email protected] > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor >
_______________________________________________ Tutor maillist - [email protected] To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
