Re: [Tutor] How to Scrape Text from PDFs

Malcolm Herbert Wed, 19 Jun 2019 01:37:08 -0700

This isn't  a response that's python-related, sorry, I'm still learning python 
myself, but more questions around the nature of the PDF and where I might start 
looking to solve the problem, were it mine.


The URLs that you are intending to match - are they themselves clickable when 
you open the PDF in another reader?  If so, then you might have better luck 
looking for the PDF element that provides that capability rather than trying to 
text-scrape to recover them.

Although unlikely inside a URL, text in a PDF can be laid out on the page in a 
completely arbitrary manner and to properly do PDF-to-text conversion you may 
need to track position on the page for each glyph as well as the font mapping 
vector - a glyph of an 'A' for instance might not actually be mapped to the 
ASCII/Unicode for 'A' ... all of which can make this a complete nightmare for 
the unwary.

So - when I last looked at generating a PDF with a live link element, this was 
implemented as blue underlined text (to make it look like a link) with an 
invisible box placed over the top which contained the PDF magic to make that do 
what I wanted when the user clicked on it.

I would suspect that what you might want would be a Python library that can 
pull apart a PDF into it's structural elements and then hunt through there for 
the appropriate "URL box" or whatever it's called ...

Hope that helps,
Malcolm

-- 
Malcolm Herbert
[email protected]
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to Scrape Text from PDFs

Reply via email to