Re: [Tutor] How to Scrape Text from PDFs

William Ray Wing via Tutor Mon, 17 Jun 2019 13:36:36 -0700


> On Jun 17, 2019, at 1:30 AM, Cem Vardar <[email protected]> wrote:
> 
> Hello,
> 
> I have been working on assignment that was described to me as “fairly 
> trivial” for a couple of days now. I have some PDF files that have links for 
> some websites and I need to extract these links from these files by using 
> Python. I would be very glad if someone could point me in the direction of 
> some resources that would give me the essential skills specific for this task.
>


Unfortunately, a PDF can contain anything from almost PostScript to a bit map.  
But lets assume your PDFs are of the almost PostScript flavor.  In that case 
you can simply read them as text, and then use standard Python’s standard 
string searching for http:// or https://.  Each time you find one, stop and 
parse (again with string handling) the URL looking for one of the typical 
terminators (e.g. .com, .net, .org etc.).

It might help to cheat a bit and open one of the PDFs with a standard text 
editor and using it, search for http:// and see what turns up.  I’ll bet it 
will be fairly clear.

Bill

> Sincerely,
> Cem
> _______________________________________________
> Tutor maillist  -  [email protected]
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to Scrape Text from PDFs

Reply via email to