> On Jun 17, 2019, at 1:30 AM, Cem Vardar <cem...@hotmail.com> wrote: > > Hello, > > I have been working on assignment that was described to me as “fairly > trivial” for a couple of days now. I have some PDF files that have links for > some websites and I need to extract these links from these files by using > Python. I would be very glad if someone could point me in the direction of > some resources that would give me the essential skills specific for this task. >
Unfortunately, a PDF can contain anything from almost PostScript to a bit map. But lets assume your PDFs are of the almost PostScript flavor. In that case you can simply read them as text, and then use standard Python’s standard string searching for http:// or https://. Each time you find one, stop and parse (again with string handling) the URL looking for one of the typical terminators (e.g. .com, .net, .org etc.). It might help to cheat a bit and open one of the PDFs with a standard text editor and using it, search for http:// and see what turns up. I’ll bet it will be fairly clear. Bill > Sincerely, > Cem > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor