On Mon, Jun 22, 2015 at 12:11:30PM +0200, Timo wrote: > Op 21-06-15 om 22:04 schreef Joshua Valdez: > >I'm having trouble making this script work to scrape information from a > >series of Wikipedia articles. > > > >What I'm trying to do is iterate over a series of wiki URLs and pull out > >the page links on a wiki portal category (e.g. > >https://en.wikipedia.org/wiki/Category:Electronic_design). > Instead of scraping the webpage, I'd have a look at the API. This might > give much better and more reliable results than to rely on parsing HTML. > > https://www.mediawiki.org/wiki/API:Main_page
Seconded, thirded and fourthed! Please don't scrape wikipedia. It is hard enough for them to deal with bandwidth requirements and remain responsive for browsers without badly-written bots trying to suck down pieces of the site. Use the API. Not only is it the polite thing to do, but it protects you too: Wikipedia is entitled to block your bot if they think it is not following the rules. > You can try out the huge amount of different options (with small > descriptions) on the sandbox page: > > https://en.wikipedia.org/wiki/Special:ApiSandbox -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor