Re: [Tutor] Retrieve data

Steven D'Aprano Tue, 12 Apr 2011 16:19:54 -0700

[email protected] wrote:

Hello everyone,


I would to retrieve data, and especially the temperature and the weather from 
http://www.nytimes.com/weather. And I don't know how to do so.

Consider whether the NY Times terms and conditions permit such automatedscraping of their web site.

Be careful you do not abuse their hospitality by hammering their website unnecessarily (say, by checking the weather eighty times a minute).

Consider whether they have a public API for downloading data directly.If so, use that. Otherwise:

Use the urlib2 and urlib modules to download the raw HTML source of thepage you are interested in. You may need to use them to login, to setcookies, set the referer [sic], submit data via forms, change theuser-agent... it's a PITA. Better to use an API if the web site offers one.

Use the htmllib module to parse the source looking for the informationyou are after. If their HTML is crap, as it so often is with commercialwebsites that should know better, download and install BeautifulSoup,and use that for parsing the HTML.

Don't be tempted to use regexes for parsing the HTML. That is the wrongsolution. Regexes *seem* like a good idea for parsing HTML, and forsimple tasks they are quick to program, but they invariably end up beingten times as much work as a proper HTML parser.

If the content you are after requires Javascript, you're probably out ofluck.



--
Steven

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieve data

Reply via email to