[email protected] wrote:
Hello everyone,

I would to retrieve data, and especially the temperature and the weather from 
http://www.nytimes.com/weather. And I don't know how to do so.

Consider whether the NY Times terms and conditions permit such automated scraping of their web site.

Be careful you do not abuse their hospitality by hammering their web site unnecessarily (say, by checking the weather eighty times a minute).

Consider whether they have a public API for downloading data directly. If so, use that. Otherwise:

Use the urlib2 and urlib modules to download the raw HTML source of the page you are interested in. You may need to use them to login, to set cookies, set the referer [sic], submit data via forms, change the user-agent... it's a PITA. Better to use an API if the web site offers one.

Use the htmllib module to parse the source looking for the information you are after. If their HTML is crap, as it so often is with commercial websites that should know better, download and install BeautifulSoup, and use that for parsing the HTML.

Don't be tempted to use regexes for parsing the HTML. That is the wrong solution. Regexes *seem* like a good idea for parsing HTML, and for simple tasks they are quick to program, but they invariably end up being ten times as much work as a proper HTML parser.

If the content you are after requires Javascript, you're probably out of luck.


--
Steven

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to