On 11/10/12 07:35, Benjamin Fishbein wrote:
I've been scraping info from a website with a url program I wrote. But now I can't open their webpage, no matter which web browser I use. I think they've somehow blocked me. How can I get back in? Is it a temporary block?
How the hell would we know??? Ask the people running the web site. If you have been breaking the terms and conditions of the web site, you could have broken the law (computer trespass). I don't say this because I approve of or agree with the law, but when you scrape websites with anything other than a browser, that's the chance you take.
And can I get in with the same computer from a different wifi?
*rolls eyes* You've been blocked once. You want to get blocked again? A lot of this depends on what the data is, why it is put on the web in the first place, and what you intend doing with it. Wait a week and see if the block is undone. Then: * If the web site gives you an official API for fetching data, USE IT. * If not, keep to their web site T&C. If the T&C allows scraping under conditions (usually something on the lines of limiting how fast you can scrape, or at what times), OBEY THOSE CONDITIONS and don't be selfish. * If you think the webmaster will be reasonable, ask permission first. (I don't recommend that you volunteer the information that you were already blocked once.) If he's not a dick, he'll probably say yes, under conditions (again, usually to do with time and speed). * If you insist in disregarding their T&C, don't be a dick about it. Always be an ethical scraper. If the police come knocking, at least you can say that you tried to avoid any harm from your actions. It could make the difference between jail and a good behaviour bond. - Make sure you download slowly: pause for at least a few seconds between each download, or even a minute or three. - Limit the rate that you download: you might be on high speed ADSL2, but the faster you slurp files from the website, the less bandwidth they have for others. - Use a cache so you aren't hitting the website again and again for the same files. - Obey robots.txt. Consider using a random pause between (say) 0 and 90 seconds between downloads to to more accurately mimic a human using a browser. Also consider changing your user-agent. Ethical scraping suggests putting your contact details in the user-agent string. Defensive scraping suggests mimicking Internet Explorer as much as possible. More about ethical scraping: http://stackoverflow.com/questions/4384493/how-can-i-ethically-and-legally-scrape-data-from-a-public-web-site -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor