On 28 January 2010 15:06, 李琴 <q...@ica.stc.sh.cn> wrote: > Hi all, > I have built a LocalWiki. Now I want the data of it to keep consistent > with the > Wikipedia and one work I should do is to get the data of update from > Wikipedia. > I get the URLs through analyzing the RSS > (http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B4%E6%94%B9&feed=rss) > and get all HTML content of the edit box by analyzing > these URLs after opening an URL and clicking the ’edit this page’. .... > That’s because I visit it too frequently and my IP address is prohibited > or the network is too slow?
李琴 well.. thats webscrapping, that is a poor tecnique, one with lots of errors that generate lots of trafic. One thing a robot must do is read and follow the http://zh.wikipedia.org/robots.txt file ( probably you sould read it too) As a general rule of Internet, a "rude" robot will be banned by the site admins. It would be a good idea to anounce your bot as a bot in the user_agent string . Good bot beavior is one that read a website like a human. I don't know, like 10 request minute?. I don't know about this "Wikipedia" site rules about it. What you are suffering could be automatic or manual throttling, since is detected a abusive number of request from your IP. "Wikipedia" seems to provide fulldumps of his wiki, but are unusable for you, since are giganteous :-/, trying to rebuilt wikipedia on your PC with a snapshot would be like summoning Tchulu in a teapot. But.. I don't know, maybe the zh version is smaller, or your resources powerfull enough. One feels that what you have built has a severe overload (wastage of resources) and there must be better ways to do it... -- -- ℱin del ℳensaje. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l