On 28 January 2010 15:06, 李琴 <q...@ica.stc.sh.cn> wrote:
> Hi all,
>  I have  built a LocalWiki.   Now I want the data of it to keep consistent
> with the
> Wikipedia and one work I should do is to get the data of update from
> Wikipedia.
> I get the URLs through analyzing the RSS
> (http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B4%E6%94%B9&feed=rss)
> and get all HTML content of the edit box by analyzing
> these URLs after opening an URL and clicking the ’edit this page’.
....
> That’s because I visit it too frequently and my IP address is prohibited
> or the network is too slow?

李琴 well.. thats webscrapping, that is a poor tecnique, one with lots
of errors that generate lots of trafic.

One thing a robot must do is read and follow  the
http://zh.wikipedia.org/robots.txt file ( probably you sould read it
too)
As a general rule of Internet, a  "rude" robot will be banned by the
site admins.

It would be a good idea to anounce your bot as a bot in the user_agent
string .  Good bot beavior is one that read a website like a human.  I
don't know,  like 10 request minute?.  I don't know about this
"Wikipedia" site rules about it.

What you are suffering could be  automatic or manual throttling, since
is detected a abusive number of request from your IP.

"Wikipedia" seems to provide fulldumps of his wiki, but are unusable
for you, since are giganteous :-/, trying to rebuilt wikipedia on your
PC with a snapshot would be like summoning Tchulu in a teapot. But.. I
don't know, maybe the zh version is smaller, or your resources
powerfull enough.  One feels that what you have built has a severe
overload (wastage of resources) and there must be better ways to do
it...



-- 
--
ℱin del ℳensaje.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to