On 10/16/10 8:40 PM, Fred Bauder wrote:
>
> http://mastersofmedia.hum.uva.nl/2010/10/16/wikipedia-we-have-a-google-refresh-problem/

The linked blog post laments the lag between the removal of vandalism on 
Wikipedia and its removal in Google's indices and cached data.

There is a way to mitigate that problem -- there are protocols to let 
Google know about recently changed pages. I'm assuming that we have no 
arrangement in place already for them to crawl recent changes for all of 
Wikipedia?

In any case, the more interesting goal is not so much to mitigate 
vandalism, but to increase the coverage and timeliness of the whole 
collection.

Anyway, the standard way to do this is Sitemaps:

   http://www.sitemaps.org/

As the name suggests, "Sitemaps" were originally intended as hints about 
site structure, but search engines like Google now use it as a sort of 
feed of recently changed pages.

   http://www.sitemaps.org/faq.php#faq_submitting_changes

They don't accept something sensible like RSS or XMPP even from other 
top 50 websites, unless you happen to be Six Apart or Twitter. Still, we 
could ask, since Daniel Kinzler has a working demo of recent changes via 
XMPP.

Alternatively, we could use the XMPP stream to either transform it to a 
Sitemaps-compatible structure or generate both kinds of files at the 
same time. I assume, famous last words, that the really heavy lifting is 
already done since we have a recent changes feature.

I don't know if I'm committing any resources to this (I'm still busy 
with other stuff for the next two months at least) but I happen to know 
a lot about this from an aborted project at another employer, so I have 
always wanted to actually use that knowledge.

-- 
Neil Kandalgaonkar  |) <[email protected]>

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to