Brian schrieb:
> I think what the toolserver guys are saying is that they've got the
> data (e.g., a replica of the master database) and they are willing to
> expand operations to include larger-scale computations, and so yes
> they are willing to become more "research oriented". They just need
> the extra hardware of course. I think it's difficult to estimate how
> much but here are some applications that I would like to make or see
> made sooner or later:
> 
> * WikiBlame - A Lucene index of the history of all projects that can
> instantly find the authors of a pasted snippet. I'm not clear on the
> memory requirements of hosting an app like this after the index is
> created, but the index will be terabyte-size at 35% of the text dump.

Note that WikiTrust can do this too, and will probably go into testing soon. For
now, the database for WikiTrust weill be off-site, but if it goes live on
wikipedia, the hardwaree would be run at the main wmf cluster, and not on the
toolserver.

> * WikiBlame for images - an image similarity algorithm over all images
> in all projects that can find all places a given image is being used.
> I believe there is a one-time major cpu cost when first analyzing the
> images and then a much lesser realtime comparison cost. Again, the
> memory requirements of hosting such an app are unclear.

That would be very nice to have...

> * A vandalism classifier bot that uses the entire history of a wiki in
> order to predict whether the current edit is vandalism. Basically, a
> major extension of existing published work on automatically detecting
> vandalism, which only used several hundred edits. This would require
> major cpu resources for training but very little cost for real-time
> classification.

Pretty big for a toolserver poroject. But an excellent research topic!

> * Dumps, including extended dump formats such as a natural language
> parse of the full text of the recent version of a wiki made readily
> available for researchers.
> 
> Finally, there are many worthwhile projects that have been presented
> at past Wikimanias or published in the literature that deserve to be
> kept up to date as the encyclopedia continues to grow. Permanent
> hosting for such projects would be a worthwhile goal, as would
> reaching out to these researchers. If the foundation can afford such
> an endeavor, the hardware cost is actually not that great. Perhaps
> datacenter fees are.

Please don't foprget that the toolserver is NOT run by the wikimedia foundation.
It's run by wikimedia germany, which has maybe a tenth of the foundation's
budget. If the foundation is interested in supporting us further, that's great,
we just need to keep responsibilities clear: is the foundation runnign a
project, or is the foundation heling us (wikimedia germany) to run a project?...

-- daniel

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to