Brian schrieb: > I think what the toolserver guys are saying is that they've got the > data (e.g., a replica of the master database) and they are willing to > expand operations to include larger-scale computations, and so yes > they are willing to become more "research oriented". They just need > the extra hardware of course. I think it's difficult to estimate how > much but here are some applications that I would like to make or see > made sooner or later: > > * WikiBlame - A Lucene index of the history of all projects that can > instantly find the authors of a pasted snippet. I'm not clear on the > memory requirements of hosting an app like this after the index is > created, but the index will be terabyte-size at 35% of the text dump.
Note that WikiTrust can do this too, and will probably go into testing soon. For now, the database for WikiTrust weill be off-site, but if it goes live on wikipedia, the hardwaree would be run at the main wmf cluster, and not on the toolserver. > * WikiBlame for images - an image similarity algorithm over all images > in all projects that can find all places a given image is being used. > I believe there is a one-time major cpu cost when first analyzing the > images and then a much lesser realtime comparison cost. Again, the > memory requirements of hosting such an app are unclear. That would be very nice to have... > * A vandalism classifier bot that uses the entire history of a wiki in > order to predict whether the current edit is vandalism. Basically, a > major extension of existing published work on automatically detecting > vandalism, which only used several hundred edits. This would require > major cpu resources for training but very little cost for real-time > classification. Pretty big for a toolserver poroject. But an excellent research topic! > * Dumps, including extended dump formats such as a natural language > parse of the full text of the recent version of a wiki made readily > available for researchers. > > Finally, there are many worthwhile projects that have been presented > at past Wikimanias or published in the literature that deserve to be > kept up to date as the encyclopedia continues to grow. Permanent > hosting for such projects would be a worthwhile goal, as would > reaching out to these researchers. If the foundation can afford such > an endeavor, the hardware cost is actually not that great. Perhaps > datacenter fees are. Please don't foprget that the toolserver is NOT run by the wikimedia foundation. It's run by wikimedia germany, which has maybe a tenth of the foundation's budget. If the foundation is interested in supporting us further, that's great, we just need to keep responsibilities clear: is the foundation runnign a project, or is the foundation heling us (wikimedia germany) to run a project?... -- daniel _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
