Hi, I think this is definitively a great idea which will save lots of researchers a ton of work.
Cheers, -- Carlos Castillo (ChaTo) On Tue, Dec 10, 2013, at 11:57 AM, Dario Taraborelli wrote: > (cross-posting Sebastiano’s post from the analytics list, this may be of > interest to both the wikidata and wiki-research-l communities) > > Begin forwarded message: > > > From: Sebastiano Vigna <[email protected]> > > Subject: [Analytics] Distributing an official graph > > Date: December 9, 2013 at 10:09:31 PM PST > > > > [Reposted from private discussion after Dario's request] > > > > My problem is that of exploring the graph structure of Wikipedia > > > > 1) easily; > > 2) reproducibly; > > 3) in a way that does not depend on parsing artifacts. > > > > Presently, when people wants to do this they either do their own parsing of > > the dumps, or they use the SQL data, or they download a dataset like > > > > http://law.di.unimi.it/webdata/enwiki-2013/ > > > > which has everything "cooked up". > > > > My frustration in the last few days was when trying to add the category > > links. I didn't realize (well, it's not very documented) that bliki > > extracts all links and render them in HTML *except* for the category links, > > that are instead accessible programmatically. Once I got there, I was able > > to make some progress. > > > > Nonetheless, I think that the graph of Wikipedia connections (hyperlinks > > and category links) is really a mine of information and it is a pity that a > > lot of huffing and puffing is necessary to do something as simple as a > > reverse visit of the category links from "People" to get, actually, all > > people pages (this is a bit more complicated--there are many false > > positives, but after a couple of fixes worked quite well). > > > > Moreover, one has continuously this feeling of walking on eggshells: a > > small change in bliki, a small change in the XML format and everything > > might stop working is such a subtle manner that you realize it only after a > > long time. > > > > I was wondering if Wikimedia would be interested in distributing in > > compressed form the Wikipedia graph. That would be the "official" Wikipedia > > graph--the benefits, in particular for people working on leveraging > > semantic information from Wikipedia, would be really significant. > > > > I would (obviously) propose to use our Java framework, WebGraph, which is > > actually quite standard in distributing large (well, actually much larger) > > graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 > > http://lemurproject.org/clueweb12/ and the recent Common Web Crawl > > http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, > > even a pair of integers per line. The advantage of a binary compressed form > > is reduced network utilization, instantaneous availability of the > > information, etc. > > > > Probably it would be useful to actually distribute several graphs with the > > same dataset--e.g., the category links, the content link, etc. It is > > immediate, using WebGraph, to build a union (i.e., a superposition) of any > > set of such graphs and use it transparently as a single graph. > > > > In my mind the distributed graph should have a contiguous ID space, say, > > induced by the lexicographical order of the titles (possibly placing > > template pages at the start or at the end of the ID space). We should > > provide graphs, and a bidirectional node<->title map. All such information > > would use about 300M of space for the current English Wikipedia. People > > could then associate pages to nodes using the title as a key. > > > > But this last part is just rambling. :) > > > > Let me know if you people are interested. We can of course take care of the > > process of cooking up the information once it is out of the SQL database. > > > > Ciao, > > > > seba > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
