(cross-posting Sebastiano’s post from the analytics list, this may be of interest to both the wikidata and wiki-research-l communities)
Begin forwarded message: > From: Sebastiano Vigna <[email protected]> > Subject: [Analytics] Distributing an official graph > Date: December 9, 2013 at 10:09:31 PM PST > > [Reposted from private discussion after Dario's request] > > My problem is that of exploring the graph structure of Wikipedia > > 1) easily; > 2) reproducibly; > 3) in a way that does not depend on parsing artifacts. > > Presently, when people wants to do this they either do their own parsing of > the dumps, or they use the SQL data, or they download a dataset like > > http://law.di.unimi.it/webdata/enwiki-2013/ > > which has everything "cooked up". > > My frustration in the last few days was when trying to add the category > links. I didn't realize (well, it's not very documented) that bliki extracts > all links and render them in HTML *except* for the category links, that are > instead accessible programmatically. Once I got there, I was able to make > some progress. > > Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and > category links) is really a mine of information and it is a pity that a lot > of huffing and puffing is necessary to do something as simple as a reverse > visit of the category links from "People" to get, actually, all people pages > (this is a bit more complicated--there are many false positives, but after a > couple of fixes worked quite well). > > Moreover, one has continuously this feeling of walking on eggshells: a small > change in bliki, a small change in the XML format and everything might stop > working is such a subtle manner that you realize it only after a long time. > > I was wondering if Wikimedia would be interested in distributing in > compressed form the Wikipedia graph. That would be the "official" Wikipedia > graph--the benefits, in particular for people working on leveraging semantic > information from Wikipedia, would be really significant. > > I would (obviously) propose to use our Java framework, WebGraph, which is > actually quite standard in distributing large (well, actually much larger) > graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 > http://lemurproject.org/clueweb12/ and the recent Common Web Crawl > http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, > even a pair of integers per line. The advantage of a binary compressed form > is reduced network utilization, instantaneous availability of the > information, etc. > > Probably it would be useful to actually distribute several graphs with the > same dataset--e.g., the category links, the content link, etc. It is > immediate, using WebGraph, to build a union (i.e., a superposition) of any > set of such graphs and use it transparently as a single graph. > > In my mind the distributed graph should have a contiguous ID space, say, > induced by the lexicographical order of the titles (possibly placing template > pages at the start or at the end of the ID space). We should provide graphs, > and a bidirectional node<->title map. All such information would use about > 300M of space for the current English Wikipedia. People could then associate > pages to nodes using the title as a key. > > But this last part is just rambling. :) > > Let me know if you people are interested. We can of course take care of the > process of cooking up the information once it is out of the SQL database. > > Ciao, > > seba > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
