Re: [Wiki-research-l] Distributing the Wikipedia category/pagelink graph

Carlos Castillo Sun, 15 Dec 2013 02:20:39 -0800

Hi,

I think this is definitively a great idea which will save lots of
researchers a ton of work.


Cheers,

-- 
Carlos Castillo (ChaTo)

On Tue, Dec 10, 2013, at 11:57 AM, Dario Taraborelli wrote:
> (cross-posting Sebastiano’s post from the analytics list, this may be of
> interest to both the wikidata and wiki-research-l communities)
> 
> Begin forwarded message:
> 
> > From: Sebastiano Vigna <[email protected]>
> > Subject: [Analytics] Distributing an official graph
> > Date: December 9, 2013 at 10:09:31 PM PST
> > 
> > [Reposted from private discussion after Dario's request]
> > 
> > My problem is that of exploring the graph structure of Wikipedia
> > 
> > 1) easily;
> > 2) reproducibly;
> > 3) in a way that does not depend on parsing artifacts.
> > 
> > Presently, when people wants to do this they either do their own parsing of 
> > the dumps, or they use the SQL data, or they download a dataset like
> > 
> > http://law.di.unimi.it/webdata/enwiki-2013/
> > 
> > which has everything "cooked up".
> > 
> > My frustration in the last few days was when trying to add the category 
> > links. I didn't realize (well, it's not very documented) that bliki 
> > extracts all links and render them in HTML *except* for the category links, 
> > that are instead accessible programmatically. Once I got there, I was able 
> > to make some progress.
> > 
> > Nonetheless, I think that the graph of Wikipedia connections (hyperlinks 
> > and category links) is really a mine of information and it is a pity that a 
> > lot of huffing and puffing is necessary to do something as simple as a 
> > reverse visit of the category links from "People" to get, actually, all 
> > people pages (this is a bit more complicated--there are many false 
> > positives, but after a couple of fixes worked quite well).
> > 
> > Moreover, one has continuously this feeling of walking on eggshells: a 
> > small change in bliki, a small change in the XML format and everything 
> > might stop working is such a subtle manner that you realize it only after a 
> > long time.
> > 
> > I was wondering if Wikimedia would be interested in distributing in 
> > compressed form the Wikipedia graph. That would be the "official" Wikipedia 
> > graph--the benefits, in particular for people working on leveraging 
> > semantic information from Wikipedia, would be really significant.
> > 
> > I would (obviously) propose to use our Java framework, WebGraph, which is 
> > actually quite standard in distributing large (well, actually much larger) 
> > graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 
> > http://lemurproject.org/clueweb12/ and the recent Common Web Crawl 
> > http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, 
> > even a pair of integers per line. The advantage of a binary compressed form 
> > is reduced network utilization, instantaneous availability of the 
> > information, etc.
> > 
> > Probably it would be useful to actually distribute several graphs with the 
> > same dataset--e.g., the category links, the content link, etc. It is 
> > immediate, using WebGraph, to build a union (i.e., a superposition) of any 
> > set of such graphs and use it transparently as a single graph.
> > 
> > In my mind the distributed graph should have a contiguous ID space, say, 
> > induced by the lexicographical order of the titles (possibly placing 
> > template pages at the start or at the end of the ID space). We should 
> > provide graphs, and a bidirectional node<->title map. All such information 
> > would use about 300M of space for the current English Wikipedia. People 
> > could then associate pages to nodes using the title as a key.
> > 
> > But this last part is just rambling. :)
> > 
> > Let me know if you people are interested. We can of course take care of the 
> > process of cooking up the information once it is out of the SQL database.
> > 
> > Ciao,
> > 
> >                                     seba
> > 
> > 
> > _______________________________________________
> > Analytics mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Distributing the Wikipedia category/pagelink graph

Reply via email to