Nice to note: the license for this data is very nicely compatible with the Apache License for software, as it's CC-3.0 Attribution:
http://creativecommons.org/licenses/by/3.0/legalcode as linked from their download dir: http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/ On Sat, May 19, 2012 at 12:50 PM, Dan Brickley <[email protected]> wrote: > Just noticed this handy-looking dataset, > > http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html > "From Words to Concepts and Back: Dictionaries for Linking Text, > Entities and Ideas" > > Excerpt, "How do we represent concepts? Our approach piggybacks on the > unique titles of entries from an encyclopedia, which are mostly proper > and common noun phrases. We consider each individual Wikipedia article > as representing a concept (an entity or an idea), identified by its > URL. Text strings that refer to concepts were collected using the > publicly available hypertext of anchors (the text you click on in a > web link) that point to each Wikipedia page, thus drawing on the vast > link structure of the web. For every English article we harvested the > strings associated with its incoming hyperlinks from the rest of > Wikipedia, the greater web, and also anchors of parallel, non-English > Wikipedia pages. Our dictionaries are cross-lingual, and any concept > deemed too fine can be broadened to a desired level of generality > using Wikipedia's groupings of articles into hierarchical categories. > > The data set contains triples, each consisting of (i) text, a short, > raw natural language string; (ii) url, a related concept, represented > by an English Wikipedia article's canonical location; and (iii) count, > an integer indicating the number of times text has been observed > connected with the concept's url. Our database thus includes weights > that measure degrees of association. " [...] > > I figured this should be of interest to a good few Mahout users, so > passing it along... > > cheers, > > Dan > -- -jake
