This is nice. Thanks! On Sat, May 19, 2012 at 1:38 PM, Jake Mannix <[email protected]> wrote: > Nice to note: the license for this data is very nicely compatible with the > Apache License for software, as it's CC-3.0 Attribution: > > http://creativecommons.org/licenses/by/3.0/legalcode > > as linked from their download dir: > > http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/ > > On Sat, May 19, 2012 at 12:50 PM, Dan Brickley <[email protected]> wrote: > >> Just noticed this handy-looking dataset, >> >> http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html >> "From Words to Concepts and Back: Dictionaries for Linking Text, >> Entities and Ideas" >> >> Excerpt, "How do we represent concepts? Our approach piggybacks on the >> unique titles of entries from an encyclopedia, which are mostly proper >> and common noun phrases. We consider each individual Wikipedia article >> as representing a concept (an entity or an idea), identified by its >> URL. Text strings that refer to concepts were collected using the >> publicly available hypertext of anchors (the text you click on in a >> web link) that point to each Wikipedia page, thus drawing on the vast >> link structure of the web. For every English article we harvested the >> strings associated with its incoming hyperlinks from the rest of >> Wikipedia, the greater web, and also anchors of parallel, non-English >> Wikipedia pages. Our dictionaries are cross-lingual, and any concept >> deemed too fine can be broadened to a desired level of generality >> using Wikipedia's groupings of articles into hierarchical categories. >> >> The data set contains triples, each consisting of (i) text, a short, >> raw natural language string; (ii) url, a related concept, represented >> by an English Wikipedia article's canonical location; and (iii) count, >> an integer indicating the number of times text has been observed >> connected with the concept's url. Our database thus includes weights >> that measure degrees of association. " [...] >> >> I figured this should be of interest to a good few Mahout users, so >> passing it along... >> >> cheers, >> >> Dan >> > > > > -- > > -jake
-- Lance Norskog [email protected]
