This is different in that it uses the entire web as well as Wikipedia. On Sun, May 20, 2012 at 1:32 PM, John Stewart <[email protected]> wrote:
> Reminds me of Explicit Semantic Analysis (ESA), which in some > circumstances appears to perform as well or better compared to missing data > approaches (LSA, topic modelling). > > http://code.google.com/p/research-esa/ > > jds > ________________________________________ > From: Dan Brickley [[email protected]] > Sent: Saturday, May 19, 2012 3:50 PM > To: [email protected] > Subject: Wikipedia things/strings dataset > > Just noticed this handy-looking dataset, > > http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html > "From Words to Concepts and Back: Dictionaries for Linking Text, > Entities and Ideas" > > Excerpt, "How do we represent concepts? Our approach piggybacks on the > unique titles of entries from an encyclopedia, which are mostly proper > and common noun phrases. We consider each individual Wikipedia article > as representing a concept (an entity or an idea), identified by its > URL. Text strings that refer to concepts were collected using the > publicly available hypertext of anchors (the text you click on in a > web link) that point to each Wikipedia page, thus drawing on the vast > link structure of the web. For every English article we harvested the > strings associated with its incoming hyperlinks from the rest of > Wikipedia, the greater web, and also anchors of parallel, non-English > Wikipedia pages. Our dictionaries are cross-lingual, and any concept > deemed too fine can be broadened to a desired level of generality > using Wikipedia's groupings of articles into hierarchical categories. > > The data set contains triples, each consisting of (i) text, a short, > raw natural language string; (ii) url, a related concept, represented > by an English Wikipedia article's canonical location; and (iii) count, > an integer indicating the number of times text has been observed > connected with the concept's url. Our database thus includes weights > that measure degrees of association. " [...] > > I figured this should be of interest to a good few Mahout users, so > passing it along... > > cheers, > > Dan >
