This is different in that it uses the entire web as well as Wikipedia.

On Sun, May 20, 2012 at 1:32 PM, John Stewart <[email protected]> wrote:

> Reminds me of Explicit Semantic Analysis (ESA), which in some
> circumstances appears to perform as well or better compared to missing data
> approaches (LSA, topic modelling).
>
> http://code.google.com/p/research-esa/
>
> jds
> ________________________________________
> From: Dan Brickley [[email protected]]
> Sent: Saturday, May 19, 2012 3:50 PM
> To: [email protected]
> Subject: Wikipedia things/strings dataset
>
> Just noticed this handy-looking dataset,
>
> http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html
> "From Words to Concepts and Back: Dictionaries for Linking Text,
> Entities and Ideas"
>
> Excerpt, "How do we represent concepts? Our approach piggybacks on the
> unique titles of entries from an encyclopedia, which are mostly proper
> and common noun phrases. We consider each individual Wikipedia article
> as representing a concept (an entity or an idea), identified by its
> URL. Text strings that refer to concepts were collected using the
> publicly available hypertext of anchors (the text you click on in a
> web link) that point to each Wikipedia page, thus drawing on the vast
> link structure of the web. For every English article we harvested the
> strings associated with its incoming hyperlinks from the rest of
> Wikipedia, the greater web, and also anchors of parallel, non-English
> Wikipedia pages. Our dictionaries are cross-lingual, and any concept
> deemed too fine can be broadened to a desired level of generality
> using Wikipedia's groupings of articles into hierarchical categories.
>
> The data set contains triples, each consisting of (i) text, a short,
> raw natural language string; (ii) url, a related concept, represented
> by an English Wikipedia article's canonical location; and (iii) count,
> an integer indicating the number of times text has been observed
> connected with the concept's url. Our database thus includes weights
> that measure degrees of association. " [...]
>
> I figured this should be of interest to a good few Mahout users, so
> passing it along...
>
> cheers,
>
> Dan
>

Reply via email to