RE: Wikipedia things/strings dataset

John Stewart Sun, 20 May 2012 10:32:44 -0700

Reminds me of Explicit Semantic Analysis (ESA), which in some circumstances 
appears to perform as well or better compared to missing data approaches (LSA, 
topic modelling).

http://code.google.com/p/research-esa/

jds
________________________________________
From: Dan Brickley [[email protected]]
Sent: Saturday, May 19, 2012 3:50 PM
To: [email protected]
Subject: Wikipedia things/strings dataset

Just noticed this handy-looking dataset,
http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html
"From Words to Concepts and Back: Dictionaries for Linking Text,
Entities and Ideas"

Excerpt, "How do we represent concepts? Our approach piggybacks on the
unique titles of entries from an encyclopedia, which are mostly proper
and common noun phrases. We consider each individual Wikipedia article
as representing a concept (an entity or an idea), identified by its
URL. Text strings that refer to concepts were collected using the
publicly available hypertext of anchors (the text you click on in a
web link) that point to each Wikipedia page, thus drawing on the vast
link structure of the web. For every English article we harvested the
strings associated with its incoming hyperlinks from the rest of
Wikipedia, the greater web, and also anchors of parallel, non-English
Wikipedia pages. Our dictionaries are cross-lingual, and any concept
deemed too fine can be broadened to a desired level of generality
using Wikipedia's groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short,
raw natural language string; (ii) url, a related concept, represented
by an English Wikipedia article's canonical location; and (iii) count,
an integer indicating the number of times text has been observed
connected with the concept's url. Our database thus includes weights
that measure degrees of association. " [...]

I figured this should be of interest to a good few Mahout users, so
passing it along...

cheers,

Dan

RE: Wikipedia things/strings dataset

Reply via email to