This is nice. Thanks!

On Sat, May 19, 2012 at 1:38 PM, Jake Mannix <[email protected]> wrote:
> Nice to note: the license for this data is very nicely compatible with the
> Apache License for software, as it's CC-3.0 Attribution:
>
>  http://creativecommons.org/licenses/by/3.0/legalcode
>
> as linked from their download dir:
>
>  http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/
>
> On Sat, May 19, 2012 at 12:50 PM, Dan Brickley <[email protected]> wrote:
>
>> Just noticed this handy-looking dataset,
>>
>> http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html
>> "From Words to Concepts and Back: Dictionaries for Linking Text,
>> Entities and Ideas"
>>
>> Excerpt, "How do we represent concepts? Our approach piggybacks on the
>> unique titles of entries from an encyclopedia, which are mostly proper
>> and common noun phrases. We consider each individual Wikipedia article
>> as representing a concept (an entity or an idea), identified by its
>> URL. Text strings that refer to concepts were collected using the
>> publicly available hypertext of anchors (the text you click on in a
>> web link) that point to each Wikipedia page, thus drawing on the vast
>> link structure of the web. For every English article we harvested the
>> strings associated with its incoming hyperlinks from the rest of
>> Wikipedia, the greater web, and also anchors of parallel, non-English
>> Wikipedia pages. Our dictionaries are cross-lingual, and any concept
>> deemed too fine can be broadened to a desired level of generality
>> using Wikipedia's groupings of articles into hierarchical categories.
>>
>> The data set contains triples, each consisting of (i) text, a short,
>> raw natural language string; (ii) url, a related concept, represented
>> by an English Wikipedia article's canonical location; and (iii) count,
>> an integer indicating the number of times text has been observed
>> connected with the concept's url. Our database thus includes weights
>> that measure degrees of association. " [...]
>>
>> I figured this should be of interest to a good few Mahout users, so
>> passing it along...
>>
>> cheers,
>>
>> Dan
>>
>
>
>
> --
>
>  -jake



-- 
Lance Norskog
[email protected]

Reply via email to