Thanks. That final one is now in torrent-land. The RDF SKOS stuff reminds me: RDF seems pointless without a confidence factor per triple. It would be cool to take a large self-inconsistent RDF-triple graph and grind out a set of confidence factors that makes it consistent. "This stuff is good, this stuff is generally solid, that stuff- poisonous lies!"
On Fri, Jul 8, 2011 at 3:23 PM, Dan Brickley <[email protected]> wrote: > On 9 July 2011 00:03, Lance Norskog <[email protected]> wrote: >> Ratings and more generally "parallel universe" or "dual space" or >> "dyadic" (but that is other things): Correspondences between samples >> in two different parallel spaces. >> >> A mail corpus has different kinds of gleanable knowledge: word/subject >> line correspondences, authoritative mail v.s. conversational, reply-to >> is a one-way relationship in the same space, time series aspects, and >> more. It would be a good base for an examples/ set of several >> algorithms and interpreting all concepts. That's a trimester course. > > On the mail front, just to mention I finally found a tool capable of > downloading my over-sized Gmail mail archives: > > http://toroid.org/ams/etc/gmail-imap-mirror > > It also preserves tags. And since I've got Gmail filters tagging > almost every incoming mail, at least those from lists, then this > creates over time a nice repository of associations from posters to > mailing lists to folder label tags. I've been thinking to throw this > soup into Mahout but haven't really thought through exactly what to > try first. I thought it would be nice to cluster tags and people, for > example. > > Another family of dataset I mentioned recently: there are lots of > "Linked Data" collections opening up, from libraries, museums and > other public sector bodies. See cloud visualisation at > http://richard.cyganiak.de/2007/10/lod/ or the directory of datasets > at http://ckan.net/ from which this is generated. > > The library/museum and cultural heritage collections in this scene > often use SKOS, which is an RDF vocabulary for representing topics > (thesaurus-like stuff). So you get some interesting structure there, > and often a database that is a set of records which are tagged with > one or more SKOS URI. So again, not classic recommendation dataset but > a lot still worth digging into. > > Bibliographic data: http://ckan.net/group/bibliographic > > Nearby: draft report from W3C Linked Library incubator group, > http://lists.w3.org/Archives/Public/public-lld/2011Jun/0084.html ... > including lengthy 'vocabularies and datasets' report, > http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset > > More on SKOS, see http://www.w3.org/2004/02/skos/ > > British Library's "national bibliography" at > http://www.bl.uk/bibliographic/datasamples.html > http://ckan.net/package/jiscopenbib-bl_bnb-1 > http://openbiblio.net/2010/11/22/querying-the-british-national-bibliography/ > > ...this gives you about 3 million records. Some of which data surfaces > in http://bibliographica.org/ ... and nearby there are things like > http://openlibrary.org/ not to mention DBpedia.org, Freebase.com and > similar. > > So while there might be a shortage of classic "users x items" > entertainment-content recommender datasets, there are many many other > interesting collections being released as open data, on weekly basis. > > cheers, > > Dan > > > ps. many Twitter crawls have vanished, but take a close look at the > html for http://snap.stanford.edu/data/twitter7.html ... > -- Lance Norskog [email protected]
