(switching to user list, and rejigging for readability) I was asking about LogLikelihood, http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and sketched an application. And now I've come back to ask for a few more pointers.
Recap: I have a couple of pretty large (3M, 12M) bibliographic datasets, both of which have fairly cryptic subject codes, plus short textual titles applied to a lot of books. We're trying to match these subject codes with other collections that are only described with a few simple web2-style tags, so the hope is to see whether the topics can be augmented with indicative keywords (and maybe later phrases) extracted from document titles. On the technical side, Lucene/Solr is already being used so ideally, I'd find a way to apply Mahout's LogLikelihood to term vectors imported from Lucene indices. On 29 July 2011 20:58, Ted Dunning <[email protected]> wrote: > The short answer, however, is that LLR as produced by Mahout should be > plenty for your problem. It should be pretty easy to produce a list of > interesting key words for each subject code and these are reasonably likely > to do a good job of retrieving documents. Thanks for the sanity check. I glossed over a bit of detail around the subject codes. They're actually mainly Library of Congress subject headings (LCSH), or local extensions based on LCSH, so they do themselves have both text and some structure. Example record (excerpts): title:The boy who harnessed the wind : creating currents of electricity and hope. summary: An enterprising teenager in Malawi builds a windmill from scraps he finds around his village and brings electricity, and a future, to his family. subject(s): Windmills -- Malawi. Inventors -- Malawi. Water-supply, Rural -- Malawi. Rural electrification -- Malawi. Electric power production -- Malawi. Kamkwamba, William, 1987- Malawi -- Rural conditions. The summary isn't always there. And sometimes we have a subject classification, more code-like (eg. Dewey Decimal Classifications; or Library of Congress Classification), as well as these thesaurus-like textually oriented controlled subject terms. One thing I didn't mention is that there are a few thousands of these subject values. This might not be a great example, but perhaps it'll do: the title and summary both contain "electricity", whereas we only have a subject heading with base phrase "Electric power production". Now perhaps overwhemingly many other books filed under "Electric power production -- Malawi." or "Electric power production -- Netherlands" etc also contain that word "electricity" (and perhaps assume the subjects have been canonicalised to throw away that final geographic qualifier / facet?). I'm not sure which pieces from the Mahout toolkit can be plugged together to result in a set of annotations associating "electricity", and other words (or ideally short phrases from the titles) with "Electric power production". Aside: these LCSH subject codes can often, recently, be canonicalised to URIs hosted by Library of Congress; see http://id.loc.gov/authorities/ and nearby. So http://id.loc.gov/authorities/sh85146930 is "Windmills" for example, and that link in turn links (machine-readably w/ SKOS) to http://id.loc.gov/authorities/sh85146874 for "Wind power". However "Windmills -- Malawi" doesn't show up there yet, although its un-registered use does show up in http://authorities.loc.gov/ ... there is also substructure in the system; the geographical facet comes last, so we can derive "Rural electrification" from "Rural electrification -- Malawi" which matches us into http://id.loc.gov/authorities/sh85115910 .... My reason to mention this is that bibliographic datasets are increasingly using such URL IDs to indicate subject codes of various types. So, backing up to initial goal of "producing a list of interesting key words for each subject code". My Mahout experience to date has mostly been running the Taste recommender code on a single machine, although I've dabbled with the SVD/Lanczos on a real Hadoop cluster, and have the "In Action" book. I'm assuming / planning to use Hadoop for this work. Getting vectors out of the lucene/solr database seems straightforward; something like: mahout lucene.vector --dir $BASE/solr/data/index/ --output bib/vecs --field label --idField id --dictOut bib/dict.out --norm 2 given a schema.xml tweaked to keep termVectors: <field name="label" type="text" indexed="true" stored="true" termVectors="true" required="true" multiValued="false" /> ...but beyond this, I'm lacking direction. There are a lot of pieces, and I'm not yet seeing quite how to plug them together to address this problem. Can you sketch a story that'll point me to the right pieces to read up on and use? cheers, Dan > I would add one step of automated relevance feedback by also extracting key > terms by doing a search for documents using the first set of keywords for a > particular subject code. Then use the top 20 or so documents in the subject > code versus the top 20 or so documents not in the subject code. This will > provide a more focused set of keywords that are likely to perform more > accurately than the first set. I would keep both sets separately so that > you can use either one at will. ps. I look forward to getting to a state where I could make such improvements!
