LLR and extracting indicative words and phrases from bibliographic data

Dan Brickley Mon, 01 Aug 2011 14:58:54 -0700

(switching to user list, and rejigging for readability)

I was asking about LogLikelihood,
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and
sketched an application. And now I've come back to ask for a few more
pointers.

Recap: I have a couple of pretty large (3M, 12M) bibliographic
datasets, both of which have fairly cryptic subject codes, plus short
textual titles applied to a lot of books. We're trying to match these
subject codes with other collections that are only described with a
few simple web2-style tags, so the hope is to see whether the topics
can be augmented with indicative keywords (and maybe later phrases)
extracted from document titles. On the technical side, Lucene/Solr is
already being used so ideally, I'd find a way to apply Mahout's
LogLikelihood to term vectors imported from Lucene indices.

On 29 July 2011 20:58, Ted Dunning <[email protected]> wrote:

> The short answer, however, is that LLR as produced by Mahout should be
> plenty for your problem.  It should be pretty easy to produce a list of
> interesting key words for each subject code and these are reasonably likely
> to do a good job of retrieving documents.

Thanks for the sanity check. I glossed over a bit of detail around the
subject codes. They're actually mainly Library of Congress subject
headings (LCSH), or local extensions based on LCSH, so they do
themselves have both text and some structure.

Example record (excerpts):

title:The boy who harnessed the wind : creating currents of
electricity and hope.

summary: An enterprising teenager in Malawi builds a windmill from
scraps he finds around his village and brings electricity, and a
future, to his family.

subject(s):
Windmills -- Malawi.
Inventors -- Malawi.
Water-supply, Rural -- Malawi.
Rural electrification -- Malawi.
Electric power production -- Malawi.
Kamkwamba, William, 1987-
Malawi -- Rural conditions.

The summary isn't always there. And sometimes we have a subject
classification, more code-like (eg. Dewey Decimal Classifications; or
Library of Congress Classification), as well as these thesaurus-like
textually oriented controlled subject terms. One thing I didn't
mention is that there are a few thousands of these subject values.

This might not be a great example, but perhaps it'll do: the title and
summary both contain "electricity", whereas we only have a subject
heading with base phrase "Electric power production". Now perhaps
overwhemingly many other books filed under "Electric power production
-- Malawi." or "Electric power production -- Netherlands" etc also
contain that word "electricity" (and perhaps assume the subjects have
been canonicalised to throw away that final geographic qualifier /
facet?). I'm not sure which pieces from the Mahout toolkit can be
plugged together to result in a set of annotations associating
"electricity", and other words (or ideally short phrases from the
titles) with "Electric power production".

Aside: these LCSH subject codes can often, recently, be canonicalised
to URIs hosted by Library of Congress; see
http://id.loc.gov/authorities/ and nearby. So
http://id.loc.gov/authorities/sh85146930 is "Windmills" for example,
and that link in turn links (machine-readably w/ SKOS) to
http://id.loc.gov/authorities/sh85146874 for "Wind power". However
"Windmills -- Malawi" doesn't show up there yet, although its
un-registered use does show up in http://authorities.loc.gov/   ...
there is also substructure in the system; the geographical facet comes
last, so we can derive "Rural electrification" from "Rural
electrification -- Malawi"  which matches us into
http://id.loc.gov/authorities/sh85115910 .... My reason to mention
this is that bibliographic datasets are increasingly using such URL
IDs to indicate subject codes of various types.

So, backing up to initial goal of  "producing a list of interesting
key words for each subject code". My Mahout experience to date has
mostly been running the Taste recommender code on a single machine,
although I've dabbled with the SVD/Lanczos on a real Hadoop cluster,
and have the "In Action" book. I'm assuming / planning to use Hadoop
for this work.

Getting vectors out of the lucene/solr database seems straightforward;
something like:
mahout  lucene.vector --dir $BASE/solr/data/index/ --output bib/vecs
--field label --idField id --dictOut bib/dict.out --norm 2

given a schema.xml tweaked to keep termVectors:
      <field name="label" type="text" indexed="true" stored="true"
termVectors="true"      required="true" multiValued="false" />

...but beyond this, I'm lacking direction. There are a lot of pieces,
and I'm not yet seeing quite how to plug them together to address this
problem. Can you sketch a story that'll point me to the right pieces
to read up on and use?

cheers,

Dan

> I would add one step of automated relevance feedback by also extracting key
> terms by doing a search for documents using the first set of keywords for a
> particular subject code.  Then use the top 20 or so documents in the subject
> code versus the top 20 or so documents not in the subject code.  This will
> provide a more focused set of keywords that are likely to perform more
> accurately than the first set.  I would keep both sets separately so that
> you can use either one at will.

ps. I look forward to getting to a state where I could make such improvements!

LLR and extracting indicative words and phrases from bibliographic data

Reply via email to