Vectorizing word-like values; use seq2sparse vs custom Java?

Dan Brickley Wed, 05 Oct 2011 00:40:35 -0700

I have set of some millions of (bibliographic) records, which have a
field with what might be called "word-like" values: there are
thousands of semi-controlled words and phrases used. I want to create
vector representations of these, initially for clustering experiments.

As far as I can see, there are two paths for doing this in Mahout, and
I have trouble choosing between them.

Option 1.)
I could use the bin/mahout utility, first by using 'mahout
seqdirectory' on a directory with one file per record, and each file
containing only the field values I want to work with. Then I would use
seq2sparse, which takes the ( key=docid, value=text from file) pairs
from my seqdirectory and vectorizes, presumably using --analyzerName
to pass in a custom analyzer that ensures my field values are
tokenized appropriately (e.g. maybe I use
underscore_to_express_multi_word_values in the text input).

Option 2.)
Instead I could build the entire thing from custom Java code, likely
starting from the same seqdirectory raw materials generated with
bin/mahout; or even doing the seqdirectory generation from Java (or
bypassing it?). SequenceFileTokenizerMapper and DictionaryVectorizer
seem relevant. Also
https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/NewsKMeansClustering.java
and
https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/ReutersToSparseVectors.java
are starting points I guess.

I'm not sure which option to explore. The first is attractive, because
- why write code if there is already a utility that more-or-less does
what I need? The second is attractive because it offers more control,
and because pretending my (semi)structured data is plain text seems
rather bogus, and introduces scope for lots of bugs around parsing /
tokenizing the field values. Besides, I'll need a custom analyzer
anyway. However my bias is towards re-using "bin/mahout seqdirectory"
if I can, because I know the Hadoop aspect 'just works', while that
would be new territory for me, if coding this up myself.

Suggestions very welcome.

Dan

ps. wrote this while starting on Option 1 with a subset of 100k
records, which just finished. To give an indication of the values

MAHOUT_LOCAL=true mahout seqdumper --seqFile seq2/chunk-1
...shows k/v pairs like

Key: /9999067.txt: Value: History_Sources_Study_and_teaching.
History_Research. History_Research_Methodology.
History_Study_and_teaching_(Higher)_United_States. Research.
Key: /999381.txt: Value: African_Americans_Education.
School_integration_United_States.
Educational_equalization_United_States. Academic_achievement.
Key: /7977028.txt: Value: Homosexuality_and_education_United_States.
Education,_Elementary_Social_aspects_United_States.
Education,_Elementary_United_States_Curricula.
Sex_instruction_United_States.
Key: /7977561.txt: Value: Development_economics. Macroeconomics.
Key: /7977659.txt: Value: Haitian_Americans_Ethnic_identity.
Haitian_Americans_Social_conditions.
Immigrants_United_States_Social_conditions. Transnationalism.
Haitians_Interviews.

... in the source text, I've used spaces to separate the different
values; each is a controlled word or phrase based on Library of
Congress subject headings, and the intent is to treat them as complex
atoms, ie. next step in "Option 1" is to find an analyzer that will
preserve 'Immigrants_United_States_Social_conditions' as a single
feature/dimension in my dictionary.

Vectorizing word-like values; use seq2sparse vs custom Java?

Reply via email to