I think you are on a good track.

The third option is custom code that does feature hashing.  That loses
interpretability in exchange for faster processing because you don't need to
make a pass through the data to get a dictionary.

I often find it very helpful to build a search index on data like this as
well.  After you do the clustering, you can update the docs with cluster
membership and use cluster id as an interesting search term.

On Wed, Oct 5, 2011 at 12:28 AM, Dan Brickley <[email protected]> wrote:

> I have set of some millions of (bibliographic) records, which have a
> field with what might be called "word-like" values: there are
> thousands of semi-controlled words and phrases used. I want to create
> vector representations of these, initially for clustering experiments.
>
> As far as I can see, there are two paths for doing this in Mahout, and
> I have trouble choosing between them.
>
> Option 1.)
> I could use the bin/mahout utility, first by using 'mahout
> seqdirectory' on a directory with one file per record, and each file
> containing only the field values I want to work with. Then I would use
> seq2sparse, which takes the ( key=docid, value=text from file) pairs
> from my seqdirectory and vectorizes, presumably using --analyzerName
> to pass in a custom analyzer that ensures my field values are
> tokenized appropriately (e.g. maybe I use
> underscore_to_express_multi_word_values in the text input).
>
> Option 2.)
> Instead I could build the entire thing from custom Java code, likely
> starting from the same seqdirectory raw materials generated with
> bin/mahout; or even doing the seqdirectory generation from Java (or
> bypassing it?). SequenceFileTokenizerMapper and DictionaryVectorizer
> seem relevant.  Also
>
> https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/NewsKMeansClustering.java
> and
>
> https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/ReutersToSparseVectors.java
> are starting points I guess.
>
>
> I'm not sure which option to explore. The first is attractive, because
> - why write code if there is already a utility that more-or-less does
> what I need? The second is attractive because it offers more control,
> and because pretending my (semi)structured data is plain text seems
> rather bogus, and introduces scope for lots of bugs around parsing /
> tokenizing the field values. Besides, I'll need a custom analyzer
> anyway. However my bias is towards re-using "bin/mahout seqdirectory"
> if I can, because I know the Hadoop aspect 'just works', while that
> would be new territory for me, if coding this up myself.
>
> Suggestions very welcome.
>
> Dan
>
>
> ps. wrote this while starting on Option 1 with a subset of 100k
> records, which just finished. To give an indication of the values
>
> MAHOUT_LOCAL=true mahout seqdumper --seqFile seq2/chunk-1
> ...shows k/v pairs like
>
> Key: /9999067.txt: Value: History_Sources_Study_and_teaching.
> History_Research. History_Research_Methodology.
> History_Study_and_teaching_(Higher)_United_States. Research.
> Key: /999381.txt: Value: African_Americans_Education.
> School_integration_United_States.
> Educational_equalization_United_States. Academic_achievement.
> Key: /7977028.txt: Value: Homosexuality_and_education_United_States.
> Education,_Elementary_Social_aspects_United_States.
> Education,_Elementary_United_States_Curricula.
> Sex_instruction_United_States.
> Key: /7977561.txt: Value: Development_economics. Macroeconomics.
> Key: /7977659.txt: Value: Haitian_Americans_Ethnic_identity.
> Haitian_Americans_Social_conditions.
> Immigrants_United_States_Social_conditions. Transnationalism.
> Haitians_Interviews.
>
> ... in the source text, I've used spaces to separate the different
> values; each is a controlled word or phrase based on Library of
> Congress subject headings, and the intent is to treat them as complex
> atoms, ie. next step in "Option 1" is to find an analyzer that will
> preserve 'Immigrants_United_States_Social_conditions' as a single
> feature/dimension in my dictionary.
>

Reply via email to