I have set of some millions of (bibliographic) records, which have a field with what might be called "word-like" values: there are thousands of semi-controlled words and phrases used. I want to create vector representations of these, initially for clustering experiments.
As far as I can see, there are two paths for doing this in Mahout, and I have trouble choosing between them. Option 1.) I could use the bin/mahout utility, first by using 'mahout seqdirectory' on a directory with one file per record, and each file containing only the field values I want to work with. Then I would use seq2sparse, which takes the ( key=docid, value=text from file) pairs from my seqdirectory and vectorizes, presumably using --analyzerName to pass in a custom analyzer that ensures my field values are tokenized appropriately (e.g. maybe I use underscore_to_express_multi_word_values in the text input). Option 2.) Instead I could build the entire thing from custom Java code, likely starting from the same seqdirectory raw materials generated with bin/mahout; or even doing the seqdirectory generation from Java (or bypassing it?). SequenceFileTokenizerMapper and DictionaryVectorizer seem relevant. Also https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/NewsKMeansClustering.java and https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/ReutersToSparseVectors.java are starting points I guess. I'm not sure which option to explore. The first is attractive, because - why write code if there is already a utility that more-or-less does what I need? The second is attractive because it offers more control, and because pretending my (semi)structured data is plain text seems rather bogus, and introduces scope for lots of bugs around parsing / tokenizing the field values. Besides, I'll need a custom analyzer anyway. However my bias is towards re-using "bin/mahout seqdirectory" if I can, because I know the Hadoop aspect 'just works', while that would be new territory for me, if coding this up myself. Suggestions very welcome. Dan ps. wrote this while starting on Option 1 with a subset of 100k records, which just finished. To give an indication of the values MAHOUT_LOCAL=true mahout seqdumper --seqFile seq2/chunk-1 ...shows k/v pairs like Key: /9999067.txt: Value: History_Sources_Study_and_teaching. History_Research. History_Research_Methodology. History_Study_and_teaching_(Higher)_United_States. Research. Key: /999381.txt: Value: African_Americans_Education. School_integration_United_States. Educational_equalization_United_States. Academic_achievement. Key: /7977028.txt: Value: Homosexuality_and_education_United_States. Education,_Elementary_Social_aspects_United_States. Education,_Elementary_United_States_Curricula. Sex_instruction_United_States. Key: /7977561.txt: Value: Development_economics. Macroeconomics. Key: /7977659.txt: Value: Haitian_Americans_Ethnic_identity. Haitian_Americans_Social_conditions. Immigrants_United_States_Social_conditions. Transnationalism. Haitians_Interviews. ... in the source text, I've used spaces to separate the different values; each is a controlled word or phrase based on Library of Congress subject headings, and the intent is to treat them as complex atoms, ie. next step in "Option 1" is to find an analyzer that will preserve 'Immigrants_United_States_Social_conditions' as a single feature/dimension in my dictionary.
