Hello! When I run ...
mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \ -i /user/hadoop/url_tokenised_text.test.seqdir.sparse/tokenized-documents \ -o /user/hadoop/url_tokenised_text.test.llr__without_preprocess \ -a org.apache.mahout.vectorizer.DefaultAnalyzer \ --maxNGramSize 3 --minSupport 100 --maxRed 400 ...things work for me but I notice in the first CollocDriver.generateCollocations pass ( the generation of subgrams ) _everything_ is going to the one reducer. The end result of the run is drwxr-xr-x - hadoop supergroup 0 2011-12-23 05:58 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams -rw-r--r-- 3 hadoop supergroup 0 2011-12-23 05:58 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/_SUCCESS -rw-r--r-- 3 hadoop supergroup 9509097 2011-12-23 05:56 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00000 -rw-r--r-- 3 hadoop supergroup 9523286 2011-12-23 05:56 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00001 -rw-r--r-- 3 hadoop supergroup 9517959 2011-12-23 05:56 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00002 -- SNIP -- -rw-r--r-- 3 hadoop supergroup 9557408 2011-12-23 05:57 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00397 -rw-r--r-- 3 hadoop supergroup 9530757 2011-12-23 05:57 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00398 -rw-r--r-- 3 hadoop supergroup 9502667 2011-12-23 05:58 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00399 drwxr-xr-x - hadoop supergroup 0 2011-12-23 05:55 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams -rw-r--r-- 3 hadoop supergroup 0 2011-12-23 05:55 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/_SUCCESS -rw-r--r-- 3 hadoop supergroup 128 2011-12-23 04:40 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00000 -rw-r--r-- 3 hadoop supergroup 9998117111 2011-12-23 04:45 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00001 -rw-r--r-- 3 hadoop supergroup 128 2011-12-23 04:40 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00002 -rw-r--r-- 3 hadoop supergroup 128 2011-12-23 04:40 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00003 -- SNIP -- -rw-r--r-- 3 hadoop supergroup 128 2011-12-23 04:40 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00397 -rw-r--r-- 3 hadoop supergroup 128 2011-12-23 04:40 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00398 -rw-r--r-- 3 hadoop supergroup 128 2011-12-23 04:40 /user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00399 Before I start to poke around does anyone agree this looks wrong? I'm running a 0.6-SNAPSHOT I cloned today from github. Was considering trying 0.5 but a quick look at recent changes doesn't seem to suggest this code has changed in awhile... Cheers, Mat
