Hello!

When I run ...

mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
-i /user/hadoop/url_tokenised_text.test.seqdir.sparse/tokenized-documents \
-o /user/hadoop/url_tokenised_text.test.llr__without_preprocess \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
--maxNGramSize 3 --minSupport 100 --maxRed 400

...things work for me but I notice in the first
CollocDriver.generateCollocations pass ( the generation of subgrams )
_everything_ is going to the one reducer.

The end result of the run is

drwxr-xr-x   - hadoop supergroup          0 2011-12-23 05:58
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams
-rw-r--r--   3 hadoop supergroup          0 2011-12-23 05:58
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/_SUCCESS
-rw-r--r--   3 hadoop supergroup    9509097 2011-12-23 05:56
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00000
-rw-r--r--   3 hadoop supergroup    9523286 2011-12-23 05:56
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00001
-rw-r--r--   3 hadoop supergroup    9517959 2011-12-23 05:56
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00002
-- SNIP --
-rw-r--r--   3 hadoop supergroup    9557408 2011-12-23 05:57
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00397
-rw-r--r--   3 hadoop supergroup    9530757 2011-12-23 05:57
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00398
-rw-r--r--   3 hadoop supergroup    9502667 2011-12-23 05:58
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/ngrams/part-r-00399
drwxr-xr-x   - hadoop supergroup          0 2011-12-23 05:55
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams
-rw-r--r--   3 hadoop supergroup          0 2011-12-23 05:55
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/_SUCCESS
-rw-r--r--   3 hadoop supergroup        128 2011-12-23 04:40
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00000
-rw-r--r--   3 hadoop supergroup 9998117111 2011-12-23 04:45
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00001
-rw-r--r--   3 hadoop supergroup        128 2011-12-23 04:40
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00002
-rw-r--r--   3 hadoop supergroup        128 2011-12-23 04:40
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00003
-- SNIP --
-rw-r--r--   3 hadoop supergroup        128 2011-12-23 04:40
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00397
-rw-r--r--   3 hadoop supergroup        128 2011-12-23 04:40
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00398
-rw-r--r--   3 hadoop supergroup        128 2011-12-23 04:40
/user/hadoop/url_tokenised_text.test.llr__without_preprocess/subgrams/part-r-00399

Before I start to poke around does anyone agree this looks wrong?

I'm running a 0.6-SNAPSHOT I cloned today from github. Was considering
trying 0.5 but a quick look at recent changes doesn't seem to suggest this
code has changed in awhile...

Cheers,
Mat

Reply via email to