Hi,
I'm using Mahout to vectorize and cluster data consisting of short
texts. So far I have done vectorizing on a single multi-core machine
and been quite happy with the results. However, now we are doing a
lot of small adjustments to increase the qulity of results and thus
would like to tighten the feedback loop, ie. get vectors more quickly.
Does anyone have good reference setup for Amazon EMR configuration for
such a task? I tried with 6 m1.small instances, but terminated the job
after 24 hrs, because I thought there is something wrong with the setup. I
pretty much followed the guides in Mahout wiki for the basic setup.
In the test case, my seq file size was 50MB and previous seq2sparse runs
have resulted around 400k vectors from that data.
Rest of the configuration was as follows:
- mahout v0.7
- 6 instances, instance type default (m1.small)
- numReducers 6
- maxNGramsize 2
Does this sound right (24 hrs and more to come...) for the given data
size? How mouch improvement should I except, if I use m1.large instances
instead? Any other recommendations?-)
br, Matti