Hi Ted / Konstantin, Thanks for the feedback. You are correct in that some reducers are doing nothing, but many are doing real work, albeit for a very short period of time. I'll run this with 1 reducer and post back my results.
Cheers, Tim On Tue, Sep 6, 2011 at 3:33 PM, Konstantin Shmakov <[email protected]>wrote: > K-means have mappers, combiners and reducers and my experience with > k-means that mappers and combiners are responsible for most of the > performance. In fact, most k-means jobs will use 1 reducer by > default. > > Did you verify that multiple reducers are actually doing something? > > Mappers write each point with cluster assignment, combiners read these > points and produce intermediate centroids and reducer doing minimum > amount of work producing final centroids from intermediate one. One > possibility is that by specifying #reducers you partially disable > combiners optimization that run on mapper nodes - this could result in > more shuffling and date sent between nodes. > > -- Konstantin > > On Tue, Sep 6, 2011 at 9:40 AM, Ted Dunning <[email protected]> wrote: > > It could also have to do with context switch rate, memory set size or > memory > > bandwidth contention. Having two many threads of certain kinds can cause > > contention on all kinds of resources. > > > > Without detailed internal diagnostics, these can be very hard to tease > out. > > Looking at load average is a good first step. If you can get to some > > memory diagnostics about cache miss rates, you might be able to get > further. > > > > On Tue, Sep 6, 2011 at 10:50 AM, Timothy Potter <[email protected] > >wrote: > > > >> Hi, > >> > >> I'm running a distributed k-means clustering job in a 16 node EC2 > cluster > >> (xlarge instances). I've experimented with 3 reducers per node > >> (mapred.reduce.tasks=48) and 2 reducers per node > (mapred.reduce.tasks=32). > >> In one of my runs (k=120) on 6m vectors with roughly 20k dimensions, > I've > >> seen a 25% improvement job performance using 2 reducers per node instead > of > >> 3 (~45 mins to do 10 iterations with 32 reducers vs. ~1 hour with 48 > >> reducers). The input data and initial clusters are the same in both > cases. > >> > >> My sense was that maybe I was over-utilizing resources with 3 reducers > per > >> node, but in fact the load average remains healthy (< 4 on xlarge > instances > >> with 4 virtual cores) and does not swap or anything obvious like that. > So > >> the only other variable that I can think of here is the number and size > of > >> output files from one iteration being sent as input to the next > iteration > >> may have something to do with the performance difference? As further > >> evidence to this hunch, running the job with vectors with 11k > dimensions, > >> the improvement was only about 10% -- so performance gains are better > with > >> more data. > >> > >> Does anyone else have any insights to what might be leading to this > result? > >> > >> Cheers, > >> Tim > >> > > > > > > -- > ksh: >
