I'm re-running it right now on 4-node cluster of EC2 xlarge instances with 3 reducers / node and 4GB max heap per child ... none are swapping and all have load avg around 3 ... will post results once I have them.
Intuitively, your comment about all points being assigned to one cluster makes sense because we get through the map tasks and all the reducers except one very quickly ... and then it bogs down. Thanks! On Thu, Feb 24, 2011 at 4:23 PM, Ted Dunning <[email protected]> wrote: > We should probably have an option to down-sample large clusters to make the > PDF computation faster. > > On Thu, Feb 24, 2011 at 3:09 PM, Jeff Eastman <[email protected]> wrote: > > > Again, if most of your points are being assigned to a single cluster that > > reducer will be bogged down observing them all. >
