Thanks a lot Ted. I think there's some preprocessing I can do to remove some outliers which may reduce my matrix size considerably.ill also check out some SVD techniques On 9 Mar 2013 17:16, "Ted Dunning" <[email protected]> wrote:
> The new streaming k-means should be able to handle that data pretty > efficiently. My guess is that on a single 16 core machine if should be > able to complete the clustering in 10 minutes or so. That is extrapolation > and thus could be wildly off, of course. > > You definitely mean sparse. 30 M / 20 M = 1.5 non-zero features per row. > That may be a problem. Or it might make the clustering fairly trivial. > > Dan, > > That code isn't checked into trunk yet, but I think. Can you comment on > where working code can be found on github? > > On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]> wrote: > > > I have approximately 20million items and a feature vector of approx 30 > > million in length,very sparse. > > > > Would you have any suggestions for other clustering algorithms I should > > look at ? > > > > Thanks, > > Colum > > > > On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote: > > > > > You are beginning to exit the realm of reasonable applicability for > > normal > > > k-means algorithms here. > > > > > > How much data do you have? > > > > > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> > > wrote: > > > > > >> Hi All, > > >> > > >> When I run KMeans clustering on a cluster, i notice that when I have > > >> "large" values for k (i.e approx >1000) I get loads of hadoop write > > >> errors: > > >> > > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream > > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting > > >> for channel to be ready for read. ch : java.nio.channels.SocketChannel > > >> > > >> This continues indefinitely and lots of part-0xxxxx files are produced > > >> of sizes of around 30kbs. > > >> > > >> If I reduce the value for k it runs fine. Furthermore If I run it in > > >> local mode with high values of k it runs fine. > > >> > > >> The command I am using is as follows: > > >> > > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults > > >> --clusters tmp -dm > > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd > > >> 1.0 -x 20 -cl -k 10000 > > >> > > >> I am running mahout 0.7. > > >> > > >> Are there some performance parameters I need to tune for mahout when > > >> dealing with large volumes of data? > > >> > > >> Thanks, > > >> Colum > > >> > > >
