Thanks a lot Ted. I think there's some preprocessing I can do to remove
some outliers which may reduce my matrix size considerably.ill also check
out some SVD techniques
On 9 Mar 2013 17:16, "Ted Dunning" <[email protected]> wrote:

> The new streaming k-means should be able to handle that data pretty
> efficiently.  My guess is that on a single 16 core machine if should be
> able to complete the clustering in 10 minutes or so.  That is extrapolation
> and thus could be wildly off, of course.
>
> You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
>  That may be a problem.  Or it might make the clustering fairly trivial.
>
> Dan,
>
> That code isn't checked into trunk yet, but I think.   Can you comment on
> where working code can be found on github?
>
> On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]> wrote:
>
> > I have approximately 20million items and a feature vector of approx 30
> > million in length,very sparse.
> >
> > Would you have any suggestions for other clustering algorithms I should
> > look at ?
> >
> > Thanks,
> > Colum
> >
> > On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote:
> >
> > > You are beginning to exit the realm of reasonable applicability for
> > normal
> > > k-means algorithms here.
> > >
> > > How much data do you have?
> > >
> > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]>
> > wrote:
> > >
> > >> Hi All,
> > >>
> > >> When I run KMeans clustering on a cluster, i notice that when I have
> > >> "large" values for k (i.e approx >1000) I get loads of hadoop write
> > >> errors:
> > >>
> > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream
> > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> > >> for channel to be ready for read. ch : java.nio.channels.SocketChannel
> > >>
> > >> This continues indefinitely and lots of part-0xxxxx files are produced
> > >> of sizes of around 30kbs.
> > >>
> > >> If I reduce the value for k it runs fine. Furthermore If I run it in
> > >> local mode with high values of k it runs fine.
> > >>
> > >> The command I am using is as follows:
> > >>
> > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> > >> --clusters tmp -dm
> > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> > >> 1.0 -x 20 -cl -k 10000
> > >>
> > >> I am running mahout 0.7.
> > >>
> > >> Are there some performance parameters I need to tune for mahout when
> > >> dealing with large volumes of data?
> > >>
> > >> Thanks,
> > >> Colum
> > >>
> >
>

Reply via email to