Re: What are the best settings for my clustering task

Jens Bonerz Sun, 06 Oct 2013 03:38:28 -0700

Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
in the list of available programs when calling the mahout binary...



2013/10/3 Ted Dunning <[email protected]>

> What you are seeing here are the cluster centroids themselves, not the
> cluster assignments.
>
> Streaming k-means is a single pass algorithm to derive these centroids.
>  Typically, the next step is to cluster these centroids using ball k-means.
>  *Those* results can then be applied back to the original (or new) input
> vectors to get cluster assignments for individual input vectors.
>
> I don't have command line specifics handy, but you seem to have done very
> well already at figuring out the details.
>
>
> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
>
> > I created a series of scripts to try out streamingkmeans in mahout an
> > increased the number of clusters to a high amount as suggested by Ted.
> > Everything seems to work. However, I can't figure out how to access the
> > actual cluster data at the end of the process.
> >
> > It just gives me output that I cannot really understand... I would expect
> > my product_ids being referenced to cluster ids...
> >
> > Example of the procedure's output:
> >
> > hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> > locally
> > Input Path: file:MahoutCluster/part-r-00000
> > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> > Key: 0: Value: key = 8678, weight = 3.00, vector =
> >
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> > Key: 1: Value: key = 3118, weight = 14.00, vector =
> >
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> > Key: 2: Value: key = 2867, weight = 3.00, vector =
> >
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> > Key: 3: Value: key = 6295, weight = 1.00, vector =
> >
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> > Key: 4: Value: key = 6725, weight = 4.00, vector =
> >
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> > Key: 5:......
> >
> >
> >
> > this is my recipe:
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 1
> > Create a seqfile from my data with Python. Its the product_id (key) and
> the
> > short normalized descripti (value) that is written into the sequence
> file.
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 2
> > create vectors from that data with the following command:
> >
> > mahout seq2sparse \
> >   -i productClusterSequenceData/productClusterSequenceData.seq \
> >   -o productClusterSequenceData/vectors \
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 3
> > Cluster the vectors using streamingkeans with this command:
> >
> > mahout streamingkmeans \
> > -i productClusterSequenceData/vectors/tfidf-vectors \
> > -o MahoutCluster \
> > --tempDir /tmp \
> > -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> > -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> > -k 10000 -km 500000 \
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 4
> > Export the streamingkmeans cluster data into a textfile (for an extract
> of
> > the result see above)
> >
> > mahout seqdumper \
> > -i MahoutCluster > similarProducts.txt
> >
> > What am I missing?
> >
> >
> >
> >
> >
> > 2013/10/3 Ted Dunning <[email protected]>
> >
> >> Yes.  That will work.
> >>
> >> The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
> log N
> >> \approx 30 so the sketch will have at about 300,000 weighted centroids
> in
> >> it.  The final clustering will have to process these centroids to
> produce
> >> the desired 5,000 clusters.  Since 300,000 is a relatively small number
> of
> >> data points, this clustering step should proceed relatively quickly.
> >>
> >>
> >>
> >> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]>
> >> wrote:
> >>
> >>> thx for your elaborate answer.
> >>>
> >>> so if the upper bound on the final number of clusters is unknown in the
> >>> beginning, what would happen, if I define a very high number that is
> >>> guaranteed to be > the estimated number of clusters.
> >>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
> >>> likely, will that work?
> >>>
> >>>
> >>> 2013/10/2 Ted Dunning <[email protected]>
> >>>
> >>>> The way that the new streaming k-means works is that there is a first
> >>>> sketch pass which only requires an upper bound on the final number of
> >>>> clusters you will want.  It adaptively creates more or less clusters
> >>>> depending on the data and your bound.  This sketch is guaranteed to be
> >>>> computed within at most one map-reduce pass.  There is a threaded
> >> version
> >>>> that runs (fast) on a single machine.  The threaded version is liable
> >> to
> >>> be
> >>>> faster than the map-reduce version for moderate or smaller data sizes.
> >>>>
> >>>> That sketch can then be used to do all kinds of things that rely on
> >>>> Euclidean distance and still get results within a small factor of the
> >>> same
> >>>> algorithm applied to all of the data.  Typically this second phase is
> a
> >>>> ball k-means algorithm, but it could easily be a dp-means algorithm
> [1]
> >>> if
> >>>> you want a variable number of clusters.  Indeed, you could run many
> >>>> dp-means passes with different values of lambda on the same sketch.
> >> Note
> >>>> that the sketch is small enough that in-memory clustering is entirely
> >>>> viable and is very fast.
> >>>>
> >>>> For the problem you describe, however, you probably don't need the
> >> sketch
> >>>> approach at all and can probably apply ball k-means or dp-means
> >> directly.
> >>>> Running many k-means clusterings with differing values of k should be
> >>>> entirely feasible as well with such data sizes.
> >>>>
> >>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Isn't the streaming k-means just a different approach to crunch
> >> through
> >>>> the
> >>>>> data? In other words, the result of streaming k-means should be
> >>>> comparable
> >>>>> to using k-means in multiple chained map reduce cycles?
> >>>>>
> >>>>> I just read a paper about the k-means clustering and its underlying
> >>>>> algorithm.
> >>>>>
> >>>>> According to that paper, k-means relies on a preknown/predefined
> >> amount
> >>>> of
> >>>>> clusters as an input parameter.
> >>>>>
> >>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> >>>>>
> >>>>> In my current scenario however, the number of clusters is unknown at
> >>> the
> >>>>> beginning.
> >>>>>
> >>>>> Maybe k-means is just not the right algorithm for clustering similar
> >>>>> products based on their short description text? What else could I
> >> use?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2013/10/1 Ted Dunning <[email protected]>
> >>>>>
> >>>>>> At such small sizes, I would guess that the sequential version of
> >> the
> >>>>>> streaming k-means or ball k-means would be better options.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> >>> [email protected]
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Hello all,
> >>>>>>>
> >>>>>>> I am currently trying create clusters from a group of 50.000
> >>> strings
> >>>>> that
> >>>>>>> contain product descriptions (around 70-100 characters length
> >>> each).
> >>>>>>>
> >>>>>>> That group of 50.000 consists of roughly 5.000 individual
> >> products
> >>>> and
> >>>>>> ten
> >>>>>>> varying product descriptions per product. The product
> >> descriptions
> >>>> are
> >>>>>>> already prepared for clustering and contain a normalized brand
> >>> name,
> >>>>>>> product
> >>>>>>> model number, etc.
> >>>>>>>
> >>>>>>> What would be a good approach to maximise the amound of found
> >>>> clusters
> >>>>>> (the
> >>>>>>> best possible value would be 5.000 clusters with 10 products
> >> each)
> >>>>>>>
> >>>>>>> I adapted the reuters cluster script to read in my data and
> >> managed
> >>>> to
> >>>>>>> create a first set of clusters. However, I have not managed to
> >>>> maximise
> >>>>>> the
> >>>>>>> cluster count.
> >>>>>>>
> >>>>>>> The question is: what do I need to tweak with regard to the
> >>> available
> >>>>>>> mahout
> >>>>>>> settings, so the clusters are created as precisely as possible?
> >>>>>>>
> >>>>>>> Many regards!
> >>>>>>> Jens
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> View this message in context:
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> >>>>>>> Sent from the Mahout User List mailing list archive at
> >> Nabble.com.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>

Re: What are the best settings for my clustering task

Reply via email to