Re: What are the best settings for my clustering task

Jens Bonerz Sun, 06 Oct 2013 04:11:08 -0700

i'd love to contribute to the community.
however, I am not proficient in java and have no dev environment for that
unfortunately.
if it was made in python. php or c++ that would be something else.


but: if you can steer me into the right direction, I will try to suggest a
patch.



2013/10/6 Ted Dunning <[email protected]>

> It is there, at the very least as part of the streaming k-means code.  The
> abbreviation bkm has been used in the past.
>
> In looking at the code just now I don't find any command line invocation
> of bkm. It should be quite simple to write one and it would be very handy
> to have a way to run streaming k-means without a map reduce step as well.
> As such it might be good to have a new option to streaming k-means to use
> just bkm in a single thread, to use threaded streaming k-means on a single
> machine or to use MapR reduce streaming k-means.
>
> You up for trying to make a patch?
>
> Sent from my iPhone
>
> On Oct 6, 2013, at 12:37, Jens Bonerz <[email protected]> wrote:
>
> > Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
> > in the list of available programs when calling the mahout binary...
> >
> >
> > 2013/10/3 Ted Dunning <[email protected]>
> >
> >> What you are seeing here are the cluster centroids themselves, not the
> >> cluster assignments.
> >>
> >> Streaming k-means is a single pass algorithm to derive these centroids.
> >> Typically, the next step is to cluster these centroids using ball
> k-means.
> >> *Those* results can then be applied back to the original (or new) input
> >> vectors to get cluster assignments for individual input vectors.
> >>
> >> I don't have command line specifics handy, but you seem to have done
> very
> >> well already at figuring out the details.
> >>
> >>
> >> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
> >>
> >>> I created a series of scripts to try out streamingkmeans in mahout an
> >>> increased the number of clusters to a high amount as suggested by Ted.
> >>> Everything seems to work. However, I can't figure out how to access the
> >>> actual cluster data at the end of the process.
> >>>
> >>> It just gives me output that I cannot really understand... I would
> expect
> >>> my product_ids being referenced to cluster ids...
> >>>
> >>> Example of the procedure's output:
> >>>
> >>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> >>> locally
> >>> Input Path: file:MahoutCluster/part-r-00000
> >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> >>> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> >>> Key: 0: Value: key = 8678, weight = 3.00, vector =
> >>>
> >>
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> >>> Key: 1: Value: key = 3118, weight = 14.00, vector =
> >>>
> >>
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> >>> Key: 2: Value: key = 2867, weight = 3.00, vector =
> >>>
> >>
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> >>> Key: 3: Value: key = 6295, weight = 1.00, vector =
> >>>
> >>
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> >>> Key: 4: Value: key = 6725, weight = 4.00, vector =
> >>>
> >>
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> >>> Key: 5:......
> >>>
> >>>
> >>>
> >>> this is my recipe:
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 1
> >>> Create a seqfile from my data with Python. Its the product_id (key) and
> >> the
> >>> short normalized descripti (value) that is written into the sequence
> >> file.
> >>>
> >>>
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 2
> >>> create vectors from that data with the following command:
> >>>
> >>> mahout seq2sparse \
> >>>  -i productClusterSequenceData/productClusterSequenceData.seq \
> >>>  -o productClusterSequenceData/vectors \
> >>>
> >>>
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 3
> >>> Cluster the vectors using streamingkeans with this command:
> >>>
> >>> mahout streamingkmeans \
> >>> -i productClusterSequenceData/vectors/tfidf-vectors \
> >>> -o MahoutCluster \
> >>> --tempDir /tmp \
> >>> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> >>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> >>> -k 10000 -km 500000 \
> >>>
> >>>
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 4
> >>> Export the streamingkmeans cluster data into a textfile (for an extract
> >> of
> >>> the result see above)
> >>>
> >>> mahout seqdumper \
> >>> -i MahoutCluster > similarProducts.txt
> >>>
> >>> What am I missing?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 2013/10/3 Ted Dunning <[email protected]>
> >>>
> >>>> Yes.  That will work.
> >>>>
> >>>> The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
> >> log N
> >>>> \approx 30 so the sketch will have at about 300,000 weighted centroids
> >> in
> >>>> it.  The final clustering will have to process these centroids to
> >> produce
> >>>> the desired 5,000 clusters.  Since 300,000 is a relatively small
> number
> >> of
> >>>> data points, this clustering step should proceed relatively quickly.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> thx for your elaborate answer.
> >>>>>
> >>>>> so if the upper bound on the final number of clusters is unknown in
> the
> >>>>> beginning, what would happen, if I define a very high number that is
> >>>>> guaranteed to be > the estimated number of clusters.
> >>>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
> >>>>> likely, will that work?
> >>>>>
> >>>>>
> >>>>> 2013/10/2 Ted Dunning <[email protected]>
> >>>>>
> >>>>>> The way that the new streaming k-means works is that there is a
> first
> >>>>>> sketch pass which only requires an upper bound on the final number
> of
> >>>>>> clusters you will want.  It adaptively creates more or less clusters
> >>>>>> depending on the data and your bound.  This sketch is guaranteed to
> be
> >>>>>> computed within at most one map-reduce pass.  There is a threaded
> >>>> version
> >>>>>> that runs (fast) on a single machine.  The threaded version is
> liable
> >>>> to
> >>>>> be
> >>>>>> faster than the map-reduce version for moderate or smaller data
> sizes.
> >>>>>>
> >>>>>> That sketch can then be used to do all kinds of things that rely on
> >>>>>> Euclidean distance and still get results within a small factor of
> the
> >>>>> same
> >>>>>> algorithm applied to all of the data.  Typically this second phase
> is
> >> a
> >>>>>> ball k-means algorithm, but it could easily be a dp-means algorithm
> >> [1]
> >>>>> if
> >>>>>> you want a variable number of clusters.  Indeed, you could run many
> >>>>>> dp-means passes with different values of lambda on the same sketch.
> >>>> Note
> >>>>>> that the sketch is small enough that in-memory clustering is
> entirely
> >>>>>> viable and is very fast.
> >>>>>>
> >>>>>> For the problem you describe, however, you probably don't need the
> >>>> sketch
> >>>>>> approach at all and can probably apply ball k-means or dp-means
> >>>> directly.
> >>>>>> Running many k-means clusterings with differing values of k should
> be
> >>>>>> entirely feasible as well with such data sizes.
> >>>>>>
> >>>>>> [1]
> http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Isn't the streaming k-means just a different approach to crunch
> >>>> through
> >>>>>> the
> >>>>>>> data? In other words, the result of streaming k-means should be
> >>>>>> comparable
> >>>>>>> to using k-means in multiple chained map reduce cycles?
> >>>>>>>
> >>>>>>> I just read a paper about the k-means clustering and its underlying
> >>>>>>> algorithm.
> >>>>>>>
> >>>>>>> According to that paper, k-means relies on a preknown/predefined
> >>>> amount
> >>>>>> of
> >>>>>>> clusters as an input parameter.
> >>>>>>>
> >>>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> >>>>>>>
> >>>>>>> In my current scenario however, the number of clusters is unknown
> at
> >>>>> the
> >>>>>>> beginning.
> >>>>>>>
> >>>>>>> Maybe k-means is just not the right algorithm for clustering
> similar
> >>>>>>> products based on their short description text? What else could I
> >>>> use?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2013/10/1 Ted Dunning <[email protected]>
> >>>>>>>
> >>>>>>>> At such small sizes, I would guess that the sequential version of
> >>>> the
> >>>>>>>> streaming k-means or ball k-means would be better options.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> >>>>> [email protected]
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello all,
> >>>>>>>>>
> >>>>>>>>> I am currently trying create clusters from a group of 50.000
> >>>>> strings
> >>>>>>> that
> >>>>>>>>> contain product descriptions (around 70-100 characters length
> >>>>> each).
> >>>>>>>>>
> >>>>>>>>> That group of 50.000 consists of roughly 5.000 individual
> >>>> products
> >>>>>> and
> >>>>>>>> ten
> >>>>>>>>> varying product descriptions per product. The product
> >>>> descriptions
> >>>>>> are
> >>>>>>>>> already prepared for clustering and contain a normalized brand
> >>>>> name,
> >>>>>>>>> product
> >>>>>>>>> model number, etc.
> >>>>>>>>>
> >>>>>>>>> What would be a good approach to maximise the amound of found
> >>>>>> clusters
> >>>>>>>> (the
> >>>>>>>>> best possible value would be 5.000 clusters with 10 products
> >>>> each)
> >>>>>>>>>
> >>>>>>>>> I adapted the reuters cluster script to read in my data and
> >>>> managed
> >>>>>> to
> >>>>>>>>> create a first set of clusters. However, I have not managed to
> >>>>>> maximise
> >>>>>>>> the
> >>>>>>>>> cluster count.
> >>>>>>>>>
> >>>>>>>>> The question is: what do I need to tweak with regard to the
> >>>>> available
> >>>>>>>>> mahout
> >>>>>>>>> settings, so the clusters are created as precisely as possible?
> >>>>>>>>>
> >>>>>>>>> Many regards!
> >>>>>>>>> Jens
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> View this message in context:
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> >>>>>>>>> Sent from the Mahout User List mailing list archive at
> >>>> Nabble.com.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
>


<http://www.hightechmg.com>

Re: What are the best settings for my clustering task

Reply via email to