Re: Preserve contents of keys after running k-means

Pat Ferrel Sat, 06 Jul 2013 09:55:12 -0700

OK, squeaky wheel alert...

When I use kmeans I'm interested primarily in the cluster membership but almost 
as much in the distance to the centroid for ordering purposes. I'd also like 
the cluster list to contain any secondary vector ids that I've used for the 
vectors, like names. Pdf makes sense for fuzzy clustering, where if takes the 
place of distance to centroid in ordering. If all these optional values were 
considered property lists that are attached to the named or ided vector then 
they might be kept with the vectors too so they would follow them through any 
further processing.


On Jul 5, 2013, at 10:28 PM, Andrew Musselman <[email protected]> 
wrote:

I want to have the core feature of k-means which is to find out which vectors 
landed in what cluster, and I'm open to discussion beyond that.

Best
Andrew

On Jul 5, 2013, at 5:43 PM, Pat Ferrel <[email protected]> wrote:

> I think https://issues.apache.org/jira/browse/MAHOUT-1030 may be the wrong 
> issue #. 
> 
> The problem is that the Names from NamedVectorWritable are not used in the 
> cluster map after kmeans. You need to maintain your own map of your vector 
> name to internal Mahout id ints. NamedVectors work all the way through from 
> vector creation out of raw docs, TFIDF weighting, etc but the Names are not 
> used in id-ing the list of vectors assigned to clusters. 
> 
> It's been an issue on my wish list for Mahout. To get general universal 
> support for named vectors or better yet property vectors (where any number of 
> properties can be attached to a vector). A truly scalable non-DB string<->int 
> index creation and lookup (mapreduce version) is doable but not trivial. If 
> you don't have too many for an in-memory hashmap you have a much easier time 
> of it.  
> 
> 
> On Jul 5, 2013, at 2:53 PM, Ted Dunning <[email protected]> wrote:
> 
> Andrew,
> 
> I was being somewhat stupid.  You are talking about a parallel program.
> There is no single counter.
> 
> The row number is what I was referring to.  Each process will have
> consecutive row numbers starting at 0.  These rows will correspond to a
> sequence of rows in the original data.  If you can cause each process to
> record these id's as they go by, you have the thing you need.
> 
> I haven't looked at this code in several years, however, so my suggestions
> may well be quite far from reasonable.
> 
> 
> 
> On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <[email protected]
>> wrote:
> 
>> Ted, I'm having a tough time finding the "internal ids" you mentioned..
>> Where are they output?
>> 
>> Thanks
>> 
>> 
>> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <
>> [email protected]
>>> wrote:
>> 
>>> :)
>>> 
>>> Aha, we were only looking in the points directory, not inside the
>>> clustered points directory.  So if I understand, you're suggesting that
>> we
>>> use the key at the beginning of the clustered points as a one-to-one map.
>>> The number of unique keys in the output doesn't seem to line up with
>> that
>>> in the input.
>>> 
>>> We may do our dumb idea for now until we get a better handle on how the
>>> output is written.
>>> 
>>> Thanks!
>>> 
>>> 
>>> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]>
>> wrote:
>>> 
>>>> Andrew,
>>>> 
>>>> That is a pretty clever solution.
>>>> 
>>>> I think that you can get by with a simpler solution by noting how the
>>>> internal id's are assigned (sequentially, I think).
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
>>>> [email protected]
>>>>> wrote:
>>>> 
>>>>> So how are people working around this without patching 0.7?
>>>> Downgrading to
>>>>> 0.6?
>>>>> 
>>>>> We're on a cluster where we don't have admin rights to patch Mahout.
>>>>> 
>>>>> Our dumb idea now is to hash the concatenated values of each vector
>> and
>>>>> pair that up with our original ids, then run another process on the
>>>> points
>>>>> results to hash the results, then join up on hash value to pull id
>>>> together
>>>>> with cluster #.
>>>>> 
>>>>> Anyone have a nicer solution to this at hand?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <
>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> Andrew,
>>>>>> 
>>>>>> This feature was available prior to Mahout 0.7 (clustering had
>> support
>>>>> for
>>>>>> Named Vectors) and was broken later. While this may not be fixed in
>>>> the
>>>>>> soon to be Mahout 0.8, there is a JIRA that's open for this -
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
>>>> targeted
>>>>>> for 0.9. Please feel free to submit a patch if you would like to
>> take
>>>> a
>>>>>> shot at it.
>>>>>> 
>>>>>> Suneel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Andrew Musselman <[email protected]>
>>>>>> To: [email protected]
>>>>>> Sent: Friday, July 5, 2013 3:05 PM
>>>>>> Subject: Preserve contents of keys after running k-means
>>>>>> 
>>>>>> 
>>>>>> Hi list
>>>>>> 
>>>>>> We are trying to do some k-means clustering and are wondering if
>>>> there's
>>>>> an
>>>>>> easy way to preserve the contents of the keys for the input records.
>>>>>> 
>>>>>> E.g.
>>>>>> 
>>>>>> 12345: (0,3,79,80)
>>>>>> 98765: (1,4,98,90)
>>>>>> 
>>>>>> where the vectors being clustered are the tuples and the keys are
>> some
>>>>> id.
>>>>>> 
>>>>>> When we run clusterdump with pointsDir specified we have the vectors
>>>> but
>>>>>> not the keys.  We're looking at NamedVector as a path to this
>>>> solution,
>>>>> as
>>>>>> well as looking at a mapping file between ordered integers and the
>>>> ids in
>>>>>> order.
>>>>>> 
>>>>>> Thanks for any advice.
>>>>>> 
>>>>>> Best
>>>>>> Andrew
>

Re: Preserve contents of keys after running k-means

Reply via email to