Re: Preserve contents of keys after running k-means

Andrew Musselman Fri, 05 Jul 2013 22:30:24 -0700

I want to have the core feature of k-means which is to find out which vectors 
landed in what cluster, and I'm open to discussion beyond that.


Best
Andrew

On Jul 5, 2013, at 5:43 PM, Pat Ferrel <[email protected]> wrote:

> I think https://issues.apache.org/jira/browse/MAHOUT-1030 may be the wrong 
> issue #. 
> 
> The problem is that the Names from NamedVectorWritable are not used in the 
> cluster map after kmeans. You need to maintain your own map of your vector 
> name to internal Mahout id ints. NamedVectors work all the way through from 
> vector creation out of raw docs, TFIDF weighting, etc but the Names are not 
> used in id-ing the list of vectors assigned to clusters. 
> 
> It's been an issue on my wish list for Mahout. To get general universal 
> support for named vectors or better yet property vectors (where any number of 
> properties can be attached to a vector). A truly scalable non-DB string<->int 
> index creation and lookup (mapreduce version) is doable but not trivial. If 
> you don't have too many for an in-memory hashmap you have a much easier time 
> of it.  
> 
> 
> On Jul 5, 2013, at 2:53 PM, Ted Dunning <[email protected]> wrote:
> 
> Andrew,
> 
> I was being somewhat stupid.  You are talking about a parallel program.
> There is no single counter.
> 
> The row number is what I was referring to.  Each process will have
> consecutive row numbers starting at 0.  These rows will correspond to a
> sequence of rows in the original data.  If you can cause each process to
> record these id's as they go by, you have the thing you need.
> 
> I haven't looked at this code in several years, however, so my suggestions
> may well be quite far from reasonable.
> 
> 
> 
> On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <[email protected]
>> wrote:
> 
>> Ted, I'm having a tough time finding the "internal ids" you mentioned..
>> Where are they output?
>> 
>> Thanks
>> 
>> 
>> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <
>> [email protected]
>>> wrote:
>> 
>>> :)
>>> 
>>> Aha, we were only looking in the points directory, not inside the
>>> clustered points directory.  So if I understand, you're suggesting that
>> we
>>> use the key at the beginning of the clustered points as a one-to-one map.
>>> The number of unique keys in the output doesn't seem to line up with
>> that
>>> in the input.
>>> 
>>> We may do our dumb idea for now until we get a better handle on how the
>>> output is written.
>>> 
>>> Thanks!
>>> 
>>> 
>>> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]>
>> wrote:
>>> 
>>>> Andrew,
>>>> 
>>>> That is a pretty clever solution.
>>>> 
>>>> I think that you can get by with a simpler solution by noting how the
>>>> internal id's are assigned (sequentially, I think).
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
>>>> [email protected]
>>>>> wrote:
>>>> 
>>>>> So how are people working around this without patching 0.7?
>>>> Downgrading to
>>>>> 0.6?
>>>>> 
>>>>> We're on a cluster where we don't have admin rights to patch Mahout.
>>>>> 
>>>>> Our dumb idea now is to hash the concatenated values of each vector
>> and
>>>>> pair that up with our original ids, then run another process on the
>>>> points
>>>>> results to hash the results, then join up on hash value to pull id
>>>> together
>>>>> with cluster #.
>>>>> 
>>>>> Anyone have a nicer solution to this at hand?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <
>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> Andrew,
>>>>>> 
>>>>>> This feature was available prior to Mahout 0.7 (clustering had
>> support
>>>>> for
>>>>>> Named Vectors) and was broken later. While this may not be fixed in
>>>> the
>>>>>> soon to be Mahout 0.8, there is a JIRA that's open for this -
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
>>>> targeted
>>>>>> for 0.9. Please feel free to submit a patch if you would like to
>> take
>>>> a
>>>>>> shot at it.
>>>>>> 
>>>>>> Suneel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Andrew Musselman <[email protected]>
>>>>>> To: [email protected]
>>>>>> Sent: Friday, July 5, 2013 3:05 PM
>>>>>> Subject: Preserve contents of keys after running k-means
>>>>>> 
>>>>>> 
>>>>>> Hi list
>>>>>> 
>>>>>> We are trying to do some k-means clustering and are wondering if
>>>> there's
>>>>> an
>>>>>> easy way to preserve the contents of the keys for the input records.
>>>>>> 
>>>>>> E.g.
>>>>>> 
>>>>>> 12345: (0,3,79,80)
>>>>>> 98765: (1,4,98,90)
>>>>>> 
>>>>>> where the vectors being clustered are the tuples and the keys are
>> some
>>>>> id.
>>>>>> 
>>>>>> When we run clusterdump with pointsDir specified we have the vectors
>>>> but
>>>>>> not the keys.  We're looking at NamedVector as a path to this
>>>> solution,
>>>>> as
>>>>>> well as looking at a mapping file between ordered integers and the
>>>> ids in
>>>>>> order.
>>>>>> 
>>>>>> Thanks for any advice.
>>>>>> 
>>>>>> Best
>>>>>> Andrew
>

Re: Preserve contents of keys after running k-means

Reply via email to