I think https://issues.apache.org/jira/browse/MAHOUT-1030 may be the wrong 
issue #. 

The problem is that the Names from NamedVectorWritable are not used in the 
cluster map after kmeans. You need to maintain your own map of your vector name 
to internal Mahout id ints. NamedVectors work all the way through from vector 
creation out of raw docs, TFIDF weighting, etc but the Names are not used in 
id-ing the list of vectors assigned to clusters. 

It's been an issue on my wish list for Mahout. To get general universal support 
for named vectors or better yet property vectors (where any number of 
properties can be attached to a vector). A truly scalable non-DB string<->int 
index creation and lookup (mapreduce version) is doable but not trivial. If you 
don't have too many for an in-memory hashmap you have a much easier time of it. 
 


On Jul 5, 2013, at 2:53 PM, Ted Dunning <[email protected]> wrote:

Andrew,

I was being somewhat stupid.  You are talking about a parallel program.
There is no single counter.

The row number is what I was referring to.  Each process will have
consecutive row numbers starting at 0.  These rows will correspond to a
sequence of rows in the original data.  If you can cause each process to
record these id's as they go by, you have the thing you need.

I haven't looked at this code in several years, however, so my suggestions
may well be quite far from reasonable.



On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <[email protected]
> wrote:

> Ted, I'm having a tough time finding the "internal ids" you mentioned..
> Where are they output?
> 
> Thanks
> 
> 
> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <
> [email protected]
>> wrote:
> 
>> :)
>> 
>> Aha, we were only looking in the points directory, not inside the
>> clustered points directory.  So if I understand, you're suggesting that
> we
>> use the key at the beginning of the clustered points as a one-to-one map.
>> The number of unique keys in the output doesn't seem to line up with
> that
>> in the input.
>> 
>> We may do our dumb idea for now until we get a better handle on how the
>> output is written.
>> 
>> Thanks!
>> 
>> 
>> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]>
> wrote:
>> 
>>> Andrew,
>>> 
>>> That is a pretty clever solution.
>>> 
>>> I think that you can get by with a simpler solution by noting how the
>>> internal id's are assigned (sequentially, I think).
>>> 
>>> 
>>> 
>>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
>>> [email protected]
>>>> wrote:
>>> 
>>>> So how are people working around this without patching 0.7?
>>> Downgrading to
>>>> 0.6?
>>>> 
>>>> We're on a cluster where we don't have admin rights to patch Mahout.
>>>> 
>>>> Our dumb idea now is to hash the concatenated values of each vector
> and
>>>> pair that up with our original ids, then run another process on the
>>> points
>>>> results to hash the results, then join up on hash value to pull id
>>> together
>>>> with cluster #.
>>>> 
>>>> Anyone have a nicer solution to this at hand?
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <
> [email protected]
>>>>> wrote:
>>>> 
>>>>> Andrew,
>>>>> 
>>>>> This feature was available prior to Mahout 0.7 (clustering had
> support
>>>> for
>>>>> Named Vectors) and was broken later. While this may not be fixed in
>>> the
>>>>> soon to be Mahout 0.8, there is a JIRA that's open for this -
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
>>> targeted
>>>>> for 0.9. Please feel free to submit a patch if you would like to
> take
>>> a
>>>>> shot at it.
>>>>> 
>>>>> Suneel
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Andrew Musselman <[email protected]>
>>>>> To: [email protected]
>>>>> Sent: Friday, July 5, 2013 3:05 PM
>>>>> Subject: Preserve contents of keys after running k-means
>>>>> 
>>>>> 
>>>>> Hi list
>>>>> 
>>>>> We are trying to do some k-means clustering and are wondering if
>>> there's
>>>> an
>>>>> easy way to preserve the contents of the keys for the input records.
>>>>> 
>>>>> E.g.
>>>>> 
>>>>> 12345: (0,3,79,80)
>>>>> 98765: (1,4,98,90)
>>>>> 
>>>>> where the vectors being clustered are the tuples and the keys are
> some
>>>> id.
>>>>> 
>>>>> When we run clusterdump with pointsDir specified we have the vectors
>>> but
>>>>> not the keys.  We're looking at NamedVector as a path to this
>>> solution,
>>>> as
>>>>> well as looking at a mapping file between ordered integers and the
>>> ids in
>>>>> order.
>>>>> 
>>>>> Thanks for any advice.
>>>>> 
>>>>> Best
>>>>> Andrew
>>>>> 
>>>> 
>>> 
>> 
>> 
> 

Reply via email to