Re: Preserve contents of keys after running k-means

Andrew Musselman Fri, 05 Jul 2013 14:36:02 -0700

Ted, I'm having a tough time finding the "internal ids" you mentioned..
 Where are they output?


Thanks


On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <[email protected]
> wrote:

> :)
>
> Aha, we were only looking in the points directory, not inside the
> clustered points directory.  So if I understand, you're suggesting that we
> use the key at the beginning of the clustered points as a one-to-one map.
>  The number of unique keys in the output doesn't seem to line up with that
> in the input.
>
> We may do our dumb idea for now until we get a better handle on how the
> output is written.
>
> Thanks!
>
>
> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]> wrote:
>
>> Andrew,
>>
>> That is a pretty clever solution.
>>
>> I think that you can get by with a simpler solution by noting how the
>> internal id's are assigned (sequentially, I think).
>>
>>
>>
>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
>> [email protected]
>> > wrote:
>>
>> > So how are people working around this without patching 0.7?
>>  Downgrading to
>> > 0.6?
>> >
>> > We're on a cluster where we don't have admin rights to patch Mahout.
>> >
>> > Our dumb idea now is to hash the concatenated values of each vector and
>> > pair that up with our original ids, then run another process on the
>> points
>> > results to hash the results, then join up on hash value to pull id
>> together
>> > with cluster #.
>> >
>> > Anyone have a nicer solution to this at hand?
>> >
>> >
>> >
>> > On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <[email protected]
>> > >wrote:
>> >
>> > > Andrew,
>> > >
>> > > This feature was available prior to Mahout 0.7 (clustering had support
>> > for
>> > > Named Vectors) and was broken later. While this may not be fixed in
>> the
>> > > soon to be Mahout 0.8, there is a JIRA that's open for this -
>> > > https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
>> targeted
>> > > for 0.9. Please feel free to submit a patch if you would like to take
>> a
>> > > shot at it.
>> > >
>> > > Suneel
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > >  From: Andrew Musselman <[email protected]>
>> > > To: [email protected]
>> > > Sent: Friday, July 5, 2013 3:05 PM
>> > > Subject: Preserve contents of keys after running k-means
>> > >
>> > >
>> > > Hi list
>> > >
>> > > We are trying to do some k-means clustering and are wondering if
>> there's
>> > an
>> > > easy way to preserve the contents of the keys for the input records.
>> > >
>> > > E.g.
>> > >
>> > > 12345: (0,3,79,80)
>> > > 98765: (1,4,98,90)
>> > >
>> > > where the vectors being clustered are the tuples and the keys are some
>> > id.
>> > >
>> > > When we run clusterdump with pointsDir specified we have the vectors
>> but
>> > > not the keys.  We're looking at NamedVector as a path to this
>> solution,
>> > as
>> > > well as looking at a mapping file between ordered integers and the
>> ids in
>> > > order.
>> > >
>> > > Thanks for any advice.
>> > >
>> > > Best
>> > > Andrew
>> > >
>> >
>>
>
>

Re: Preserve contents of keys after running k-means

Reply via email to