:) Aha, we were only looking in the points directory, not inside the clustered points directory. So if I understand, you're suggesting that we use the key at the beginning of the clustered points as a one-to-one map. The number of unique keys in the output doesn't seem to line up with that in the input.
We may do our dumb idea for now until we get a better handle on how the output is written. Thanks! On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]> wrote: > Andrew, > > That is a pretty clever solution. > > I think that you can get by with a simpler solution by noting how the > internal id's are assigned (sequentially, I think). > > > > On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman < > [email protected] > > wrote: > > > So how are people working around this without patching 0.7? Downgrading > to > > 0.6? > > > > We're on a cluster where we don't have admin rights to patch Mahout. > > > > Our dumb idea now is to hash the concatenated values of each vector and > > pair that up with our original ids, then run another process on the > points > > results to hash the results, then join up on hash value to pull id > together > > with cluster #. > > > > Anyone have a nicer solution to this at hand? > > > > > > > > On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <[email protected] > > >wrote: > > > > > Andrew, > > > > > > This feature was available prior to Mahout 0.7 (clustering had support > > for > > > Named Vectors) and was broken later. While this may not be fixed in the > > > soon to be Mahout 0.8, there is a JIRA that's open for this - > > > https://issues.apache.org/jira/browse/MAHOUT-1030 that's been targeted > > > for 0.9. Please feel free to submit a patch if you would like to take a > > > shot at it. > > > > > > Suneel > > > > > > > > > > > > > > > ________________________________ > > > From: Andrew Musselman <[email protected]> > > > To: [email protected] > > > Sent: Friday, July 5, 2013 3:05 PM > > > Subject: Preserve contents of keys after running k-means > > > > > > > > > Hi list > > > > > > We are trying to do some k-means clustering and are wondering if > there's > > an > > > easy way to preserve the contents of the keys for the input records. > > > > > > E.g. > > > > > > 12345: (0,3,79,80) > > > 98765: (1,4,98,90) > > > > > > where the vectors being clustered are the tuples and the keys are some > > id. > > > > > > When we run clusterdump with pointsDir specified we have the vectors > but > > > not the keys. We're looking at NamedVector as a path to this solution, > > as > > > well as looking at a mapping file between ordered integers and the ids > in > > > order. > > > > > > Thanks for any advice. > > > > > > Best > > > Andrew > > > > > >
