Ted, I'm having a tough time finding the "internal ids" you mentioned.. Where are they output?
Thanks On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <[email protected] > wrote: > :) > > Aha, we were only looking in the points directory, not inside the > clustered points directory. So if I understand, you're suggesting that we > use the key at the beginning of the clustered points as a one-to-one map. > The number of unique keys in the output doesn't seem to line up with that > in the input. > > We may do our dumb idea for now until we get a better handle on how the > output is written. > > Thanks! > > > On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]> wrote: > >> Andrew, >> >> That is a pretty clever solution. >> >> I think that you can get by with a simpler solution by noting how the >> internal id's are assigned (sequentially, I think). >> >> >> >> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman < >> [email protected] >> > wrote: >> >> > So how are people working around this without patching 0.7? >> Downgrading to >> > 0.6? >> > >> > We're on a cluster where we don't have admin rights to patch Mahout. >> > >> > Our dumb idea now is to hash the concatenated values of each vector and >> > pair that up with our original ids, then run another process on the >> points >> > results to hash the results, then join up on hash value to pull id >> together >> > with cluster #. >> > >> > Anyone have a nicer solution to this at hand? >> > >> > >> > >> > On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <[email protected] >> > >wrote: >> > >> > > Andrew, >> > > >> > > This feature was available prior to Mahout 0.7 (clustering had support >> > for >> > > Named Vectors) and was broken later. While this may not be fixed in >> the >> > > soon to be Mahout 0.8, there is a JIRA that's open for this - >> > > https://issues.apache.org/jira/browse/MAHOUT-1030 that's been >> targeted >> > > for 0.9. Please feel free to submit a patch if you would like to take >> a >> > > shot at it. >> > > >> > > Suneel >> > > >> > > >> > > >> > > >> > > ________________________________ >> > > From: Andrew Musselman <[email protected]> >> > > To: [email protected] >> > > Sent: Friday, July 5, 2013 3:05 PM >> > > Subject: Preserve contents of keys after running k-means >> > > >> > > >> > > Hi list >> > > >> > > We are trying to do some k-means clustering and are wondering if >> there's >> > an >> > > easy way to preserve the contents of the keys for the input records. >> > > >> > > E.g. >> > > >> > > 12345: (0,3,79,80) >> > > 98765: (1,4,98,90) >> > > >> > > where the vectors being clustered are the tuples and the keys are some >> > id. >> > > >> > > When we run clusterdump with pointsDir specified we have the vectors >> but >> > > not the keys. We're looking at NamedVector as a path to this >> solution, >> > as >> > > well as looking at a mapping file between ordered integers and the >> ids in >> > > order. >> > > >> > > Thanks for any advice. >> > > >> > > Best >> > > Andrew >> > > >> > >> > >
