Andrew, I was being somewhat stupid. You are talking about a parallel program. There is no single counter.
The row number is what I was referring to. Each process will have consecutive row numbers starting at 0. These rows will correspond to a sequence of rows in the original data. If you can cause each process to record these id's as they go by, you have the thing you need. I haven't looked at this code in several years, however, so my suggestions may well be quite far from reasonable. On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <[email protected] > wrote: > Ted, I'm having a tough time finding the "internal ids" you mentioned.. > Where are they output? > > Thanks > > > On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman < > [email protected] > > wrote: > > > :) > > > > Aha, we were only looking in the points directory, not inside the > > clustered points directory. So if I understand, you're suggesting that > we > > use the key at the beginning of the clustered points as a one-to-one map. > > The number of unique keys in the output doesn't seem to line up with > that > > in the input. > > > > We may do our dumb idea for now until we get a better handle on how the > > output is written. > > > > Thanks! > > > > > > On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]> > wrote: > > > >> Andrew, > >> > >> That is a pretty clever solution. > >> > >> I think that you can get by with a simpler solution by noting how the > >> internal id's are assigned (sequentially, I think). > >> > >> > >> > >> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman < > >> [email protected] > >> > wrote: > >> > >> > So how are people working around this without patching 0.7? > >> Downgrading to > >> > 0.6? > >> > > >> > We're on a cluster where we don't have admin rights to patch Mahout. > >> > > >> > Our dumb idea now is to hash the concatenated values of each vector > and > >> > pair that up with our original ids, then run another process on the > >> points > >> > results to hash the results, then join up on hash value to pull id > >> together > >> > with cluster #. > >> > > >> > Anyone have a nicer solution to this at hand? > >> > > >> > > >> > > >> > On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi < > [email protected] > >> > >wrote: > >> > > >> > > Andrew, > >> > > > >> > > This feature was available prior to Mahout 0.7 (clustering had > support > >> > for > >> > > Named Vectors) and was broken later. While this may not be fixed in > >> the > >> > > soon to be Mahout 0.8, there is a JIRA that's open for this - > >> > > https://issues.apache.org/jira/browse/MAHOUT-1030 that's been > >> targeted > >> > > for 0.9. Please feel free to submit a patch if you would like to > take > >> a > >> > > shot at it. > >> > > > >> > > Suneel > >> > > > >> > > > >> > > > >> > > > >> > > ________________________________ > >> > > From: Andrew Musselman <[email protected]> > >> > > To: [email protected] > >> > > Sent: Friday, July 5, 2013 3:05 PM > >> > > Subject: Preserve contents of keys after running k-means > >> > > > >> > > > >> > > Hi list > >> > > > >> > > We are trying to do some k-means clustering and are wondering if > >> there's > >> > an > >> > > easy way to preserve the contents of the keys for the input records. > >> > > > >> > > E.g. > >> > > > >> > > 12345: (0,3,79,80) > >> > > 98765: (1,4,98,90) > >> > > > >> > > where the vectors being clustered are the tuples and the keys are > some > >> > id. > >> > > > >> > > When we run clusterdump with pointsDir specified we have the vectors > >> but > >> > > not the keys. We're looking at NamedVector as a path to this > >> solution, > >> > as > >> > > well as looking at a mapping file between ordered integers and the > >> ids in > >> > > order. > >> > > > >> > > Thanks for any advice. > >> > > > >> > > Best > >> > > Andrew > >> > > > >> > > >> > > > > >
