Re: Preserve contents of keys after running k-means

Ted Dunning Fri, 05 Jul 2013 14:54:36 -0700

Andrew,

I was being somewhat stupid.  You are talking about a parallel program.
 There is no single counter.


The row number is what I was referring to.  Each process will have
consecutive row numbers starting at 0.  These rows will correspond to a
sequence of rows in the original data.  If you can cause each process to
record these id's as they go by, you have the thing you need.

I haven't looked at this code in several years, however, so my suggestions
may well be quite far from reasonable.



On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <[email protected]
> wrote:

> Ted, I'm having a tough time finding the "internal ids" you mentioned..
>  Where are they output?
>
> Thanks
>
>
> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <
> [email protected]
> > wrote:
>
> > :)
> >
> > Aha, we were only looking in the points directory, not inside the
> > clustered points directory.  So if I understand, you're suggesting that
> we
> > use the key at the beginning of the clustered points as a one-to-one map.
> >  The number of unique keys in the output doesn't seem to line up with
> that
> > in the input.
> >
> > We may do our dumb idea for now until we get a better handle on how the
> > output is written.
> >
> > Thanks!
> >
> >
> > On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]>
> wrote:
> >
> >> Andrew,
> >>
> >> That is a pretty clever solution.
> >>
> >> I think that you can get by with a simpler solution by noting how the
> >> internal id's are assigned (sequentially, I think).
> >>
> >>
> >>
> >> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
> >> [email protected]
> >> > wrote:
> >>
> >> > So how are people working around this without patching 0.7?
> >>  Downgrading to
> >> > 0.6?
> >> >
> >> > We're on a cluster where we don't have admin rights to patch Mahout.
> >> >
> >> > Our dumb idea now is to hash the concatenated values of each vector
> and
> >> > pair that up with our original ids, then run another process on the
> >> points
> >> > results to hash the results, then join up on hash value to pull id
> >> together
> >> > with cluster #.
> >> >
> >> > Anyone have a nicer solution to this at hand?
> >> >
> >> >
> >> >
> >> > On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <
> [email protected]
> >> > >wrote:
> >> >
> >> > > Andrew,
> >> > >
> >> > > This feature was available prior to Mahout 0.7 (clustering had
> support
> >> > for
> >> > > Named Vectors) and was broken later. While this may not be fixed in
> >> the
> >> > > soon to be Mahout 0.8, there is a JIRA that's open for this -
> >> > > https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
> >> targeted
> >> > > for 0.9. Please feel free to submit a patch if you would like to
> take
> >> a
> >> > > shot at it.
> >> > >
> >> > > Suneel
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > ________________________________
> >> > >  From: Andrew Musselman <[email protected]>
> >> > > To: [email protected]
> >> > > Sent: Friday, July 5, 2013 3:05 PM
> >> > > Subject: Preserve contents of keys after running k-means
> >> > >
> >> > >
> >> > > Hi list
> >> > >
> >> > > We are trying to do some k-means clustering and are wondering if
> >> there's
> >> > an
> >> > > easy way to preserve the contents of the keys for the input records.
> >> > >
> >> > > E.g.
> >> > >
> >> > > 12345: (0,3,79,80)
> >> > > 98765: (1,4,98,90)
> >> > >
> >> > > where the vectors being clustered are the tuples and the keys are
> some
> >> > id.
> >> > >
> >> > > When we run clusterdump with pointsDir specified we have the vectors
> >> but
> >> > > not the keys.  We're looking at NamedVector as a path to this
> >> solution,
> >> > as
> >> > > well as looking at a mapping file between ordered integers and the
> >> ids in
> >> > > order.
> >> > >
> >> > > Thanks for any advice.
> >> > >
> >> > > Best
> >> > > Andrew
> >> > >
> >> >
> >>
> >
> >
>

Re: Preserve contents of keys after running k-means

Reply via email to