I want to have the core feature of k-means which is to find out which vectors landed in what cluster, and I'm open to discussion beyond that.
Best Andrew On Jul 5, 2013, at 5:43 PM, Pat Ferrel <[email protected]> wrote: > I think https://issues.apache.org/jira/browse/MAHOUT-1030 may be the wrong > issue #. > > The problem is that the Names from NamedVectorWritable are not used in the > cluster map after kmeans. You need to maintain your own map of your vector > name to internal Mahout id ints. NamedVectors work all the way through from > vector creation out of raw docs, TFIDF weighting, etc but the Names are not > used in id-ing the list of vectors assigned to clusters. > > It's been an issue on my wish list for Mahout. To get general universal > support for named vectors or better yet property vectors (where any number of > properties can be attached to a vector). A truly scalable non-DB string<->int > index creation and lookup (mapreduce version) is doable but not trivial. If > you don't have too many for an in-memory hashmap you have a much easier time > of it. > > > On Jul 5, 2013, at 2:53 PM, Ted Dunning <[email protected]> wrote: > > Andrew, > > I was being somewhat stupid. You are talking about a parallel program. > There is no single counter. > > The row number is what I was referring to. Each process will have > consecutive row numbers starting at 0. These rows will correspond to a > sequence of rows in the original data. If you can cause each process to > record these id's as they go by, you have the thing you need. > > I haven't looked at this code in several years, however, so my suggestions > may well be quite far from reasonable. > > > > On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <[email protected] >> wrote: > >> Ted, I'm having a tough time finding the "internal ids" you mentioned.. >> Where are they output? >> >> Thanks >> >> >> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman < >> [email protected] >>> wrote: >> >>> :) >>> >>> Aha, we were only looking in the points directory, not inside the >>> clustered points directory. So if I understand, you're suggesting that >> we >>> use the key at the beginning of the clustered points as a one-to-one map. >>> The number of unique keys in the output doesn't seem to line up with >> that >>> in the input. >>> >>> We may do our dumb idea for now until we get a better handle on how the >>> output is written. >>> >>> Thanks! >>> >>> >>> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <[email protected]> >> wrote: >>> >>>> Andrew, >>>> >>>> That is a pretty clever solution. >>>> >>>> I think that you can get by with a simpler solution by noting how the >>>> internal id's are assigned (sequentially, I think). >>>> >>>> >>>> >>>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman < >>>> [email protected] >>>>> wrote: >>>> >>>>> So how are people working around this without patching 0.7? >>>> Downgrading to >>>>> 0.6? >>>>> >>>>> We're on a cluster where we don't have admin rights to patch Mahout. >>>>> >>>>> Our dumb idea now is to hash the concatenated values of each vector >> and >>>>> pair that up with our original ids, then run another process on the >>>> points >>>>> results to hash the results, then join up on hash value to pull id >>>> together >>>>> with cluster #. >>>>> >>>>> Anyone have a nicer solution to this at hand? >>>>> >>>>> >>>>> >>>>> On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi < >> [email protected] >>>>>> wrote: >>>>> >>>>>> Andrew, >>>>>> >>>>>> This feature was available prior to Mahout 0.7 (clustering had >> support >>>>> for >>>>>> Named Vectors) and was broken later. While this may not be fixed in >>>> the >>>>>> soon to be Mahout 0.8, there is a JIRA that's open for this - >>>>>> https://issues.apache.org/jira/browse/MAHOUT-1030 that's been >>>> targeted >>>>>> for 0.9. Please feel free to submit a patch if you would like to >> take >>>> a >>>>>> shot at it. >>>>>> >>>>>> Suneel >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ________________________________ >>>>>> From: Andrew Musselman <[email protected]> >>>>>> To: [email protected] >>>>>> Sent: Friday, July 5, 2013 3:05 PM >>>>>> Subject: Preserve contents of keys after running k-means >>>>>> >>>>>> >>>>>> Hi list >>>>>> >>>>>> We are trying to do some k-means clustering and are wondering if >>>> there's >>>>> an >>>>>> easy way to preserve the contents of the keys for the input records. >>>>>> >>>>>> E.g. >>>>>> >>>>>> 12345: (0,3,79,80) >>>>>> 98765: (1,4,98,90) >>>>>> >>>>>> where the vectors being clustered are the tuples and the keys are >> some >>>>> id. >>>>>> >>>>>> When we run clusterdump with pointsDir specified we have the vectors >>>> but >>>>>> not the keys. We're looking at NamedVector as a path to this >>>> solution, >>>>> as >>>>>> well as looking at a mapping file between ordered integers and the >>>> ids in >>>>>> order. >>>>>> >>>>>> Thanks for any advice. >>>>>> >>>>>> Best >>>>>> Andrew >
