If you get near a good solution (which is what the streaming k-means tries to make very likely) then the local convexity should not be spoiled by the L_1 regularization. This is behind the guaranteed convergence of SGD and the reason that step-wise regression so often fails ... stepwise is equivalent to L_0 regularization which breaks convexity. L_1 does nearly the same amount of good while leaving convexity intact.
On Mon, Mar 4, 2013 at 2:44 PM, Sean Owen <[email protected]> wrote: > So this is constraining the centroids themselves to be 0 in many > dimensions... interesting. I assume this comes into play when you move > centroids -- you're not minimizing squared distance anymore but that plus > an additional penalty. I'd be interested to hear your results Chris if you > try it -- versus simply using the L1 distance alone. I agree this does seem > to make it all the more important to run it many times and pick the best > clustering due to local minima. > > > On Mon, Mar 4, 2013 at 7:18 PM, Ted Dunning <[email protected]> wrote: > > > Always good to test, but I think that euclidean distance with L_1 > > regularization is probably more interesting. > > > > On Mon, Mar 4, 2013 at 12:00 PM, Chris Harrington <[email protected] > > >wrote: > > > > > So if I'm understanding what you are saying is, simply put, that I > should > > > investigate the use L_1 as my distance measure during my measuring of > > > vector distance within a cluster? > > > > > > > > > On 1 Mar 2013, at 16:24, Ted Dunning wrote: > > > > > > > What Sean says is just right, except that I was (telegraphically) > > getting > > > > at a slightly different point with L_1: > > > > > > > > On Wed, Feb 27, 2013 at 7:23 AM, Chris Harrington < > [email protected] > > > >wrote: > > > > > > > >> Is L_1 regularization the same as manhattan distance? > > > >> > > > > > > > > L_1 metric is manhattan distance, yes. > > > > > > > > L_1 regularization of k-means refers to something a little bit > > different. > > > > > > > > The idea with regularization is that you add some sort of penalty to > > the > > > > function you are optimizing. This penalty pushes the optimization > > > toward a > > > > solution that you would prefer on some other grounds than just the > > > > optimization alone. Regularization often helps in solving > > > underdetermined > > > > systems where there are an infinite number of solutions and we have > to > > > pick > > > > a preferred solution. > > > > > > > > There isn't anything that says that you have to be optimizing the > same > > > kind > > > > of function as the regularization. Thus k-means, which is inherently > > > > optimizing squared error can quite reasonably be regularized with L_1 > > > (sum > > > > of the absolute value of the centroids' coefficients). > > > > > > > > I haven't tried this at all seriously yet. L_1 regularization tends > to > > > > help drive toward sparsity, but it is normally used in convex > problems > > > > where we can guarantee a findable global optimum. The k-means > problem, > > > > however, is not convex so adding the regularization may screw things > up > > > in > > > > practice. For text-like data, I have a strong intuition that the > > > idealized > > > > effect of L_1 should be very good, but the pragmatic effect may not > be > > so > > > > useful. > > > > > > > > >
