Mahalanobis distance as defined really depends on good estimates of covariance and that is a pretty dense computation.
Generally, you need to define some restricted kind of covariance or a strong prior distribution on covariance in order to get a reasonable estimate in the kind of ill-posed problem that you are suggesting. One possibility is to assume that covariance is sparse with only a few non-diagonal terms. If you have some kind of heuristic for deciding where you will allow non-zeros, then you can build an estimation technique. Not surprisingly, I recommend LLR for finding interesting points. Another option is to look at rank limited approximations like SVD which are not dense, but do require less storage than fully dense representations. Both of these can be viewed as decompositions, but this approach probably only really helps with the SVD approach. If I were working on this problem, though, I would strongly question the need for Mahalanobis distance in the first place. It can be lovely, but it really comes from an earlier time and I question the applicability to large sparse matrices of counts. On Wed, Jul 20, 2011 at 10:28 AM, Marco Turchi <[email protected]>wrote: > ... > I have the last question: > I need to compute the covariance matrix of some input data to compute the > Mahalanobis distance. In my data, the number of features is bigger than the > number of docs, and these data are very sparse. I cannot compute the > covariance matrix using the classic approach, because it becomes very time > and computational expensive. Is there any implementation of an approximate > way to compute it in Mahout? I have had a look in the library, but I do not > find it. > > thanks for your help > Marco > > > On 20 Jul 2011, at 19:15, Ted Dunning wrote: > > Well, that is hard to understand without a complete test case. In the >> first >> case you have a vector with one element constructed while in the second >> case, you wind up with two elements and don't show the constructor. >> >> Write up a JIRA and attach a unit test. >> >> My strong expectation is that this is a case of refrigerator blindness. >> >> On Wed, Jul 20, 2011 at 9:22 AM, Marco Turchi <[email protected] >> >wrote: >> >> Hi >>> you are right about the sparse vector issue... but I'm constructing distV >>> in the same way changing only + and - in the variable mean. In both the >>> cases, I should have the same number of entries in the final vector. >>> >>> Thanks a lot for your help >>> Marco >>> >>> >>> On 20 Jul 2011, at 17:42, Ted Dunning wrote: >>> >>> You constructed the first vector with a dimension of 1. It looks like >>> you >>> >>>> constructed the second one with a larger dimension of 2. >>>> >>>> When you offset a sparse vector, all of the zeros become non-zero and >>>> the >>>> vector becomes dense. This results in a bunch of cells being created. >>>> >>>> On Wed, Jul 20, 2011 at 6:28 AM, marco turchi <[email protected] >>>> >>>>> wrote: >>>>> >>>> >>>> Dear All, >>>> >>>>> I have a strange behaviour when I use the method Plus for Vector. >>>>> >>>>> I have a RandomAccessSparseVector vector, if I add a positive number, I >>>>> got >>>>> a new Vector where each element is the sum of the old value plus the >>>>> positive number. While if I add a negative number, the new vector has 1 >>>>> more >>>>> entry: >>>>> >>>>> >>>>> RandomAccessSparseVector distV = new RandomAccessSparseVector(1); >>>>> distV.setQuick(0,1); >>>>> double mean = 1; >>>>> RandomAccessSparseVector app = >>>>> (RandomAccessSparseVector)(****distV.plus(mean)); >>>>> >>>>> the output is >>>>> {0:2.0} >>>>> >>>>> if I have >>>>> double mean = -1; >>>>> RandomAccessSparseVector app = >>>>> (RandomAccessSparseVector)(****distV.plus(mean)); >>>>> >>>>> the output is >>>>> {1:1.0,0:-1.0} >>>>> >>>>> For sure I'm doing something wrong. Do you have any ideas where the >>>>> problem >>>>> is? >>>>> >>>>> Thanks a lot in advance >>>>> Marco >>>>> >>>>> >>>>> >>> >
