Re: Computing SVD Of "Large Sparse Data"

Eshwaran Vijaya Kumar Wed, 08 Jun 2011 07:40:05 -0700

Thanks Ted. That is good news.
On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:


> I think that incorporating mean subtraction into the SSVD code should
> be relatively straightforward.  The trick is that you have to project
> the orginal matrix and the mean separately and then combine the
> results.  There are later steps that require similar mods, but since
> it is all based on a simple product that can always be factored, it
> should be straightforward.
> 
> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar
> <[email protected]> wrote:
>> One of the algorithms we would like to try out is PCA where it seems to be 
>> good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say 
>> A,  to avoid destroying sparsity ( Refer : 
>> http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0
>>  ). From what I understand about the Lanczos algorithm, it shouldn't be too 
>> hard to modify the solver code so that I can pass A and m*m^T without 
>> combining them into a single matrix and then do repeated multiplications.  
>> Unfortunately, I have not yet had time to look at SSVD; So it would be 
>> extremely helpful if someone who has looked at the problem more closer can 
>> comment on how to do these (potential?) modifications to the SSVD code to 
>> avoid having to deal with dense matrices.
>> 
>> 
>> Thanks in advance
>> Eshwaran
>> 
>> 
>> 
>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:
>> 
>>> I would push for SSVD as well if you want a real SVD.
>>> 
>>> Also, I don't think that you lose information about which vectors are which
>>> (or as Jake put it "what they mean").  The stochastic decomposition gives a
>>> very accurate estimate of the top-k singular vectors.  It does this by using
>>> the random projection to project the top singular vectors into a sub-space
>>> and then correcting the results obtained back into the original space.  This
>>> is not the same as simply doing the decomposition on the random projection
>>> and then using that decomposition.
>>> 
>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
>>> [email protected]> wrote:
>>> 
>>>> Hi Jake,
>>>> Thank you for your reply. Good to know that we can use Lanczos. I will
>>>> have to look into SSVD algorithm closer to figure out whether the
>>>> information loss is worth the gain in speed (and computational efficiency).
>>>> I guess We will have to run more tests to see which works best to decide on
>>>> which path to go by.
>>>> 
>>>> 
>>>> Esh
>>>> 
>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>>>> 
>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD
>>>>> via Lanczos, so give it a try.
>>>>> 
>>>>> SSVD will probably run faster, but you lose some information on what the
>>>>> singular vectors "mean".  If you don't need this information, SSVD may be
>>>>> better for you.
>>>>> 
>>>>> What would be awesome for *us* is if you tried both and told us what you
>>>>> found, in terms of performance and relevance.  :)
>>>>> 
>>>>> -jake
>>>>> 
>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
>>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>> Hello all,
>>>>> We are trying to build a clustering system which will have an SVD
>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
>>>>> and SSVD. Could someone give me some tips on which would be a better
>>>> choice
>>>>> of a solver given that the size of the data will be roughly 100 million
>>>> rows
>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We
>>>> will
>>>>> be working with text data so the resultant matrix should be relatively
>>>>> sparse to begin with.
>>>>> 
>>>>> Thanks
>>>>> Eshwaran
>>>> 
>>>> 
>> 
>>

Re: Computing SVD Of "Large Sparse Data"

Reply via email to