Re: Computing SVD Of "Large Sparse Data"

Ted Dunning Wed, 08 Jun 2011 15:04:56 -0700

I would encourage you to take a stab at a patch on this.  You aren't
the only person to have expressed interest in scaling PCA, but you
aren't a member of a large horde, either.


On Wed, Jun 8, 2011 at 7:39 AM, Eshwaran Vijaya Kumar
<[email protected]> wrote:
> Thanks Ted. That is good news.
> On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:
>
>> I think that incorporating mean subtraction into the SSVD code should
>> be relatively straightforward.  The trick is that you have to project
>> the orginal matrix and the mean separately and then combine the
>> results.  There are later steps that require similar mods, but since
>> it is all based on a simple product that can always be factored, it
>> should be straightforward.
>>
>> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar
>> <[email protected]> wrote:
>>> One of the algorithms we would like to try out is PCA where it seems to be 
>>> good to avoid subtracting the mean matrix (m* m^T) from the data matrix, 
>>> say A,  to avoid destroying sparsity ( Refer : 
>>> http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0
>>>  ). From what I understand about the Lanczos algorithm, it shouldn't be too 
>>> hard to modify the solver code so that I can pass A and m*m^T without 
>>> combining them into a single matrix and then do repeated multiplications.  
>>> Unfortunately, I have not yet had time to look at SSVD; So it would be 
>>> extremely helpful if someone who has looked at the problem more closer can 
>>> comment on how to do these (potential?) modifications to the SSVD code to 
>>> avoid having to deal with dense matrices.
>>>
>>>
>>> Thanks in advance
>>> Eshwaran
>>>
>>>
>>>
>>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:
>>>
>>>> I would push for SSVD as well if you want a real SVD.
>>>>
>>>> Also, I don't think that you lose information about which vectors are which
>>>> (or as Jake put it "what they mean").  The stochastic decomposition gives a
>>>> very accurate estimate of the top-k singular vectors.  It does this by 
>>>> using
>>>> the random projection to project the top singular vectors into a sub-space
>>>> and then correcting the results obtained back into the original space.  
>>>> This
>>>> is not the same as simply doing the decomposition on the random projection
>>>> and then using that decomposition.
>>>>
>>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Jake,
>>>>> Thank you for your reply. Good to know that we can use Lanczos. I will
>>>>> have to look into SSVD algorithm closer to figure out whether the
>>>>> information loss is worth the gain in speed (and computational 
>>>>> efficiency).
>>>>> I guess We will have to run more tests to see which works best to decide 
>>>>> on
>>>>> which path to go by.
>>>>>
>>>>>
>>>>> Esh
>>>>>
>>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>>>>>
>>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD
>>>>>> via Lanczos, so give it a try.
>>>>>>
>>>>>> SSVD will probably run faster, but you lose some information on what the
>>>>>> singular vectors "mean".  If you don't need this information, SSVD may be
>>>>>> better for you.
>>>>>>
>>>>>> What would be awesome for *us* is if you tried both and told us what you
>>>>>> found, in terms of performance and relevance.  :)
>>>>>>
>>>>>> -jake
>>>>>>
>>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hello all,
>>>>>> We are trying to build a clustering system which will have an SVD
>>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
>>>>>> and SSVD. Could someone give me some tips on which would be a better
>>>>> choice
>>>>>> of a solver given that the size of the data will be roughly 100 million
>>>>> rows
>>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We
>>>>> will
>>>>>> be working with text data so the resultant matrix should be relatively
>>>>>> sparse to begin with.
>>>>>>
>>>>>> Thanks
>>>>>> Eshwaran
>>>>>
>>>>>
>>>
>>>
>
>

Re: Computing SVD Of "Large Sparse Data"

Reply via email to