Sidenote: I was able to process a matrix of 3B non-zeros in 3 hours on a
6 machine cluster with Mahout's SSVD

On 13.12.2013 22:50, Dmitriy Lyubimov wrote:
> PS if i am not mistaken crunching Nathan's data, his largest experiment
> (wiki-all) was for ~8B non-zero elements for a sparse matrix geometry of
> 37Mx38M and it took him 22 hours to compute on his setup (4 EC2 large
> worker nodes?) with 1 power iteration (quite good accuracy) but analytical
> extrapolation to 16-32 nodes looks fairly good to me for a problem of that
> size. ~30 machines is not anywhere an extraordinary cluster by any
> measurement today.
> 
> 
> On Fri, Dec 13, 2013 at 1:17 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
>>
>>
>>
>> On Fri, Dec 13, 2013 at 12:42 PM, Ron Ayoub <[email protected]> wrote:
>>
>>> I'm doing some up front research on implementing LSI and choice of tools.
>>> I understand Mahout provide an out-of-core implementation of Stochastic
>>> SVD. On the web site it use the words 'reasonable size problems'. Would a
>>> spare matrix 1,000,000 * 1,000,000 having some 250,000,000 nonzero entries
>>> be out of the question.
>>
>>
>> for performance/accuracy assessment Nathan's dissertation [1] pp. 139 and
>> on is so far the best source I know.
>>
>> Nathan compares performance and assesses bottlenecks on at least two
>> interesting data sets -- wiki and wiki-max. He is experience the bottleneck
>> in the matrix multiplication operation (but he may have done the testing
>> before certain improvements were applied to the matrix-matrix part of power
>> iterations -- i am still hazy on that).
>>
>> [1]
>> http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
>>
>> I have a great hope that this bottleneck could be further addressed by
>> punting MapReduce out of equation and replacing with Bagel or GraphX
>> broadcast operations in the upcoming Spark 0.9. I have plans to address
>> that with Mahout-on-Spark part of the code but I am still waiting for Spark
>> project to rehash its graph based computation approach (there's sense that
>> GraphX should be superior in broadcasting techniques than existing Bagel
>> api in Spark).
>>
>>
>>> If so, what tools out there can do that. For instance, ARPACK.
>>
>>
>> AFAIK nobody to date cared to do the comparisons with ARPACK
>>
>>
>>> Regardless, how does Mahout SSVD compare to ARPACK. These seems to be the
>>> options out there that I have found. Thanks.
>>>
>>
>>
>>
> 

Reply via email to