If its very sparse you can try https://issues.apache.org/jira/browse/MAHOUT-703
Instead of minimizing reconstruction error, it tries to enforce that your words rank higher than other words not present in your document. Example of some results from this approach: https://docs.google.com/present/edit?id=0AQC247eq7Jp5ZGZ6NXpyOWhfMjlmM2pzdjRkZw&authkey=CNj2h98P&hl=en_US On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar < [email protected]> wrote: > Hello all, > We are trying to build a clustering system which will have an SVD > component. I believe Mahout has two SVD solvers: DistributedLanczosSolver > and SSVD. Could someone give me some tips on which would be a better choice > of a solver given that the size of the data will be roughly 100 million rows > with each row having roughly 50 K dimensions (100 million X 50000 ). We will > be working with text data so the resultant matrix should be relatively > sparse to begin with. > > Thanks > Eshwaran -- Yee Yang Li Hector http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)
