Appreciate the replies!

> Yes this problem has been pretty much beaten to shreds. In
> fact so much so i wrote it into troubleshooting in section
> 5 of the manual
> (https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000).

Aha, it looks like I had an out-of-date version of that
file! I grabbed it from here:

     
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf

linked to from this page:

     
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html

The FAQ section wasn't yet written, it looks like.

> Perhaps I can suggest as a first measure to run a simple
> local MR job on your file which just counts # of rows in
> every map split. You should not see any that is less than
> k+p (110?). Since you are using local mode and not actual
> hdfs blocks, there may be some irregularities.

Indeed, this was the problem: I saw that all but the last
split contained 889 rows … but that the final one was of
size 107. I tinkered with the parameters and this got me
sorted; specifically, I added the following to my ‘JobConf’:

     JobConf conf = new JobConf();
     conf.setLong("mapred.min.split.size", 75570350L);

where ‘75570350L’ was an empirically-derived ‘large-enough’
number. With that change made, the SSVD completed
successfully.

> Also since random matrices exhibit just as much variance
> in every direction, random projection will not be able to
> reduce problem efficiently. (meaning the singular vectors
> of the final solution will be all over the place compared
> to technically optimal solution). Tests on random matrices
> are not meaningful for precision assessment purposes; only
> inputs with good spectrum decay are (as in tests). But it
> looks like many people are trying to do just that.

Oh, right … I didn't have the real data available but wanted
to get some idea of the feasibility of using the Mahout SSVD
on input that was vaguely the right size … I didn't expect
anything meaningful to come out :~}

I'm going to get the actual data ready and run it ‘for real’
now, which, ought to produce something a bit more
interesting.

Reply via email to