Appreciate the replies! > Yes this problem has been pretty much beaten to shreds. In > fact so much so i wrote it into troubleshooting in section > 5 of the manual > (https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000).
Aha, it looks like I had an out-of-date version of that file! I grabbed it from here: https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf linked to from this page: https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html The FAQ section wasn't yet written, it looks like. > Perhaps I can suggest as a first measure to run a simple > local MR job on your file which just counts # of rows in > every map split. You should not see any that is less than > k+p (110?). Since you are using local mode and not actual > hdfs blocks, there may be some irregularities. Indeed, this was the problem: I saw that all but the last split contained 889 rows … but that the final one was of size 107. I tinkered with the parameters and this got me sorted; specifically, I added the following to my ‘JobConf’: JobConf conf = new JobConf(); conf.setLong("mapred.min.split.size", 75570350L); where ‘75570350L’ was an empirically-derived ‘large-enough’ number. With that change made, the SSVD completed successfully. > Also since random matrices exhibit just as much variance > in every direction, random projection will not be able to > reduce problem efficiently. (meaning the singular vectors > of the final solution will be all over the place compared > to technically optimal solution). Tests on random matrices > are not meaningful for precision assessment purposes; only > inputs with good spectrum decay are (as in tests). But it > looks like many people are trying to do just that. Oh, right … I didn't have the real data available but wanted to get some idea of the feasibility of using the Mahout SSVD on input that was vaguely the right size … I didn't expect anything meaningful to come out :~} I'm going to get the actual data ready and run it ‘for real’ now, which, ought to produce something a bit more interesting.