Just to follow up: I now have my real data, which, is much sparser than the totally-random data … and, unsurprisingly, it exhibits a good bit more regularity, so it's compressible to the point that the on-disc SequenceFile is small enough that there's only a single map job, which, of course, means that the problem that I was experiencing doesn't arise at all.
Incidentally, with the random data, I *did* get the same behaviour when I ran on a ‘real’ Hadoop cluster (It's the full Hadoop stack, running on a single box.) On Thu, Feb 14, 2013 at 9:56 AM, K.D.P. Ross <[email protected]> wrote: > Appreciate the replies! > >> Yes this problem has been pretty much beaten to shreds. In >> fact so much so i wrote it into troubleshooting in section >> 5 of the manual >> (https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000). > > Aha, it looks like I had an out-of-date version of that > file! I grabbed it from here: > > > https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf > > linked to from this page: > > > https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html > > The FAQ section wasn't yet written, it looks like. > >> Perhaps I can suggest as a first measure to run a simple >> local MR job on your file which just counts # of rows in >> every map split. You should not see any that is less than >> k+p (110?). Since you are using local mode and not actual >> hdfs blocks, there may be some irregularities. > > Indeed, this was the problem: I saw that all but the last > split contained 889 rows … but that the final one was of > size 107. I tinkered with the parameters and this got me > sorted; specifically, I added the following to my ‘JobConf’: > > JobConf conf = new JobConf(); > conf.setLong("mapred.min.split.size", 75570350L); > > where ‘75570350L’ was an empirically-derived ‘large-enough’ > number. With that change made, the SSVD completed > successfully. > >> Also since random matrices exhibit just as much variance >> in every direction, random projection will not be able to >> reduce problem efficiently. (meaning the singular vectors >> of the final solution will be all over the place compared >> to technically optimal solution). Tests on random matrices >> are not meaningful for precision assessment purposes; only >> inputs with good spectrum decay are (as in tests). But it >> looks like many people are trying to do just that. > > Oh, right … I didn't have the real data available but wanted > to get some idea of the feasibility of using the Mahout SSVD > on input that was vaguely the right size … I didn't expect > anything meaningful to come out :~} > > I'm going to get the actual data ready and run it ‘for real’ > now, which, ought to produce something a bit more > interesting.
