Just to follow up: I now have my real data, which, is much
sparser than the totally-random data … and, unsurprisingly,
it exhibits a good bit more regularity, so it's compressible
to the point that the on-disc SequenceFile is small enough
that there's only a single map job, which, of course, means
that the problem that I was experiencing doesn't arise at
all.

Incidentally, with the random data, I *did* get the same
behaviour when I ran on a ‘real’ Hadoop cluster (It's the
full Hadoop stack, running on a single box.)

On Thu, Feb 14, 2013 at 9:56 AM, K.D.P. Ross <[email protected]> wrote:
> Appreciate the replies!
>
>> Yes this problem has been pretty much beaten to shreds. In
>> fact so much so i wrote it into troubleshooting in section
>> 5 of the manual
>> (https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000).
>
> Aha, it looks like I had an out-of-date version of that
> file! I grabbed it from here:
>
>      
> https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf
>
> linked to from this page:
>
>      
> https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html
>
> The FAQ section wasn't yet written, it looks like.
>
>> Perhaps I can suggest as a first measure to run a simple
>> local MR job on your file which just counts # of rows in
>> every map split. You should not see any that is less than
>> k+p (110?). Since you are using local mode and not actual
>> hdfs blocks, there may be some irregularities.
>
> Indeed, this was the problem: I saw that all but the last
> split contained 889 rows … but that the final one was of
> size 107. I tinkered with the parameters and this got me
> sorted; specifically, I added the following to my ‘JobConf’:
>
>      JobConf conf = new JobConf();
>      conf.setLong("mapred.min.split.size", 75570350L);
>
> where ‘75570350L’ was an empirically-derived ‘large-enough’
> number. With that change made, the SSVD completed
> successfully.
>
>> Also since random matrices exhibit just as much variance
>> in every direction, random projection will not be able to
>> reduce problem efficiently. (meaning the singular vectors
>> of the final solution will be all over the place compared
>> to technically optimal solution). Tests on random matrices
>> are not meaningful for precision assessment purposes; only
>> inputs with good spectrum decay are (as in tests). But it
>> looks like many people are trying to do just that.
>
> Oh, right … I didn't have the real data available but wanted
> to get some idea of the feasibility of using the Mahout SSVD
> on input that was vaguely the right size … I didn't expect
> anything meaningful to come out :~}
>
> I'm going to get the actual data ready and run it ‘for real’
> now, which, ought to produce something a bit more
> interesting.

Reply via email to