So - it's always a little embarrassing saying "I don't understand", but here goes. I can't claim to have strong linear math skills, and don't mind admitting that, but I've (hopefully) a rough idea what's going on. I have got now to the basic stage where I've at least put a matrix into the hadoop SVD/Lanczos implementation (ie. https://issues.apache.org/jira/browse/MAHOUT-180) and got something out again. But then I hit a wall...
My problem is that I was imagining the results would be three factor'd matrixes (which when multiplied would reproduce the original, and from which I could take left-most columns per various SVD tutorials). Instead, I get: 11/02/25 10:03:11 INFO decomposer.DistributedLanczosSolver: Persisting 10 eigenVectors and eigenValues to: outpath/rawEigenvectors which when unpacked with http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization-reading.html gives me key 0 value: {0:-0.5695508206727358,1:-0.4285601649419706,2:-0.3882489326234163,3:-0.584132531205635} key 1 value: {0:-0.2721655269759087,2:0.13608276348795434,3:-0.9525793444156804} key 2 value: {0:0.022062855712982308,1:0.7148365251006306,2:0.6927736693876481,3:0.09266399399452617} key 3 value: {0:0.783849515338196,1:0.3919247576690979,2:-0.3919247576690983,3:-0.2799462554779274} key 4 value: {0:0.557690858476082,1:-0.5791405068790082,2:0.5898653310804713,3:-0.0750737694102418} key 5 value: {0:-0.22447685082502516,1:-0.5158243682948284,2:0.8253228908193994,3:0.04875951606025693} key 6 value: {0:0.13483997249264842,1:-0.13483997249264842,2:-0.9438798074485389,3:0.26967994498529685} key 7 value: {0:-0.6758100682735698,1:0.693089266667016,2:0.2503084269657081,3:-0.01592832197702688} key 8 value: {0:0.4104908741187378,1:0.26436466202790915,2:0.32473155620641553,3:-0.8100357918881508} ....i.e. a single grid of values. Now http://en.wikipedia.org/wiki/Singular_value_decomposition#Relation_to_eigenvalue_decomposition and http://www.scribd.com/doc/7017586/Gorrell-Webb tell me that these are intimately related to the SVD 3 matrices, however for a novice the connection isn't entirely clear. I'll copy details of the specific job / data I tried below, but the basic issue is I guess more of documentation for tool-oriented rather than math-oriented users. So consider this a case study in misunderstanding. If the answer is "you need to (re)learn a bit more maths", that's a fine outcome. If I get my head around this I'll try to reflect what I learn back into the Wiki. So I was inspired to dig into SVD by running across a few friendly tutorials like http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ and I tried to stick with their example for my original test, also to walk through it in Matlab/Octave. So my matrix was (in matlab-ese), from their 'Family Guy Seasons x Users', where the elements were specific ratings by users for seasons: A = [5,5,0,5; 5,0,3,4; 3,4,0,3; 0,0,5,3; 5,4,4,5; 5,4,5,5] I converted it to Mahout binary using the tool at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html with the following input csv data: 0,0,5.0 0,1,5.0 0,3,5.0 1,0,5.0 1,2,3.0 1,3,4.0 2,0,3.0 2,1,4.0 2,3,3.0 3,2,5.0 3,3,3.0 4,0,5.0 4,1,4.0 4,2,4.0 4,3,5.0 5,0,5.0 5,1,4.0 5,2,5.0 5,3,5.0 (note I just skip the zero'd elements; is that appropriate/correct?) On the hadoop cluster I blundered my way into the following: hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \ org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \ --input svdoutput.mht --output outpath --numRows 6 --numCols 4 --rank 10 ...which is where I got the values given at start of this mail. I've poked around in octave with http://www.mathworks.com/help/techdoc/ref/eig.html and http://www.mathworks.com/help/techdoc/ref/svd.html but I've really hit my limit here I think. Thanks for any pointers or other advice, cheers, Dan ps. re wiki documentation, how do you all feel about continuing to use the example in http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ ? maybe would be good to have matlab equivalents in there too?
