On Jul 6, 2010, at 2:24 AM, Jake Mannix wrote: > It turns out that the number of rows isn't actually used in the SVD code at > all (you can put in any number for this parameter), but this is an artifact > of the particular choice of spitting out only the right singular vectors. > NumCols is indeed necessary, but there's an ugly trick to figure it out > too: run it with numCols = anything, and the first time you run, you'll get > an exception which tells you what the cardinality of the vectors are. This > is the true numCols to use. > > This should probably be fixed, as this is ugly as sin. Easy fix is: remove > numRows (add back when they become necessary, if ever), and make numCols > optional, calculating it on the fly by fetching the first chunk of the > SequenceFile from HDFS and finding out the dim of the vector.
Hmm, I was looking at the code and it is passed into DistributedMatrix, etc., so it seemed like it was needed. > > Glad to see some more other committers playing with the SVD code finally - I > should have pretended I left those hacks in on purpose specifically to see > when y'all would use it and mention how horrible it was. :P > You're hacks beat my non-existent SVD code! > -jake > > On Mon, Jul 5, 2010 at 11:59 PM, Grant Ingersoll <[email protected]>wrote: > >> Trying out SVD for the first time and trying to make sense of the >> parameters... >> >> Am I missing a more obvious way to get the number of rows to give to SVD >> than to iterate through the whole sequence file of vectors and count them >> up? Assuming a sufficiently large vector file, don't I need a M/R job to do >> this? Likewise, one would have to do this for the --numCols as well, right? >> In reality, I suppose it would be useful to have a utility that checked to >> make sure all the vectors in a file are the same cardinality, right? >> >> Just trying to get my head around the practical side of running SVD. >> >> >> Thanks, >> Grant
