On Jul 6, 2010, at 2:24 AM, Jake Mannix wrote:

> It turns out that the number of rows isn't actually used in the SVD code at
> all (you can put in any number for this parameter), but this is an artifact
> of the particular choice of spitting out only the right singular vectors.
> NumCols is indeed necessary, but there's an ugly trick to figure it out
> too: run it with numCols = anything, and the first time you run, you'll get
> an exception which tells you what the cardinality of the vectors are.  This
> is the true numCols to use.
> 
> This should probably be fixed, as this is ugly as sin.  Easy fix is: remove
> numRows (add back when they become necessary, if ever), and make numCols
> optional, calculating it on the fly by fetching the first chunk of the
> SequenceFile from HDFS and finding out the dim of the vector.

Hmm, I was looking at the code and it is passed into DistributedMatrix, etc., 
so it seemed like it was needed.

> 
> Glad to see some more other committers playing with the SVD code finally - I
> should have pretended I left those hacks in on purpose specifically to see
> when y'all would use it and mention how horrible it was. :P
> 

You're hacks beat my non-existent SVD code!

>  -jake
> 
> On Mon, Jul 5, 2010 at 11:59 PM, Grant Ingersoll <[email protected]>wrote:
> 
>> Trying out SVD for the first time and trying to make sense of the
>> parameters...
>> 
>> Am I missing a more obvious way to get the number of rows to give to SVD
>> than to iterate through the whole sequence file of vectors and count them
>> up?  Assuming a sufficiently large vector file, don't I need a M/R job to do
>> this?  Likewise, one would have to do this for the --numCols as well, right?
>> In reality, I suppose it would be useful to have a utility that checked to
>> make sure all the vectors in a file are the same cardinality, right?
>> 
>> Just trying to get my head around the practical side of running SVD.
>> 
>> 
>> Thanks,
>> Grant

Reply via email to