On Jul 5, 2010, at 7:14 PM, Ted Dunning wrote:

> On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <[email protected]> wrote:
> 
>> Trying out SVD for the first time and trying to make sense of the
>> parameters...
>> 
>> Am I missing a more obvious way to get the number of rows to give to SVD
>> than to iterate through the whole sequence file of vectors and count them
>> up?
> 
> 
> Pretty much.  But you can also integrate that task into the production of
> the vectors.
> 
> 
>> Assuming a sufficiently large vector file, don't I need a M/R job to do
>> this?  Likewise, one would have to do this for the --numCols as well, right?
>> In reality, I suppose it would be useful to have a utility that checked to
>> make sure all the vectors in a file are the same cardinality, right?
>> 
> 
> Yes and no.  The number of rows should be the number of documents you
> vectorized.  The number of columns should be the number of distinct terms
> that you observed in vectorizing.  Both should be pretty easily available.

Yeah, I can count the rows w/ the VectorDumper, but that doesn't really scale.  
Just wondering if I was missing some tool that people are using.

Reply via email to