On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <[email protected]> wrote:

> Trying out SVD for the first time and trying to make sense of the
> parameters...
>
> Am I missing a more obvious way to get the number of rows to give to SVD
> than to iterate through the whole sequence file of vectors and count them
> up?


Pretty much.  But you can also integrate that task into the production of
the vectors.


> Assuming a sufficiently large vector file, don't I need a M/R job to do
> this?  Likewise, one would have to do this for the --numCols as well, right?
>  In reality, I suppose it would be useful to have a utility that checked to
> make sure all the vectors in a file are the same cardinality, right?
>

Yes and no.  The number of rows should be the number of documents you
vectorized.  The number of columns should be the number of distinct terms
that you observed in vectorizing.  Both should be pretty easily available.
 With sparse vectors, we don't care quite as much about the size of the
vector and often set it to a "large enough" value.

The other major approach is to use random projection to get fixed length
vectors of known and predetermined size out.  This is the strategy I use in
the SGD code and it makes a lot of things much, much easier because you can
set the cardinality of the vectors involved ahead of time.  IT makes
converting a vector back into terms much harder, though.

Reply via email to