On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <[email protected]> wrote:
> Trying out SVD for the first time and trying to make sense of the > parameters... > > Am I missing a more obvious way to get the number of rows to give to SVD > than to iterate through the whole sequence file of vectors and count them > up? Pretty much. But you can also integrate that task into the production of the vectors. > Assuming a sufficiently large vector file, don't I need a M/R job to do > this? Likewise, one would have to do this for the --numCols as well, right? > In reality, I suppose it would be useful to have a utility that checked to > make sure all the vectors in a file are the same cardinality, right? > Yes and no. The number of rows should be the number of documents you vectorized. The number of columns should be the number of distinct terms that you observed in vectorizing. Both should be pretty easily available. With sparse vectors, we don't care quite as much about the size of the vector and often set it to a "large enough" value. The other major approach is to use random projection to get fixed length vectors of known and predetermined size out. This is the strategy I use in the SGD code and it makes a lot of things much, much easier because you can set the cardinality of the vectors involved ahead of time. IT makes converting a vector back into terms much harder, though.
