On Jul 5, 2010, at 7:14 PM, Ted Dunning wrote: > On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <[email protected]> wrote: > >> Trying out SVD for the first time and trying to make sense of the >> parameters... >> >> Am I missing a more obvious way to get the number of rows to give to SVD >> than to iterate through the whole sequence file of vectors and count them >> up? > > > Pretty much. But you can also integrate that task into the production of > the vectors. > > >> Assuming a sufficiently large vector file, don't I need a M/R job to do >> this? Likewise, one would have to do this for the --numCols as well, right? >> In reality, I suppose it would be useful to have a utility that checked to >> make sure all the vectors in a file are the same cardinality, right? >> > > Yes and no. The number of rows should be the number of documents you > vectorized. The number of columns should be the number of distinct terms > that you observed in vectorizing. Both should be pretty easily available.
Yeah, I can count the rows w/ the VectorDumper, but that doesn't really scale. Just wondering if I was missing some tool that people are using.
