I think that 400K dimensions isn't a big problem if you have enough data so
that you start to see good overlap.  If your bigrams are sparse enough that
you don't get overlap, then this won't work.  SVD could be very helpful to
smooth the data in those cases.

The biggest implementation problem will the storage required for the
centroid vectors if you have too if number of clusters x vector size gets
too large.

Another option is to use hashed feature vectors.  These will retain
essentially all of the data of the larger vectors but will allow your
centroids to be more moderate in size.  This also helps in not requiring a
pass over your data to assign vector locations.

On Thu, Mar 17, 2011 at 8:52 AM, Timothy Potter <thelabd...@gmail.com>wrote:

> Hi Ted,
>
> Regarding your comment: "For clustering purposes, you probably don't even
> need SVD here ..."
>
> I was just experimenting with vectors that have 20K dimensions, but my end
> goal was to run the SVD on vectors with n-grams that have roughly 380K
> dimensions. Do you still think SVD is not needed for this situation? My
> thought was to get the n-gram vectors down to a more manageable size and
> SVD
> seemed like what I needed?
>
> Cheers,
> Tim
>
> On Mon, Mar 14, 2011 at 3:56 PM, Timothy Potter <thelabd...@gmail.com
> >wrote:
>
> > Thanks for the clarification Jake.
> >
> > The end goal is to run the SVD against my n-gram vector, which have 380K
> > dimensions.
> >
> > I'll update the wiki once I have this working.
> >
> > Tim
> >
> >
> > On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.man...@gmail.com
> >wrote:
> >
> >>
> >> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com
> >wrote:
> >>
> >>> Looking for a little clarification with using SVD to reduce dimensions
> of
> >>> my
> >>> vectors for clustering ...
> >>>
> >>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf
> >>> vectors
> >>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
> >>> using:
> >>>
> >>> bin/mahout svd -i
> >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
> >>>    -o /asf-mail-archives/mahout-0.4/svd \
> >>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
> >>>
> >>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why
> >>> only
> >>> 87, but I'm assuming that has something to do with Lanczos???
> >>>
> >>
> >> Hi Timothy,
> >>
> >>   The LanczosSolver looks for 100 eigenvectors, but then does some
> cleanup
> >> after
> >> the fact: convergence issues and numeric overflow can cause some
> >> eigenvectors
> >> to show up twice - the last step in Mahout SVD is to remove these
> >> spurious
> >> eigenvectors (and also any which just don't appear to be "eigen" enough
> >> (ie,
> >> they don't satisfy the eigenvector criterion with high enough fidelity).
> >>
> >>   If you really need more eigenvectors, you can try re-running with
> >> rank=150,
> >> and then take the top 100 out of however many you get out.
> >>
> >> So then I proceeded to transpose the SVD output using:
> >>>
> >>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
> >>> --numRows 87
> >>>
> >>> Next, I tried to run transpose on my original vectors using:
> >>>
> >>> transpose -i
> >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
> >>> --numCols 20444 --numRows 6076937
> >>>
> >>>
> >> So the problems with this is that the tfidf-vectors is a
> >> SequenceFile<Text,VectorWritable> - which is fine for input into
> >> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
> >> pairs),
> >> but not so fine for being really considered a "matrix" - you need to run
> >> the
> >> RowIdJob on these tfidf-vectors first.  This will normalize your
> >> SequenceFIle<Text,VectorWritable> into a
> >> SequenceFile<IntWritable,VectorWritable>
> >> and a SequenceFIle<IntWritable,Text> (where original one is the join of
> >> these new ones, on the new int key).
> >>
> >> Hope that helps.
> >>
> >>   -jake
> >>
> >
> >
>

Reply via email to