Sent from android tab
On Aug 13, 2011 6:15 PM, "Eshwaran Vijaya Kumar" <[email protected]>
wrote:
>
>
> On Aug 13, 2011, at 2:11 PM, Dmitriy Lyubimov wrote:
>
> > NP.
> >
> > thanks for testing it out.
> >
> > I would appreciate if you could let me know how it goes with non-full
rank
> > decomposition and perhaps at larger scale.
> >
>
> Sure thing.
> > One thing to keep in mind is that it projects it into m x k+p _dense_
> > matrix, assuming that k+p is much less than non-zero elements in a
sparse
> > row vector. If it is not the case, you actually would create more
> > computation, not less, with a random projection. One person tried to use
it
> > with m= millions but rows were so sparse that there were only a handful
(~10
> > avg) non-zero items per row (somewhat typical for user ratings), but he
> > tried to compute actually hundreds of singular values which of course
> > created more intermediate work than something like Lanczos would
probably
> > do. That's not a good application of this method.
>
> So this is a bit surprising: In my situation, the k would be relatively
low < 20. Since I am working with text data, I suspect that the rows are
pretty sparse, although I have not instrumented row non zero element
distributions yet. Based on your notes, I was planning to set k + p = 500
(or less depending on width of matrix) so that I would get reasonably good
singular vectors. I guess I will do some more tuning.

Text data (LSA) should be fine, that's what I use it for. Even if you have
rows too sparse, it is still fine. doing stochastic projection just has a
purpose of reducing work in favor of speed but losing some precision. If
your rows are too sparse, you just are not getting much of speed benefit,
that's it.

on another hand, assuming LSA, you might also put terms, not documenta, into
rows, and if you have a large corpus, that would help a lot, although it is
not worth in my case.

>
> > Another thing is also that you need to have good singular value decay in
> > your data, otherwise this methods would be surprisingly far from true
> > vectors (in my experiments).
> >
>
> I am not too sure off hand whether this is true for my dataset.
>
Any dataset with trends in it should be fine. Only completely random dataset
doesn't have any trends. I am just saying test for accuracy on a random data
for this method is nota good benchmark, unless it is full decompositiobcc
where dimensionality reduction does not kick in.

>
> > -d
> >
> >
> > On Sat, Aug 13, 2011 at 1:48 PM, Eshwaran Vijaya Kumar <
> > [email protected]> wrote:
> >
> >> Dmitriy,
> >> That sounds great. I eagerly await the patch.
> >> Thanks
> >> Esh
> >> On Aug 13, 2011, at 1:37 PM, Dmitriy Lyubimov wrote:
> >>
> >>> Ok, i got u0 working.
> >>>
> >>> The problem is of course that something called BBt job is to be
coerced
> >> to
> >>> have 1 reducer (it's fine, every mapper won't yeld more than
> >>> upper-triangular matrix of k+p x k+p geometry, so even if you end up
> >> having
> >>> thousands of them, reducer would sum them up just fine.
> >>>
> >>> it worked before apparently because configuration hold 1 reducer by
> >> default
> >>> if not set explicitly, i am not quite sure if that's something in
hadoop
> >> mr
> >>> client or mahout change that now precludes it from working.
> >>>
> >>> anyway, i got a patch (really a one-liner) and an example equivalent
to
> >>> yours worked fine for me with 3 reducers.
> >>>
> >>> Also, in the tests, it also requests 3 reducers, but the reason it
works
> >> in
> >>> tests and not in distributed mapred is because local mapred doesn't
> >> support
> >>> multiple reducers. I investigated this issue before and apparently
there
> >>> were a couple of patches floating around but for some reason those
> >> changes
> >>> did not take hold in cdh3u0.
> >>>
> >>> I will publish patch in a jira shortly and will commit it Sunday-ish.
> >>>
> >>> Thanks.
> >>> -d
> >>>
> >>>
> >>> On Fri, Aug 5, 2011 at 7:06 PM, Eshwaran Vijaya Kumar <
> >>> [email protected]> wrote:
> >>>
> >>>> OK. So to add more info to this, I tried setting the number of
reducers
> >> to
> >>>> 1 and now I don't get that particular error. The singular values and
> >> left
> >>>> and right singular vectors appear to be correct though (verified
using
> >>>> Matlab).
> >>>>
> >>>> On Aug 5, 2011, at 1:55 PM, Eshwaran Vijaya Kumar wrote:
> >>>>
> >>>>> All,
> >>>>> I am trying to test Stochastic SVD and am facing some errors where
it
> >>>> would be great if  someone could clarifying what is going on. I am
> >> trying to
> >>>> feed the solver a DistributedRowMatrix with the exact same parameters
> >> that
> >>>> the test in  LocalSSVDSolverSparseSequentialTest uses, i.e, Generate
a
> >> 1000
> >>>> X 100 DRM with SequentialSparseVectors and then ask for blockHeight
251,
> >> p
> >>>> (oversampling) = 60, k (rank) = 40. I get the following error:
> >>>>>
> >>>>> Exception in thread "main" java.io.IOException: Unexpected overrun
in
> >>>> upper triangular matrix files
> >>>>>      at
> >>>>
> >>
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.loadUpperTriangularMatrix(SSVDSolver.java:471)
> >>>>>      at
> >>>>
> >>
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:268)
> >>>>>      at com.mozilla.SSVDCli.run(SSVDCli.java:89)
> >>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >>>>>      at com.mozilla.SSVDCli.main(SSVDCli.java:129)
> >>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>>      at
> >>>>
> >>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>>      at
> >>>>
> >>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>      at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >>>>>
> >>>>> Also, I am using CDH3 with Mahout recompiled to work with CDH3 jars.
> >>>>>
> >>>>> Thanks
> >>>>> Esh
> >>>>>
> >>>>
> >>>>
> >>
> >>
>

Reply via email to