Re: Errors in SSVD

Dmitriy Lyubimov Wed, 17 Aug 2011 15:01:17 -0700
Ted,
sorry for my stupid question of the day: what does the "out-of-core" term
mean?
 On Aug 16, 2011 2:18 PM, "Ted Dunning" <[email protected]> wrote:
> I have several in-memory implementations almost ready to publish.
>
> These provide straightforward implementation of the original SSVD
algorithm
> from the Martinsson and Halko paper, a version that avoids QR and LQ
> decompositions and an out-of-core version that only keeps a moderate sized
> amount of data in memory at any time.
>
> My hangup at this point is getting my Cholesky decomposition reliable for
> rank-deficient inputs.
>
> On Tue, Aug 16, 2011 at 1:57 PM, Eshwaran Vijaya Kumar <
> [email protected]> wrote:
>
>> I have decided to do something similar: Do the pipeline in memory and not
>> invoke map-reduce for small datasets which I think will handle the issue.
>> Thanks again for clearing that up.
>> Esh
>>
>> Aug 16, 2011, at 1:45 PM, Dmitriy Lyubimov wrote:
>>
>> > PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW
what
>> i
>> > am using int local tests to assert SSVD results. Although it starts to
>> feel
>> > slow pretty quickly and sometimes produces errors (i think i starts
>> feeling
>> > slow at 10k x 1k inputs)
>> >
>> > On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>> >
>> >> also, with data as small as this, stochastic noise ratio would be
>> >> significant (as in 'big numbers' law) so if you really think you might
>> need
>> >> to handle inputs that small, you better write a pipeline that detects
>> this
>> >> as a corner case and just runs in-memory decomposition. In fact, i
think
>> >> dense matrices up to 100,000 rows can be quite comfortably computed
>> >> in-memory (Ted knows much more on practical limits of tools like R or
>> even
>> >> as simple as apache.math)
>> >>
>> >> -d
>> >>
>> >>
>> >> On Tue, Aug 16, 2011 at 12:46 PM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>> >>
>> >>> yep that's what i figured. you have 193 rows or so but distributed
>> between
>> >>> 7 files so they are small and would generate several mappers and
there
>> are
>> >>> probably some there with a small row count.
>> >>>
>> >>> See my other email. This method is for big data, big files. If you
want
>> to
>> >>> automate handling of small files, you can probably do some
intermediate
>> step
>> >>> with some heuristic that merges together all files say shorter than
>> 1Mb.
>> >>>
>> >>> -d
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Aug 16, 2011 at 12:43 PM, Eshwaran Vijaya Kumar <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Number of mappers is 7. DFS block size is 128 MB, the reason I think
>> >>>> there are 7 mappers being used is that I am using a Pig script to
>> generate
>> >>>> the sequence file of Vectors and that script generates 7 reducers. I
>> am not
>> >>>> setting minSplitSize though.
>> >>>>
>> >>>> On Aug 16, 2011, at 12:15 PM, Dmitriy Lyubimov wrote:
>> >>>>
>> >>>>> Hm. This is not common at all.
>> >>>>>
>> >>>>> This error would surface if map split can't accumulate at least k+p
>> >>>> rows.
>> >>>>>
>> >>>>> That's another requirement which usually is non-issue -- any
>> >>>> precomputed
>> >>>>> split must contain at least k+p rows, which normally would not be
the
>> >>>> case
>> >>>>> only if matrix is extra wide and dense, in which case
--minSplitSize
>> >>>> must be
>> >>>>> used to avoid this.
>> >>>>>
>> >>>>> But in your case, the matrix is so small it must fit in one split.
>> Can
>> >>>> you
>> >>>>> please verify how many mappers the job generates?
>> >>>>>
>> >>>>> if it's more than 1 than something is going fishy with hadoop.
>> >>>> Otherwise,
>> >>>>> something is fishy with input (it's either not 293 rows, or k+p is
>> more
>> >>>> than
>> >>>>> 293).
>> >>>>>
>> >>>>> -d
>> >>>>>
>> >>>>> On Tue, Aug 16, 2011 at 11:39 AM, Eshwaran Vijaya Kumar <
>> >>>>> [email protected]> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote:
>> >>>>>>
>> >>>>>>> This is unusually small input. What's the block size? Use large
>> >>>> blocks
>> >>>>>> (such
>> >>>>>>> as 30,000). Block size can't be less than k+p.
>> >>>>>>>
>> >>>>>>
>> >>>>>> I did set blockSize to 30,000 (as recommended in the PDF that you
>> >>>> wrote
>> >>>>>> up). As far as input size, the reason to do that is because it is
>> >>>> easier to
>> >>>>>> test and verify the map-reduce pipeline with my in-memory
>> >>>> implementation of
>> >>>>>> the algorithm.
>> >>>>>>
>> >>>>>>> Can you please cut and paste actual log of qjob tasks that
failed?
>> >>>> This
>> >>>>>> is
>> >>>>>>> front end error, but the actual problem is actually in the
backend
>> >>>>>> ranging
>> >>>>>>> anywhere from hadoop problems to algorithm problems.
>> >>>>>> Sure. Refer http://esh.pastebin.mozilla.org/1302059
>> >>>>>> Input is a DistributedRowMatrix 293 X 236, k = 4, p = 40,
>> >>>> numReduceTasks =
>> >>>>>> 1, blockHeight = 30,000. Reducing p = 20 ensures job goes
through...
>> >>>>>>
>> >>>>>> Thanks again
>> >>>>>> Esh
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Aug 16, 2011 9:44 AM, "Eshwaran Vijaya Kumar" <
>> >>>>>> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>> Thanks again. I am using 0.5 right now. We will try to patch it
up
>> >>>> and
>> >>>>>> see
>> >>>>>>> how it performs. In the mean time, I am having another (possibly
>> >>>> user?)
>> >>>>>>> error: I have a 260 X 230 matrix. I set k+p = 40, it fails with
>> >>>>>>>>
>> >>>>>>>> Exception in thread "main" java.io.IOException: Q job
>> unsuccessful.
>> >>>>>>>> at
>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:349)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:262)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:91)
>> >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:131)
>> >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> >>>>>>>> at
>> >>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> >>>>>>>> at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>> >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >>>>>>>> at
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Suppose I set k+p to be much lesser say around 20, it works
fine.
>> Is
>> >>>> it
>> >>>>>>> just that my dataset is of low rank or is there something else
>> going
>> >>>> on
>> >>>>>>> here?
>> >>>>>>>>
>> >>>>>>>> Thanks
>> >>>>>>>> Esh
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Aug 14, 2011, at 1:47 PM, Dmitriy Lyubimov wrote:
>> >>>>>>>>
>> >>>>>>>>> ... i need to let some time for review before pushing to ASF
repo
>> >>>> )..
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Sun, Aug 14, 2011 at 1:47 PM, Dmitriy Lyubimov <
>> >>>> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> patch is posted as MAHOUT -786.
>> >>>>>>>>>>
>> >>>>>>>>>> also 0.6 trunk with patch applied is here :
>> >>>>>>>>>> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786
>> >>>>>>>>>>
>> >>>>>>>>>> <https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786>I
>> >>>> will
>> >>>>>>> commit
>> >>>>>>>>>> to ASF repo tomorrow night (even that it is extremely simple,
i
>> >>>> need
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Sat, Aug 13, 2011 at 1:48 PM, Eshwaran Vijaya Kumar <
>> >>>>>>>>>> [email protected]> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Dmitriy,
>> >>>>>>>>>>> That sounds great. I eagerly await the patch.
>> >>>>>>>>>>> Thanks
>> >>>>>>>>>>> Esh
>> >>>>>>>>>>> On Aug 13, 2011, at 1:37 PM, Dmitriy Lyubimov wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Ok, i got u0 working.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> The problem is of course that something called BBt job is to
>> be
>> >>>>>>> coerced
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> have 1 reducer (it's fine, every mapper won't yeld more than
>> >>>>>>>>>>>> upper-triangular matrix of k+p x k+p geometry, so even if
you
>> >>>> end up
>> >>>>>>>>>>> having
>> >>>>>>>>>>>> thousands of them, reducer would sum them up just fine.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> it worked before apparently because configuration hold 1
>> reducer
>> >>>> by
>> >>>>>>>>>>> default
>> >>>>>>>>>>>> if not set explicitly, i am not quite sure if that's
something
>> >>>> in
>> >>>>>>> hadoop
>> >>>>>>>>>>> mr
>> >>>>>>>>>>>> client or mahout change that now precludes it from working.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> anyway, i got a patch (really a one-liner) and an example
>> >>>> equivalent
>> >>>>>>> to
>> >>>>>>>>>>>> yours worked fine for me with 3 reducers.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Also, in the tests, it also requests 3 reducers, but the
>> reason
>> >>>> it
>> >>>>>>> works
>> >>>>>>>>>>> in
>> >>>>>>>>>>>> tests and not in distributed mapred is because local mapred
>> >>>> doesn't
>> >>>>>>>>>>> support
>> >>>>>>>>>>>> multiple reducers. I investigated this issue before and
>> >>>> apparently
>> >>>>>>> there
>> >>>>>>>>>>>> were a couple of patches floating around but for some reason
>> >>>> those
>> >>>>>>>>>>> changes
>> >>>>>>>>>>>> did not take hold in cdh3u0.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I will publish patch in a jira shortly and will commit it
>> >>>>>> Sunday-ish.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks.
>> >>>>>>>>>>>> -d
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Fri, Aug 5, 2011 at 7:06 PM, Eshwaran Vijaya Kumar <
>> >>>>>>>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> OK. So to add more info to this, I tried setting the number
>> of
>> >>>>>>> reducers
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>> 1 and now I don't get that particular error. The singular
>> >>>> values
>> >>>>>> and
>> >>>>>>>>>>> left
>> >>>>>>>>>>>>> and right singular vectors appear to be correct though
>> >>>> (verified
>> >>>>>>> using
>> >>>>>>>>>>>>> Matlab).
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Aug 5, 2011, at 1:55 PM, Eshwaran Vijaya Kumar wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> All,
>> >>>>>>>>>>>>>> I am trying to test Stochastic SVD and am facing some
errors
>> >>>> where
>> >>>>>>> it
>> >>>>>>>>>>>>> would be great if someone could clarifying what is going
on.
>> I
>> >>>> am
>> >>>>>>>>>>> trying to
>> >>>>>>>>>>>>> feed the solver a DistributedRowMatrix with the exact same
>> >>>>>> parameters
>> >>>>>>>>>>> that
>> >>>>>>>>>>>>> the test in LocalSSVDSolverSparseSequentialTest uses, i.e,
>> >>>> Generate
>> >>>>>> a
>> >>>>>>>>>>> 1000
>> >>>>>>>>>>>>> X 100 DRM with SequentialSparseVectors and then ask for
>> >>>> blockHeight
>> >>>>>>>>>>> 251, p
>> >>>>>>>>>>>>> (oversampling) = 60, k (rank) = 40. I get the following
>> error:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Exception in thread "main" java.io.IOException: Unexpected
>> >>>> overrun
>> >>>>>>> in
>> >>>>>>>>>>>>> upper triangular matrix files
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.loadUpperTriangularMatrix(SSVDSolver.java:471)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:268)
>> >>>>>>>>>>>>>> at com.mozilla.SSVDCli.run(SSVDCli.java:89)
>> >>>>>>>>>>>>>> at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >>>>>>>>>>>>>> at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> >>>>>>>>>>>>>> at com.mozilla.SSVDCli.main(SSVDCli.java:129)
>> >>>>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>>>>>>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Also, I am using CDH3 with Mahout recompiled to work with
>> CDH3
>> >>>>>> jars.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks
>> >>>>>>>>>>>>>> Esh
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>
Re: Errors in SSVD

Reply via email to