Re: reproducibility

Koobas Sun, 17 Mar 2013 07:13:33 -0700

Understood.
Thanks a lot.


On Sun, Mar 17, 2013 at 9:57 AM, Sean Owen <[email protected]> wrote:

> If an algorithm has a stochastic/random element, no it won't necessarily
> produce the same result, by design. If you can fix the seed of the random
> number generator, you should get the same result. Except that if the
> process is multi-threaded or distributed, even that doesn't guarantee it --
> the RNG could be accessed in a different order. Even if you can control
> your code it can be hard to control the RNGs in third-party libraries. Even
> in a deterministic single-threaded program Java's floating point results
> are not guaranteed to be the same across platforms (unless you use
> strictfp).
>
> ALS definitely has a random starting point, so reproducibility is not
> guaranteed even from the top. If you fix the random seed in the context of
> this project's unit tests, you *should* get the same result since I think
> it manages to use no third-party RNGs and runs a test from a fixed starting
> point in 1 thread.
>
> KNN does not have a stochastic element. I think you would get the same
> results on one platform, unless I'm missing something.
>
> I don't think exact reproducibility is an issue. Certainly at scale where
> the entire computation is distributed over such a complex cluster
> environment. Most ML is about guessing at what's not known anyway. As long
> as very small differences make only very small differences in the outcome,
> differing FP behavior will make no or vanishingly small difference.
>
> The only place where I think FP reproducibility matters -- of the sort that
> numerical libraries care about -- is in under/overflow issues. But that is
> solved by moving into a log space or something. You would never want to
> depend on the nth significant digit of a float mattering.
>
>
>
>
> On Sun, Mar 17, 2013 at 1:43 PM, Koobas <[email protected]> wrote:
>
> > I am asking the basic reproducibility question.
> > If I run twice on the same dataset, with the same hardware setup, will I
> > always get the same resuts?
> > Or is there any chance that on two different runs, the same user will get
> > slightly different suggestions?
> > I am mostly revolving in the space of numerical libraries, where
> > reproducibility is, sort of, a big deal.
> > Maybe it's not much of a concern in machine learning.
> > I am just curious.
> >
> >
> > On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <[email protected]> wrote:
> >
> > > What's your question? ALS has a random starting point which changes the
> > > results a bit. Not sure about KNN though.
> > >
> > >
> >
> > > On Sun, Mar 17, 2013 at 3:03 AM, Koobas <[email protected]> wrote:
> > >
> > > > Can anybody shed any light on the issue of reproducibility in Mahout,
> > > > with and without Hadoop, specifically in the context of kNN and ALS
> > > > recommenders?
> > > >
> > >
> >
>

Re: reproducibility

Reply via email to