Understood. Thanks a lot.
On Sun, Mar 17, 2013 at 9:57 AM, Sean Owen <[email protected]> wrote: > If an algorithm has a stochastic/random element, no it won't necessarily > produce the same result, by design. If you can fix the seed of the random > number generator, you should get the same result. Except that if the > process is multi-threaded or distributed, even that doesn't guarantee it -- > the RNG could be accessed in a different order. Even if you can control > your code it can be hard to control the RNGs in third-party libraries. Even > in a deterministic single-threaded program Java's floating point results > are not guaranteed to be the same across platforms (unless you use > strictfp). > > ALS definitely has a random starting point, so reproducibility is not > guaranteed even from the top. If you fix the random seed in the context of > this project's unit tests, you *should* get the same result since I think > it manages to use no third-party RNGs and runs a test from a fixed starting > point in 1 thread. > > KNN does not have a stochastic element. I think you would get the same > results on one platform, unless I'm missing something. > > I don't think exact reproducibility is an issue. Certainly at scale where > the entire computation is distributed over such a complex cluster > environment. Most ML is about guessing at what's not known anyway. As long > as very small differences make only very small differences in the outcome, > differing FP behavior will make no or vanishingly small difference. > > The only place where I think FP reproducibility matters -- of the sort that > numerical libraries care about -- is in under/overflow issues. But that is > solved by moving into a log space or something. You would never want to > depend on the nth significant digit of a float mattering. > > > > > On Sun, Mar 17, 2013 at 1:43 PM, Koobas <[email protected]> wrote: > > > I am asking the basic reproducibility question. > > If I run twice on the same dataset, with the same hardware setup, will I > > always get the same resuts? > > Or is there any chance that on two different runs, the same user will get > > slightly different suggestions? > > I am mostly revolving in the space of numerical libraries, where > > reproducibility is, sort of, a big deal. > > Maybe it's not much of a concern in machine learning. > > I am just curious. > > > > > > On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <[email protected]> wrote: > > > > > What's your question? ALS has a random starting point which changes the > > > results a bit. Not sure about KNN though. > > > > > > > > > > > On Sun, Mar 17, 2013 at 3:03 AM, Koobas <[email protected]> wrote: > > > > > > > Can anybody shed any light on the issue of reproducibility in Mahout, > > > > with and without Hadoop, specifically in the context of kNN and ALS > > > > recommenders? > > > > > > > > > >
