Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Are you using SGD for logistic regression? There's a random element
there too, by nature. I looked into the code and see that you can't
set a seed, but actually, the sampling is done with a fixed seed per
partition anyway. Hm.

In general you would not expect these algorithms to produce the same
result, given the stochastic nature. In this particular case, I'm not
sure if you can or should be able to get the implementation to act
deterministically. Even if the overt use of randomness is seed-able,
there may be some non-determinism in the distributed nature of the
processing that is having an effect.

On Fri, Jan 30, 2015 at 7:27 PM, Jianguo Li  wrote:
> Thanks. I did specify a seed parameter.
>
> Seems that the problem is not caused by kFold. I actually ran another
> experiment without cross validation. I just built a model with the training
> data and then tested the model on the test data. However, the accuracy still
> varies from one run to another. Interestingly, this only happens when I ran
> the experiment on our cluster. If I ran the experiment on my local machine,
> I can reproduce the result each time. Has anybody encountered similar issue
> before?
>
> Thanks,
>
> Jianguo
>
> On Fri, Jan 30, 2015 at 11:22 AM, Sean Owen  wrote:
>>
>> Have a look at the source code for MLUtils.kFold. Yes, there is a
>> random element. That's good; you want the folds to be randomly chosen.
>> Note there is a seed parameter, as in a lot of the APIs, that lets you
>> fix the RNG seed and so get the same result every time, if you need
>> to.
>>
>> On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li 
>> wrote:
>> > Hi,
>> >
>> > I am using the utility function kFold provided in Spark for doing k-fold
>> > cross validation using logistic regression. However, each time I run the
>> > experiment, I got different different result. Since everything else
>> > stays
>> > constant, I was wondering if this is due to the kFold function I used.
>> > Does
>> > anyone know if the kFold gives you a different split on a data set each
>> > time
>> > you call it?
>> >
>> > Thanks,
>> >
>> > Jianguo
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Thanks. I did specify a seed parameter.

Seems that the problem is not caused by kFold. I actually ran another
experiment without cross validation. I just built a model with the training
data and then tested the model on the test data. However, the accuracy
still varies from one run to another. Interestingly, this only happens when
I ran the experiment on our cluster. If I ran the experiment on my local
machine, I can reproduce the result each time. Has anybody encountered
similar issue before?

Thanks,

Jianguo

On Fri, Jan 30, 2015 at 11:22 AM, Sean Owen  wrote:

> Have a look at the source code for MLUtils.kFold. Yes, there is a
> random element. That's good; you want the folds to be randomly chosen.
> Note there is a seed parameter, as in a lot of the APIs, that lets you
> fix the RNG seed and so get the same result every time, if you need
> to.
>
> On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li 
> wrote:
> > Hi,
> >
> > I am using the utility function kFold provided in Spark for doing k-fold
> > cross validation using logistic regression. However, each time I run the
> > experiment, I got different different result. Since everything else stays
> > constant, I was wondering if this is due to the kFold function I used.
> Does
> > anyone know if the kFold gives you a different split on a data set each
> time
> > you call it?
> >
> > Thanks,
> >
> > Jianguo
>


Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Have a look at the source code for MLUtils.kFold. Yes, there is a
random element. That's good; you want the folds to be randomly chosen.
Note there is a seed parameter, as in a lot of the APIs, that lets you
fix the RNG seed and so get the same result every time, if you need
to.

On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li  wrote:
> Hi,
>
> I am using the utility function kFold provided in Spark for doing k-fold
> cross validation using logistic regression. However, each time I run the
> experiment, I got different different result. Since everything else stays
> constant, I was wondering if this is due to the kFold function I used. Does
> anyone know if the kFold gives you a different split on a data set each time
> you call it?
>
> Thanks,
>
> Jianguo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Hi,

I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
constant, I was wondering if this is due to the kFold function I used. Does
anyone know if the kFold gives you a different split on a data set each
time you call it?

Thanks,

Jianguo