Re: [scikit-learn] Inconsistent Logistic Regression fit results

2016-08-15 Thread m...@sebastianraschka.com
Hi, Chris,
have you set the random seed to a specific, contant integer value? Note that 
the default in LogisticRegression is random_state=None. Setting it to some 
arbitrary number like 123 may help if you haven’t done so, yet.

Best,
Sebastian



> On Aug 15, 2016, at 5:27 PM, Chris Cameron  wrote:
> 
> Hi all,
> 
> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() 
> is providing me with inconsistent results.
> 
> The documentation for sklearn.linear_model.LogisticRegression states that "It 
> is thus not uncommon, to have slightly different results for the same input 
> data.” I am experiencing this, however the fix of using a smaller “tol” 
> parameter isn’t providing me with consistent fit.
> 
> The code I’m using:
> 
> def log_run(logreg_x, logreg_y):
>logreg_x['pass_fail'] = logreg_y
>df_train, df_test = train_test_split(logreg_x, random_state=0)
>y_train = df_train.pass_fail.as_matrix()
>y_test = df_test.pass_fail.as_matrix()
>del(df_train['pass_fail'])
>del(df_test['pass_fail'])
>log_reg_fit = 
> LogisticRegression(class_weight='balanced',tol=0.1).fit(df_train, 
> y_train)
>predicted = log_reg_fit.predict(df_test)
>accuracy = accuracy_score(y_test, predicted)
>kappa = cohen_kappa_score(y_test, predicted)
> 
>return [kappa, accuracy]
> 
> 
> I’ve gone out of my way to be sure the test and train data is the same for 
> each run, so I don’t think there should be random shuffling going on.
> 
> Example output:
> ---
> log_run(df_save, y)
> Out[32]: [0.027728, 0.5]
> 
> log_run(df_save, y)
> Out[33]: [0.027728, 0.5]
> 
> log_run(df_save, y)
> Out[34]: [0.11347517730496456, 0.58337]
> 
> log_run(df_save, y)
> Out[35]: [0.042553191489361743, 0.55004]
> 
> log_run(df_save, y)
> Out[36]: [-0.07407407407407407, 0.51672]
> 
> log_run(df_save, y)
> Out[37]: [0.042553191489361743, 0.55004]
> 
> A little information on the problem DataFrame:
> ---
> len(df_save)
> Out[40]: 240
> 
> len(df_save.columns)
> Out[41]: 18
> 
> 
> If I omit this particular column the Kappa no longer fluctuates:
> 
> df_save[‘abc'].head()
> Out[42]: 
> 00.026316
> 10.33
> 20.015152
> 30.010526
> 40.125000
> Name: abc, dtype: float64
> 
> 
> Does anyone have ideas on how I can figure this out? Is there some 
> randomness/shuffling still going on I missed?
> 
> 
> Thanks!
> Chris
> ___
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Inconsistent Logistic Regression fit results

2016-08-15 Thread m...@sebastianraschka.com
hm, was worth a try. What happens if you change the solver to something else 
than liblinear, does this issue still persist?


Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably 
unrelated to your issue, I’d recommend setting

>y_train = df_train.pass_fail.values
>y_test = df_test.pass_fail.values

instead of

>y_train = df_train.pass_fail.as_matrix()
>y_test = df_test.pass_fail.as_matrix()


Also, try passing NumPy arrays to the fit method:

>log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)

and

> predicted = log_reg_fit.predict(df_test.values)

and so forth.





> On Aug 15, 2016, at 6:00 PM, Chris Cameron  wrote:
> 
> Sebastian,
> 
> That doesn’t do it. With the function:
> 
> def log_run(logreg_x, logreg_y):
>logreg_x['pass_fail'] = logreg_y
>df_train, df_test = train_test_split(logreg_x, random_state=0)
>y_train = df_train.pass_fail.as_matrix()
>y_test = df_test.pass_fail.as_matrix()
>del(df_train['pass_fail'])
>del(df_test['pass_fail'])
>log_reg_fit = LogisticRegression(class_weight='balanced',
> tol=0.1,
> random_state=0).fit(df_train, y_train)
>predicted = log_reg_fit.predict(df_test)
>accuracy = accuracy_score(y_test, predicted)
>kappa = cohen_kappa_score(y_test, predicted)
> 
>return [kappa, accuracy]
> 
> I’m still seeing:
> log_run(df_save, y)
> Out[7]: [-0.054421768707483005, 0.48334]
> 
> log_run(df_save, y)
> Out[8]: [0.042553191489361743, 0.55004]
> 
> log_run(df_save, y)
> Out[9]: [0.042553191489361743, 0.55004]
> 
> log_run(df_save, y)
> Out[10]: [0.027728, 0.5]
> 
> 
> Chris
> 
>> On Aug 15, 2016, at 3:42 PM, [email protected] wrote:
>> 
>> Hi, Chris,
>> have you set the random seed to a specific, contant integer value? Note that 
>> the default in LogisticRegression is random_state=None. Setting it to some 
>> arbitrary number like 123 may help if you haven’t done so, yet.
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron  wrote:
>>> 
>>> Hi all,
>>> 
>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() 
>>> is providing me with inconsistent results.
>>> 
>>> The documentation for sklearn.linear_model.LogisticRegression states that 
>>> "It is thus not uncommon, to have slightly different results for the same 
>>> input data.” I am experiencing this, however the fix of using a smaller 
>>> “tol” parameter isn’t providing me with consistent fit.
>>> 
>>> The code I’m using:
>>> 
>>> def log_run(logreg_x, logreg_y):
>>>  logreg_x['pass_fail'] = logreg_y
>>>  df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>  y_train = df_train.pass_fail.as_matrix()
>>>  y_test = df_test.pass_fail.as_matrix()
>>>  del(df_train['pass_fail'])
>>>  del(df_test['pass_fail'])
>>>  log_reg_fit = 
>>> LogisticRegression(class_weight='balanced',tol=0.1).fit(df_train, 
>>> y_train)
>>>  predicted = log_reg_fit.predict(df_test)
>>>  accuracy = accuracy_score(y_test, predicted)
>>>  kappa = cohen_kappa_score(y_test, predicted)
>>> 
>>>  return [kappa, accuracy]
>>> 
>>> 
>>> I’ve gone out of my way to be sure the test and train data is the same for 
>>> each run, so I don’t think there should be random shuffling going on.
>>> 
>>> Example output:
>>> ---
>>> log_run(df_save, y)
>>> Out[32]: [0.027728, 0.5]
>>> 
>>> log_run(df_save, y)
>>> Out[33]: [0.027728, 0.5]
>>> 
>>> log_run(df_save, y)
>>> Out[34]: [0.11347517730496456, 0.58337]
>>> 
>>> log_run(df_save, y)
>>> Out[35]: [0.042553191489361743, 0.55004]
>>> 
>>> log_run(df_save, y)
>>> Out[36]: [-0.07407407407407407, 0.51672]
>>> 
>>> log_run(df_save, y)
>>> Out[37]: [0.042553191489361743, 0.55004]
>>> 
>>> A little information on the problem DataFrame:
>>> ---
>>> len(df_save)
>>> Out[40]: 240
>>> 
>>> len(df_save.columns)
>>> Out[41]: 18
>>> 
>>> 
>>> If I omit this

Re: [scikit-learn] update pydata schedule

2016-08-18 Thread m...@sebastianraschka.com
Sorry for this previous Email, please disregard. This was a reminder to myself 
and I somehow sent it to the wrong recipient.

Sent from my iPhone

> On Aug 18, 2016, at 11:44 AM, Sebastian Raschka  
> wrote:
> 
> 
> ___
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn