hm, was worth a try. What happens if you change the solver to something else
than liblinear, does this issue still persist?
Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably
unrelated to your issue, I’d recommend setting
>y_train = df_train.pass_fail.values
>y_test = df_test.pass_fail.values
instead of
>y_train = df_train.pass_fail.as_matrix()
>y_test = df_test.pass_fail.as_matrix()
Also, try passing NumPy arrays to the fit method:
>log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)
and
> predicted = log_reg_fit.predict(df_test.values)
and so forth.
> On Aug 15, 2016, at 6:00 PM, Chris Cameron wrote:
>
> Sebastian,
>
> That doesn’t do it. With the function:
>
> def log_run(logreg_x, logreg_y):
>logreg_x['pass_fail'] = logreg_y
>df_train, df_test = train_test_split(logreg_x, random_state=0)
>y_train = df_train.pass_fail.as_matrix()
>y_test = df_test.pass_fail.as_matrix()
>del(df_train['pass_fail'])
>del(df_test['pass_fail'])
>log_reg_fit = LogisticRegression(class_weight='balanced',
> tol=0.1,
> random_state=0).fit(df_train, y_train)
>predicted = log_reg_fit.predict(df_test)
>accuracy = accuracy_score(y_test, predicted)
>kappa = cohen_kappa_score(y_test, predicted)
>
>return [kappa, accuracy]
>
> I’m still seeing:
> log_run(df_save, y)
> Out[7]: [-0.054421768707483005, 0.48334]
>
> log_run(df_save, y)
> Out[8]: [0.042553191489361743, 0.55004]
>
> log_run(df_save, y)
> Out[9]: [0.042553191489361743, 0.55004]
>
> log_run(df_save, y)
> Out[10]: [0.027728, 0.5]
>
>
> Chris
>
>> On Aug 15, 2016, at 3:42 PM, [email protected] wrote:
>>
>> Hi, Chris,
>> have you set the random seed to a specific, contant integer value? Note that
>> the default in LogisticRegression is random_state=None. Setting it to some
>> arbitrary number like 123 may help if you haven’t done so, yet.
>>
>> Best,
>> Sebastian
>>
>>
>>
>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron wrote:
>>>
>>> Hi all,
>>>
>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit()
>>> is providing me with inconsistent results.
>>>
>>> The documentation for sklearn.linear_model.LogisticRegression states that
>>> "It is thus not uncommon, to have slightly different results for the same
>>> input data.” I am experiencing this, however the fix of using a smaller
>>> “tol” parameter isn’t providing me with consistent fit.
>>>
>>> The code I’m using:
>>>
>>> def log_run(logreg_x, logreg_y):
>>> logreg_x['pass_fail'] = logreg_y
>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>> y_train = df_train.pass_fail.as_matrix()
>>> y_test = df_test.pass_fail.as_matrix()
>>> del(df_train['pass_fail'])
>>> del(df_test['pass_fail'])
>>> log_reg_fit =
>>> LogisticRegression(class_weight='balanced',tol=0.1).fit(df_train,
>>> y_train)
>>> predicted = log_reg_fit.predict(df_test)
>>> accuracy = accuracy_score(y_test, predicted)
>>> kappa = cohen_kappa_score(y_test, predicted)
>>>
>>> return [kappa, accuracy]
>>>
>>>
>>> I’ve gone out of my way to be sure the test and train data is the same for
>>> each run, so I don’t think there should be random shuffling going on.
>>>
>>> Example output:
>>> ---
>>> log_run(df_save, y)
>>> Out[32]: [0.027728, 0.5]
>>>
>>> log_run(df_save, y)
>>> Out[33]: [0.027728, 0.5]
>>>
>>> log_run(df_save, y)
>>> Out[34]: [0.11347517730496456, 0.58337]
>>>
>>> log_run(df_save, y)
>>> Out[35]: [0.042553191489361743, 0.55004]
>>>
>>> log_run(df_save, y)
>>> Out[36]: [-0.07407407407407407, 0.51672]
>>>
>>> log_run(df_save, y)
>>> Out[37]: [0.042553191489361743, 0.55004]
>>>
>>> A little information on the problem DataFrame:
>>> ---
>>> len(df_save)
>>> Out[40]: 240
>>>
>>> len(df_save.columns)
>>> Out[41]: 18
>>>
>>>
>>> If I omit this