RE: Effects problems in logistic regression

2014-12-22 Thread Franco Barrientos
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!

 

De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:42
Para: 'DB Tsai'
CC: 'Sean Owen'; user@spark.apache.org
Asunto: RE: Effects problems in logistic regression

 

Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com mailto:so...@cloudera.com ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649 tel:%28%2B562%29-29699649 
(+569)-76347893 tel:%28%2B569%29-76347893 

franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  

www.exalitica.com http://www.exalitica.com/ 


  http://exalitica.com/web/img/frim.png 

 



Re: Effects problems in logistic regression

2014-12-22 Thread DB Tsai
Sounds great.


Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

 Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!



 *De:* Franco Barrientos [mailto:franco.barrien...@exalitica.com]
 *Enviado el:* jueves, 18 de diciembre de 2014 16:42
 *Para:* 'DB Tsai'
 *CC:* 'Sean Owen'; user@spark.apache.org
 *Asunto:* RE: Effects problems in logistic regression



 Thanks I will try.



 *De:* DB Tsai [mailto:dbt...@dbtsai.com dbt...@dbtsai.com]
 *Enviado el:* jueves, 18 de diciembre de 2014 16:24
 *Para:* Franco Barrientos
 *CC:* Sean Owen; user@spark.apache.org
 *Asunto:* Re: Effects problems in logistic regression



 Can you try LogisticRegressionWithLBFGS? I verified that this will be
 converged to the same result trained by R's glmnet package without
 regularization. The problem of LogisticRegressionWithSGD is it's very
 slow in term of converging, and lots of time, it's very sensitive to
 stepsize which can lead to wrong answer.



 The regularization logic in MLLib is not entirely correct, and it will
 penalize the intercept. In general, with really high regularization, all
 the coefficients will be zeros except the intercept. In logistic
 regression, the non-zero intercept can be understood as the
 prior-probability of each class, and in linear regression, this will be the
 mean of response. I'll have a PR to fix this issue.



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai



 On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
 franco.barrien...@exalitica.com wrote:

 Yes, without the “amounts” variables the results are similiar. When I put
 other variables its fine.



 *De:* Sean Owen [mailto:so...@cloudera.com]
 *Enviado el:* jueves, 18 de diciembre de 2014 14:22
 *Para:* Franco Barrientos
 *CC:* user@spark.apache.org
 *Asunto:* Re: Effects problems in logistic regression



 Are you sure this is an apples-to-apples comparison? for example does your
 SAS process normalize or otherwise transform the data first?



 Is the optimization configured similarly in both cases -- same
 regularization, etc.?



 Are you sure you are pulling out the intercept correctly? It is a separate
 value from the logistic regression model in Spark.



 On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
 franco.barrien...@exalitica.com wrote:

 Hi all!,



 I have a problem with LogisticRegressionWithSGD, when I train a data set
 with one variable (wich is a amount of an item) and intercept, I get
 weights of

 (-0.4021,-207.1749) for both features, respectively. This don´t make sense
 to me because I run a logistic regression for the same data in SAS and I
 get these weights (-2.6604,0.000245).



 The rank of this variable is from 0 to 59102 with a mean of 1158.



 The problem is when I want to calculate the probabilities for each user
 from data set, this probability is near to zero or zero in much cases,
 because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is
 a big number, in fact infinity for spark.



 How can I treat this variable? or why this happened?



 Thanks ,



 *Franco Barrientos*
 Data Scientist

 Málaga #115, Of. 1003, Las Condes.
 Santiago, Chile.
 (+562)-29699649
 (+569)-76347893

 franco.barrien...@exalitica.com

 www.exalitica.com

 [image: http://exalitica.com/web/img/frim.png]






Re: Effects problems in logistic regression

2014-12-18 Thread Sean Owen
Are you sure this is an apples-to-apples comparison? for example does your
SAS process normalize or otherwise transform the data first?

Is the optimization configured similarly in both cases -- same
regularization, etc.?

Are you sure you are pulling out the intercept correctly? It is a separate
value from the logistic regression model in Spark.

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

 Hi all!,



 I have a problem with LogisticRegressionWithSGD, when I train a data set
 with one variable (wich is a amount of an item) and intercept, I get
 weights of

 (-0.4021,-207.1749) for both features, respectively. This don´t make sense
 to me because I run a logistic regression for the same data in SAS and I
 get these weights (-2.6604,0.000245).



 The rank of this variable is from 0 to 59102 with a mean of 1158.



 The problem is when I want to calculate the probabilities for each user
 from data set, this probability is near to zero or zero in much cases,
 because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is
 a big number, in fact infinity for spark.



 How can I treat this variable? or why this happened?



 Thanks ,



 *Franco Barrientos*
 Data Scientist

 Málaga #115, Of. 1003, Las Condes.
 Santiago, Chile.
 (+562)-29699649
 (+569)-76347893

 franco.barrien...@exalitica.com

 www.exalitica.com

 [image: http://exalitica.com/web/img/frim.png]





RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com mailto:so...@cloudera.com ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649 tel:%28%2B562%29-29699649 
(+569)-76347893 tel:%28%2B569%29-76347893 

franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  

www.exalitica.com http://www.exalitica.com/ 


  http://exalitica.com/web/img/frim.png