Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

ayan guha Sat, 30 May 2015 17:34:03 -0700

I hope they will come up with1.4 before spark summit in mid June
On 31 May 2015 10:07, "Joseph Bradley" <jos...@databricks.com> wrote:


> Spark 1.4 should be available next month, but I'm not sure about the exact
> date.
> Your interpretation of high lambda is reasonable.  "High" lambda is really
> data-dependent.
> "lambda" is the same as the "regParam" in Spark, available in all recent
> Spark versions.
>
> On Fri, May 29, 2015 at 5:35 AM, mélanie gallois <
> melanie.galloi...@gmail.com> wrote:
>
>> When will Spark 1.4 be available exactly?
>> To answer to "Model selection can be achieved through high
>> lambda resulting lots of zero in the coefficients" : Do you mean that
>> putting a high lambda as a parameter of the logistic regression keeps only
>> a few significant variables and "deletes" the others with a zero in the
>> coefficients? What is a high lambda for you?
>> Is the lambda a parameter available in Spark 1.4 only or can I see it in
>> Spark 1.3?
>>
>> 2015-05-23 0:04 GMT+02:00 Joseph Bradley <jos...@databricks.com>:
>>
>>> If you want to select specific variable combinations by hand, then you
>>> will need to modify the dataset before passing it to the ML algorithm.  The
>>> DataFrame API should make that easy to do.
>>>
>>> If you want to have an ML algorithm select variables automatically, then
>>> I would recommend using L1 regularization for now and possibly elastic net
>>> after 1.4 is release, per DB's suggestion.
>>>
>>> If you want detailed model statistics similar to what R provides, I've
>>> created a JIRA for discussing how we should add that functionality to
>>> MLlib.  Those types of stats will be added incrementally, but feedback
>>> would be great for prioritization:
>>> https://issues.apache.org/jira/browse/SPARK-7674
>>>
>>> To answer your question: "How are the weights calculated: is there a
>>> correlation calculation with the variable of interest?"
>>> --> Weights are calculated as with all logistic regression algorithms,
>>> by using convex optimization to minimize a regularized log loss.
>>>
>>> Good luck!
>>> Joseph
>>>
>>> On Fri, May 22, 2015 at 1:07 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>
>>>> In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
>>>> pipeline framework. Model selection can be achieved through high
>>>> lambda resulting lots of zero in the coefficients.
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> -------------------------------------------------------
>>>> Blog: https://www.dbtsai.com
>>>>
>>>>
>>>> On Fri, May 22, 2015 at 1:19 AM, SparknewUser
>>>> <melanie.galloi...@gmail.com> wrote:
>>>> > I am new in MLlib and in Spark.(I use Scala)
>>>> >
>>>> > I'm trying to understand how LogisticRegressionWithLBFGS and
>>>> > LogisticRegressionWithSGD work.
>>>> > I usually use R to do logistic regressions but now I do it on Spark
>>>> > to be able to analyze Big Data.
>>>> >
>>>> > The model only returns weights and intercept. My problem is that I
>>>> have no
>>>> > information about which variable is significant and which variable I
>>>> had
>>>> > better
>>>> > to delete to improve my model. I only have the confusion matrix and
>>>> the AUC
>>>> > to evaluate the performance.
>>>> >
>>>> > Is there any way to have information about the variables I put in my
>>>> model?
>>>> > How can I try different variable combinations, do I have to modify the
>>>> > dataset
>>>> > of origin (e.g. delete one or several columns?)
>>>> > How are the weights calculated: is there a correlation calculation
>>>> with the
>>>> > variable
>>>> > of interest?
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
>>>> > Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> *Mélanie*
>>
>
>

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

Reply via email to