[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!

2019-04-07 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811835#comment-16811835
 ] 

Teng Peng commented on SPARK-27352:
---

I would say go ahead and send a PR to add the link to the doc. This might be a 
good place for the link [https://spark.apache.org/docs/latest/index.html]

> Apply for translation of the Chinese version, I hope to get authorization! 
> ---
>
> Key: SPARK-27352
> URL: https://issues.apache.org/jira/browse/SPARK-27352
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuan Yifan
>Priority: Minor
>
> Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
> community in China, focusing on Big Data and AI.
> Recently, we have been making progress on translating Spark documents.
>  - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
>  - [Document Preview|http://spark.apachecn.org/]
> There are several reasons:
>  *1. The English level of many Chinese users is not very good.*
>  *2. Network problems, you know (China's magic network)!*
>  *3. Online blogs are very messy.*
> We are very willing to do some Chinese localization for your project. If 
> possible, please give us some authorization.
> Yifan Yuan from Apache CN
> You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
> for more details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!

2019-04-07 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811765#comment-16811765
 ] 

Teng Peng commented on SPARK-27352:
---

Correct me if I am wrong. I do not think any authorization are required for 
translation to other languages. 

> Apply for translation of the Chinese version, I hope to get authorization! 
> ---
>
> Key: SPARK-27352
> URL: https://issues.apache.org/jira/browse/SPARK-27352
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuan Yifan
>Priority: Minor
>
> Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
> community in China, focusing on Big Data and AI.
> Recently, we have been making progress on translating Spark documents.
>  - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
>  - [Document Preview|http://spark.apachecn.org/]
> There are several reasons:
>  *1. The English level of many Chinese users is not very good.*
>  *2. Network problems, you know (China's magic network)!*
>  *3. Online blogs are very messy.*
> We are very willing to do some Chinese localization for your project. If 
> possible, please give us some authorization.
> Yifan Yuan from Apache CN
> You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
> for more details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2018-07-24 Thread Teng Peng (JIRA)
Teng Peng created SPARK-24907:
-

 Summary: Migrate JDBC data source to DataSource API v2
 Key: SPARK-24907
 URL: https://issues.apache.org/jira/browse/SPARK-24907
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Teng Peng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23507) Migrate file-based data sources to data source v2

2018-06-23 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-23507:
--
Issue Type: Umbrella  (was: Improvement)

> Migrate file-based data sources to data source v2
> -
>
> Key: SPARK-23507
> URL: https://issues.apache.org/jira/browse/SPARK-23507
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Migrate file-based data sources to data source v2, including:
>  # Parquet
>  # ORC
>  # Json
>  # CSV
>  # JDBC
>  # Text



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18755) Add Randomized Grid Search to Spark ML

2018-06-23 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-18755:
--
Comment: was deleted

(was: [~yuhaoyan] Is this what you are looking for: after we build the grid, we 
randomly select a few points in the grid based on the 
searchRatio*totalNumofPoints?

If yes, I am thinking if it is necessary to extend trait Params and write Set 
Get function for searchRatio, which might be over-engineering. )

> Add Randomized Grid Search to Spark ML
> --
>
> Key: SPARK-18755
> URL: https://issues.apache.org/jira/browse/SPARK-18755
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Major
>
> Randomized Grid Search  implements a randomized search over parameters, where 
> each setting is sampled from a distribution over possible parameter values. 
> This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and 
> possible values.
> 2. Adding parameters that do not influence the performance does not decrease 
> efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, 
> while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
> build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator 
> and RandomizedTrainValiationSplit, which can be complicated since we need to 
> deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward. We can support 
> Randomized grid search via some smallest change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22911) Migrate structured streaming sources to new DataSourceV2 APIs

2018-06-21 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22911:
--
Issue Type: Umbrella  (was: Improvement)

> Migrate structured streaming sources to new DataSourceV2 APIs
> -
>
> Key: SPARK-22911
> URL: https://issues.apache.org/jira/browse/SPARK-22911
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default

2018-06-11 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508945#comment-16508945
 ] 

Teng Peng commented on SPARK-24516:
---

+1.

> PySpark Bindings for K8S - make Python 3 the default
> 
>
> Key: SPARK-24516
> URL: https://issues.apache.org/jira/browse/SPARK-24516
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the 
> default Python version there is 2. While you can override this by setting it 
> to 3, I think we should have sensible defaults.
> Python 3 has been around for ten years and is the clear successor, Python 2 
> has only 18 months left in terms of support. There isn't a good reason to 
> suggest Python 2 should be used, not in 2018 and not when both versions are 
> supported.
> The relevant commit [is 
> here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194],
>  the version is also [in the 
> documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21199) Its not possible to impute Vector types

2018-06-10 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-21199:
--
Component/s: (was: Spark Core)
 ML

> Its not possible to impute Vector types
> ---
>
> Key: SPARK-21199
> URL: https://issues.apache.org/jira/browse/SPARK-21199
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0, 2.1.1
>Reporter: Franklyn Dsouza
>Priority: Major
>
> There are cases where nulls end up in vector columns in dataframes. Currently 
> there is no way to fill in these nulls because its not possible to create a 
> literal vector column expression using lit().
> Also the entire pyspark ml api will fail when they encounter nulls so this 
> makes it hard to work with the data.
> I think that either vector support should be added to the imputer or vectors 
> should be supported in column expressions so they can be used in a coalesce.
> [~mlnick]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-06-10 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22943:
--
Comment: was deleted

(was: This issue looks quiet interesting, but can you be more specific about 
"consistent and foreseeable conversion"? Can you give an example that current 
implementation does not handle well?)

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-06 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503217#comment-16503217
 ] 

Teng Peng commented on SPARK-24431:
---

[~Ben2018] The article makes sense to me. It seems the current behavior follows 
"Case 2: TP is not 0", but it set precision = 1 if recall = 0. (See 
[https://github.com/apache/spark/blob/734ed7a7b397578f16549070f350215bde369b3c/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L110]
 )

Have you checked out SPARK-21806 and its discussion on JIRA? 

Let's hear [~srowen] 's opinions.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-02 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499109#comment-16499109
 ] 

Teng Peng edited comment on SPARK-24431 at 6/2/18 6:48 PM:
---

I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of areaUnderPR, which could attain 1.0. 


was (Author: teng peng):
I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-02 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499109#comment-16499109
 ] 

Teng Peng edited comment on SPARK-24431 at 6/2/18 6:47 PM:
---

I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

 Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 


was (Author: teng peng):
I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

 

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-02 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499109#comment-16499109
 ] 

Teng Peng edited comment on SPARK-24431 at 6/2/18 6:47 PM:
---

I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 


was (Author: teng peng):
I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

 Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-02 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499109#comment-16499109
 ] 

Teng Peng edited comment on SPARK-24431 at 6/2/18 6:46 PM:
---

I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

 

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 


was (Author: teng peng):
I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-02 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499109#comment-16499109
 ] 

Teng Peng commented on SPARK-24431:
---

I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24391) to_json/from_json should support arrays of primitives, and more generally all JSON

2018-05-26 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-24391:
--
Comment: was deleted

(was: My plan is to follow the Spark-19849 & Spark-21513 to support more 
primitive types. I will start with StringType to see how it goes.)

> to_json/from_json should support arrays of primitives, and more generally all 
> JSON 
> ---
>
> Key: SPARK-24391
> URL: https://issues.apache.org/jira/browse/SPARK-24391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam Kitajima-Kimbrel
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-19849 and 
> https://issues.apache.org/jira/browse/SPARK-21513 brought support for more 
> column types to functions.to_json/from_json, but I also have cases where I'd 
> like to simply (de)serialize an array of primitives to/from JSON when 
> outputting to certain destinations, which does not work:
> {code:java}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq("[1, 2, 3]").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> val schema = new ArrayType(IntegerType, false)
> schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false)
> scala> df.select(from_json($"a", schema))
> org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' 
> due to data type mismatch: Input schema array must be a struct or an 
> array of structs.;;
> 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, 
> Some(America/Los_Angeles)) AS jsontostructs(a)#10]
> scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr")
> arrayDf: org.apache.spark.sql.DataFrame = [arr: array]
> scala> arrayDf.select(to_json($"arr"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' 
> due to data type mismatch: Input type array must be a struct, array of 
> structs or a map or array of map.;;
> 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS 
> structstojson(arr)#26]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24391) to_json/from_json should support arrays of primitives, and more generally all JSON

2018-05-26 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491745#comment-16491745
 ] 

Teng Peng commented on SPARK-24391:
---

My plan is to follow the Spark-19849 & Spark-21513 to support more primitive 
types. I will start with StringType to see how it goes.

> to_json/from_json should support arrays of primitives, and more generally all 
> JSON 
> ---
>
> Key: SPARK-24391
> URL: https://issues.apache.org/jira/browse/SPARK-24391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam Kitajima-Kimbrel
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-19849 and 
> https://issues.apache.org/jira/browse/SPARK-21513 brought support for more 
> column types to functions.to_json/from_json, but I also have cases where I'd 
> like to simply (de)serialize an array of primitives to/from JSON when 
> outputting to certain destinations, which does not work:
> {code:java}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq("[1, 2, 3]").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> val schema = new ArrayType(IntegerType, false)
> schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false)
> scala> df.select(from_json($"a", schema))
> org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' 
> due to data type mismatch: Input schema array must be a struct or an 
> array of structs.;;
> 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, 
> Some(America/Los_Angeles)) AS jsontostructs(a)#10]
> scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr")
> arrayDf: org.apache.spark.sql.DataFrame = [arr: array]
> scala> arrayDf.select(to_json($"arr"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' 
> due to data type mismatch: Input type array must be a struct, array of 
> structs or a map or array of map.;;
> 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS 
> structstojson(arr)#26]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

2018-05-20 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482156#comment-16482156
 ] 

Teng Peng commented on SPARK-24269:
---

Yes, I understand this rationale too. However, I am inclined to a more 
conservative position: we should not aggressively set "NOT NULL" in the 
inference. If we do, that might not be what users want.

> Infer nullability rather than declaring all columns as nullable
> ---
>
> Key: SPARK-24269
> URL: https://issues.apache.org/jira/browse/SPARK-24269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, CSV and JSON datasource set the *nullable* flag to true 
> independently from data itself during schema inferring.
> JSON: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
> CSV: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51
> For example, source dataset has schema:
> {code}
> root
>  |-- item_id: integer (nullable = false)
>  |-- country: string (nullable = false)
>  |-- state: string (nullable = false)
> {code}
> If we save it and read again the schema of the inferred dataset is
> {code}
> root
>  |-- item_id: integer (nullable = true)
>  |-- country: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> The ticket aims to set the nullable flag more precisely during schema 
> inferring based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

2018-05-19 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481812#comment-16481812
 ] 

Teng Peng commented on SPARK-24269:
---

Does it make sense to infer nullability from JSON and CSV? 

> Infer nullability rather than declaring all columns as nullable
> ---
>
> Key: SPARK-24269
> URL: https://issues.apache.org/jira/browse/SPARK-24269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, CSV and JSON datasource set the *nullable* flag to true 
> independently from data itself during schema inferring.
> JSON: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
> CSV: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51
> For example, source dataset has schema:
> {code}
> root
>  |-- item_id: integer (nullable = false)
>  |-- country: string (nullable = false)
>  |-- state: string (nullable = false)
> {code}
> If we save it and read again the schema of the inferred dataset is
> {code}
> root
>  |-- item_id: integer (nullable = true)
>  |-- country: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> The ticket aims to set the nullable flag more precisely during schema 
> inferring based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-05-02 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461006#comment-16461006
 ] 

Teng Peng commented on SPARK-22943:
---

This issue looks quiet interesting, but can you be more specific about 
"consistent and foreseeable conversion"? Can you give an example that current 
implementation does not handle well?

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23180) RFormulaModel should have labels member

2018-05-02 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460967#comment-16460967
 ] 

Teng Peng commented on SPARK-23180:
---

Can you give me an example for 1. the current workaround 2. the proposed 
solution? I will look into it. 

> RFormulaModel should have labels member
> ---
>
> Key: SPARK-23180
> URL: https://issues.apache.org/jira/browse/SPARK-23180
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Kevin Kuo
>Priority: Major
>
> Like {{StringIndexerModel}}, {{RFormulaModel}} should have a {{labels}} 
> member to facilitate constructing the appropriate {{IndexToString}} 
> transformer to get the string labels back. Current workaround is to perform a 
> transform then parsing the schema which is tedious.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-04-28 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-23171:
--
Issue Type: Improvement  (was: Umbrella)

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  207976072  90723051590   
> 26
> 

[jira] [Updated] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-04-28 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-23171:
--
Issue Type: Umbrella  (was: Improvement)

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  207976072  90723051590   
> 26
> 

[jira] [Commented] (SPARK-24024) Fix deviance calculations in GLM to handle corner cases

2018-04-19 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443892#comment-16443892
 ] 

Teng Peng commented on SPARK-24024:
---

I will first reproduce the issue he has, check how does R handle this, and if 
we have any other fix needed.

> Fix deviance calculations in GLM to handle corner cases
> ---
>
> Key: SPARK-24024
> URL: https://issues.apache.org/jira/browse/SPARK-24024
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Minor
>
> It is reported by Spark users that the deviance calculations does not handle 
> a corner case. Thus, the correct model summary cannot be obtained. The user 
> has confirmed the the issue is in
> override def deviance(y: Double, mu: Double, weight: Double): Double = {
>  2.0 * weight * (y * math.log(y / mu) - (y - mu))
>  }
> when y = 0.
>  
> The user also mentioned there are many other places he believe we should 
> check the same thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24024) Fix deviance calculations in GLM to handle corner cases

2018-04-19 Thread Teng Peng (JIRA)
Teng Peng created SPARK-24024:
-

 Summary: Fix deviance calculations in GLM to handle corner cases
 Key: SPARK-24024
 URL: https://issues.apache.org/jira/browse/SPARK-24024
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Teng Peng


It is reported by Spark users that the deviance calculations does not handle a 
corner case. Thus, the correct model summary cannot be obtained. The user has 
confirmed the the issue is in

override def deviance(y: Double, mu: Double, weight: Double): Double = {
 2.0 * weight * (y * math.log(y / mu) - (y - mu))
 }

when y = 0.

 

The user also mentioned there are many other places he believe we should check 
the same thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23740) Add FPGrowth Param for filtering out very common items

2018-03-25 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413199#comment-16413199
 ] 

Teng Peng commented on SPARK-23740:
---

I suppose `beforehand` means before itemsets been generated, correct?

If so, it seems we have two approaches here:
 # Add a new filter condition in `genFreqItems`, but this is in MLlib, not ML.
 # Filter input dataset before we call mllibFP. Then we will have implement a 
similar method like `genFreqItems` in MLlib. Does this look good to you?

> Add FPGrowth Param for filtering out very common items
> --
>
> Key: SPARK-23740
> URL: https://issues.apache.org/jira/browse/SPARK-23740
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> It would be handy to have a Param in FPGrowth for filtering out very common 
> items.  This is from a use case where the dataset had items appearing in 
> 99.9%+ of the rows.  These common items were useless, but they caused the 
> algorithm to generate many unnecessary itemsets.  Filtering useless common 
> items beforehand can make the algorithm much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2018-03-20 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407451#comment-16407451
 ] 

Teng Peng edited comment on SPARK-19208 at 3/21/18 4:44 AM:


[~timhunter] Has the Jira ticket been opened? I believe the new API for 
statistical info would be a great improvement.


was (Author: teng peng):
[~timhunter] Has the Jira ticket been opened? I believe this would be a great 
improvement.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2018-03-20 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407451#comment-16407451
 ] 

Teng Peng commented on SPARK-19208:
---

[~timhunter] Has the Jira ticket been opened? I believe this would be a great 
improvement.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23537) Logistic Regression without standardization

2018-03-05 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387276#comment-16387276
 ] 

Teng Peng commented on SPARK-23537:
---

This is a quiet interesting question and I do not have answer yet: Do we need 
standardization for L-BFGS in the first place?

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23578) Add multicolumn support for Binarizer

2018-03-03 Thread Teng Peng (JIRA)
Teng Peng created SPARK-23578:
-

 Summary: Add multicolumn support for Binarizer
 Key: SPARK-23578
 URL: https://issues.apache.org/jira/browse/SPARK-23578
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Teng Peng


[Spark-20542] added an API that Bucketizer that can bin multiple columns. Based 
on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-11-20 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260095#comment-16260095
 ] 

Teng Peng commented on SPARK-20133:
---

I believe the documentation, including user guide and example script, is done 
in https://spark.apache.org/docs/2.2.0/ml-statistics.html

> User guide for spark.ml.stat.ChiSquareTest
> --
>
> Key: SPARK-20133
> URL: https://issues.apache.org/jira/browse/SPARK-20133
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add new user guide section for spark.ml.stat, and document ChiSquareTest.  
> This may involve adding new example scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22449) Add BIC for GLM

2017-11-20 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng resolved SPARK-22449.
---
Resolution: Later

> Add BIC for GLM
> ---
>
> Key: SPARK-22449
> URL: https://issues.apache.org/jira/browse/SPARK-22449
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Currently, we only have AIC for GLM. BIC is another "similar" criterion 
> widely used and implemented in all major statical tools.
> Postive reasons: 
> 1. Completeness.
> 2. Useful for some users.
> Negative reasons:
> 1. Not sure how many users would actually use BIC.
> Possible Implementation:
> 1. Duplicate AIC's methods. Calculate penalty term independently. Pros: safe 
> & consistent. Cons: duplication.
> 2. Let AIC & BIC share the log likelihood by a same method. Calculate penalty 
> term independently.
> Pros: similar to scikit learn. No duplication. Cons: less safe & consistent.
> Reference:
> 1. 
> https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
> 2.http://users.stat.umn.edu/~yangx374/papers/Pre-Print_2003-10_Biometrika.pdf
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22359) Improve the test coverage of window functions

2017-11-06 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22359:
--
Comment: was deleted

(was: [~jiangxb] If I can have 1 test as reference, I will figure out the rests 
myself.)

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions

2017-11-06 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241392#comment-16241392
 ] 

Teng Peng commented on SPARK-22359:
---

[~jiangxb] If I can have 1 test as reference, I will figure out the rests 
myself.

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11502) Word2VecSuite needs appropriate checks

2017-11-05 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-11502:
--
Comment: was deleted

(was: I am interested in this one. My plan is to compare the test against 1. 
other ML tests 2. other Word2Vec library's tests.)

> Word2VecSuite needs appropriate checks
> --
>
> Key: SPARK-11502
> URL: https://issues.apache.org/jira/browse/SPARK-11502
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Imran Rashid
>Priority: Minor
>
> When the random number generator was changed slightly in SPARK-10116, 
> {{ml.feature.Word2VecSuite}} started failing.  The tests were updated to have 
> "magic" values, to at least provide minimal characterization tests preventing 
> more behavior changes.  However, the tests really need to be improved to have 
> more sensible checks.
> Note that brute force search over seeds from 0 to 1000 failed to find one 
> that worked -- the input data seems to be careful chosen for the previous 
> random number generator / seed combo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20077) Documentation for ml.stats.Correlation

2017-11-05 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239846#comment-16239846
 ] 

Teng Peng commented on SPARK-20077:
---

[~srowen] On this pagehttps://spark.apache.org/docs/latest/ml-statistics.html, 
we have Pearson and Spearman coefficients. Just want to make sure: Maybe we 
need something other than this?

Correlation computes the correlation matrix for the input Dataset of Vectors 
using the specified method. The output will be a DataFrame that contains the 
correlation matrix of the column of vectors.

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val data = Seq(
  Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println("Pearson correlation matrix:\n" + coeff1.toString)

val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println("Spearman correlation matrix:\n" + coeff2.toString)



> Documentation for ml.stats.Correlation
> --
>
> Key: SPARK-20077
> URL: https://issues.apache.org/jira/browse/SPARK-20077
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Priority: Minor
>
> Now that (Pearson) correlations are available in spark.ml, we need to write 
> some documentation to go along with this feature. It can simply be looking at 
> the unit tests for example right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22449) Add BIC for GLM

2017-11-04 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22449:
--
Description: 
Currently, we only have AIC for GLM. BIC is another "similar" criterion widely 
used and implemented in all major statical tools.

Postive reasons: 
1. Completeness.
2. Useful for some users.

Negative reasons:
1. Not sure how many users would actually use BIC.

Possible Implementation:
1. Duplicate AIC's methods. Calculate penalty term independently. Pros: safe & 
consistent. Cons: duplication.
2. Let AIC & BIC share the log likelihood by a same method. Calculate penalty 
term independently.
Pros: similar to scikit learn. No duplication. Cons: less safe & consistent.

Reference:
1. 
https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
2.http://users.stat.umn.edu/~yangx374/papers/Pre-Print_2003-10_Biometrika.pdf

Thoughts?

  was:
Currently, we only have AIC for GLM. BIC is another "similar" criterion widely 
used and implemented in all major statical tools.

Postive reasons: 
1. Completeness.
2. Useful for some users.

Negative reasons:
1. Not sure how many users would actually use BIC.

Possible Implementation:
1. Duplicate almost the same methods for log likelihood part. Calculate penalty 
term independently. Pros: safe & consistent. Cons: duplication.
2. Let AIC & BIC share the log likelihood by a same method. Calculate penalty 
term independently.
Pros: similar to scikit learn. No duplication. Cons: less safe & consistent.

Reference:
1. 
https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
2.http://users.stat.umn.edu/~yangx374/papers/Pre-Print_2003-10_Biometrika.pdf

Thoughts?


> Add BIC for GLM
> ---
>
> Key: SPARK-22449
> URL: https://issues.apache.org/jira/browse/SPARK-22449
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Currently, we only have AIC for GLM. BIC is another "similar" criterion 
> widely used and implemented in all major statical tools.
> Postive reasons: 
> 1. Completeness.
> 2. Useful for some users.
> Negative reasons:
> 1. Not sure how many users would actually use BIC.
> Possible Implementation:
> 1. Duplicate AIC's methods. Calculate penalty term independently. Pros: safe 
> & consistent. Cons: duplication.
> 2. Let AIC & BIC share the log likelihood by a same method. Calculate penalty 
> term independently.
> Pros: similar to scikit learn. No duplication. Cons: less safe & consistent.
> Reference:
> 1. 
> https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
> 2.http://users.stat.umn.edu/~yangx374/papers/Pre-Print_2003-10_Biometrika.pdf
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22449) Add BIC for GLM

2017-11-04 Thread Teng Peng (JIRA)
Teng Peng created SPARK-22449:
-

 Summary: Add BIC for GLM
 Key: SPARK-22449
 URL: https://issues.apache.org/jira/browse/SPARK-22449
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Teng Peng
Priority: Minor


Currently, we only have AIC for GLM. BIC is another "similar" criterion widely 
used and implemented in all major statical tools.

Postive reasons: 
1. Completeness.
2. Useful for some users.

Negative reasons:
1. Not sure how many users would actually use BIC.

Possible Implementation:
1. Duplicate almost the same methods for log likelihood part. Calculate penalty 
term independently. Pros: safe & consistent. Cons: duplication.
2. Let AIC & BIC share the log likelihood by a same method. Calculate penalty 
term independently.
Pros: similar to scikit learn. No duplication. Cons: less safe & consistent.

Reference:
1. 
https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
2.http://users.stat.umn.edu/~yangx374/papers/Pre-Print_2003-10_Biometrika.pdf

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18755) Add Randomized Grid Search to Spark ML

2017-11-04 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239389#comment-16239389
 ] 

Teng Peng commented on SPARK-18755:
---

[~yuhaoyan] Is this what you are looking for: after we build the grid, we 
randomly select a few points in the grid based on the 
searchRatio*totalNumofPoints?

If yes, I am thinking if it is necessary to extend trait Params and write Set 
Get function for searchRatio, which might be over-engineering. 

> Add Randomized Grid Search to Spark ML
> --
>
> Key: SPARK-18755
> URL: https://issues.apache.org/jira/browse/SPARK-18755
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Randomized Grid Search  implements a randomized search over parameters, where 
> each setting is sampled from a distribution over possible parameter values. 
> This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and 
> possible values.
> 2. Adding parameters that do not influence the performance does not decrease 
> efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, 
> while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
> build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator 
> and RandomizedTrainValiationSplit, which can be complicated since we need to 
> deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward. We can support 
> Randomized grid search via some smallest change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237884#comment-16237884
 ] 

Teng Peng commented on SPARK-22433:
---

Thanks for the quick response, Sean. I am glad this issue is discussed in Spark 
community.

I understand how important coherent is, and it's the users' decision to do what 
they believe is appropriate. 

I just want to propose a one-line change: change eval.setMetricName("r2") to 
"mse" in test("cross validation with linear regression"). Then we would not 
leave the impression that "Wait what? Spark officially cross validate on R2?" 

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237774#comment-16237774
 ] 

Teng Peng commented on SPARK-22433:
---

What I agree with you: be coherent, and we prefer ML-oreinted standard.

What I want to add: be coherent, and we prefer ML-oreinted standard only if we 
are talking about ML. If we are talking about traditional statistics, we should 
stick to the established standard of traditional statistics.

What I want to explain:
1.
ML world: there is training set and test set. We have this to evaluate if we 
our models have good prediction performance. If we don't have them, then 
unavoidably overfitting.
Traditional statistics world: there is no training set and test, because our 
goal is interpretation of models, not prediction performance. 
R^2 is in the framework of traditional statistics, and it has nothing to do 
with prediction related goals. If we are using R^2, we are in the domain of 
traditional statistics. If our goal is interpretation, then we look at R^2. 

2.
The regressionMetric and regressionEvaluator is designed for ML related goals 
using linear regression approach(which might be useful for a benchmark). So 
this two are actually in the domain of ML world, not traditional statistics. 
However, R^2 is mixed into it. This mixture appear everywhere. Looking at 
test("cross validation with linear regression") . R^2 is evaluated by cross 
validation, and the larger the better. This is a misunderstanding of what R2 is.

The bottom line: there is a clear distinction between traditional statistics 
and ML. If something belongs to traditional statistics, then we should not mix 
them with ML.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22433:
--
Description: 
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.

  was:
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRessionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22433:
--
Description: 
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearRegressionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.

  was:
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22433:
--
Description: 
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.

  was:
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".
2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.
3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearregRessionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)
Teng Peng created SPARK-22433:
-

 Summary: Linear regression R^2 train/test terminology related 
 Key: SPARK-22433
 URL: https://issues.apache.org/jira/browse/SPARK-22433
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Teng Peng
Priority: Minor


Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".
2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.
3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org