[jira] [Commented] (SPARK-30670) Pipes for PySpark

2020-01-30 Thread Vincent (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027242#comment-17027242
 ] 

Vincent commented on SPARK-30670:
-

I just had a look, but transform does not allow for `*args` and `**kwargs`. Is 
there a reason for this? To me this feels like it is not feature-complete.

> Pipes for PySpark
> -
>
> Key: SPARK-30670
> URL: https://issues.apache.org/jira/browse/SPARK-30670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Vincent
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
> functional programming pattern that is inspired from the tidyverse that is 
> currently missing. The pandas community also recently adopted this pattern, 
> documented [here]([https://tomaugspurger.github.io/method-chaining.html).]
> This is the idea. Suppose you had this;
> {code:java}
> # file that has [user, date, timestamp, eventtype]
> ddf = spark.read.parquet("")
> w_user = Window().partitionBy("user")
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> thres_sesstime = 60 * 15 
> min_n_rows = 10
> min_n_sessions = 5
> clean_ddf = (ddf
>   .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
>   .withColumn("new_session", (sf.col("delta") > 
> thres_sesstime).cast("integer"))
>   .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
>   .drop("new_session")
>   .drop("delta")
>   .withColumn("nrow_user", sf.count(sf.col("timestamp")))
>   .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
>   .filter(sf.col("nrow_user") > min_n_rows)
>   .filter(sf.col("nrow_user_date") > min_n_sessions)
>   .drop("nrow_user")
>   .drop("nrow_user_date"))
> {code}
> The code works and it is somewhat clear. We add a session to the dataframe 
> and then we use this to remove outliers. The issue is that this chain of 
> commands can get quite long so instead it might be better to turn this into 
> functions.
> {code:java}
> def add_session(dataf, session_threshold=60*15):
> w_user = Window().partitionBy("user")
>   
> return (dataf  
> .withColumn("delta", sf.col("timestamp") - 
> sf.lag("timestamp").over(w_user))
> .withColumn("new_session", (sf.col("delta") > 
> threshold_sesstime).cast("integer"))
> .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
> .drop("new_session")
> .drop("delta"))
> def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> 
> return (dataf  
> .withColumn("nrow_user", sf.count(sf.col("timestamp")))
> .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
> .filter(sf.col("nrow_user") > min_n_rows)
> .filter(sf.col("nrow_user_date") > min_n_sessions)
> .drop("nrow_user")
> .drop("nrow_user_date"))
> {code}
> The issue lies not in these functions. These functions are great! You can 
> unit test them and they really give nice verbs that function as an 
> abstraction. The issue is in how you now need to apply them. 
> {code:java}
> remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
> {code}
> It'd be much nicer to perhaps allow for this;
> {code:java}
> (ddf
>   .pipe(add_session, session_threshold=900)
>   .pipe(remove_outliers, min_n_rows=11))
> {code}
> The cool thing about this is that you can really easily allow for method 
> chaining but also that you have an amazing way to split high level code and 
> low level code. You still allow mutation as a high level by exposing keyword 
> arguments but you can easily find the lower level code in debugging because 
> you've contained details to their functions.
> For code maintenance, I've relied on this pattern a lot personally. But 
> sofar, I've always monkey-patched spark to be able to do this.
> {code:java}
> from pyspark.sql import DataFrame 
> def pipe(self, func, *args, **kwargs):
> return func(self, *args, **kwargs)
> {code}
> Could I perhaps add these few lines of code to the codebase?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26449) Missing Dataframe.transform API in Python API

2020-01-30 Thread Vincent (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027241#comment-17027241
 ] 

Vincent commented on SPARK-26449:
-

Is there a reason why transform does not accept `*args` and **kwargs`?

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Assignee: Erik Christiansen
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30670) Pipes for PySpark

2020-01-29 Thread Vincent (Jira)
Vincent created SPARK-30670:
---

 Summary: Pipes for PySpark
 Key: SPARK-30670
 URL: https://issues.apache.org/jira/browse/SPARK-30670
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: Vincent


I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
functional programming pattern that is inspired from the tidyverse that is 
currently missing. The pandas community also recently adopted this pattern, 
documented [here]([https://tomaugspurger.github.io/method-chaining.html).]

This is the idea. Suppose you had this;


{code:java}
# file that has [user, date, timestamp, eventtype]
ddf = spark.read.parquet("")

w_user = Window().partitionBy("user")
w_user_date = Window().partitionBy("user", "date")
w_user_time = Window().partitionBy("user").sortBy("timestamp")

thres_sesstime = 60 * 15 
min_n_rows = 10
min_n_sessions = 5

clean_ddf = (ddf
  .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
  .withColumn("new_session", (sf.col("delta") > thres_sesstime).cast("integer"))
  .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
  .drop("new_session")
  .drop("delta")
  .withColumn("nrow_user", sf.count(sf.col("timestamp")))
  .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
  .filter(sf.col("nrow_user") > min_n_rows)
  .filter(sf.col("nrow_user_date") > min_n_sessions)
  .drop("nrow_user")
  .drop("nrow_user_date"))
{code}
The code works and it is somewhat clear. We add a session to the dataframe and 
then we use this to remove outliers. The issue is that this chain of commands 
can get quite long so instead it might be better to turn this into functions.
{code:java}
def add_session(dataf, session_threshold=60*15):
w_user = Window().partitionBy("user")
  
return (dataf  
.withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
.withColumn("new_session", (sf.col("delta") > 
threshold_sesstime).cast("integer"))
.withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
.drop("new_session")
.drop("delta"))

def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
w_user_date = Window().partitionBy("user", "date")
w_user_time = Window().partitionBy("user").sortBy("timestamp")

return (dataf  
.withColumn("nrow_user", sf.count(sf.col("timestamp")))
.withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
.filter(sf.col("nrow_user") > min_n_rows)
.filter(sf.col("nrow_user_date") > min_n_sessions)
.drop("nrow_user")
.drop("nrow_user_date"))
{code}
The issue lies not in these functions. These functions are great! You can unit 
test them and they really give nice verbs that function as an abstraction. The 
issue is in how you now need to apply them. 
{code:java}
remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
{code}
It'd be much nicer to perhaps allow for this;
{code:java}
(ddf
  .pipe(add_session, session_threshold=900)
  .pipe(remove_outliers, min_n_rows=11))
{code}
The cool thing about this is that you can really easily allow for method 
chaining but also that you have an amazing way to split high level code and low 
level code. You still allow mutation as a high level by exposing keyword 
arguments but you can easily find the lower level code in debugging because 
you've contained details to their functions.

For code maintenance, I've relied on this pattern a lot personally. But sofar, 
I've always monkey-patched spark to be able to do this.
{code:java}
from pyspark.sql import DataFrame 

def pipe(self, func, *args, **kwargs):
return func(self, *args, **kwargs)
{code}
Could I perhaps add these few lines of code to the codebase?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27087) Inability to access to column alias in pyspark

2019-03-07 Thread Vincent (JIRA)
Vincent created SPARK-27087:
---

 Summary: Inability to access to column alias in pyspark
 Key: SPARK-27087
 URL: https://issues.apache.org/jira/browse/SPARK-27087
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Vincent


In pyspark I have the following:
{code:java}
import pyspark.sql.functions as F
cc = F.lit(1).alias("A")

print(cc)
print(cc._jc.toString())
{code}

I get :
{noformat}
Column
1 AS `A`
{noformat}

Is there any way for me to just print "A" from cc ? it seems I'm unable to 
extract the alias programatically from the column object.

Also I think that in spark-sql in scala, if I print "cc" it would just print 
"A" instead, so this seem like a bug or a missing feature to me



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613181#comment-16613181
 ] 

Vincent commented on SPARK-25412:
-

Thanks, Nick,  for the reply.

so, the tradeoff is between highly sparse vector by increasing numFeature size 
and risk of losing certain features with conflicted hash value (since changing 
the value/meaning of those features equals to making them useless ), correct?

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-11 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611628#comment-16611628
 ] 

Vincent commented on SPARK-25412:
-

[~nick.pentre...@gmail.com] thanks.

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-11 Thread Vincent (JIRA)
Vincent created SPARK-25412:
---

 Summary: FeatureHasher would change the value of output feature
 Key: SPARK-25412
 URL: https://issues.apache.org/jira/browse/SPARK-25412
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.1
Reporter: Vincent


In the current implementation of FeatureHasher.transform, a simple modulo on 
the hashed value is used to determine the vector index, it's suggested to use a 
large integer value as the numFeature parameter

we found several issues regarding current implementation: 
 # Cannot get the feature name back by its index after featureHasher transform, 
for example. when getting feature importance from decision tree training 
followed by a FeatureHasher
 # when index conflict, which is a great chance to happen especially when 
'numFeature' is relatively small, its value would be changed with a new valued 
(sum of current and old value)
 #  to avoid confliction, we should set the 'numFeature' with a large number, 
highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-09 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608612#comment-16608612
 ] 

Vincent commented on SPARK-25364:
-

duplication. close this Jira.

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25364
> URL: https://issues.apache.org/jira/browse/SPARK-25364
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-09 Thread Vincent (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent resolved SPARK-25364.
-
Resolution: Duplicate

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25364
> URL: https://issues.apache.org/jira/browse/SPARK-25364
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-07 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606746#comment-16606746
 ] 

Vincent commented on SPARK-25365:
-

[~nick.pentre...@gmail.com] Thanks.

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25365
> URL: https://issues.apache.org/jira/browse/SPARK-25365
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-07 Thread Vincent (JIRA)
Vincent created SPARK-25364:
---

 Summary: a better way to handle vector index and sparsity in 
FeatureHasher implementation ?
 Key: SPARK-25364
 URL: https://issues.apache.org/jira/browse/SPARK-25364
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.3.1
Reporter: Vincent


In the current implementation of FeatureHasher.transform, a simple modulo on 
the hashed value is used to determine the vector index, it's suggested to use a 
large integer value as the numFeature parameter

we found several issues regarding current implementation: 
 # Cannot get the feature name back by its index after featureHasher transform, 
for example. when getting feature importance from decision tree training 
followed by a FeatureHasher
 # when index conflict, which is a great chance to happen especially when 
'numFeature' is relatively small, its value would be updated with the sum of 
current and old value, ie, the value of the conflicted feature vector would be 
change by this module.
 #  to avoid confliction, we should set the 'numFeature' with a large number, 
highly sparse vector increase the computation complexity of model training

we are working on fixing these problems due to our business need, thinking it 
might or might not be an issue for others as well, we'd like to hear from the 
community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-07 Thread Vincent (JIRA)
Vincent created SPARK-25365:
---

 Summary: a better way to handle vector index and sparsity in 
FeatureHasher implementation ?
 Key: SPARK-25365
 URL: https://issues.apache.org/jira/browse/SPARK-25365
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.3.1
Reporter: Vincent


In the current implementation of FeatureHasher.transform, a simple modulo on 
the hashed value is used to determine the vector index, it's suggested to use a 
large integer value as the numFeature parameter

we found several issues regarding current implementation: 
 # Cannot get the feature name back by its index after featureHasher transform, 
for example. when getting feature importance from decision tree training 
followed by a FeatureHasher
 # when index conflict, which is a great chance to happen especially when 
'numFeature' is relatively small, its value would be updated with the sum of 
current and old value, ie, the value of the conflicted feature vector would be 
change by this module.
 #  to avoid confliction, we should set the 'numFeature' with a large number, 
highly sparse vector increase the computation complexity of model training

we are working on fixing these problems due to our business need, thinking it 
might or might not be an issue for others as well, we'd like to hear from the 
community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25034) possible triple memory consumption in fetchBlockSync()

2018-08-06 Thread Vincent (JIRA)
Vincent created SPARK-25034:
---

 Summary: possible triple memory consumption in fetchBlockSync()
 Key: SPARK-25034
 URL: https://issues.apache.org/jira/browse/SPARK-25034
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.2, 2.4.0
Reporter: Vincent


Hello

in the code of  _fetchBlockSync_() in _blockTransferService_, we have:
 
{code:java}
val ret = ByteBuffer.allocate(data.size.toInt)
ret.put(data.nioByteBuffer())
ret.flip()
result.success(new NioManagedBuffer(ret)) 
{code}

In some cases, the _data_ variable is a _NettyManagedBuffer_, whose underlying 
netty representation is a _CompositeByteBuffer_.

Going through the code above in this configuration, assuming that the variable 
_data_ holds N bytes:
1) we allocate a full buffer of N bytes in _ret_
2) calling _data.nioByteBuffer()_ on a  _CompositeByteBuffer_ will trigger a 
full merge of all the composite buffers, which will allocate  *again* a full 
buffer of N bytes
3) we copy to _ret_ the data byte by byte

This means that at some point the N bytes of data are located 3 times in memory.
Is this really necessary?
It seems unclear to me why we have to process at all the data, given that we 
receive a _ManagedBuffer_ and we want to return a _ManagedBuffer_ 
Is there something I'm missing here? It seems this whole operation could be 
done with 0 copies. 
The only upside here is that the new buffer will have merged all the composite 
buffer's arrays, but it is really not clear if this is intended. In any case 
this could be done with peak memory of 2N and not 3N

Cheers!
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream

2018-07-31 Thread Vincent (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent resolved SPARK-24968.
-
Resolution: Fixed

> Configurable Chunksize in ChunkedByteBufferOutputStream
> ---
>
> Key: SPARK-24968
> URL: https://issues.apache.org/jira/browse/SPARK-24968
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.0, 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> Hello,
> it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size 
> is always configured to be 4MB. I suggest we make it configurable via spark 
> conf. This would allow to solve issues like SPARK-24917 (by increasing the 
> chunk size to a bigger value in the conf).
> What do you think ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream

2018-07-31 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563467#comment-16563467
 ] 

Vincent commented on SPARK-24968:
-

indeed they are closely related

I'll close this ticket

> Configurable Chunksize in ChunkedByteBufferOutputStream
> ---
>
> Key: SPARK-24968
> URL: https://issues.apache.org/jira/browse/SPARK-24968
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.0, 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> Hello,
> it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size 
> is always configured to be 4MB. I suggest we make it configurable via spark 
> conf. This would allow to solve issues like SPARK-24917 (by increasing the 
> chunk size to a bigger value in the conf).
> What do you think ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream

2018-07-30 Thread Vincent (JIRA)
Vincent created SPARK-24968:
---

 Summary: Configurable Chunksize in ChunkedByteBufferOutputStream
 Key: SPARK-24968
 URL: https://issues.apache.org/jira/browse/SPARK-24968
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.2, 2.4.0
Reporter: Vincent


Hello,

it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size 
is always configured to be 4MB. I suggest we make it configurable via spark 
conf. This would allow to solve issues like SPARK-24917 (by increasing the 
chunk size to a bigger value in the conf).

What do you think ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-07-25 Thread Vincent (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-24917:

Description: 
Hello

while investigating some OOM errors in Spark 2.2 [(here's my call 
stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following behavior 
happening, which I think is weird:
 * a request happens to send a partition over network
 * this partition is 1.9 GB and is persisted in memory
 * this partition is apparently stored in a ByteBufferBlockData, that is made 
of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
 * the call to toNetty() is supposed to only wrap all the arrays and not 
allocate any memory
 * yet the call stack shows that netty is allocating memory and is trying to 
consolidate all the chunks into one big 1.9GB array
 * this means that at this point the memory footprint is 2x the size of the 
actual partition (which is huge when the partition is 1.9GB)

Is this transient allocation expected?

After digging, it turns out that the actual copy is due to [this 
method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
 in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
(16) components it will trigger a re-allocation of all the buffer. This netty 
issue was fixed in this recent change : 
[https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]

 

As a result, is it possible to benefit from this change somehow in spark 2.2 
and above? I don't know how the netty dependencies are handled for spark

 

NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
kinda changed the approach for spark 2.4 by bypassing netty buffer altogether. 
However as it is written in the ticket, this approach *still* needs to have the 
*entire* block serialized in memory, so this would be a downgrade from fixing 
the netty issue when your buffer in <  2GB

 

Thanks!

 

 

  was:
Hello

while investigating some OOM errors in Spark 2.2 [(here's my call 
stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following behavior 
happening, which I think is weird:
 * a request happens to send a partition over network
 * this partition is 1.9 GB and is persisted in memory
 * this partition is apparently stored in a ByteBufferBlockData, that is made 
of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
 * the call to toNetty() is supposed to only wrap all the arrays and not 
allocate any memory
 * yet the call stack shows that netty is allocating memory and is trying to 
consolidate all the chunks into one big 1.9GB array
 * this means that at this point the memory footprint is 2x the size of the 
actual partition (which is huge when the partition is 1.9GB)

Is this transient allocation expected?

After digging, it turns out that the actual copy is due to [this 
method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
 in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
(16) components it will trigger a re-allocation of all the buffer. This netty 
issue was fixed in this recent change : 
[https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]

 

As a result, is it possible to benefit from this change somehow in spark 2.2 
and above? I don't know how the netty dependencies are handled for spark

 

NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
fixes the issue for spark 2.4 by bypassing netty buffer altogether

 

Thanks!

 

 


> Sending a partition over netty results in 2x memory usage
> -
>
> Key: SPARK-24917
> URL: https://issues.apache.org/jira/browse/SPARK-24917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Vincent
>Priority: Major
>
> Hello
> while investigating some OOM errors in Spark 2.2 [(here's my call 
> stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following 
> behavior happening, which I think is weird:
>  * a request happens to send a partition over network
>  * this partition is 1.9 GB and is persisted in memory
>  * this partition is apparently stored in a ByteBufferBlockData, that is made 
> of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
>  * the call to toNetty() is supposed to only wrap all the arrays and not 
> allocate any memory
>  * yet the call stack shows that netty is allocating memory and is trying to 
> consolidate all the chunks into one big 1.9GB array
>  * this means that at this point the memory footprint is 2x the size of the 
> actual partition (which is huge when the partition is 1.9GB)
> Is this transient allocation expected?
> 

[jira] [Updated] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-07-25 Thread Vincent (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-24917:

Description: 
Hello

while investigating some OOM errors in Spark 2.2 [(here's my call 
stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following behavior 
happening, which I think is weird:
 * a request happens to send a partition over network
 * this partition is 1.9 GB and is persisted in memory
 * this partition is apparently stored in a ByteBufferBlockData, that is made 
of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
 * the call to toNetty() is supposed to only wrap all the arrays and not 
allocate any memory
 * yet the call stack shows that netty is allocating memory and is trying to 
consolidate all the chunks into one big 1.9GB array
 * this means that at this point the memory footprint is 2x the size of the 
actual partition (which is huge when the partition is 1.9GB)

Is this transient allocation expected?

After digging, it turns out that the actual copy is due to [this 
method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
 in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
(16) components it will trigger a re-allocation of all the buffer. This netty 
issue was fixed in this recent change : 
[https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]

 

As a result, is it possible to benefit from this change somehow in spark 2.2 
and above? I don't know how the netty dependencies are handled for spark

 

NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
fixes the issue for spark 2.4 by bypassing netty buffer altogether

 

Thanks!

 

 

  was:
Hello

while investigating some OOM errors in Spark 2.2 [(here's my call 
stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following behavior 
happening, which I think is weird:
 * a request happens to send a partition over network
 * this partition is 1.9 GB and is persisted in memory
 * this partition is apparently stored in a ByteBufferBlockData, that is made 
of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
 * the call to toNetty() is supposed to only wrap all the arrays and not 
allocate any memory
 * yet the call stack shows that netty is allocating memory and is trying to 
consolidate all the chunks into one big 1.9GB array
 * this means that at this point the memory footprint is 2x the size of the 
actual partition (which is huge when the partition is 1.9GB)

Is this transient allocation expected?

After digging, it turns out that the actual copy is due to [this 
method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
 in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
(16) components it will trigger a re-allocation of all the buffer. This netty 
issue was fixed in this recent change : 
[https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]

 

As a result, is it possible to benefit from this change somehow in spark 2.2 
and above? I don't know how the netty dependencies are handled for spark

 

Thanks!

 

 


> Sending a partition over netty results in 2x memory usage
> -
>
> Key: SPARK-24917
> URL: https://issues.apache.org/jira/browse/SPARK-24917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Vincent
>Priority: Major
>
> Hello
> while investigating some OOM errors in Spark 2.2 [(here's my call 
> stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following 
> behavior happening, which I think is weird:
>  * a request happens to send a partition over network
>  * this partition is 1.9 GB and is persisted in memory
>  * this partition is apparently stored in a ByteBufferBlockData, that is made 
> of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
>  * the call to toNetty() is supposed to only wrap all the arrays and not 
> allocate any memory
>  * yet the call stack shows that netty is allocating memory and is trying to 
> consolidate all the chunks into one big 1.9GB array
>  * this means that at this point the memory footprint is 2x the size of the 
> actual partition (which is huge when the partition is 1.9GB)
> Is this transient allocation expected?
> After digging, it turns out that the actual copy is due to [this 
> method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
>  in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
> (16) components it will trigger a re-allocation of all the buffer. This netty 
> issue was fixed in this recent 

[jira] [Created] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-07-25 Thread Vincent (JIRA)
Vincent created SPARK-24917:
---

 Summary: Sending a partition over netty results in 2x memory usage
 Key: SPARK-24917
 URL: https://issues.apache.org/jira/browse/SPARK-24917
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.2
Reporter: Vincent


Hello

while investigating some OOM errors in Spark 2.2 [(here's my call 
stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following behavior 
happening, which I think is weird:
 * a request happens to send a partition over network
 * this partition is 1.9 GB and is persisted in memory
 * this partition is apparently stored in a ByteBufferBlockData, that is made 
of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
 * the call to toNetty() is supposed to only wrap all the arrays and not 
allocate any memory
 * yet the call stack shows that netty is allocating memory and is trying to 
consolidate all the chunks into one big 1.9GB array
 * this means that at this point the memory footprint is 2x the size of the 
actual partition (which is huge when the partition is 1.9GB)

Is this transient allocation expected?

After digging, it turns out that the actual copy is due to [this 
method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
 in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
(16) components it will trigger a re-allocation of all the buffer. This netty 
issue was fixed in this recent change : 
[https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]

 

As a result, is it possible to benefit from this change somehow in spark 2.2 
and above? I don't know how the netty dependencies are handled for spark

 

Thanks!

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-22096:

Attachment: performance data for NB.png

> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
> Attachments: performance data for NB.png
>
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~16% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-22096:

Description: 
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.
[^performance data for NB.png]

  was:
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.


> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
> Attachments: performance data for NB.png
>
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~16% performance gain with these changes.
> [^performance data for NB.png]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-22096:

Description: 
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.

  was:
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.


> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~16% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22098) Add aggregateByKeyLocally in RDD

2017-09-21 Thread Vincent (JIRA)
Vincent created SPARK-22098:
---

 Summary: Add aggregateByKeyLocally in RDD
 Key: SPARK-22098
 URL: https://issues.apache.org/jira/browse/SPARK-22098
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Vincent
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)
Vincent created SPARK-22096:
---

 Summary: use aggregateByKeyLocally to save one stage in 
calculating ItemFrequency in NaiveBayes
 Key: SPARK-22096
 URL: https://issues.apache.org/jira/browse/SPARK-22096
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Vincent
Priority: Minor


NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-14 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125353#comment-16125353
 ] 

Vincent commented on SPARK-21688:
-

sorry for late reply. 
Yes, It's simple and easy to check the env variables in the code, but I don't 
think that's a right thing to do. 
First, I still believe that if a user decides to run on native blas to speed up 
his/her application, he/she should be aware of proper settings as mentioned in 
https://issues.apache.org/jira/browse/SPARK-21305. they can set 1, 2... or any 
arbitrary number of threads for native blas that can give them better 
performance; 
Second, there are a bunch of BLAS variations, MKL, Openblas, Atlas, Cublas 
...etc, each one has a different variable name for this setting, check all 
these variant settings in the code doesn't seem right.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
>Priority: Minor
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121624#comment-16121624
 ] 

Vincent commented on SPARK-21688:
-

Okay. Yes, true. It can still run without issue but we are just offering 
another choice for those who wanna have 50% speedup or more by using native 
BLAS in their case, they can also stick to F2J with a simple setting in spark 
configuration.

the problem for default thread settings has been discussed in 
https://issues.apache.org/jira/browse/SPARK-21305. I believe it's non-trivial 
but seems it's a common issue for all native blas implementations, there's not 
a good solution to this issue for now.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121589#comment-16121589
 ] 

Vincent commented on SPARK-21688:
-

[~srowen] Thanks for your comments. I think if user decides to use native blas, 
they should be aware of the threading configuration impacts, checking this env 
variable in mllib doesnt make sense; and no, actually we didn't just present 
the best-case result, instead, we took the average value of the 3-run tests for 
each case, and the result shows, for small dataset native blas might not have 
advantage over f2j, but the gap is small and we would expect that big data 
processing is more common case here.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121254#comment-16121254
 ] 

Vincent commented on SPARK-21688:
-

and if native blas is left with default multi-threading setting, it could 
impact other ops on JVM, as we found in native-trywait.png in attached file.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: native-trywait.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: (was: uni-test on ddot.png)

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, svm1.png, 
> svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: ddot unitest.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, svm1.png, 
> svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121236#comment-16121236
 ] 

Vincent commented on SPARK-21688:
-

upload a data we collected before, uni-test on ddot, we can see for data size 
greater than 100, native blas normally has advantages. But if the size is 
smaller than 100, f2j would be a better choice.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png, uni-test on ddot.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: uni-test on ddot.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png, uni-test on ddot.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121209#comment-16121209
 ] 

Vincent commented on SPARK-21688:
-

currently, there are certain places in ML/MLLib, such as in mllib/SVM, blas 
operations (dot, axpy, etc..)are bound with f2j, there is no chance to use 
native blas. We understand it was due to performance issue for blas level I api 
to go with F2J, but that's mainly because multi-thread native blas issue, with 
proper settings, we wont be bothered with such issue. So, maybe we should 
change the f2j-binding calls in the current implementation. [~srowen]

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121113#comment-16121113
 ] 

Vincent edited comment on SPARK-21688 at 8/10/17 6:13 AM:
--

attach svm profiling data  and training comparison data for both F2J and MKL 
solution


was (Author: vincexie):
profiling

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: svm1.png
svm2.png
svm-mkl-1.png
svm-mkl-2.png

profiling

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-09 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: mllib svm training.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-09 Thread Vincent (JIRA)
Vincent created SPARK-21688:
---

 Summary: performance improvement in mllib SVM with native BLAS 
 Key: SPARK-21688
 URL: https://issues.apache.org/jira/browse/SPARK-21688
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.2.0
 Environment: 4 nodes: 1 master node, 3 worker nodes
model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Memory : 180G
num of core per node: 10
Reporter: Vincent


in current mllib SVM implementation, we found that the CPU is not fully 
utilized, one reason is that f2j blas is set to be used in the HingeGradient 
computation. As we found out earlier 
(https://issues.apache.org/jira/browse/SPARK-21305) that with proper settings, 
native blas is generally better than f2j on the uni-test level, here we make 
the blas operations in SVM go with MKL blas and get an end to end performance 
report showing that in most cases native blas outperformance f2j blas up to 50%.
So, we suggest removing those f2j-fixed calling and going for native blas if 
available. If this proposal is acceptable, we will move on to benchmark other 
algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-14 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048769#comment-16048769
 ] 

Vincent commented on SPARK-20988:
-

okay, no problem :)

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-13 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048593#comment-16048593
 ] 

Vincent commented on SPARK-20988:
-

opps. I have finished the conversion part, but there are still other parts of 
work to do, that is to use BLAS in multinomial gradient update as explained in 
SPARK-17134. If you have already started it, I will go check LinearSVC :)

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-12 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047370#comment-16047370
 ] 

Vincent commented on SPARK-20988:
-

I can work on this if no one is working on it now :)

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21058) potential SVD optimization

2017-06-11 Thread Vincent (JIRA)
Vincent created SPARK-21058:
---

 Summary: potential SVD optimization
 Key: SPARK-21058
 URL: https://issues.apache.org/jira/browse/SPARK-21058
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.1
Reporter: Vincent


In the current implementation, computeSVD will compute SVD for matrix A by 
computing AT*A first and svd on the Gramian matrix, we found that the Gramian 
matrix computation is the hot spot of the overall SVD computation. While svd on 
the Gramian matrix can benefit svd computation on the skinny matrix, for a 
non-skinny matrix, it could also become a huge overhead. So, is it possible to 
offer another option by computing svd on the original matrix instead of the 
Gramian matrix? We can decide which way to go by the ratio between numCols and 
numRows, or by simply settings from the user.
We have observed a handsome gain on a toy dataset by svd on the original matrix 
instead of the Gramian matrix, if the proposal is acceptable, we will start to 
work on the patch and gather more performance data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21049) why do we need computeGramianMatrix when computing SVD

2017-06-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045536#comment-16045536
 ] 

Vincent commented on SPARK-21049:
-

[~srowen]thanks. that's right. But we found it quite often that, the matrix is 
not skinny, and it spent quite a lot of time computing gramian matrix. 
Actually, we found that in such case, if we compute the svd on the original 
matrix, we could at least have 5x+ speedup. So, I wonder, whether it's possible 
to add an option here, to offer the user a choice to choose whether go with 
gramian or the original matrix. After all, user knows their data better, what 
do u think?

> why do we need computeGramianMatrix when computing SVD
> --
>
> Key: SPARK-21049
> URL: https://issues.apache.org/jira/browse/SPARK-21049
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Vincent
>
> computeSVD will compute SVD for matrix A by computing AT*A first and svd on 
> the Gramian matrix, we found that the gramian matrix computation is the hot 
> spot of the overall SVD computation, but, per my understanding, we can simply 
> do svd on the original matrix. The singular vector of the gramian matrix 
> should be the same as the right singular vector of the original matrix A, 
> while the singular value of the gramian matrix is double as that of the 
> original matrix. why do we svd on the gramian matrix then?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21049) why do we need computeGramianMatrix when computing SVD

2017-06-10 Thread Vincent (JIRA)
Vincent created SPARK-21049:
---

 Summary: why do we need computeGramianMatrix when computing SVD
 Key: SPARK-21049
 URL: https://issues.apache.org/jira/browse/SPARK-21049
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.1
Reporter: Vincent


computeSVD will compute SVD for matrix A by computing AT*A first and svd on the 
Gramian matrix, we found that the gramian matrix computation is the hot spot of 
the overall SVD computation, but, per my understanding, we can simply do svd on 
the original matrix. The singular vector of the gramian matrix should be the 
same as the right singular vector of the original matrix A, while the singular 
value of the gramian matrix is double as that of the original matrix. why do we 
svd on the gramian matrix then?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2017-05-07 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000176#comment-16000176
 ] 

Vincent commented on SPARK-17134:
-

I will submit a PR for this issue soon.

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-07 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900463#comment-15900463
 ] 

Vincent commented on SPARK-19852:
-

I can work on this issue, since it is related to SPARK-17498

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7132) Add fit with validation set to spark.ml GBT

2017-02-22 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877781#comment-15877781
 ] 

Vincent commented on SPARK-7132:


Hi All, any update on this issue?

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2017-02-21 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875618#comment-15875618
 ] 

Vincent commented on SPARK-14682:
-

any update?

> Provide evaluateEachIteration method or equivalent for spark.ml GBTs
> 
>
> Key: SPARK-14682
> URL: https://issues.apache.org/jira/browse/SPARK-14682
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> spark.mllib GradientBoostedTrees provide an evaluateEachIteration method.  We 
> should provide that or an equivalent for spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19590) Update the document for QuantileDiscretizer in pyspark

2017-02-13 Thread Vincent (JIRA)
Vincent created SPARK-19590:
---

 Summary: Update the document for QuantileDiscretizer in pyspark
 Key: SPARK-19590
 URL: https://issues.apache.org/jira/browse/SPARK-19590
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.1.0
Reporter: Vincent


Document the changes in pyspark for 
https://issues.apache.org/jira/browse/SPARK-17219



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-02-08 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857600#comment-15857600
 ] 

Vincent commented on SPARK-17498:
-

I can take the issue and make a PR

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18023) Adam optimizer

2016-11-20 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682491#comment-15682491
 ] 

Vincent commented on SPARK-18023:
-

thanks [~mlnick]
that's really what we need. when I wrote the code for Adagrad, I do find 
some conflicts with original design. These new optimizers do not share a common 
API with what we have now in mllib, and also with a different workflow, it's 
hard to fit in and make a good PR without changing the original design, so I 
just made a package instead for now.

> Adam optimizer
> --
>
> Key: SPARK-18023
> URL: https://issues.apache.org/jira/browse/SPARK-18023
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Vincent
>Priority: Minor
>
> It could be incredibly slow for SGD methods to diverge or converge if their  
> learning rate alpha are set inappropriately, many alternative methods have 
> been proposed to produce desirable convergence with less dependence on 
> hyperparameter settings, and to help prevent local optimum, e.g. Momentom, 
> NAG (Nesterov's Accelerated Gradient), Adagrad, RMSProp etc.
> Among which, Adam is one of the popular algorithms, which is for first-order 
> gradient-based optimization of stochastic objective functions. It's proved to 
> be well suited for problems with large data and/or parameters, and for 
> problems with noisy and/or sparse gradients and is computationally efficient. 
> Refer to this paper for details
> In fact, Tensorflow has implemented most of the adaptive optimization methods 
> mentioned, and we have seen that Adam out performs most of SGD methods in 
> certain cases, such as very sparse dataset in a FM model.
> It could be nice for Spark to have these adaptive optimization methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add groupKFold to CrossValidator

2016-11-02 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631286#comment-15631286
 ] 

Vincent commented on SPARK-17055:
-

[~srowen] No offense. Maybe we can invite more ppl to have a look at this 
issue? I saw [~mengxr], [~josephkb] and [~sethah] did works similar to this 
one. How do you guys think of this one? Do you all agree that we drop this?

P.S. leave aside the coding for now, it probably needs more work to make a 
completed PR. :)

Thanks.

> add groupKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17055) add groupKFold to CrossValidator

2016-11-01 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-17055:

Summary: add groupKFold to CrossValidator  (was: add labelKFold to 
CrossValidator)

> add groupKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-11-01 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625108#comment-15625108
 ] 

Vincent commented on SPARK-17055:
-

[~sowen] Okay. hmm, I guess we have some misunderstanding here. 
[~remi.delas...@gmail.com] reviewed the code and gave some feedbacks that we 
should be prepared to accommodate to other folding methods if there is any. But 
my opinion was that we align with current design in MLLIB, becoz: 1. we dont 
have that many folding methods so far in MLLIB; 2. Changing the API as Remi 
proposed will have impact on current cross-validation usage, which I think it'd 
be better get the nod from committers.
So, I suggest reopen this ticket coz as we know this folding method is useful 
in practice, and ,as from what I have learned, some ppl in the community 
actually need/use this PR when they use Spark ML. :)

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-10-31 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624114#comment-15624114
 ] 

Vincent commented on SPARK-17055:
-

[~srowen] May I ask the reason why we close this issue? It'd be helpful for us 
to understand current guideline if we are to implement more features in 
ML/MLLIB, thanks.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18023) Adam optimizer

2016-10-20 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590951#comment-15590951
 ] 

Vincent commented on SPARK-18023:
-

I can start with ADAM, then maybe other Ada methods after that

> Adam optimizer
> --
>
> Key: SPARK-18023
> URL: https://issues.apache.org/jira/browse/SPARK-18023
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Vincent
>Priority: Minor
>
> It could be incredibly slow for SGD methods to diverge or converge if their  
> learning rate alpha are set inappropriately, many alternative methods have 
> been proposed to produce desirable convergence with less dependence on 
> hyperparameter settings, and to help prevent local optimum, e.g. Momentom, 
> NAG (Nesterov's Accelerated Gradient), Adagrad, RMSProp etc.
> Among which, Adam is one of the popular algorithms, which is for first-order 
> gradient-based optimization of stochastic objective functions. It's proved to 
> be well suited for problems with large data and/or parameters, and for 
> problems with noisy and/or sparse gradients and is computationally efficient. 
> Refer to this paper for details
> In fact, Tensorflow has implemented most of the adaptive optimization methods 
> mentioned, and we have seen that Adam out performs most of SGD methods in 
> certain cases, such as very sparse dataset in a FM model.
> It could be nice for Spark to have these adaptive optimization methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18023) Adam optimizer

2016-10-20 Thread Vincent (JIRA)
Vincent created SPARK-18023:
---

 Summary: Adam optimizer
 Key: SPARK-18023
 URL: https://issues.apache.org/jira/browse/SPARK-18023
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Vincent
Priority: Minor


It could be incredibly slow for SGD methods to diverge or converge if their  
learning rate alpha are set inappropriately, many alternative methods have been 
proposed to produce desirable convergence with less dependence on 
hyperparameter settings, and to help prevent local optimum, e.g. Momentom, NAG 
(Nesterov's Accelerated Gradient), Adagrad, RMSProp etc.
Among which, Adam is one of the popular algorithms, which is for first-order 
gradient-based optimization of stochastic objective functions. It's proved to 
be well suited for problems with large data and/or parameters, and for problems 
with noisy and/or sparse gradients and is computationally efficient. Refer to 
this paper for details

In fact, Tensorflow has implemented most of the adaptive optimization methods 
mentioned, and we have seen that Adam out performs most of SGD methods in 
certain cases, such as very sparse dataset in a FM model.

It could be nice for Spark to have these adaptive optimization methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-09 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559597#comment-15559597
 ] 

Vincent commented on SPARK-17219:
-

No problem. I will try to submit another PR based on above discussions.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-07 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557034#comment-15557034
 ] 

Vincent commented on SPARK-17219:
-

[~josephkb] [~srowen] [~timhunter] let me know what I can do to help if there 
is anything.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-07 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557021#comment-15557021
 ] 

Vincent commented on SPARK-17219:
-

in this PR(https://github.com/apache/spark/pull/14858) NaN values are always 
put into the last bucket.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17498) StringIndexer.setHandleInvalid sohuld have another option 'new'

2016-09-12 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483514#comment-15483514
 ] 

Vincent edited comment on SPARK-17498 at 9/12/16 8:55 AM:
--

Here is how we cc [~qhuang] look at this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error in such case, do you 
think we should add such 'new' way of handler as proposed for StringIndexer?


was (Author: vincexie):
Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error in such case, do you 
think we should add such 'new' way of handler as proposed for StringIndexer?

> StringIndexer.setHandleInvalid sohuld have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17498) StringIndexer.setHandleInvalid sohuld have another option 'new'

2016-09-12 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483514#comment-15483514
 ] 

Vincent edited comment on SPARK-17498 at 9/12/16 8:43 AM:
--

Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error in such case, do you 
think we should add such 'new' way of handler as proposed for StringIndexer?


was (Author: vincexie):
Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error for such case, do you 
think we should add such 'new' way of handler for StringIndexer?

> StringIndexer.setHandleInvalid sohuld have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid sohuld have another option 'new'

2016-09-12 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483514#comment-15483514
 ] 

Vincent commented on SPARK-17498:
-

Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error for such case, do you 
think we should add such 'new' way of handler for StringIndexer?

> StringIndexer.setHandleInvalid sohuld have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6680) Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration

2016-09-09 Thread YSMAL Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15476947#comment-15476947
 ] 

YSMAL Vincent commented on SPARK-6680:
--

HI, using docker you can get rid of this alias on hostname, by using the 
{code}--hostname spark-master{code} option in docker container.


> Be able to specifie IP for spark-shell(spark driver) blocker for Docker 
> integration
> ---
>
> Key: SPARK-6680
> URL: https://issues.apache.org/jira/browse/SPARK-6680
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 1.3.0
> Environment: Docker.
>Reporter: Egor Pakhomov
>Priority: Minor
>  Labels: core, deploy, docker
>
> Suppose I have 3 docker containers - spark_master, spark_worker and 
> spark_shell. In docker for public IP of this container there is an alias like 
> "fgsdfg454534". It only visible in this container. When spark use it for 
> communication other containers receive this alias and don't know what to do 
> with it. Thats why I used SPARK_LOCAL_IP for master and worker. But it 
> doesn't work for spark driver(for spark shell - other types of drivers I 
> haven't try). Spark driver sent everyone "fgsdfg454534" alias about itself 
> and then nobody can address it. I've overcome it in 
> https://github.com/epahomov/docker-spark, but it would be better if it would 
> be solved on spark code level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-29 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15445873#comment-15445873
 ] 

Vincent commented on SPARK-17219:
-

Cool. I will refine the patch. thanks [~srowen] :)

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-29 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15445858#comment-15445858
 ] 

Vincent commented on SPARK-17219:
-

yes, discretizer can do it easily, especially if only QuantileDiscretizer is in 
question.  But same changes should also be applied to other discretizers in the 
future, like, as Berry mentioned, MDLPDiscretizer, etc.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-29 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15445808#comment-15445808
 ] 

Vincent commented on SPARK-17219:
-

then we have to shift this work to user, who needs to filter out the NaN value 
if, somehow they got the NaN in their split from functions such as quantile. 
And we will make a check before setSplits to Bucketizer, throwing an error if 
NaN split found, while put NaN from data input into an extra bucket in 
Bucketizer.transform. Sounds good?

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-29 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15445768#comment-15445768
 ] 

Vincent commented on SPARK-17219:
-

[~srowen] Hi all, per discussion, I thought we are going to handle NaN splits, 
I mean to allow NaN exist in splits, rather than rejecting -  although NaN 
splits dont really make sense, I think one pro to process NaN splits is making 
bucketizer more robust. If instead, we reject the NaN splits, it'd be better we 
give an error msg to user that, such error is due to NaN value found in the 
splits, in terms of complexity, rejecting or accepting NaN splits, they cost 
the same.
what do you think?

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436873#comment-15436873
 ] 

Vincent commented on SPARK-17219:
-

Okay, thanks. 
So, meaning we will have no options for users actually. We will put NaN in an 
extra bucket. I think we should document it, in case users might find it 
confusing, they should be informed to handle extra bucket with NaN.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436870#comment-15436870
 ] 

Vincent commented on SPARK-17219:
-

yes, if we wanna make this scenario more general to all bucketizer cases, I 
guess you should change the title. Currently the Bucketizer usage is limited, 
it wont have a big impact on current code base if we make some changes to the 
API.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436856#comment-15436856
 ] 

Vincent commented on SPARK-17219:
-

[~srowen] sorryOwen, by saying 'keep it to one behavior'?  do u mean we just 
make extra one bucket whenever Bucketizer find NaN in a cutpoits vector? and 
let users handle NaN elements removal/error handling before feeding data to 
Bucketizer?

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436804#comment-15436804
 ] 

Vincent commented on SPARK-17219:
-

if so, we have to add this option within Bucketizer, right?

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436553#comment-15436553
 ] 

Vincent commented on SPARK-17219:
-

I can work on this issue if no one else is on it  :)

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436143#comment-15436143
 ] 

Vincent commented on SPARK-17219:
-

for this scenario, we can add a new parameter for QuantileDiscretizer, a 
nullStrategy param as Berry mentioned. Actually, R supports such kind of option 
by having a "na.rm"  flag for user to either remove NaN elements before 
quantile, or throw an error (by default). So, I think it's a nice thing to have 
in Spark too.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436136#comment-15436136
 ] 

Vincent commented on SPARK-17219:
-

for cases where only null and non-null buckets are needed, I guess we dont need 
to call QuantileDiscretizer to do that

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent edited comment on SPARK-17055 at 8/23/16 10:34 AM:
---

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Personally I think it's fine to keep the way it is, 
though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.


was (Author: vincexie):
sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent edited comment on SPARK-17055 at 8/23/16 9:18 AM:
--

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.


was (Author: vincexie):
sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way add 
this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent commented on SPARK-17055:
-

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way add 
this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430381#comment-15430381
 ] 

Vincent edited comment on SPARK-17055 at 8/22/16 9:14 AM:
--

well, a better model will have a better cv performance on validation data with 
unseen labels, so the final selected model will have a relatively better 
capability on predicting samples with unseen categories/labels in real case.


was (Author: vincexie):
well, a better model will have a better cv performance on data with unseen 
labels, so the final selected model will have a relatively better capability on 
predicting samples with unseen categories/labels in real case.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430381#comment-15430381
 ] 

Vincent commented on SPARK-17055:
-

well, a better model will have a better cv performance on data with unseen 
labels, so the final selected model will have a relatively better capability on 
predicting samples with unseen categories/labels in real case.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426276#comment-15426276
 ] 

Vincent commented on SPARK-17086:
-

Agree! [~srowen]

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426261#comment-15426261
 ] 

Vincent edited comment on SPARK-17086 at 8/18/16 10:57 AM:
---

[~srowen] in the example you just took, yes, it will return [-Infinity, 1.0, 
2.0, 3.0, Infinity]. that's also the result we saw on spark-1.6.2


was (Author: vincexie):
[~srowen] in the example you just took, yes, it will return [-Infinity, 1.0, 
2.0, 3.0, Infinity]. that's also result we saw on spark-1.6.2

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426261#comment-15426261
 ] 

Vincent commented on SPARK-17086:
-

[~srowen] in the example you just took, yes, it will return [-Infinity, 1.0, 
2.0, 3.0, Infinity]. that's also result we saw on spark-1.6.2

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426246#comment-15426246
 ] 

Vincent commented on SPARK-17086:
-

[~yanboliang] yes, actually that case was handled on spark-1.6.2
[~srowen] currently it will throw an illegal exception in this case:
java.lang.IllegalArgumentException: quantileDiscretizer_07696c9dca6c parameter 
splits given invalid value

so, what i'm doing now is to have a check before calling approxQuantile, that, 
if the distinct input data count is less than numBuckets, we will simply return 
an array with distinct elements as splits, for those cases where number of 
distinct input data is greater than numBuckets, we will just go to 
approxQuantile as the way it is now to generate a splits set. 

For example, with an input data shown in this case, we will output splits : 
[-Infinity, 1.0, 2.0, 3.0, Infinity]

what do you think? [~yanboliang] [~srowen]

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-17 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424076#comment-15424076
 ] 

Vincent commented on SPARK-17086:
-

confirmed issue doesnt exist on Spark-1.6.2.
I will work on this issue.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-15 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420867#comment-15420867
 ] 

Vincent commented on SPARK-17055:
-

one of the most common tasks is to fit a "model" to a set of training data, so 
as to be able to make reliable predictions on general untrained data. 
labelKFold can be used to test the model's ability to generalize by evaluating 
its performance on a class of data not used for training, which is assumed to 
approximate the typical unseen data that a model will encounter.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17055) add labelKFold to CrossValidator

2016-08-14 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-17055:

Description: 
Current CrossValidator only supports k-fold, which randomly divides all the 
samples in k groups of samples. But in cases when data is gathered from 
different subjects and we want to avoid over-fitting, we want to hold out 
samples with certain labels from training data and put them into validation 
fold, i.e. we want to ensure that the same label is not in both testing and 
training sets.

Mainstream packages like Sklearn already supports such cross validation method. 
(http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)

  was:
Current CrossValidator only supports k-fold, which randomly divides all the 
samples in k groups of samples. But in cases when data is gathered from 
different subjects and we want to avoid over-fitting, we want to hold out 
samples with certain labels from training data and put them into validation 
fold, i.e. we want to ensure that the same label is not in both testing and 
training sets.

Mainstream package like Sklearn already supports such cross validation method. 


> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17055) add labelKFold to CrossValidator

2016-08-14 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-17055:

Affects Version/s: (was: 2.0.0)

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream package like Sklearn already supports such cross validation 
> method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17055) add labelKFold to CrossValidator

2016-08-14 Thread Vincent (JIRA)
Vincent created SPARK-17055:
---

 Summary: add labelKFold to CrossValidator
 Key: SPARK-17055
 URL: https://issues.apache.org/jira/browse/SPARK-17055
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Vincent
Priority: Minor


Current CrossValidator only supports k-fold, which randomly divides all the 
samples in k groups of samples. But in cases when data is gathered from 
different subjects and we want to avoid over-fitting, we want to hold out 
samples with certain labels from training data and put them into validation 
fold, i.e. we want to ensure that the same label is not in both testing and 
training sets.

Mainstream package like Sklearn already supports such cross validation method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org