[jira] [Updated] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

2018-11-29 Thread Xinyong Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyong Tian updated SPARK-26166:
-
Description: 
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table need be  recomputed, thus random number could be different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 

  was:
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table might stilled be  recomputed, thus random number could be 
different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 


> CrossValidator.fit() bug,training and validation dataset may overlap
> 
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xinyong Tian
>Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

2018-11-29 Thread Xinyong Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyong Tian updated SPARK-26166:
-
Description: 
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table might stilled be  recomputed, thus random number could be 
different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 

  was:
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.cache()

If  df not cached, it will be reselect each time when train and validation 
dataframe need to be created. The order of rows in df,which rand(seed)  is 
dependent on, is not deterministic . Thus each time random column value could 
be different for a specific row even with seed.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

https://dzone.com/articles/non-deterministic-order-for-select-with-limit

 


> CrossValidator.fit() bug,training and validation dataset may overlap
> 
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xinyong Tian
>Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table might stilled be  recomputed, thus random number could be 
> different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

2018-11-25 Thread Xinyong Tian (JIRA)
Xinyong Tian created SPARK-26166:


 Summary: CrossValidator.fit() bug,training and validation dataset 
may overlap
 Key: SPARK-26166
 URL: https://issues.apache.org/jira/browse/SPARK-26166
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Xinyong Tian


In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.cache()

If  df not cached, it will be reselect each time when train and validation 
dataframe need to be created. The order of rows in df,which rand(seed)  is 
dependent on, is not deterministic . Thus each time random column value could 
be different for a specific row even with seed.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

https://dzone.com/articles/non-deterministic-order-for-select-with-limit

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25441) calculate term frequency in CountVectorizer()

2018-09-15 Thread Xinyong Tian (JIRA)
Xinyong Tian created SPARK-25441:


 Summary: calculate term frequency in CountVectorizer()
 Key: SPARK-25441
 URL: https://issues.apache.org/jira/browse/SPARK-25441
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.1
Reporter: Xinyong Tian


currently CountVectorizer() can not output TF (term frequency). I hope there 
will be such option.

TF defined as https://en.m.wikipedia.org/wiki/Tf–idf

 

example,

>>> df = spark.createDataFrame( ...  [(0, ["a", "b", "c"]), (1, ["a", "b", "b", 
>>> "c", "a"])], ...  ["label", "raw"])

>>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")

>>> model = cv.fit(df)

>>> model.transform(df).limit(1).show(truncate=False)

label        raw           vectors 

0            [a, b, c]       (3,[0,1,2],[1.0,1.0,1.0])

 

instead I want 

0            [a, b, c]       (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector 
devided by by its sum, here 3, so                                               
                                  sum of new vector will 1,for every 
row(document)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-06 Thread Xinyong Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504202#comment-16504202
 ] 

Xinyong Tian commented on SPARK-24431:
--

I also feel it is reasonable to set first point as (0,p). In fact, as long as 
it is not (0,1), aucPR will be small enough for a model that predicts same p 
for all examples, so cross validation will not select such model.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-05 Thread Xinyong Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502794#comment-16502794
 ] 

Xinyong Tian commented on SPARK-24431:
--

I read more about first point of or curve
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
In the above example, when setting predicted probability for each row as 0.01, 
only one point on pr curve is defined, ie recall=1, precision =0.01.  according 
to the website, first point on the problem curve should be a horizontal line 
from 2nd point (the only point (1,0.01) here), which should be (0,0.01).  In 
this way, the no model 's  areaUnderPR=0.01,  instead of 0.05.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-05 Thread Xinyong Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502761#comment-16502761
 ] 

Xinyong Tian commented on SPARK-24431:
--

Your understanding of event rate is what I meant.
I understand that max areaUnderPR can be 1. What I meant is that 0.5 is the max 
areaUnderPR for the grid I searched. For example. Let us say there is  a 
dataset with event rate 0.01 and the best model's  areaUnderPR is 0.30. But 
without any model ,we can set predicted probability for each row as 0.01. This 
is the situation when there is too much regularlzation. The problem is that , 
at this situation , BinaryClassificationEvaluator will calculate areaUnderPR as 
0.50(for reason see original description), which is better than the best model 
. This is not what we want. 

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-05-30 Thread Xinyong Tian (JIRA)
Xinyong Tian created SPARK-24431:


 Summary: wrong areaUnderPR calculation in 
BinaryClassificationEvaluator 
 Key: SPARK-24431
 URL: https://issues.apache.org/jira/browse/SPARK-24431
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Xinyong Tian


My problem, I am using CrossValidator(estimator=LogisticRegression(...), ...,  
evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to select 
best model. when the regParam in logistict regression is very high, no variable 
will be selected (no model), ie every row 's prediction is same ,eg. equal 
event rate (baseline frequency). But at this point,  
BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
model seleted is a no model. 

the reason is following.  at time of no model, precision recall curve will be 
only two points: at recall =0, precision should be set to  zero , while the 
software set it to 1. at recall=1, precision is the event rate. As a result, 
the areaUnderPR will be close 0.5 (my even rate is very low), which is maximum .

the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org