[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Issue Type: Bug  (was: Improvement)

> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)
> lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())
> model = ovr.fit(df)model_path = 'temp' + "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())

model = ovr.fit(df)model_path = 'temp' + "/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())
> model = ovr.fit(df)model_path = 'temp' + "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)
lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())

model = ovr.fit(df)model_path = 'temp' + "/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())

model = ovr.fit(df)model_path = 'temp' + "/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)
> lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())
> model = ovr.fit(df)model_path = 'temp' + "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
> "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}

{{'rawPrediction'}}

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
> "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}

{{'rawPrediction'}}

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
> {{data_path = "/sample_multiclass_classification_data.txt"}}
> {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
> LinearSVC(regParam=0.01){}}}
> {{# set the name of rawPrediction column}}
> {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
> {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
> ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
> {{model.write().overwrite().save(model_path)}}
> {{model2 = OneVsRestModel.load(model_path)}}
> {{model2.getRawPredictionCol()}}
> {{Output:}}
> {{raw_prediction }}
> {{'rawPrediction'}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

 

  was:
The naming of `rawPredcitionCol` in `OneVsRest` does not persist after saving 
and loading a trained model. This becomes an issue when I try to stack multiple 
One Vs Rest models in a pipeline. Code example below. 

{{```}}

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

{{```}}


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
> {{data_path = "/sample_multiclass_classification_data.txt"}}
> {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
> LinearSVC(regParam=0.01){}}}
> {{# set the name of rawPrediction column}}
> {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
> {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
> ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
> {{model.write().overwrite().save(model_path)}}
> {{model2 = OneVsRestModel.load(model_path)}}
> {{model2.getRawPredictionCol()}}
> {{Output:}}
> {{raw_prediction }}{{'rawPrediction'}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)
koba created SPARK-39544:


 Summary: setPredictionCol for OneVsRest does not persist when 
saving model to disk
 Key: SPARK-39544
 URL: https://issues.apache.org/jira/browse/SPARK-39544
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.3.0, 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 
3.0.1, 3.0.0
 Environment: Python 3.6

Spark 3.2
Reporter: koba


The naming of `rawPredcitionCol` in `OneVsRest` does not persist after saving 
and loading a trained model. This becomes an issue when I try to stack multiple 
One Vs Rest models in a pipeline. Code example below. 

{{```}}

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

{{```}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29952) Pandas UDFs do not support vectors as input

2019-11-18 Thread koba (Jira)
koba created SPARK-29952:


 Summary: Pandas UDFs do not support vectors as input
 Key: SPARK-29952
 URL: https://issues.apache.org/jira/browse/SPARK-29952
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.3
Reporter: koba


Currently, pandas udfs do not support columns of vectors as input. Only columns 
of arrays. This means that feature columns that contain Dense- or Sparse 
vectors generated by CountVectorizer for example are not supported by pandas 
udfs out of the box. One needs to convert vectors into arrays first. It was not 
documented anywhere and I had to find out by trial and error. Below is an 
example. 

 
{code:java}
from pyspark.sql.functions import udf, pandas_udf
import pyspark.sql.functions as F
from pyspark.ml.linalg import DenseVector, Vectors, VectorUDT
from pyspark.sql.types import *
import numpy as np

columns = ['features','id']
vals = [
 (DenseVector([1, 2, 1, 3]),1),
 (DenseVector([2, 2, 1, 3]),2)
]

sdf = spark.createDataFrame(vals,columns)
sdf.show()

+-+---+
| features| id|
+-+---+
|[1.0,2.0,1.0,3.0]|  1|
|[2.0,2.0,1.0,3.0]|  2|
+-+---+
{code}
{code:java}
@udf(returnType=ArrayType(FloatType()))
def vector_to_array(v):
# convert column of vectors into column of arrays
a = v.values.tolist()
return a

sdf = sdf.withColumn('features_array',vector_to_array('features'))
sdf.show()
sdf.dtypes

+-+---++
| features| id|  features_array|
+-+---++
|[1.0,2.0,1.0,3.0]|  1|[1.0, 2.0, 1.0, 3.0]|
|[2.0,2.0,1.0,3.0]|  2|[2.0, 2.0, 1.0, 3.0]|
+-+---++

[('features', 'vector'), ('id', 'bigint'), ('features_array', 'array')]
{code}
{code:java}
import pandas as pd

@pandas_udf(LongType())
def _pandas_udf(v):
res = []
for i in v:
res.append(i.mean())
return pd.Series(res)

sdf.select(_pandas_udf('features_array')).show()

+---+
|_pandas_udf(features_array)|
+---+
|  1|
|  2|
+---+
{code}
But If I use the vector column I get the following error.
{code:java}
sdf.select(_pandas_udf('features')).show()

---
Py4JJavaError Traceback (most recent call last)
 in 
 13 
 14 
---> 15 sdf.select(_pandas_udf('features')).show()

~/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages/pyspark/sql/dataframe.py
 in show(self, n, truncate, vertical)
376 """
377 if isinstance(truncate, bool) and truncate:
--> 378 print(self._jdf.showString(n, 20, vertical))
379 else:
380 print(self._jdf.showString(n, int(truncate), vertical))

~/.pyenv/versions/3.4.4/lib/python3.4/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1255 answer = self.gateway_client.send_command(command)
   1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259 for temp_arg in temp_args:

~/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages/pyspark/sql/utils.py
 in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

~/.pyenv/versions/3.4.4/lib/python3.4/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o2635.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 156.0 failed 1 times, most recent failure: Lost task 0.0 in stage 156.0 
(TID 606, localhost, executor driver): java.lang.UnsupportedOperationException: 
Unsupported data type: 
struct,values:array>
at 
org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowType(ArrowUtils.scala:56)
at 
org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowField(ArrowUtils.scala:92)
at 
org.apache.spark.sql.execution.arrow.ArrowUtils$$anonfun$toArrowSchema$1.apply(ArrowUtils.scala:116)
at 
org.apache.spark.sql.execution.arrow.ArrowUtils$$anonfun$toArrowSchema$1.apply(ArrowUtils.scala:115)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)