[jira] [Commented] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-01-25 Thread Evan Zamir (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482128#comment-17482128
 ] 

Evan Zamir commented on SPARK-38027:


Looking into this further I think the issue is arising upon serializing the 
model either logging it or persisting it to disk. From my logs:

2022-01-25 14:21:33,664 root ERROR An error occurred while calling 
o1538.toString.
: java.util.NoSuchElementException: Failed to find a default value for link
at 
org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
at org.apache.spark.ml.param.Params.$(params.scala:762)
at org.apache.spark.ml.param.Params.$$(params.scala:762)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
at 
org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


> Undefined link function causing error in GLM that uses Tweedie family
> -
>
> Key: SPARK-38027
> URL: https://issues.apache.org/jira/browse/SPARK-38027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.1.2
> Environment: Running on Mac OS X Monterey
>Reporter: Evan Zamir
>Priority: Major
>  Labels: GLM, pyspark
>
> I am trying to use the GLM regression with a Tweedie distribution so I can 
> model insurance use cases. I have set up a very simple example adapted from 
> the docs:
> {code:python}
> def create_fake_losses_data(self):
> df = self._spark.createDataFrame([
> ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
> ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", 
> "label", "offset", "weight", "features"])
> logging.info(df.collect())
> setattr(self, 'fake_data', df)
> try:
> glr = GeneralizedLinearRegression(
> family="tweedie", variancePower=1.5, linkPower=-1, 
> offsetCol='offset')
> glr.setRegParam(0.3)
> model = glr.fit(df)
> logging.info(model)
> except Py4JJavaError as e:
> print(e)
> return self
> {code}
> This causes the following error:
> *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
> : java.util.NoSuchElementException: Failed to find a default value for link*
> at 
> org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
> at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
> at org.apache.spark.ml.param.Params.$(params.scala:762)
> at org.apache.spark.ml.param.Params.$$(params.scala:762)
> at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at 
> 

[jira] [Created] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-01-25 Thread Evan Zamir (Jira)
Evan Zamir created SPARK-38027:
--

 Summary: Undefined link function causing error in GLM that uses 
Tweedie family
 Key: SPARK-38027
 URL: https://issues.apache.org/jira/browse/SPARK-38027
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.1.2
 Environment: Running on Mac OS X Monterey
Reporter: Evan Zamir


I am trying to use the GLM regression with a Tweedie distribution so I can 
model insurance use cases. I have set up a very simple example adapted from the 
docs:


{code:python}
def create_fake_losses_data(self):
df = self._spark.createDataFrame([
("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", "label", 
"offset", "weight", "features"])
logging.info(df.collect())
setattr(self, 'fake_data', df)
try:
glr = GeneralizedLinearRegression(
family="tweedie", variancePower=1.5, linkPower=-1, 
offsetCol='offset')
glr.setRegParam(0.3)
model = glr.fit(df)
logging.info(model)
except Py4JJavaError as e:
print(e)
return self
{code}

This causes the following error:

*py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
: java.util.NoSuchElementException: Failed to find a default value for link*
at 
org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
at org.apache.spark.ml.param.Params.$(params.scala:762)
at org.apache.spark.ml.param.Params.$$(params.scala:762)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
at 
org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


I was under the assumption that the default value for link is None, if not 
defined otherwise.
 
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26387) Parallelism seems to cause difference in CrossValidation model metrics

2018-12-17 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-26387:
--

 Summary: Parallelism seems to cause difference in CrossValidation 
model metrics
 Key: SPARK-26387
 URL: https://issues.apache.org/jira/browse/SPARK-26387
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.3.2, 2.3.1
Reporter: Evan Zamir


I can only reproduce this issue when running Spark on different Amazon EMR 
versions, but it seems that between Spark 2.3.1 and 2.3.2 (corresponding to EMR 
versions 5.17/5.18) the presence of the parallelism parameter was causing AUC 
metric to increase. Literally, I run the same exact code with and without 
parallelism and the AUC of my models (logistic regression) are changing 
significantly. I can't find a previous bug report relating to this, so I'm 
posting this as new.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier

2018-07-19 Thread Evan Zamir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Zamir updated SPARK-24866:
---
Description: I'm encountering a very strange behavior that I can't explain 
away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally 
using 1 Core instance. On these models, I have been consistently getting ROCs 
(during CV) ~0.55-0.60 (not good models obviously, but that's not the point 
here). After learning that Spark 2.3 introduced a parallelism parameter for the 
CV class, I decided to implement that and see if increasing the number of Core 
instances could also help speed up the models (which take several hours, 
sometimes up to a full day). To make a long story short, I have seen that on 
some of my datasets simply increasing the number of Core instances (i.e. 2), 
the ROC scores (*bestValidationMetric*) increase tremendously to the range of 
0.85-0.95. For the life of me I can't figure out why simply increasing the 
number of instances (with absolutely no changes to code), would have this 
effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
I'd post here and see if anyone has an idea for me.   (was: I'm encountering a 
very strange behavior that I can't explain away other than a bug somewhere. I'm 
creating RF models on Amazon EMR, normally using 1 Core instance. On these 
models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good 
models obviously, but that's not the point here). After learning that Spark 2.3 
introduced a parallelism parameter for the CV class, I decided to implement 
that and see if increasing the number of Core instances could also help speed 
up the models (which take several hours, sometimes up to a full day). To make a 
long story short, I have seen that on some of my datasets simply increasing the 
number of Core instances (i.e. 2), the ROC scores increase tremendously to the 
range of 0.85-0.95. For the life of me I can't figure out why simply increasing 
the number of instances (with absolutely no changes to code), would have this 
effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
I'd post here and see if anyone has an idea for me. )

> Artifactual ROC scores when scaling up Random Forest classifier
> ---
>
> Key: SPARK-24866
> URL: https://issues.apache.org/jira/browse/SPARK-24866
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Evan Zamir
>Priority: Minor
>
> I'm encountering a very strange behavior that I can't explain away other than 
> a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core 
> instance. On these models, I have been consistently getting ROCs (during CV) 
> ~0.55-0.60 (not good models obviously, but that's not the point here). After 
> learning that Spark 2.3 introduced a parallelism parameter for the CV class, 
> I decided to implement that and see if increasing the number of Core 
> instances could also help speed up the models (which take several hours, 
> sometimes up to a full day). To make a long story short, I have seen that on 
> some of my datasets simply increasing the number of Core instances (i.e. 2), 
> the ROC scores (*bestValidationMetric*) increase tremendously to the range of 
> 0.85-0.95. For the life of me I can't figure out why simply increasing the 
> number of instances (with absolutely no changes to code), would have this 
> effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
> I'd post here and see if anyone has an idea for me. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier

2018-07-19 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-24866:
--

 Summary: Artifactual ROC scores when scaling up Random Forest 
classifier
 Key: SPARK-24866
 URL: https://issues.apache.org/jira/browse/SPARK-24866
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Evan Zamir


I'm encountering a very strange behavior that I can't explain away other than a 
bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core 
instance. On these models, I have been consistently getting ROCs (during CV) 
~0.55-0.60 (not good models obviously, but that's not the point here). After 
learning that Spark 2.3 introduced a parallelism parameter for the CV class, I 
decided to implement that and see if increasing the number of Core instances 
could also help speed up the models (which take several hours, sometimes up to 
a full day). To make a long story short, I have seen that on some of my 
datasets simply increasing the number of Core instances (i.e. 2), the ROC 
scores increase tremendously to the range of 0.85-0.95. For the life of me I 
can't figure out why simply increasing the number of instances (with absolutely 
no changes to code), would have this effect. I don't know if this is a Spark 
problem or somehow EMR, but I figured I'd post here and see if anyone has an 
idea for me. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23684) mode append function not working

2018-03-15 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400823#comment-16400823
 ] 

Evan Zamir commented on SPARK-23684:


Yes, you're right. Feel free to close this.

> mode append function not working 
> -
>
> Key: SPARK-23684
> URL: https://issues.apache.org/jira/browse/SPARK-23684
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Evan Zamir
>Priority: Minor
>
> {{df.write.mode('append').jdbc(url, table, properties=\{"driver": 
> "org.postgresql.Driver"}) }}
> produces the following error and does not write to existing table:
> {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling 
> o894.jdbc.}}
> {{: scala.MatchError: null}}
> {{ at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}}
> {{ at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}}
> {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}}
> {{ at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}}
> {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}}
> {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}}
> {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
> {{ at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
> {{ at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{ at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}}
> {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}}
> {{ at py4j.Gateway.invoke(Gateway.java:280)}}
> {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}}
> {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}}
> {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}}
> {{ at java.lang.Thread.run(Thread.java:745)}}
> However,
> {{df.write.jdbc(url, table, properties=\{"driver": 
> "org.postgresql.Driver"},mode='append')}}
> does not produce an error and adds a row to an exisiting table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23684) mode append function not working

2018-03-14 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-23684:
--

 Summary: mode append function not working 
 Key: SPARK-23684
 URL: https://issues.apache.org/jira/browse/SPARK-23684
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.2.0
Reporter: Evan Zamir


{{df.write.mode('append').jdbc(url, table, properties=\{"driver": 
"org.postgresql.Driver"}) }}

produces the following error and does not write to existing table:

{{2018-03-14 11:00:08,332 root ERROR An error occurred while calling 
o894.jdbc.}}
{{: scala.MatchError: null}}
{{ at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}}
{{ at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}}
{{ at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}}
{{ at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}}
{{ at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}}
{{ at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}}
{{ at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
{{ at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
{{ at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}}
{{ at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
{{ at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}}
{{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}}
{{ at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}}
{{ at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}}
{{ at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}}
{{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}}
{{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}}
{{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
{{ at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
{{ at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
{{ at java.lang.reflect.Method.invoke(Method.java:498)}}
{{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}}
{{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}}
{{ at py4j.Gateway.invoke(Gateway.java:280)}}
{{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}}
{{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}}
{{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}}
{{ at java.lang.Thread.run(Thread.java:745)}}

However,

{{df.write.jdbc(url, table, properties=\{"driver": 
"org.postgresql.Driver"},mode='append')}}

does not produce an error and adds a row to an exisiting table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23631) Add summary to RandomForestClassificationModel

2018-03-08 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-23631:
--

 Summary: Add summary to RandomForestClassificationModel
 Key: SPARK-23631
 URL: https://issues.apache.org/jira/browse/SPARK-23631
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.0
Reporter: Evan Zamir


I'm using the RandomForestClassificationModel and noticed that there is no 
summary attribute like there is for LogisticRegressionModel. Specifically, I'd 
like to have the roc and pr curves. Is that on the Spark roadmap anywhere? Is 
there a reason it hasn't been implemented?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20182) Dot in DataFrame Column title causes errors

2017-03-31 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-20182:
--

 Summary: Dot in DataFrame Column title causes errors
 Key: SPARK-20182
 URL: https://issues.apache.org/jira/browse/SPARK-20182
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Evan Zamir


I did a search and saw this issue pop up before, and while it seemed like it 
had been solved before 2.1, I'm still seeing an error.

```
emp = spark.createDataFrame([(["Joe", "Bob", "Mary"],),
(["Mike", "Matt", "Stacy"],)],
  ["first.names"])

print(emp.collect())

emp.select(['first.names']).alias('first')


```
[Row(first.names=['Joe', 'Bob', 'Mary']), Row(first.names=['Mike', 'Matt', 
'Stacy'])]
Py4JJavaError Traceback (most recent call last)
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:

Py4JJavaError: An error occurred while calling o1734.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`first.names`' given 
input columns: [first.names];;
'Project ['first.names]
+- LogicalRDD [first.names#466]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 

[jira] [Created] (SPARK-17923) dateFormat unexpected kwarg to df.write.csv

2016-10-13 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-17923:
--

 Summary: dateFormat unexpected kwarg to df.write.csv
 Key: SPARK-17923
 URL: https://issues.apache.org/jira/browse/SPARK-17923
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Evan Zamir
Priority: Minor


Calling like this:
{code}writer.csv(path, header=header, sep=sep, compression=compression, 
dateFormat=date_format){code}

Getting the following error:
{code}TypeError: csv() got an unexpected keyword argument 'dateFormat'{code}

This error comes after being called with {code}date_format='-MM-dd'{code} 
as an argument.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491186#comment-15491186
 ] 

Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:53 PM:
-

Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says {{weightCol=None}}, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read {{weightCol=""}} (which apparently is 
the way one has to write the code to run without error)?


was (Author: zamir.e...@gmail.com):
Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says {{weightCol=None}}, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read {{weightCol=""}}?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491186#comment-15491186
 ] 

Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:52 PM:
-

Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says {{weightCol=None}}, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read {{weightCol=""}}?


was (Author: zamir.e...@gmail.com):
Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says `weightCol=None`, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read `weightCol=""`?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491186#comment-15491186
 ] 

Evan Zamir commented on SPARK-17508:


Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says `weightCol=None`, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read `weightCol=""`?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-12 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485782#comment-15485782
 ] 

Evan Zamir commented on SPARK-17508:


[~bryanc] Oh, that helps a lot! I've been writing very light wrappers around 
Spark functions and it wasn't clear to me whether I could keep weightCol as an 
optional parameter. At least now I can reason about how to do it better.

I guess this isn't so much a bug then, as it is a feature request. So if 
someone wants to close the issue or reclassify, that would make sense. I can 
only imagine I'm not the only Spark user who has been miffed by this.

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-12 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484861#comment-15484861
 ] 

Evan Zamir commented on SPARK-17508:


Just ran the same snippet of code setting weightCol="" and that runs without 
error. It's only when I set weightCol=None that I get an error.

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-12 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484850#comment-15484850
 ] 

Evan Zamir commented on SPARK-17508:


Yep, I'm running 2.0.0. You can see in the error messages above that it's 
running 2.0.0. Can you try running the same code snippet and see if it works 
for you?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-12 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-17508:
--

 Summary: Setting weightCol to None in ML library causes an error
 Key: SPARK-17508
 URL: https://issues.apache.org/jira/browse/SPARK-17508
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Evan Zamir


The following code runs without error:

{code}
spark = SparkSession.builder.appName('WeightBug').getOrCreate()
df = spark.createDataFrame(
[
(1.0, 1.0, Vectors.dense(1.0)),
(0.0, 1.0, Vectors.dense(-1.0))
],
["label", "weight", "features"])
lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
model = lr.fit(df)
{code}

My expectation from reading the documentation is that setting weightCol=None 
should treat all weights as 1.0 (regardless of whether a column exists). 
However, the same code with weightCol set to None causes the following errors:

Traceback (most recent call last):

  File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
model = lr.fit(df)
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
64, in fit
return self._fit(dataset)
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
line 213, in _fit
java_model = self._fit_java(dataset)
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
line 210, in _fit_java
return self._java_obj.fit(dataset._jdf)
  File 
"/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
63, in deco
return f(*a, **kw)
  File 
"/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
: java.lang.NullPointerException
at 
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
at 
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
at 
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)


Process finished with exit code 1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org