[jira] [Commented] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family
[ https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482128#comment-17482128 ] Evan Zamir commented on SPARK-38027: Looking into this further I think the issue is arising upon serializing the model either logging it or persisting it to disk. From my logs: 2022-01-25 14:21:33,664 root ERROR An error occurred while calling o1538.toString. : java.util.NoSuchElementException: Failed to find a default value for link at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) > Undefined link function causing error in GLM that uses Tweedie family > - > > Key: SPARK-38027 > URL: https://issues.apache.org/jira/browse/SPARK-38027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.1.2 > Environment: Running on Mac OS X Monterey >Reporter: Evan Zamir >Priority: Major > Labels: GLM, pyspark > > I am trying to use the GLM regression with a Tweedie distribution so I can > model insurance use cases. I have set up a very simple example adapted from > the docs: > {code:python} > def create_fake_losses_data(self): > df = self._spark.createDataFrame([ > ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)), > ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)), > ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)), > ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", > "label", "offset", "weight", "features"]) > logging.info(df.collect()) > setattr(self, 'fake_data', df) > try: > glr = GeneralizedLinearRegression( > family="tweedie", variancePower=1.5, linkPower=-1, > offsetCol='offset') > glr.setRegParam(0.3) > model = glr.fit(df) > logging.info(model) > except Py4JJavaError as e: > print(e) > return self > {code} > This causes the following error: > *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString. > : java.util.NoSuchElementException: Failed to find a default value for link* > at > org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) > at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) > at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) > at org.apache.spark.ml.param.Params.$(params.scala:762) > at org.apache.spark.ml.param.Params.$$(params.scala:762) > at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) > at > org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at >
[jira] [Created] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family
Evan Zamir created SPARK-38027: -- Summary: Undefined link function causing error in GLM that uses Tweedie family Key: SPARK-38027 URL: https://issues.apache.org/jira/browse/SPARK-38027 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.1.2 Environment: Running on Mac OS X Monterey Reporter: Evan Zamir I am trying to use the GLM regression with a Tweedie distribution so I can model insurance use cases. I have set up a very simple example adapted from the docs: {code:python} def create_fake_losses_data(self): df = self._spark.createDataFrame([ ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)), ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)), ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)), ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", "label", "offset", "weight", "features"]) logging.info(df.collect()) setattr(self, 'fake_data', df) try: glr = GeneralizedLinearRegression( family="tweedie", variancePower=1.5, linkPower=-1, offsetCol='offset') glr.setRegParam(0.3) model = glr.fit(df) logging.info(model) except Py4JJavaError as e: print(e) return self {code} This causes the following error: *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString. : java.util.NoSuchElementException: Failed to find a default value for link* at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) I was under the assumption that the default value for link is None, if not defined otherwise. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26387) Parallelism seems to cause difference in CrossValidation model metrics
Evan Zamir created SPARK-26387: -- Summary: Parallelism seems to cause difference in CrossValidation model metrics Key: SPARK-26387 URL: https://issues.apache.org/jira/browse/SPARK-26387 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.3.2, 2.3.1 Reporter: Evan Zamir I can only reproduce this issue when running Spark on different Amazon EMR versions, but it seems that between Spark 2.3.1 and 2.3.2 (corresponding to EMR versions 5.17/5.18) the presence of the parallelism parameter was causing AUC metric to increase. Literally, I run the same exact code with and without parallelism and the AUC of my models (logistic regression) are changing significantly. I can't find a previous bug report relating to this, so I'm posting this as new. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier
[ https://issues.apache.org/jira/browse/SPARK-24866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evan Zamir updated SPARK-24866: --- Description: I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores (*bestValidationMetric*) increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me. (was: I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me. ) > Artifactual ROC scores when scaling up Random Forest classifier > --- > > Key: SPARK-24866 > URL: https://issues.apache.org/jira/browse/SPARK-24866 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Evan Zamir >Priority: Minor > > I'm encountering a very strange behavior that I can't explain away other than > a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core > instance. On these models, I have been consistently getting ROCs (during CV) > ~0.55-0.60 (not good models obviously, but that's not the point here). After > learning that Spark 2.3 introduced a parallelism parameter for the CV class, > I decided to implement that and see if increasing the number of Core > instances could also help speed up the models (which take several hours, > sometimes up to a full day). To make a long story short, I have seen that on > some of my datasets simply increasing the number of Core instances (i.e. 2), > the ROC scores (*bestValidationMetric*) increase tremendously to the range of > 0.85-0.95. For the life of me I can't figure out why simply increasing the > number of instances (with absolutely no changes to code), would have this > effect. I don't know if this is a Spark problem or somehow EMR, but I figured > I'd post here and see if anyone has an idea for me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier
Evan Zamir created SPARK-24866: -- Summary: Artifactual ROC scores when scaling up Random Forest classifier Key: SPARK-24866 URL: https://issues.apache.org/jira/browse/SPARK-24866 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Evan Zamir I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23684) mode append function not working
[ https://issues.apache.org/jira/browse/SPARK-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400823#comment-16400823 ] Evan Zamir commented on SPARK-23684: Yes, you're right. Feel free to close this. > mode append function not working > - > > Key: SPARK-23684 > URL: https://issues.apache.org/jira/browse/SPARK-23684 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Evan Zamir >Priority: Minor > > {{df.write.mode('append').jdbc(url, table, properties=\{"driver": > "org.postgresql.Driver"}) }} > produces the following error and does not write to existing table: > {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling > o894.jdbc.}} > {{: scala.MatchError: null}} > {{ at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}} > {{ at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}} > {{ at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}} > {{ at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}} > {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}} > {{ at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}} > {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}} > {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}} > {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} > {{ at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}} > {{ at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}} > {{ at java.lang.reflect.Method.invoke(Method.java:498)}} > {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}} > {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}} > {{ at py4j.Gateway.invoke(Gateway.java:280)}} > {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}} > {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}} > {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}} > {{ at java.lang.Thread.run(Thread.java:745)}} > However, > {{df.write.jdbc(url, table, properties=\{"driver": > "org.postgresql.Driver"},mode='append')}} > does not produce an error and adds a row to an exisiting table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23684) mode append function not working
Evan Zamir created SPARK-23684: -- Summary: mode append function not working Key: SPARK-23684 URL: https://issues.apache.org/jira/browse/SPARK-23684 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.2.0 Reporter: Evan Zamir {{df.write.mode('append').jdbc(url, table, properties=\{"driver": "org.postgresql.Driver"}) }} produces the following error and does not write to existing table: {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling o894.jdbc.}} {{: scala.MatchError: null}} {{ at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}} {{ at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}} {{ at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}} {{ at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}} {{ at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}} {{ at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}} {{ at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} {{ at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} {{ at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}} {{ at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}} {{ at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}} {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}} {{ at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}} {{ at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}} {{ at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}} {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}} {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}} {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} {{ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}} {{ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}} {{ at java.lang.reflect.Method.invoke(Method.java:498)}} {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}} {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}} {{ at py4j.Gateway.invoke(Gateway.java:280)}} {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}} {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}} {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}} {{ at java.lang.Thread.run(Thread.java:745)}} However, {{df.write.jdbc(url, table, properties=\{"driver": "org.postgresql.Driver"},mode='append')}} does not produce an error and adds a row to an exisiting table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23631) Add summary to RandomForestClassificationModel
Evan Zamir created SPARK-23631: -- Summary: Add summary to RandomForestClassificationModel Key: SPARK-23631 URL: https://issues.apache.org/jira/browse/SPARK-23631 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.0 Reporter: Evan Zamir I'm using the RandomForestClassificationModel and noticed that there is no summary attribute like there is for LogisticRegressionModel. Specifically, I'd like to have the roc and pr curves. Is that on the Spark roadmap anywhere? Is there a reason it hasn't been implemented? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20182) Dot in DataFrame Column title causes errors
Evan Zamir created SPARK-20182: -- Summary: Dot in DataFrame Column title causes errors Key: SPARK-20182 URL: https://issues.apache.org/jira/browse/SPARK-20182 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: Evan Zamir I did a search and saw this issue pop up before, and while it seemed like it had been solved before 2.1, I'm still seeing an error. ``` emp = spark.createDataFrame([(["Joe", "Bob", "Mary"],), (["Mike", "Matt", "Stacy"],)], ["first.names"]) print(emp.collect()) emp.select(['first.names']).alias('first') ``` [Row(first.names=['Joe', 'Bob', 'Mary']), Row(first.names=['Mike', 'Matt', 'Stacy'])] Py4JJavaError Traceback (most recent call last) /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: Py4JJavaError: An error occurred while calling o1734.select. : org.apache.spark.sql.AnalysisException: cannot resolve '`first.names`' given input columns: [first.names];; 'Project ['first.names] +- LogicalRDD [first.names#466] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.select(Dataset.scala:1121) at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
[jira] [Created] (SPARK-17923) dateFormat unexpected kwarg to df.write.csv
Evan Zamir created SPARK-17923: -- Summary: dateFormat unexpected kwarg to df.write.csv Key: SPARK-17923 URL: https://issues.apache.org/jira/browse/SPARK-17923 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Evan Zamir Priority: Minor Calling like this: {code}writer.csv(path, header=header, sep=sep, compression=compression, dateFormat=date_format){code} Getting the following error: {code}TypeError: csv() got an unexpected keyword argument 'dateFormat'{code} This error comes after being called with {code}date_format='-MM-dd'{code} as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491186#comment-15491186 ] Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:53 PM: - Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says {{weightCol=None}}, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read {{weightCol=""}} (which apparently is the way one has to write the code to run without error)? was (Author: zamir.e...@gmail.com): Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says {{weightCol=None}}, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read {{weightCol=""}}? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491186#comment-15491186 ] Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:52 PM: - Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says {{weightCol=None}}, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read {{weightCol=""}}? was (Author: zamir.e...@gmail.com): Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says `weightCol=None`, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read `weightCol=""`? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491186#comment-15491186 ] Evan Zamir commented on SPARK-17508: Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says `weightCol=None`, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read `weightCol=""`? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485782#comment-15485782 ] Evan Zamir commented on SPARK-17508: [~bryanc] Oh, that helps a lot! I've been writing very light wrappers around Spark functions and it wasn't clear to me whether I could keep weightCol as an optional parameter. At least now I can reason about how to do it better. I guess this isn't so much a bug then, as it is a feature request. So if someone wants to close the issue or reclassify, that would make sense. I can only imagine I'm not the only Spark user who has been miffed by this. > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484861#comment-15484861 ] Evan Zamir commented on SPARK-17508: Just ran the same snippet of code setting weightCol="" and that runs without error. It's only when I set weightCol=None that I get an error. > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484850#comment-15484850 ] Evan Zamir commented on SPARK-17508: Yep, I'm running 2.0.0. You can see in the error messages above that it's running 2.0.0. Can you try running the same code snippet and see if it works for you? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17508) Setting weightCol to None in ML library causes an error
Evan Zamir created SPARK-17508: -- Summary: Setting weightCol to None in ML library causes an error Key: SPARK-17508 URL: https://issues.apache.org/jira/browse/SPARK-17508 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Evan Zamir The following code runs without error: {code} spark = SparkSession.builder.appName('WeightBug').getOrCreate() df = spark.createDataFrame( [ (1.0, 1.0, Vectors.dense(1.0)), (0.0, 1.0, Vectors.dense(-1.0)) ], ["label", "weight", "features"]) lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") model = lr.fit(df) {code} My expectation from reading the documentation is that setting weightCol=None should treat all weights as 1.0 (regardless of whether a column exists). However, the same code with weightCol set to None causes the following errors: Traceback (most recent call last): File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in model = lr.fit(df) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 213, in _fit java_model = self._fit_java(dataset) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 210, in _fit_java return self._java_obj.fit(dataset._jdf) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. : java.lang.NullPointerException at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org