[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types
[ https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205458#comment-14205458 ] Joseph K. Bradley commented on SPARK-4328: -- [~atalwalkar] Thanks for pointing this out! Python serialization updates make Python ML API more brittle to types - Key: SPARK-4328 URL: https://issues.apache.org/jira/browse/SPARK-4328 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley In Spark 1.1, you could create a LabeledPoint with labels specified as integers, and then use it with LinearRegression. This was broken by the Python API updates since then. E.g., this code runs in the 1.1 branch but not in the current master: {code} from pyspark.mllib.regression import * import numpy features = numpy.ndarray((3)) data = sc.parallelize([LabeledPoint(1, features)]) LinearRegressionWithSGD.train(data) {code} Recommendation: Allow users to use integers from Python. The error message you get is: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o55.trainLinearRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) at org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types
[ https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205457#comment-14205457 ] Joseph K. Bradley commented on SPARK-4328: -- both related to Python API SerDe updates Python serialization updates make Python ML API more brittle to types - Key: SPARK-4328 URL: https://issues.apache.org/jira/browse/SPARK-4328 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley In Spark 1.1, you could create a LabeledPoint with labels specified as integers, and then use it with LinearRegression. This was broken by the Python API updates since then. E.g., this code runs in the 1.1 branch but not in the current master: {code} from pyspark.mllib.regression import * import numpy features = numpy.ndarray((3)) data = sc.parallelize([LabeledPoint(1, features)]) LinearRegressionWithSGD.train(data) {code} Recommendation: Allow users to use integers from Python. The error message you get is: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o55.trainLinearRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) at org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org