[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

2014-11-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205458#comment-14205458
 ] 

Joseph K. Bradley commented on SPARK-4328:
--

[~atalwalkar]  Thanks for pointing this out!

 Python serialization updates make Python ML API more brittle to types
 -

 Key: SPARK-4328
 URL: https://issues.apache.org/jira/browse/SPARK-4328
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 In Spark 1.1, you could create a LabeledPoint with labels specified as 
 integers, and then use it with LinearRegression.  This was broken by the 
 Python API updates since then.  E.g., this code runs in the 1.1 branch but 
 not in the current master:
 {code}
 from pyspark.mllib.regression import *
 import numpy
 features = numpy.ndarray((3))
 data = sc.parallelize([LabeledPoint(1, features)])
 LinearRegressionWithSGD.train(data)
 {code}
 Recommendation: Allow users to use integers from Python.
 The error message you get is:
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 o55.trainLinearRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 
 (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot 
 be cast to java.lang.Double
   at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
   at 
 org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

2014-11-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205457#comment-14205457
 ] 

Joseph K. Bradley commented on SPARK-4328:
--

both related to Python API SerDe updates

 Python serialization updates make Python ML API more brittle to types
 -

 Key: SPARK-4328
 URL: https://issues.apache.org/jira/browse/SPARK-4328
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 In Spark 1.1, you could create a LabeledPoint with labels specified as 
 integers, and then use it with LinearRegression.  This was broken by the 
 Python API updates since then.  E.g., this code runs in the 1.1 branch but 
 not in the current master:
 {code}
 from pyspark.mllib.regression import *
 import numpy
 features = numpy.ndarray((3))
 data = sc.parallelize([LabeledPoint(1, features)])
 LinearRegressionWithSGD.train(data)
 {code}
 Recommendation: Allow users to use integers from Python.
 The error message you get is:
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 o55.trainLinearRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 
 (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot 
 be cast to java.lang.Double
   at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
   at 
 org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org