[jira] [Commented] (SPARK-4113) Pyhon UDF on ArrayType
[ https://issues.apache.org/jira/browse/SPARK-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188007#comment-14188007 ] Apache Spark commented on SPARK-4113: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2973 Pyhon UDF on ArrayType -- Key: SPARK-4113 URL: https://issues.apache.org/jira/browse/SPARK-4113 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.2.0 from Matei: I have a table where column c is of type arrayint. However the following set of commands fails: sqlContext.registerFunction(py_func, lambda a: len(a)) %sql select py_func(c) from some_temp Error in SQL statement: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 252.0 failed 4 times, most recent failure: Lost task 2.3 in stage 252.0 (TID 8454, ip-10-0-157-104.us-west-2.compute.internal): net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.dump(Pickler.java:95) The same function works if I select a Row from my table into Python and call it on its third column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188012#comment-14188012 ] Xiangrui Meng commented on SPARK-3080: -- Thanks for confirming the issue! I guess this could be a serialization issue. Did you observe any executor loss during the computation or in-memory cached RDDs switching to on-disk storage? [~derenrich] Which public dataset are you using? Could you also let me know all the ALS parameters and custom Spark settings you used? Thanks! [~ilganeli] If you do need to run ALS on the full dataset, I recommend using the new ALS implementation at https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala It should perform better. But it is not merged yet. ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer
DB Tsai created SPARK-4129: -- Summary: Performance tuning in MultivariateOnlineSummarizer Key: SPARK-4129 URL: https://issues.apache.org/jira/browse/SPARK-4129 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop through the nonZero elements in the vector. However, activeIterator doesn't perform well due to lots of overhead. In this PR, native while loop is used for both DenseVector and SparseVector. The benchmark result with 20 executors using mnist8m dataset: Before: DenseVector: 48.2 seconds SparseVector: 16.3 seconds After: DenseVector: 17.8 seconds SparseVector: 11.2 seconds Since MultivariateOnlineSummarizer is used in several places, the overall performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188037#comment-14188037 ] Apache Spark commented on SPARK-4129: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/2992 Performance tuning in MultivariateOnlineSummarizer -- Key: SPARK-4129 URL: https://issues.apache.org/jira/browse/SPARK-4129 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop through the nonZero elements in the vector. However, activeIterator doesn't perform well due to lots of overhead. In this PR, native while loop is used for both DenseVector and SparseVector. The benchmark result with 20 executors using mnist8m dataset: Before: DenseVector: 48.2 seconds SparseVector: 16.3 seconds After: DenseVector: 17.8 seconds SparseVector: 11.2 seconds Since MultivariateOnlineSummarizer is used in several places, the overall performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org