[jira] [Commented] (SPARK-4113) Pyhon UDF on ArrayType

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188007#comment-14188007
 ] 

Apache Spark commented on SPARK-4113:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2973

 Pyhon UDF on ArrayType
 --

 Key: SPARK-4113
 URL: https://issues.apache.org/jira/browse/SPARK-4113
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.2.0


 from Matei:
 I have a table where column c is of type arrayint. However the following 
 set of commands fails:
 sqlContext.registerFunction(py_func, lambda a: len(a))
 %sql select py_func(c) from some_temp
 Error in SQL statement: java.lang.RuntimeException: 
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
 stage 252.0 failed 4 times, most recent failure: Lost task 2.3 in stage 252.0 
 (TID 8454, ip-10-0-157-104.us-west-2.compute.internal): 
 net.razorvine.pickle.PickleException: couldn't introspect javabean: 
 java.lang.IllegalArgumentException: wrong number of arguments
 net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.dump(Pickler.java:95)
 The same function works if I select a Row from my table into Python and call 
 it on its third column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188012#comment-14188012
 ] 

Xiangrui Meng commented on SPARK-3080:
--

Thanks for confirming the issue! I guess this could be a serialization issue. 
Did you observe any executor loss during the computation or in-memory cached 
RDDs switching to on-disk storage?

[~derenrich] Which public dataset are you using? Could you also let me know all 
the ALS parameters and custom Spark settings you used? Thanks!

[~ilganeli] If you do need to run ALS on the full dataset, I recommend using 
the new ALS implementation at

https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala

It should perform better. But it is not merged yet.

 ArrayIndexOutOfBoundsException in ALS for Large datasets
 

 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Burak Yavuz

 The stack trace is below:
 {quote}
 java.lang.ArrayIndexOutOfBoundsException: 2716
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 {quote}
 This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer

2014-10-28 Thread DB Tsai (JIRA)
DB Tsai created SPARK-4129:
--

 Summary: Performance tuning in MultivariateOnlineSummarizer
 Key: SPARK-4129
 URL: https://issues.apache.org/jira/browse/SPARK-4129
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
through the nonZero elements in the vector. However, activeIterator doesn't 
perform well due to lots of overhead. In this PR, native while loop is used for 
both DenseVector and SparseVector.

The benchmark result with 20 executors using mnist8m dataset:

Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds

After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds

Since MultivariateOnlineSummarizer is used in several places, the overall 
performance gain in mllib library will be significant with this PR. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188037#comment-14188037
 ] 

Apache Spark commented on SPARK-4129:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/2992

 Performance tuning in MultivariateOnlineSummarizer
 --

 Key: SPARK-4129
 URL: https://issues.apache.org/jira/browse/SPARK-4129
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai

 In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
 through the nonZero elements in the vector. However, activeIterator doesn't 
 perform well due to lots of overhead. In this PR, native while loop is used 
 for both DenseVector and SparseVector.
 The benchmark result with 20 executors using mnist8m dataset:
 Before:
 DenseVector: 48.2 seconds
 SparseVector: 16.3 seconds
 After:
 DenseVector: 17.8 seconds
 SparseVector: 11.2 seconds
 Since MultivariateOnlineSummarizer is used in several places, the overall 
 performance gain in mllib library will be significant with this PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2