[jira] [Commented] (SPARK-3721) Broadcast Variables above 2GB break in PySpark

2014-10-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159453#comment-14159453
 ] 

Davies Liu commented on SPARK-3721:
---

[~bmiller1] This is an PR for this https://github.com/apache/spark/pull/2659, 
could you help to test it?

 Broadcast Variables above 2GB break in PySpark
 --

 Key: SPARK-3721
 URL: https://issues.apache.org/jira/browse/SPARK-3721
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Brad Miller
Assignee: Davies Liu

 The bug displays 3 unique failure modes in PySpark, all of which seem to be 
 related to broadcast variable size. Note that the tests below ran python 
 2.7.3 on all machines and used the Spark 1.1.0 binaries.
 **BLOCK 1** [no problem]
 {noformat}
 import cPickle
 from pyspark import SparkContext
 def check_pre_serialized(size):
 msg = cPickle.dumps(range(2 ** size))
 print 'serialized length:', len(msg)
 bvar = sc.broadcast(msg)
 print 'length recovered from broadcast variable:', len(bvar.value)
 print 'correct value recovered:', msg == bvar.value
 bvar.unpersist()
 def check_unserialized(size):
 msg = range(2 ** size)
 bvar = sc.broadcast(msg)
 print 'correct value recovered:', msg == bvar.value
 bvar.unpersist()
 SparkContext.setSystemProperty('spark.executor.memory', '15g')
 SparkContext.setSystemProperty('spark.cores.max', '5')
 sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
 'broadcast_bug')
 {noformat}
 **BLOCK 2**  [no problem]
 {noformat}
 check_pre_serialized(20)
  serialized length: 9374656
  length recovered from broadcast variable: 9374656
  correct value recovered: True
 {noformat}
 **BLOCK 3**  [no problem]
 {noformat}
 check_unserialized(20)
  correct value recovered: True
 {noformat}
 **BLOCK 4**  [no problem]
 {noformat}
 check_pre_serialized(27)
  serialized length: 1499501632
  length recovered from broadcast variable: 1499501632
  correct value recovered: True
 {noformat}
 **BLOCK 5**  [no problem]
 {noformat}
 check_unserialized(27)
  correct value recovered: True
 {noformat}
 **BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
 sc.broadcast]**
 {noformat}
 check_pre_serialized(28)
 .
  /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
  354
  355 def dumps(self, obj):
  -- 356 return cPickle.dumps(obj, 2)
  357
  358 loads = cPickle.loads
 
  SystemError: error return without exception set
 {noformat}
 **BLOCK 7**  [no problem]
 {noformat}
 check_unserialized(28)
  correct value recovered: True
 {noformat}
 **BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
 {noformat}
 check_pre_serialized(29)
  serialized length: 6331339840
  length recovered from broadcast variable: 2036372544
  correct value recovered: False
 {noformat}
 **BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
 sc.broadcast]**
 {noformat}
 check_unserialized(29)
 ..
  /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
  418 
  419 def dumps(self, obj):
  -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
  421 
  422 def loads(self, obj):
  
  OverflowError: size does not fit in an int
 {noformat}
 **BLOCK 10**  [ERROR 1]
 {noformat}
 check_pre_serialized(30)
 ...same as above...
 {noformat}
 **BLOCK 11**  [ERROR 3]
 {noformat}
 check_unserialized(30)
 ...same as above...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3721) Broadcast Variables above 2GB break in PySpark

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159454#comment-14159454
 ] 

Apache Spark commented on SPARK-3721:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2659

 Broadcast Variables above 2GB break in PySpark
 --

 Key: SPARK-3721
 URL: https://issues.apache.org/jira/browse/SPARK-3721
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Brad Miller
Assignee: Davies Liu

 The bug displays 3 unique failure modes in PySpark, all of which seem to be 
 related to broadcast variable size. Note that the tests below ran python 
 2.7.3 on all machines and used the Spark 1.1.0 binaries.
 **BLOCK 1** [no problem]
 {noformat}
 import cPickle
 from pyspark import SparkContext
 def check_pre_serialized(size):
 msg = cPickle.dumps(range(2 ** size))
 print 'serialized length:', len(msg)
 bvar = sc.broadcast(msg)
 print 'length recovered from broadcast variable:', len(bvar.value)
 print 'correct value recovered:', msg == bvar.value
 bvar.unpersist()
 def check_unserialized(size):
 msg = range(2 ** size)
 bvar = sc.broadcast(msg)
 print 'correct value recovered:', msg == bvar.value
 bvar.unpersist()
 SparkContext.setSystemProperty('spark.executor.memory', '15g')
 SparkContext.setSystemProperty('spark.cores.max', '5')
 sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
 'broadcast_bug')
 {noformat}
 **BLOCK 2**  [no problem]
 {noformat}
 check_pre_serialized(20)
  serialized length: 9374656
  length recovered from broadcast variable: 9374656
  correct value recovered: True
 {noformat}
 **BLOCK 3**  [no problem]
 {noformat}
 check_unserialized(20)
  correct value recovered: True
 {noformat}
 **BLOCK 4**  [no problem]
 {noformat}
 check_pre_serialized(27)
  serialized length: 1499501632
  length recovered from broadcast variable: 1499501632
  correct value recovered: True
 {noformat}
 **BLOCK 5**  [no problem]
 {noformat}
 check_unserialized(27)
  correct value recovered: True
 {noformat}
 **BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
 sc.broadcast]**
 {noformat}
 check_pre_serialized(28)
 .
  /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
  354
  355 def dumps(self, obj):
  -- 356 return cPickle.dumps(obj, 2)
  357
  358 loads = cPickle.loads
 
  SystemError: error return without exception set
 {noformat}
 **BLOCK 7**  [no problem]
 {noformat}
 check_unserialized(28)
  correct value recovered: True
 {noformat}
 **BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
 {noformat}
 check_pre_serialized(29)
  serialized length: 6331339840
  length recovered from broadcast variable: 2036372544
  correct value recovered: False
 {noformat}
 **BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
 sc.broadcast]**
 {noformat}
 check_unserialized(29)
 ..
  /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
  418 
  419 def dumps(self, obj):
  -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
  421 
  422 def loads(self, obj):
  
  OverflowError: size does not fit in an int
 {noformat}
 **BLOCK 10**  [ERROR 1]
 {noformat}
 check_pre_serialized(30)
 ...same as above...
 {noformat}
 **BLOCK 11**  [ERROR 3]
 {noformat}
 check_unserialized(30)
 ...same as above...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3801) Make app dir cleanup more efficient

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159458#comment-14159458
 ] 

Apache Spark commented on SPARK-3801:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/2660

 Make app dir cleanup more efficient
 ---

 Key: SPARK-3801
 URL: https://issues.apache.org/jira/browse/SPARK-3801
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 The newly merged more conservative app directory cleanup can be more 
 efficient. Since it is not needed to store newer files, using 'exists' 
 instead of 'filter' can be more efficient because redundant items are skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3802) Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt

2014-10-05 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3802:
-

 Summary: Scala version is wrong in 
dev/audit-release/blank_sbt_build/build.sbt
 Key: SPARK-3802
 URL: https://issues.apache.org/jira/browse/SPARK-3802
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor


In dev/audit-release/blank_sbt_build/build.sbt, scalaVersion indicates 2.9.3 
but I think 2.10.4 is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3802) Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159501#comment-14159501
 ] 

Apache Spark commented on SPARK-3802:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2661

 Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt
 -

 Key: SPARK-3802
 URL: https://issues.apache.org/jira/browse/SPARK-3802
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor

 In dev/audit-release/blank_sbt_build/build.sbt, scalaVersion indicates 2.9.3 
 but I think 2.10.4 is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Masaru Dobashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masaru Dobashi updated SPARK-3803:
--
Description: 
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{quote}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{quote}

The RowMatrix instance was generated from the result of TF-IDF like the 
following.

{quote}
scala val hashingTF = new HashingTF()
scala val tf = hashingTF.transform(texts)
scala import org.apache.spark.mllib.feature.IDF
scala tf.cache()
scala val idf = new IDF().fit(tf)
scala val tfidf: RDD[Vector] = idf.transform(tf)

scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala val mat = new RowMatrix(tfidf)
scala val pc = mat.computePrincipalComponents(2)
{quote}

 ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
 

 Key: SPARK-3803
 URL: https://issues.apache.org/jira/browse/SPARK-3803
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Masaru Dobashi

 When I executed computePrincipalComponents method of RowMatrix, I got 
 java.lang.ArrayIndexOutOfBoundsException.
 {quote}
 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
 RDDFunctions.scala:111
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Masaru Dobashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masaru Dobashi updated SPARK-3803:
--
Description: 
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

The RowMatrix instance was generated from the result of TF-IDF like the 
following.

{code}
scala val hashingTF = new HashingTF()
scala val tf = hashingTF.transform(texts)
scala import org.apache.spark.mllib.feature.IDF
scala tf.cache()
scala val idf = new IDF().fit(tf)
scala val tfidf: RDD[Vector] = idf.transform(tf)

scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala val mat = new RowMatrix(tfidf)
scala val pc = mat.computePrincipalComponents(2)
{code}


  was:
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{quote}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)


[jira] [Commented] (SPARK-3794) Building spark core fails with specific hadoop version

2014-10-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159524#comment-14159524
 ] 

Sean Owen commented on SPARK-3794:
--

The real problem here is that commons-io is not a dependency of Spark, and 
should not be, but it began to be used a few days ago in commit 
https://github.com/apache/spark/commit/cf1d32e3e1071829b152d4b597bf0a0d7a5629a2 
for SPARK-1860. So it is accidentally depending on the version of Commons IO 
brought in by third party dependencies. 

I will propose a PR that removes this usage in favor of Guava or Java APIs.

 Building spark core fails with specific hadoop version
 --

 Key: SPARK-3794
 URL: https://issues.apache.org/jira/browse/SPARK-3794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5
Reporter: cocoatomo
  Labels: spark

 At the commit cf1d32e3e1071829b152d4b597bf0a0d7a5629a2, building spark core 
 result in compilation error when we specify some hadoop versions.
 To reproduce this issue, we should execute following command with 
 hadoop.version=1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, or 2.2.0.
 {noformat}
 $ cd ./core
 $ mvn -Dhadoop.version=hadoop.version -DskipTests clean compile
 ...
 [ERROR] 
 /Users/tomohiko/MyRepos/Scala/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:720:
  value listFilesAndDirs is not a member of object 
 org.apache.commons.io.FileUtils
 [ERROR]   val files = FileUtils.listFilesAndDirs(dir, 
 TrueFileFilter.TRUE, TrueFileFilter.TRUE)
 [ERROR] ^
 {noformat}
 Because that compilation uses commons-io version 2.1 and 
 FileUtils#listFilesAndDirs method was added at commons-io version 2.2, this 
 compilation always fails.
 FileUtils#listFilesAndDirs → 
 http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFilesAndDirs%28java.io.File,%20org.apache.commons.io.filefilter.IOFileFilter,%20org.apache.commons.io.filefilter.IOFileFilter%29
 Because a hadoop-client in those problematic version depends on commons-io 
 2.1 not 2.4, we should have assumption that commons-io is version 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159559#comment-14159559
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Sandy, one other thing:
While I understand the reasoning for changes to the title and the description 
of the JIRA, it would probably be better to coordinate this with the original 
submitter before making such changes in the future (similar to the way Patric 
suggested in SPARK-3174). This would alleviate potential discrepancies in the 
overall message and intentions of the JIRA. 
Anyway, I’ve edited both the title and the description taking into 
consideration your edits.

 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark _integrates with external resource-managing platforms_ such 
 as Apache Hadoop YARN and Mesos to facilitate 
 execution of Spark DAG in a distributed environment provided by those 
 platforms. 
 However, this integration is tightly coupled within Spark's implementation 
 making it rather difficult to introduce integration points with 
 other resource-managing platforms without constant modifications to Spark's 
 core (see comments below for more details). 
 In addition, Spark _does not provide any integration points to a third-party 
 **DAG-like** and **DAG-capable** execution environments_ native 
 to those platforms, thus limiting access to some of their native features 
 (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
 and monitoring and more) as well as specialization aspects of
 such execution environments (open source and proprietary). As an example, 
 inability to gain access to such features are starting to affect Spark's 
 viability in large scale, batch 
 and/or ETL applications. 
 Introducing a pluggable architecture would solve both of the issues mentioned 
 above ultimately benefitting Spark's technology and community by allowing it 
 to 
 venture into co-existence and collaboration with a variety of existing Big 
 Data platforms as well as the once yet to come to the market.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI).
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via 
 master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current 
 (corresponding) methods of SparkContext to delegate to such implementation 
 ensuring binary and source compatibility with older versions of Spark.  
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3801) Make app dir cleanup more efficient

2014-10-05 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-3801.
--
Resolution: Duplicate

 Make app dir cleanup more efficient
 ---

 Key: SPARK-3801
 URL: https://issues.apache.org/jira/browse/SPARK-3801
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 The newly merged more conservative app directory cleanup can be more 
 efficient. Since it is not needed to store newer files, using 'exists' 
 instead of 'filter' can be more efficient because redundant items are skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3801) Make app dir cleanup more efficient

2014-10-05 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159561#comment-14159561
 ] 

Liang-Chi Hsieh commented on SPARK-3801:


I think so. Thanks.

 Make app dir cleanup more efficient
 ---

 Key: SPARK-3801
 URL: https://issues.apache.org/jira/browse/SPARK-3801
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 The newly merged more conservative app directory cleanup can be more 
 efficient. Since it is not needed to store newer files, using 'exists' 
 instead of 'filter' can be more efficient because redundant items are skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159573#comment-14159573
 ] 

Sean Owen commented on SPARK-3561:
--

I'd be interested to see a more specific motivating use case. Is this about 
using Tez for example, and where does it help to stack Spark on Tez on YARN? or 
MR2, etc. Spark Core and Tez overlap, to be sure, and I'm not sure how much 
value it adds to run one on the other. Kind of like running Oracle on MySQL or 
something. For whatever it is: is it maybe not more natural to integrate the 
feature into Spark itself?

It would be great if it this were all just a matter of one extra trait and 
interface. In practice I suspect there are a number of hidden assumptions 
throughout the code that may leak through attempts at this abstraction. 

I am definitely asking rather than asserting, curious to see more specifics 
about the upside.

 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark _integrates with external resource-managing platforms_ such 
 as Apache Hadoop YARN and Mesos to facilitate 
 execution of Spark DAG in a distributed environment provided by those 
 platforms. 
 However, this integration is tightly coupled within Spark's implementation 
 making it rather difficult to introduce integration points with 
 other resource-managing platforms without constant modifications to Spark's 
 core (see comments below for more details). 
 In addition, Spark _does not provide any integration points to a third-party 
 **DAG-like** and **DAG-capable** execution environments_ native 
 to those platforms, thus limiting access to some of their native features 
 (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
 and monitoring and more) as well as specialization aspects of
 such execution environments (open source and proprietary). As an example, 
 inability to gain access to such features are starting to affect Spark's 
 viability in large scale, batch 
 and/or ETL applications. 
 Introducing a pluggable architecture would solve both of the issues mentioned 
 above ultimately benefitting Spark's technology and community by allowing it 
 to 
 venture into co-existence and collaboration with a variety of existing Big 
 Data platforms as well as the once yet to come to the market.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI).
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via 
 master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current 
 (corresponding) methods of SparkContext to delegate to such implementation 
 ensuring binary and source compatibility with older versions of Spark.  
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3007) Add Dynamic Partition support to Spark Sql hive

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159578#comment-14159578
 ] 

Apache Spark commented on SPARK-3007:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2663

 Add Dynamic Partition support  to  Spark Sql hive
 ---

 Key: SPARK-3007
 URL: https://issues.apache.org/jira/browse/SPARK-3007
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: baishuo
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 

[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
JobExecutionContext by either implementing it from scratch or extending form 
DefaultExecutionContext.
Please see the attached design doc and pull request for more details.

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Masaru Dobashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masaru Dobashi updated SPARK-3803:
--
Description: 
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

The RowMatrix instance was generated from the result of TF-IDF like the 
following.

{code}
scala val hashingTF = new HashingTF()
scala val tf = hashingTF.transform(texts)
scala import org.apache.spark.mllib.feature.IDF
scala tf.cache()
scala val idf = new IDF().fit(tf)
scala val tfidf: RDD[Vector] = idf.transform(tf)

scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala val mat = new RowMatrix(tfidf)
scala val pc = mat.computePrincipalComponents(2)
{code}

I think this was because I created HashingTF instance with default numFeatures 
and Array is used in RowMatrix#computeGramianMatrix method
like the following.

{code}
  /**
   * Computes the Gramian matrix `A^T A`.
   */
  def computeGramianMatrix(): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2

// Compute the upper triangular part of the gram matrix.
val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
  seqOp = (U, v) = {
RowMatrix.dspr(1.0, v, U.data)
U
  }, combOp = (U1, U2) = U1 += U2)

RowMatrix.triuToFull(n, GU.data)
  }
{code} 

When the size of Vectors generated by TF-IDF is too large, it makes nt to 
have undesirable value (and undesirable size of Array used in treeAggregate),
since n * (n + 1) / 2 exceeded Int.MaxValue.


Is this surmise correct?

And, of course, I could avoid this situation by creating instance of HashingTF 
with smaller numFeatures.
But this seems to be not fundamental solution.


  was:
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Masaru Dobashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masaru Dobashi updated SPARK-3803:
--
Description: 
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

The RowMatrix instance was generated from the result of TF-IDF like the 
following.

{code}
scala val hashingTF = new HashingTF()
scala val tf = hashingTF.transform(texts)
scala import org.apache.spark.mllib.feature.IDF
scala tf.cache()
scala val idf = new IDF().fit(tf)
scala val tfidf: RDD[Vector] = idf.transform(tf)

scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala val mat = new RowMatrix(tfidf)
scala val pc = mat.computePrincipalComponents(2)
{code}

I think this was because I created HashingTF instance with default numFeatures 
and Array is used in RowMatrix#computeGramianMatrix method
like the following.

{code}
  /**
   * Computes the Gramian matrix `A^T A`.
   */
  def computeGramianMatrix(): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2

// Compute the upper triangular part of the gram matrix.
val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
  seqOp = (U, v) = {
RowMatrix.dspr(1.0, v, U.data)
U
  }, combOp = (U1, U2) = U1 += U2)

RowMatrix.triuToFull(n, GU.data)
  }
{code} 

When the size of Vectors generated by TF-IDF is too large, it makes nt to 
have undesirable value (and undesirable size of Array used in treeAggregate),
since n * (n + 1) / 2 exceeded Int.MaxValue.


Is this surmise correct?

And, of course, I could avoid this situation by creating instance of HashingTF 
with smaller numFeatures.
But this does not seems to be fundamental solution.


  was:
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Masaru Dobashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masaru Dobashi updated SPARK-3803:
--
Description: 
GrWhen I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

The RowMatrix instance was generated from the result of TF-IDF like the 
following.

{code}
scala val hashingTF = new HashingTF()
scala val tf = hashingTF.transform(texts)
scala import org.apache.spark.mllib.feature.IDF
scala tf.cache()
scala val idf = new IDF().fit(tf)
scala val tfidf: RDD[Vector] = idf.transform(tf)

scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala val mat = new RowMatrix(tfidf)
scala val pc = mat.computePrincipalComponents(2)
{code}

I think this was because I created HashingTF instance with default numFeatures 
and Array is used in RowMatrix#computeGramianMatrix method
like the following.

{code}
  /**
   * Computes the Gramian matrix `A^T A`.
   */
  def computeGramianMatrix(): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2

// Compute the upper triangular part of the gram matrix.
val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
  seqOp = (U, v) = {
RowMatrix.dspr(1.0, v, U.data)
U
  }, combOp = (U1, U2) = U1 += U2)

RowMatrix.triuToFull(n, GU.data)
  }
{code} 

When the size of Vectors generated by TF-IDF is too large, it makes nt to 
have undesirable value (and undesirable size of Array used in treeAggregate),
since n * (n + 1) / 2 exceeded Int.MaxValue.


Is this surmise correct?

And, of course, I could avoid this situation by creating instance of HashingTF 
with smaller numFeatures.
But this may not be fundamental solution.


  was:
GrWhen I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Masaru Dobashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masaru Dobashi updated SPARK-3803:
--
Description: 
When I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)

org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

The RowMatrix instance was generated from the result of TF-IDF like the 
following.

{code}
scala val hashingTF = new HashingTF()
scala val tf = hashingTF.transform(texts)
scala import org.apache.spark.mllib.feature.IDF
scala tf.cache()
scala val idf = new IDF().fit(tf)
scala val tfidf: RDD[Vector] = idf.transform(tf)

scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala val mat = new RowMatrix(tfidf)
scala val pc = mat.computePrincipalComponents(2)
{code}

I think this was because I created HashingTF instance with default numFeatures 
and Array is used in RowMatrix#computeGramianMatrix method
like the following.

{code}
  /**
   * Computes the Gramian matrix `A^T A`.
   */
  def computeGramianMatrix(): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2

// Compute the upper triangular part of the gram matrix.
val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
  seqOp = (U, v) = {
RowMatrix.dspr(1.0, v, U.data)
U
  }, combOp = (U1, U2) = U1 += U2)

RowMatrix.triuToFull(n, GU.data)
  }
{code} 

When the size of Vectors generated by TF-IDF is too large, it makes nt to 
have undesirable value (and undesirable size of Array used in treeAggregate),
since n * (n + 1) / 2 exceeded Int.MaxValue.


Is this surmise correct?

And, of course, I could avoid this situation by creating instance of HashingTF 
with smaller numFeatures.
But this may not be fundamental solution.


  was:
GrWhen I executed computePrincipalComponents method of RowMatrix, I got 
java.lang.ArrayIndexOutOfBoundsException.

{code}
14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
(TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161

[jira] [Commented] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159587#comment-14159587
 ] 

Sean Owen commented on SPARK-3803:
--

I agree with your assessment. It would take some work, though not terribly 
much, to rewrite this method to correctly handle A with more than 46340 
columns. At n = 46340, the Gramian already consumes about 8.5GB of memory, so 
it's kinda getting big to realistically use in core anyway. At the least, an 
error should be raised if n is too large. Any one else think this should be 
supported though? Would be nice, but, practically helpful?

 ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
 

 Key: SPARK-3803
 URL: https://issues.apache.org/jira/browse/SPARK-3803
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Masaru Dobashi

 When I executed computePrincipalComponents method of RowMatrix, I got 
 java.lang.ArrayIndexOutOfBoundsException.
 {code}
 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
 RDDFunctions.scala:111
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 The RowMatrix instance was generated from the result of TF-IDF like the 
 following.
 {code}
 scala val hashingTF = new HashingTF()
 scala val tf = hashingTF.transform(texts)
 scala import org.apache.spark.mllib.feature.IDF
 scala tf.cache()
 scala val idf = new IDF().fit(tf)
 scala val tfidf: RDD[Vector] = idf.transform(tf)
 scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
 scala val mat = new RowMatrix(tfidf)
 scala val pc = mat.computePrincipalComponents(2)
 {code}
 I think this was because I created HashingTF instance with default 
 numFeatures and Array is used in RowMatrix#computeGramianMatrix method
 like the following.
 {code}
   /**
* Computes the Gramian matrix `A^T A`.
*/
   def computeGramianMatrix(): Matrix = {
 val n = numCols().toInt
 val nt: Int = n * (n + 1) / 2
 // Compute the upper triangular part of the gram matrix.
 val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
   seqOp = (U, v) = 

[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark. 
The trait will define 4 only operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext. 

Please see the attached design doc for more details. 
Pull Request will be posted shortly as well

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
JobExecutionContext by either implementing it from scratch or extending form 
DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext 

[jira] [Updated] (SPARK-3597) MesosSchedulerBackend does not implement `killTask`

2014-10-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3597:
-
Assignee: Brenden Matthews

 MesosSchedulerBackend does not implement `killTask`
 ---

 Key: SPARK-3597
 URL: https://issues.apache.org/jira/browse/SPARK-3597
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews
Assignee: Brenden Matthews
 Fix For: 1.1.1, 1.2.0


 The MesosSchedulerBackend class does not implement `killTask`, and therefore 
 results in exceptions like this:
 14/09/19 01:52:53 ERROR TaskSetManager: Task 238 in stage 1.0 failed 4 times; 
 aborting job
 14/09/19 01:52:53 INFO TaskSchedulerImpl: Cancelling stage 1
 14/09/19 01:52:53 INFO DAGScheduler: Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:194)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:192)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:185)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1211)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1197)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3597) MesosSchedulerBackend does not implement `killTask`

2014-10-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3597.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.1.1, 1.2.0

 MesosSchedulerBackend does not implement `killTask`
 ---

 Key: SPARK-3597
 URL: https://issues.apache.org/jira/browse/SPARK-3597
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews
Assignee: Brenden Matthews
 Fix For: 1.1.1, 1.2.0


 The MesosSchedulerBackend class does not implement `killTask`, and therefore 
 results in exceptions like this:
 14/09/19 01:52:53 ERROR TaskSetManager: Task 238 in stage 1.0 failed 4 times; 
 aborting job
 14/09/19 01:52:53 INFO TaskSchedulerImpl: Cancelling stage 1
 14/09/19 01:52:53 INFO DAGScheduler: Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:194)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:192)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:185)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1211)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1197)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Summary: Native Hadoop/YARN integration for batch/ETL workloads.  (was: 
Expose pluggable architecture to facilitate native integration with third-party 
execution environments.)

 Native Hadoop/YARN integration for batch/ETL workloads.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159593#comment-14159593
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

After giving it some thought, I am changing the title and the description back 
to the original as it would be more appropriate to discuss whatever question 
anyone may have via comments.

 Native Hadoop/YARN integration for batch/ETL workloads.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode

2014-10-05 Thread Brenden Matthews (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159602#comment-14159602
 ] 

Brenden Matthews edited comment on SPARK-2445 at 10/5/14 5:45 PM:
--

[~drexin] can you confirm that SPARK-3535 resolves your issue?


was (Author: brenden):
[~drexin] can you confirm that https://issues.apache.org/jira/browse/SPARK-3535 
resolves your issue?

 MesosExecutorBackend crashes in fine grained mode
 -

 Key: SPARK-2445
 URL: https://issues.apache.org/jira/browse/SPARK-2445
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Dario Rexin

 When multiple instances of the MesosExecutorBackend are running on the same 
 slave, they will have the same executorId assigned (equal to the mesos 
 slaveId), but will have a different port (which is randomly assigned). 
 Because of this, it can not register a new BlockManager, because one is 
 already registered with the same executorId, but a different BlockManagerId. 
 More description and a fix can be found in this PR on GitHub:
 https://github.com/apache/spark/pull/1358



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode

2014-10-05 Thread Brenden Matthews (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159602#comment-14159602
 ] 

Brenden Matthews commented on SPARK-2445:
-

[~drexin] can you confirm that https://issues.apache.org/jira/browse/SPARK-3535 
resolves your issue?

 MesosExecutorBackend crashes in fine grained mode
 -

 Key: SPARK-2445
 URL: https://issues.apache.org/jira/browse/SPARK-2445
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Dario Rexin

 When multiple instances of the MesosExecutorBackend are running on the same 
 slave, they will have the same executorId assigned (equal to the mesos 
 slaveId), but will have a different port (which is randomly assigned). 
 Because of this, it can not register a new BlockManager, because one is 
 already registered with the same executorId, but a different BlockManagerId. 
 More description and a fix can be found in this PR on GitHub:
 https://github.com/apache/spark/pull/1358



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3804) Output of Generator expressions is not stable after serialization.

2014-10-05 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3804:
---

 Summary: Output of Generator expressions is not stable after 
serialization.
 Key: SPARK-3804
 URL: https://issues.apache.org/jira/browse/SPARK-3804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


The output of a generator expression (such as explode) can change after 
serialization.  This caused a nasty data corruption bug.  While this bug has 
seen been addressed it would be good to fix this since it violates the general 
contract of query plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159707#comment-14159707
 ] 

Andrew Ash commented on SPARK-1860:
---

Filed as SPARK-3805

 Standalone Worker cleanup should not clean up running executors
 ---

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Blocker

 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 executors that happen to be running for longer than 7 days, hitting streaming 
 jobs especially hard.
 Executor's log/data folders should not be cleaned up if they're still 
 running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3805) Enable Standalone worker cleanup by default

2014-10-05 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3805:
-

 Summary: Enable Standalone worker cleanup by default
 Key: SPARK-3805
 URL: https://issues.apache.org/jira/browse/SPARK-3805
 Project: Spark
  Issue Type: Task
Reporter: Andrew Ash


Now that SPARK-1860 is fixed we should be able to turn on 
{{spark.worker.cleanup.enabled}} by default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3805) Enable Standalone worker cleanup by default

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159711#comment-14159711
 ] 

Apache Spark commented on SPARK-3805:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2664

 Enable Standalone worker cleanup by default
 ---

 Key: SPARK-3805
 URL: https://issues.apache.org/jira/browse/SPARK-3805
 Project: Spark
  Issue Type: Task
Reporter: Andrew Ash

 Now that SPARK-1860 is fixed we should be able to turn on 
 {{spark.worker.cleanup.enabled}} by default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3464) Graceful decommission of executors

2014-10-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159712#comment-14159712
 ] 

Andrew Ash commented on SPARK-3464:
---

[~pwendell] what JIRA replaced this one?

 Graceful decommission of executors
 --

 Key: SPARK-3464
 URL: https://issues.apache.org/jira/browse/SPARK-3464
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Assignee: Andrew Or

 In most cases, even when an application is utilizing only a small fraction of 
 its available resources, executors will still have tasks running or blocks 
 cached.  It would be useful to have a mechanism for waiting for running tasks 
 on an executor to finish and migrating its cached blocks elsewhere before 
 discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3166) Custom serialisers can't be shipped in application jars

2014-10-05 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3166:
--
Target Version/s: 1.2.0

 Custom serialisers can't be shipped in application jars
 ---

 Key: SPARK-3166
 URL: https://issues.apache.org/jira/browse/SPARK-3166
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis

 Spark cannot currently use a custom serialiser that is shipped with the 
 application jar. Trying to do this causes a java.lang.ClassNotFoundException 
 when trying to instantiate the custom serialiser in the Executor processes. 
 This occurs because Spark attempts to instantiate the custom serialiser 
 before the application jar has been shipped to the Executor process. A 
 reproduction of the problem is available here: 
 https://github.com/GrahamDennis/spark-custom-serialiser
 I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches 
 as of August 21, 2014.  This issue is related to SPARK-2878, and my fix for 
 that issue (https://github.com/apache/spark/pull/1890) also solves this.  My 
 pull request was not merged because it adds the user jar to the Executor 
 processes' class path at launch time.  Such a significant change was thought 
 by [~rxin] to require more QA, and should be considered for inclusion in 1.2 
 at the earliest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3755) Do not bind port 1 - 1024 to server in spark

2014-10-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159737#comment-14159737
 ] 

Andrew Ash commented on SPARK-3755:
---

I think you can only edit the title and description of a ticket when the ticket 
is open (and this is now closed)

 Do not bind port 1 - 1024 to server in spark
 

 Key: SPARK-3755
 URL: https://issues.apache.org/jira/browse/SPARK-3755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.1.1, 1.2.0


 Non-root user use port 1- 1024 to start jetty server will get the exception  
 java.net.SocketException: Permission denied



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-10-05 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3369:
--
Labels: breaking_change  (was: )

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Critical
  Labels: breaking_change
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3792) enable JavaHiveQLSuite

2014-10-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3792.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2652
[https://github.com/apache/spark/pull/2652]

 enable JavaHiveQLSuite
 --

 Key: SPARK-3792
 URL: https://issues.apache.org/jira/browse/SPARK-3792
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3645) Make caching using SQL commands eager by default, with the option of being lazy

2014-10-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3645.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2513
[https://github.com/apache/spark/pull/2513]

 Make caching using SQL commands eager by default, with the option of being 
 lazy
 ---

 Key: SPARK-3645
 URL: https://issues.apache.org/jira/browse/SPARK-3645
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]

2014-10-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3776.
-
Resolution: Fixed

Issue resolved by pull request 2641
[https://github.com/apache/spark/pull/2641]

 Wrong conversion to Catalyst for Option[Product]
 

 Key: SPARK-3776
 URL: https://issues.apache.org/jira/browse/SPARK-3776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Renat Yusupov
 Fix For: 1.2.0


 Method ScalaReflection.convertToCatalyst make wrong conversion for 
 Option[Product] data.
 For example:
 case class A(intValue: Int)
 case class B(optionA: Option[A])
 val b = B(Some(A(5)))
 ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3794) Building spark core fails due to inadvertent dependency on Commons IO

2014-10-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3794.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2662
[https://github.com/apache/spark/pull/2662]

 Building spark core fails due to inadvertent dependency on Commons IO
 -

 Key: SPARK-3794
 URL: https://issues.apache.org/jira/browse/SPARK-3794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5
Reporter: cocoatomo
  Labels: spark
 Fix For: 1.2.0


 At the commit cf1d32e3e1071829b152d4b597bf0a0d7a5629a2, building spark core 
 result in compilation error when we specify some hadoop versions.
 To reproduce this issue, we should execute following command with 
 hadoop.version=1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, or 2.2.0.
 {noformat}
 $ cd ./core
 $ mvn -Dhadoop.version=hadoop.version -DskipTests clean compile
 ...
 [ERROR] 
 /Users/tomohiko/MyRepos/Scala/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:720:
  value listFilesAndDirs is not a member of object 
 org.apache.commons.io.FileUtils
 [ERROR]   val files = FileUtils.listFilesAndDirs(dir, 
 TrueFileFilter.TRUE, TrueFileFilter.TRUE)
 [ERROR] ^
 {noformat}
 Because that compilation uses commons-io version 2.1 and 
 FileUtils#listFilesAndDirs method was added at commons-io version 2.2, this 
 compilation always fails.
 FileUtils#listFilesAndDirs → 
 http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFilesAndDirs%28java.io.File,%20org.apache.commons.io.filefilter.IOFileFilter,%20org.apache.commons.io.filefilter.IOFileFilter%29
 Because a hadoop-client in those problematic version depends on commons-io 
 2.1 not 2.4, we should have assumption that commons-io is version 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-10-05 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcg: RDD[Double]

  /* Returns the mean NDCG of all the queries */
  lazy val meanNdcg: Double
}
{code}


  was:
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

 - *averagePrecision(position: Int): Double* this is the presicion@position
 - *meanAveragePrecision*: the average of precision@n for all values of n
- *ndcg*




 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
 methods will be implemented:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 class RankingMetrics(predictionAndLabels: RDD[(Array[Double], 
 Array[Double])]) {
   /* Returns the precsion@k for each query */
   lazy val precAtK: RDD[Array[Double]]
   /* Returns the average precision for each query */
   lazy val avePrec: RDD[Double]
   /*Returns the mean average precision (MAP) of all the queries*/
   lazy val meanAvePrec: Double
   /*Returns the normalized discounted cumulative gain for each query */
   lazy val ndcg: RDD[Double]
   /* Returns the mean NDCG of all the queries */
   lazy val meanNdcg: Double
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-10-05 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include common metrics for ranking algorithms 
(http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcg: RDD[Double]

  /* Returns the mean NDCG of all the queries */
  lazy val meanNdcg: Double
}
{code}


  was:
Include common metrics for ranking 
algorithms(http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcg: RDD[Double]

  /* Returns the mean NDCG of all the queries */
  lazy val meanNdcg: Double
}
{code}



 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang

 Include common metrics for ranking algorithms 
 (http://www-nlp.stanford.edu/IR-book/), including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
 methods will be implemented:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 class RankingMetrics(predictionAndLabels: RDD[(Array[Double], 
 Array[Double])]) {
   /* Returns the precsion@k for each query */
   lazy val precAtK: RDD[Array[Double]]
   /* Returns the average precision for each query */
   lazy val avePrec: RDD[Double]
   /*Returns the mean average precision (MAP) of all the queries*/
   lazy val meanAvePrec: Double
   /*Returns the normalized discounted cumulative gain for each query */
   lazy val ndcg: RDD[Double]
   /* Returns the mean NDCG of all the queries */
   lazy val meanNdcg: Double
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-10-05 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include common metrics for ranking 
algorithms(http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcg: RDD[Double]

  /* Returns the mean NDCG of all the queries */
  lazy val meanNdcg: Double
}
{code}


  was:
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcg: RDD[Double]

  /* Returns the mean NDCG of all the queries */
  lazy val meanNdcg: Double
}
{code}



 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang

 Include common metrics for ranking 
 algorithms(http://www-nlp.stanford.edu/IR-book/), including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
 methods will be implemented:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 class RankingMetrics(predictionAndLabels: RDD[(Array[Double], 
 Array[Double])]) {
   /* Returns the precsion@k for each query */
   lazy val precAtK: RDD[Array[Double]]
   /* Returns the average precision for each query */
   lazy val avePrec: RDD[Double]
   /*Returns the mean average precision (MAP) of all the queries*/
   lazy val meanAvePrec: Double
   /*Returns the normalized discounted cumulative gain for each query */
   lazy val ndcg: RDD[Double]
   /* Returns the mean NDCG of all the queries */
   lazy val meanNdcg: Double
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3806) minor bug and exception in CliSuite

2014-10-05 Thread wangfei (JIRA)
wangfei created SPARK-3806:
--

 Summary: minor bug and exception in CliSuite
 Key: SPARK-3806
 URL: https://issues.apache.org/jira/browse/SPARK-3806
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei


Clisuite throw exception as follows:
Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6
at 
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
at 
scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at 
scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
at 
scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
at 
scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
at 
scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3805) Enable Standalone worker cleanup by default

2014-10-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159837#comment-14159837
 ] 

Patrick Wendell commented on SPARK-3805:


I commented on the PR - but because this deletes user data, I don't think it's 
a good idea to enable it by default.

 Enable Standalone worker cleanup by default
 ---

 Key: SPARK-3805
 URL: https://issues.apache.org/jira/browse/SPARK-3805
 Project: Spark
  Issue Type: Task
Reporter: Andrew Ash

 Now that SPARK-1860 is fixed we should be able to turn on 
 {{spark.worker.cleanup.enabled}} by default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.

2014-10-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159838#comment-14159838
 ] 

Patrick Wendell commented on SPARK-3561:


[~ozhurakousky] do you have a timeline for posting a more complete design doc 
for what the idea is with this?

 Native Hadoop/YARN integration for batch/ETL workloads.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3561:
---
Summary: Allow for pluggable execution contexts in Spark  (was: Native 
Hadoop/YARN integration for batch/ETL workloads.)

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159840#comment-14159840
 ] 

Patrick Wendell commented on SPARK-3561:


I also changed the title here that reflects the current design doc and JIRA. We 
have a culture in the project of having JIRA titles reflect accurately the 
current proposal. We can change it again if a new doc causes the scope of this 
to change.

[~ozhurakousky] I'd prefer not to change the title back until there is a new 
design proposed. The problem is that people are confusing this with SPARK-3174 
and SPARK-3797.

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159840#comment-14159840
 ] 

Patrick Wendell edited comment on SPARK-3561 at 10/6/14 3:55 AM:
-

I also changed the title here that reflects the current design doc and pull 
request. We have a culture in the project of having JIRA titles reflect 
accurately the current proposal. We can change it again if a new doc causes the 
scope of this to change.

[~ozhurakousky] I'd prefer not to change the title back until there is a new 
design proposed. The problem is that people are confusing this with SPARK-3174 
and SPARK-3797.


was (Author: pwendell):
I also changed the title here that reflects the current design doc and JIRA. We 
have a culture in the project of having JIRA titles reflect accurately the 
current proposal. We can change it again if a new doc causes the scope of this 
to change.

[~ozhurakousky] I'd prefer not to change the title back until there is a new 
design proposed. The problem is that people are confusing this with SPARK-3174 
and SPARK-3797.

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3761:
---
Priority: Major  (was: Blocker)

 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
 -

 Key: SPARK-3761
 URL: https://issues.apache.org/jira/browse/SPARK-3761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Igor Tkachenko

 I have Scala code:
 val master = spark://server address:7077
 val sc = new SparkContext(new SparkConf()
   .setMaster(master)
   .setAppName(SparkQueryDemo 01)
   .set(spark.executor.memory, 512m))
 val count2 = sc .textFile(hdfs://server 
 address:8020/tmp/data/risk/account.txt)
   .filter(line  = line.contains(Word))
   .count()
 I've got such an error:
 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0:0 failed 4 times, most
 recent failure: Exception failure in TID 6 on host server address: 
 java.lang.ClassNotFoundExcept
 ion: SimpleApp$$anonfun$1
 My dependencies :
 object Version {
   val spark= 1.0.0-cdh5.1.0
 }
 object Library {
   val sparkCore  = org.apache.spark  % spark-assembly_2.10  % 
 Version.spark
 }
 My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3806) minor bug and exception in CliSuite

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159842#comment-14159842
 ] 

Apache Spark commented on SPARK-3806:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2666

 minor bug and exception in CliSuite
 ---

 Key: SPARK-3806
 URL: https://issues.apache.org/jira/browse/SPARK-3806
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 Clisuite throw exception as follows:
 Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6
   at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
   at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
   at 
 scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
   at 
 scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
   at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
   at 
 scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
   at 
 scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
   at 
 scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
   at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3761:
---
Component/s: Spark Core

 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
 -

 Key: SPARK-3761
 URL: https://issues.apache.org/jira/browse/SPARK-3761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Igor Tkachenko
Priority: Blocker

 I have Scala code:
 val master = spark://server address:7077
 val sc = new SparkContext(new SparkConf()
   .setMaster(master)
   .setAppName(SparkQueryDemo 01)
   .set(spark.executor.memory, 512m))
 val count2 = sc .textFile(hdfs://server 
 address:8020/tmp/data/risk/account.txt)
   .filter(line  = line.contains(Word))
   .count()
 I've got such an error:
 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0:0 failed 4 times, most
 recent failure: Exception failure in TID 6 on host server address: 
 java.lang.ClassNotFoundExcept
 ion: SimpleApp$$anonfun$1
 My dependencies :
 object Version {
   val spark= 1.0.0-cdh5.1.0
 }
 object Library {
   val sparkCore  = org.apache.spark  % spark-assembly_2.10  % 
 Version.spark
 }
 My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3761:
---
Priority: Critical  (was: Major)

 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
 -

 Key: SPARK-3761
 URL: https://issues.apache.org/jira/browse/SPARK-3761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Igor Tkachenko
Priority: Critical

 I have Scala code:
 val master = spark://server address:7077
 val sc = new SparkContext(new SparkConf()
   .setMaster(master)
   .setAppName(SparkQueryDemo 01)
   .set(spark.executor.memory, 512m))
 val count2 = sc .textFile(hdfs://server 
 address:8020/tmp/data/risk/account.txt)
   .filter(line  = line.contains(Word))
   .count()
 I've got such an error:
 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0:0 failed 4 times, most
 recent failure: Exception failure in TID 6 on host server address: 
 java.lang.ClassNotFoundExcept
 ion: SimpleApp$$anonfun$1
 My dependencies :
 object Version {
   val spark= 1.0.0-cdh5.1.0
 }
 object Library {
   val sparkCore  = org.apache.spark  % spark-assembly_2.10  % 
 Version.spark
 }
 My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3806) minor bug in CliSuite

2014-10-05 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3806:
---
Summary: minor bug in CliSuite  (was: minor bug and exception in CliSuite)

 minor bug in CliSuite
 -

 Key: SPARK-3806
 URL: https://issues.apache.org/jira/browse/SPARK-3806
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 Clisuite throw exception as follows:
 Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6
   at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
   at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
   at 
 scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
   at 
 scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
   at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
   at 
 scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
   at 
 scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
   at 
 scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
   at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3783:
---
Assignee: Nathan Kronenfeld

 The type parameters for SparkContext.accumulable are inconsistent Accumulable 
 itself
 

 Key: SPARK-3783
 URL: https://issues.apache.org/jira/browse/SPARK-3783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Nathan Kronenfeld
Assignee: Nathan Kronenfeld
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 10m
  Remaining Estimate: 10m

 SparkContext.accumulable takes type parameters [T, R] - and passes them to 
 accumulable, in that order.
 Accumulable takes type parameters [R, T].
 So T for SparkContext.accumulable corresponds with R for Accumulable and vice 
 versa.
 Minor, but very confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3783.

   Resolution: Fixed
Fix Version/s: 1.2.0

https://github.com/apache/spark/pull/2637

 The type parameters for SparkContext.accumulable are inconsistent Accumulable 
 itself
 

 Key: SPARK-3783
 URL: https://issues.apache.org/jira/browse/SPARK-3783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Nathan Kronenfeld
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 10m
  Remaining Estimate: 10m

 SparkContext.accumulable takes type parameters [T, R] - and passes them to 
 accumulable, in that order.
 Accumulable takes type parameters [R, T].
 So T for SparkContext.accumulable corresponds with R for Accumulable and vice 
 versa.
 Minor, but very confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3782) AkkaUtils directly using log4j

2014-10-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159846#comment-14159846
 ] 

Patrick Wendell commented on SPARK-3782:


Would you mind pasting the exact exception? Also, what happens if you set 
spark.akka.logLifecycleEvents to true - does that work as a workaround?

 AkkaUtils directly using log4j
 --

 Key: SPARK-3782
 URL: https://issues.apache.org/jira/browse/SPARK-3782
 Project: Spark
  Issue Type: Bug
Reporter: Martin Gilday

 AkkaUtils is calling setLevel on Logger from log4j. This causes issues when 
 using another implementation of SLF4J such as logback as 
 log4j-over-slf4j.jars implementation of this class does not contain this 
 method on Logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3782:
---
Summary: Direct use of log4j in AkkaUtils interferes with certain logging 
configurations   (was: AkkaUtils directly using log4j)

 Direct use of log4j in AkkaUtils interferes with certain logging 
 configurations 
 

 Key: SPARK-3782
 URL: https://issues.apache.org/jira/browse/SPARK-3782
 Project: Spark
  Issue Type: Bug
Reporter: Martin Gilday

 AkkaUtils is calling setLevel on Logger from log4j. This causes issues when 
 using another implementation of SLF4J such as logback as 
 log4j-over-slf4j.jars implementation of this class does not contain this 
 method on Logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3314) Script creation of AMIs

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3314.

Resolution: Fixed
  Assignee: Patrick Wendell

So I think the original goal of this JIRA was just to provide some way for 
users to re-create our AMI which has been solved by the work in spark-ec2. 

https://github.com/mesos/spark-ec2/blob/v3/create_image.sh

Nick - maybe we can open a new issue with your work? It might be nice to see a 
design proposal for that as well.

 Script creation of AMIs
 ---

 Key: SPARK-3314
 URL: https://issues.apache.org/jira/browse/SPARK-3314
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: holdenk
Assignee: Patrick Wendell
Priority: Minor

 The current Spark AMIs have been built up over time. It would be useful to 
 provide a script which can be used to bootstrap from a fresh Amazon AMI. We 
 could also update the AMIs in the project at the same time to be based on a 
 newer version so we don't have to wait so long for the security updates to be 
 installed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3694:
---
Assignee: (was: Patrick Wendell)

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1190) Do not initialize log4j if slf4j log4j backend is not being used

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1190:
---
Assignee: Patrick Wendell  (was: Patrick Cogan)

 Do not initialize log4j if slf4j log4j backend is not being used
 

 Key: SPARK-1190
 URL: https://issues.apache.org/jira/browse/SPARK-1190
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 0.9.1, 1.0.0


 I already have a patch here just need to test it and commit. IIRC there were 
 some issues with the maven build.
 https://github.com/apache/incubator-spark/pull/573
 https://github.com/pwendell/incubator-spark/commit/66594e88e5be50fca073a7ef38fa62db4082b3c8
 initialization with: java.lang.StackOverflowError
   at java.lang.ThreadLocal.access$400(ThreadLocal.java:72)
   at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:376)
   at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:261)
   at java.lang.ThreadLocal.get(ThreadLocal.java:146)
   at java.lang.StringCoding.deref(StringCoding.java:63)
   at java.lang.StringCoding.encode(StringCoding.java:330)
   at java.lang.String.getBytes(String.java:916)
   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
   at java.io.File.exists(File.java:813)
   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080)
   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1047)
   at sun.misc.URLClassPath.findResource(URLClassPath.java:176)
   at java.net.URLClassLoader$2.run(URLClassLoader.java:551)
   at java.net.URLClassLoader$2.run(URLClassLoader.java:549)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findResource(URLClassLoader.java:548)
   at java.lang.ClassLoader.getResource(ClassLoader.java:1147)
   at org.apache.spark.Logging$class.initializeLogging(Logging.scala:109)
   at 
 org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97)
   at org.apache.spark.Logging$class.log(Logging.scala:36)
   at org.apache.spark.util.Utils$.log(Utils.scala:47)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3805) Enable Standalone worker cleanup by default

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3805:
---
Component/s: Deploy

 Enable Standalone worker cleanup by default
 ---

 Key: SPARK-3805
 URL: https://issues.apache.org/jira/browse/SPARK-3805
 Project: Spark
  Issue Type: Task
  Components: Deploy
Reporter: Andrew Ash

 Now that SPARK-1860 is fixed we should be able to turn on 
 {{spark.worker.cleanup.enabled}} by default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3174:
---
Component/s: YARN
 Spark Core

 Provide elastic scaling within a Spark application
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3765) Add test information to sbt build docs

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3765:
---
Summary: Add test information to sbt build docs  (was: add testing with sbt 
to doc)

 Add test information to sbt build docs
 --

 Key: SPARK-3765
 URL: https://issues.apache.org/jira/browse/SPARK-3765
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp

2014-10-05 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159998#comment-14159998
 ] 

Venkata Ramana G commented on SPARK-3034:
-

Requires adding another data type DateType
Modifications required in parser, datatype addition and DataType conversion to 
and from TimeStamp and String. 
Compatibility with Date supported in Hive 0.12.0.
Date UDFs compatibility.

Started working on the same.

 [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
 -

 Key: SPARK-3034
 URL: https://issues.apache.org/jira/browse/SPARK-3034
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a simple HiveQL via yarn-cluster, got error as below:
 {quote}
 Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 
 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: 
 java.sql.Date cannot be cast to java.sql.Timestamp
 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)
 
 org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 

[jira] [Commented] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160002#comment-14160002
 ] 

Patrick Wendell commented on SPARK-3731:


[~davies] any chance you can take a look at this?

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
Assignee: Davies Liu
 Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
 worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3731:
---
Priority: Critical  (was: Major)

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
Assignee: Davies Liu
Priority: Critical
 Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
 worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3731:
---
Target Version/s: 1.1.1, 1.2.0

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
Assignee: Davies Liu
 Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
 worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3731:
---
Assignee: Davies Liu

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
Assignee: Davies Liu
 Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
 worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3742) Link to Spark UI sometimes fails when using H/A RM's

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3742:
---
Summary: Link to Spark UI sometimes fails when using H/A RM's  (was: 
SparkUI page unable to open)

 Link to Spark UI sometimes fails when using H/A RM's
 

 Key: SPARK-3742
 URL: https://issues.apache.org/jira/browse/SPARK-3742
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: meiyoula

 When running an application on yarn, the hyperlink on yarn page can't jump to 
 sparkUI page. It happens sometimes.
 The error message is: This is standby RM. Redirecting to the current active 
 RM: http://vm-181:8088/proxy/application_1409206382122_0037



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3732:
---
Component/s: YARN

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3742) SparkUI page unable to open

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3742:
---
Component/s: (was: Spark Core)
 YARN

 SparkUI page unable to open
 ---

 Key: SPARK-3742
 URL: https://issues.apache.org/jira/browse/SPARK-3742
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: meiyoula

 When running an application on yarn, the hyperlink on yarn page can't jump to 
 sparkUI page. It happens sometimes.
 The error message is: This is standby RM. Redirecting to the current active 
 RM: http://vm-181:8088/proxy/application_1409206382122_0037



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2885:
---
Component/s: MLlib

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Fix For: 1.2.0

 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2331:
---
Summary: SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]  (was: 
SparkContext.emptyRDD has wrong return type)

 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3782:
---
Component/s: Spark Core

 Direct use of log4j in AkkaUtils interferes with certain logging 
 configurations 
 

 Key: SPARK-3782
 URL: https://issues.apache.org/jira/browse/SPARK-3782
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Martin Gilday

 AkkaUtils is calling setLevel on Logger from log4j. This causes issues when 
 using another implementation of SLF4J such as logback as 
 log4j-over-slf4j.jars implementation of this class does not contain this 
 method on Logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3694:
---
Component/s: Spark Core

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2331:
---
Component/s: Spark Core

 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-10-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3694:
---
Issue Type: Improvement  (was: Bug)

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3568) Add metrics for ranking algorithms

2014-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160022#comment-14160022
 ] 

Apache Spark commented on SPARK-3568:
-

User 'coderxiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2667

 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang

 Include common metrics for ranking algorithms 
 (http://www-nlp.stanford.edu/IR-book/), including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
 methods will be implemented:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 class RankingMetrics(predictionAndLabels: RDD[(Array[Double], 
 Array[Double])]) {
   /* Returns the precsion@k for each query */
   lazy val precAtK: RDD[Array[Double]]
   /* Returns the average precision for each query */
   lazy val avePrec: RDD[Double]
   /*Returns the mean average precision (MAP) of all the queries*/
   lazy val meanAvePrec: Double
   /*Returns the normalized discounted cumulative gain for each query */
   lazy val ndcg: RDD[Double]
   /* Returns the mean NDCG of all the queries */
   lazy val meanNdcg: Double
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp

2014-10-05 Thread Venkata Ramana G (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Ramana G updated SPARK-3034:

Comment: was deleted

(was: Requires adding another data type DateType
Modifications required in parser, datatype addition and DataType conversion to 
and from TimeStamp and String. 
Compatibility with Date supported in Hive 0.12.0.
Date UDFs compatibility.

Started working on the same.)

 [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
 -

 Key: SPARK-3034
 URL: https://issues.apache.org/jira/browse/SPARK-3034
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a simple HiveQL via yarn-cluster, got error as below:
 {quote}
 Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 
 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: 
 java.sql.Date cannot be cast to java.sql.Timestamp
 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)
 
 org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)