[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3410: -- Summary: The priority of shutdownhook for ApplicationMaster should not be integer literal (was: The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.) The priority of shutdownhook for ApplicationMaster should not be integer literal Key: SPARK-3410 URL: https://issues.apache.org/jira/browse/SPARK-3410 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor In ApplicationMaster, the priority of shutdown hook is set to 30, which expects higher than the priority of o.a.h.FileSystem. In FileSystem, the priority of shutdown hook is expressed as public constant named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3412) Add Missing Types for Row API
Cheng Hao created SPARK-3412: Summary: Add Missing Types for Row API Key: SPARK-3412 URL: https://issues.apache.org/jira/browse/SPARK-3412 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122579#comment-14122579 ] Guoqiang Li commented on SPARK-2491: Executor running multiple tasks at the same time,after {{System.exit}} is called,[DiskBlockManager.scala#L144|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L144] will delete Spark local dirs. When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Components: YARN Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at
[jira] [Created] (SPARK-3413) Spark Blocked due to Executor lost in FIFO MODE
Patrick Liu created SPARK-3413: -- Summary: Spark Blocked due to Executor lost in FIFO MODE Key: SPARK-3413 URL: https://issues.apache.org/jira/browse/SPARK-3413 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2 Reporter: Patrick Liu I run spark on yarn. Spark scheduler is running in FIFO mode. I have 80 worker instances setup. However, as time passes, some worker will be lost. (Killed by JVM when OOM, etc). But some tasks will still run in those executors. Obviously the task will never finished. Then the stage will not finish. So the later stages will be blocked. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611 ] Lukas Nalezenec edited comment on SPARK-3369 at 9/5/14 8:44 AM: Hi, it looks like serious issue for me. How about break backward compatibility and make it right in Spark 1.2 ? (BTW: I found the issue, I had to do nasty workaround in my code). was (Author: lukas.nalezenec): Hi, it looks like serious issue for me. How about break backward compatibility and make it right in Spark 1.2 ? (BTW: I found the issue, I had to do nasty workaround in my code). http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/#comments Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611 ] Lukas Nalezenec commented on SPARK-3369: Hi, it looks like serious issue for me. How about break backward compatibility and make it right in Spark 1.2 ? (BTW: I found the issue, I had to do nasty workaround in my code). http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/#comments Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699 ] Alexander Ulanov commented on SPARK-3403: - I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single threaded dll. With this dll my tests didn't fail and seem to be executed properly. Thank you for suggestion! 1)Do you think that the same issue will remain in Linux? 2)What are the performance implications when using single threaded OpenBLAS through breeze? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122579#comment-14122579 ] Guoqiang Li edited comment on SPARK-2491 at 9/5/14 9:00 AM: Executor running multiple tasks at the same time.When one task throws an exception,{{System.exit}} is called,[DiskBlockManager.scala#L144|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L144] will delete Spark local dirs. Other task will throw an {{FileNotFoundException}} exception. was (Author: gq): Executor running multiple tasks at the same time,after {{System.exit}} is called,[DiskBlockManager.scala#L144|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L144] will delete Spark local dirs. When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Components: YARN Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split:
[jira] [Created] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
Cheng Lian created SPARK-3414: - Summary: Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Critical Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is now lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} is only executed once. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} attributes referenced in the join operator is now lowercased yet. And then, when {{select * from boom}} is been analyzed, the input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} attributes referenced in the join operator is now lowercased yet. And then, when {{select * from boom}} is been analyzed, the input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1]
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} is only executed once. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} attributes referenced in the join operator is now lowercased yet. And then, when {{select * from boom}} is been analyzed, the input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at
[jira] [Comment Edited] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611 ] Lukas Nalezenec edited comment on SPARK-3369 at 9/5/14 9:17 AM: Hi, it looks like serious issue for me. How about break backward compatibility and make it right in Spark 1.2 ? (BTW: I found the issue, I had to do workaround in my code). was (Author: lukas.nalezenec): Hi, it looks like serious issue for me. How about break backward compatibility and make it right in Spark 1.2 ? (BTW: I found the issue, I had to do nasty workaround in my code). Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Summary: Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names (was: Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Critical Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd
[jira] [Commented] (SPARK-3412) Add Missing Types for Row API
[ https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122729#comment-14122729 ] Apache Spark commented on SPARK-3412: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2284 Add Missing Types for Row API - Key: SPARK-3412 URL: https://issues.apache.org/jira/browse/SPARK-3412 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699 ] Alexander Ulanov edited comment on SPARK-3403 at 9/5/14 9:53 AM: - I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single threaded dll. With this dll my tests didn't fail and seem to be executed properly. Thank you for suggestion! 1)Do you think that the same issue will remain in Linux? 2)What are the performance implications when using single threaded OpenBLAS through breeze? 3)I didn't get any performance improvements with native libraries versus java arrays. My matrices are of size up to 10K-20K . Is it supposed to be so? was (Author: avulanov): I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single threaded dll. With this dll my tests didn't fail and seem to be executed properly. Thank you for suggestion! 1)Do you think that the same issue will remain in Linux? 2)What are the performance implications when using single threaded OpenBLAS through breeze? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611 ] Lukas Nalezenec edited comment on SPARK-3369 at 9/5/14 11:59 AM: - Hi, it looks like serious issue for me. How about breaking backward compatibility and make it right in Spark 1.2 ? API users would need only to add x.iterator() to their code. (BTW: I found the issue, I had to do workaround in my code). was (Author: lukas.nalezenec): Hi, it looks like serious issue for me. How about break backward compatibility and make it right in Spark 1.2 ? (BTW: I found the issue, I had to do workaround in my code). Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3415) Using sys.stderr in pyspark results in error
Ward Viaene created SPARK-3415: -- Summary: Using sys.stderr in pyspark results in error Key: SPARK-3415 URL: https://issues.apache.org/jira/browse/SPARK-3415 Project: Spark Issue Type: Bug Reporter: Ward Viaene Using sys.stderr in pyspark results in: File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in save_file from ..transport.adapter import SerializingAdapter ValueError: Attempted relative import beyond toplevel package Code to reproduce (copy paste the code in pyspark): import sys class TestClass(object): def __init__(self, out = sys.stderr): self.out = out def getOne(self): return 'one' def f(): print type(t) return 'ok' t = TestClass() a = [ 1 , 2, 3, 4, 5 ] b = sc.parallelize(a) b.map(lambda x: f()).first() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122921#comment-14122921 ] Sean Owen commented on SPARK-3369: -- The API change is unlikely to happen. Making a bunch of flatMap2 methods is really ugly. I suppose you could try wrapper the Iterator in this: {code} public class IteratorIterableT implements IterableT { private final IteratorT iterator; private boolean consumed; public IteratorIterable(IteratorT iterator) { this.iterator = iterator; } @Override public IteratorT iterator() { if (consumed) { throw new IllegalStateException(Iterator already consumed); } consumed = true; return iterator; } } {code} If, as I suspect, Spark actually only calls iterator() once, this will work, and this may be the most tolerable workaround until Spark 2.x. If it doesn't work, and iterator() is called multiple times, this will fail fast and at least we'd know. Can you try something like this? Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3415) Using sys.stderr in pyspark results in error
[ https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ward Viaene updated SPARK-3415: --- Component/s: PySpark Using sys.stderr in pyspark results in error Key: SPARK-3415 URL: https://issues.apache.org/jira/browse/SPARK-3415 Project: Spark Issue Type: Bug Components: PySpark Reporter: Ward Viaene Labels: python Using sys.stderr in pyspark results in: File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in save_file from ..transport.adapter import SerializingAdapter ValueError: Attempted relative import beyond toplevel package Code to reproduce (copy paste the code in pyspark): import sys class TestClass(object): def __init__(self, out = sys.stderr): self.out = out def getOne(self): return 'one' def f(): print type(t) return 'ok' t = TestClass() a = [ 1 , 2, 3, 4, 5 ] b = sc.parallelize(a) b.map(lambda x: f()).first() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3415) Using sys.stderr in pyspark results in error
[ https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ward Viaene updated SPARK-3415: --- Labels: python (was: ) Using sys.stderr in pyspark results in error Key: SPARK-3415 URL: https://issues.apache.org/jira/browse/SPARK-3415 Project: Spark Issue Type: Bug Components: PySpark Reporter: Ward Viaene Labels: python Using sys.stderr in pyspark results in: File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in save_file from ..transport.adapter import SerializingAdapter ValueError: Attempted relative import beyond toplevel package Code to reproduce (copy paste the code in pyspark): import sys class TestClass(object): def __init__(self, out = sys.stderr): self.out = out def getOne(self): return 'one' def f(): print type(t) return 'ok' t = TestClass() a = [ 1 , 2, 3, 4, 5 ] b = sc.parallelize(a) b.map(lambda x: f()).first() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3415) Using sys.stderr in pyspark results in error
[ https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ward Viaene updated SPARK-3415: --- Affects Version/s: 1.1.0 1.0.2 Using sys.stderr in pyspark results in error Key: SPARK-3415 URL: https://issues.apache.org/jira/browse/SPARK-3415 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0 Reporter: Ward Viaene Labels: python Using sys.stderr in pyspark results in: File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in save_file from ..transport.adapter import SerializingAdapter ValueError: Attempted relative import beyond toplevel package Code to reproduce (copy paste the code in pyspark): import sys class TestClass(object): def __init__(self, out = sys.stderr): self.out = out def getOne(self): return 'one' def f(): print type(t) return 'ok' t = TestClass() a = [ 1 , 2, 3, 4, 5 ] b = sc.parallelize(a) b.map(lambda x: f()).first() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123005#comment-14123005 ] Yu Ishikawa commented on SPARK-2430: Hi [~rnowling], The community had suggested looking into scikit-learn's API so that is a good idea. I agree with that idea. May I suggest you work on SPARK-2966 / SPARK-2429 first? All right. I will try it ! Thanks Standarized Clustering Algorithm API and Framework -- Key: SPARK-2430 URL: https://issues.apache.org/jira/browse/SPARK-2430 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Recently, there has been a chorus of voices on the mailing lists about adding new clustering algorithms to MLlib. To support these additions, we should develop a common framework and API to reduce code duplication and keep the APIs consistent. At the same time, we can also expand the current API to incorporate requested features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3377) Don't mix metrics from different applications otherwise we cannot distinguish
[ https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3377: -- Summary: Don't mix metrics from different applications otherwise we cannot distinguish (was: Don't mix metrics from different applications) Don't mix metrics from different applications otherwise we cannot distinguish - Key: SPARK-3377 URL: https://issues.apache.org/jira/browse/SPARK-3377 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Critical I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. e.g, We current implementation cannot distinguish metric jvm.memory is for which executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3260) Yarn - pass acls along with executor launch
[ https://issues.apache.org/jira/browse/SPARK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3260. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) Yarn - pass acls along with executor launch --- Key: SPARK-3260 URL: https://issues.apache.org/jira/browse/SPARK-3260 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 1.2.0 In https://github.com/apache/spark/pull/1196 I added passing the spark view and modify acls into yarn. Unfortunately we are only passing them into the application master and I missed passing them in when we launch individual containers (executors). We need to modify the ExecutorRunnable.startContainer to set the acls in the ContainerLaunchContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3375) spark on yarn container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3375. -- Resolution: Fixed Fix Version/s: 1.2.0 spark on yarn container allocation issues - Key: SPARK-3375 URL: https://issues.apache.org/jira/browse/SPARK-3375 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Blocker Fix For: 1.2.0 It looks like if yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors. This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123029#comment-14123029 ] Yu Ishikawa commented on SPARK-2966: Hi [~rnowling], {quote} Based on my reading of the Spark contribution guidelines ( https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I think that the Spark community would prefer to have one good implementation of an algorithm instead of multiple similar algorithms. {quote} I got it. Tnak you for let me know. {quote} would you like to take the lead on the hierarchical clustering? {quote} I'd love to! I read the Freeman's example code. Because I like it, I will try to implement it. And I would like you to review it. thanks Add an approximation algorithm for hierarchical clustering to MLlib --- Key: SPARK-2966 URL: https://issues.apache.org/jira/browse/SPARK-2966 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor A hierarchical clustering algorithm is a useful unsupervised learning method. Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1). I would like to implement this method. I suggest adding an approximate hierarchical clustering algorithm to MLlib. I'd like this to be assigned to me. h3. Reference # Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3416) Add matrix operations for large data set
Yu Ishikawa created SPARK-3416: -- Summary: Add matrix operations for large data set Key: SPARK-3416 URL: https://issues.apache.org/jira/browse/SPARK-3416 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa I think matrix operations for large data set would be helpful. There is a method to multiply a RDD based matrix and a local matrix. However, there is not a method to operate a RDD based matrix and another one. - multiplication - addition / subraction - power - scalar - multipy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3417) Use of old-style classes in pyspark
Matthew Rocklin created SPARK-3417: -- Summary: Use of old-style classes in pyspark Key: SPARK-3417 URL: https://issues.apache.org/jira/browse/SPARK-3417 Project: Spark Issue Type: Bug Reporter: Matthew Rocklin Priority: Minor pyspark seems to use old-style classes class Foo: These are relatively ancient and should be replaced by class Foo(object): Many newer libraries depend on this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3417) Use of old-style classes in pyspark
[ https://issues.apache.org/jira/browse/SPARK-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123059#comment-14123059 ] Apache Spark commented on SPARK-3417: - User 'mrocklin' has created a pull request for this issue: https://github.com/apache/spark/pull/2288 Use of old-style classes in pyspark --- Key: SPARK-3417 URL: https://issues.apache.org/jira/browse/SPARK-3417 Project: Spark Issue Type: Bug Reporter: Matthew Rocklin Priority: Minor pyspark seems to use old-style classes class Foo: These are relatively ancient and should be replaced by class Foo(object): Many newer libraries depend on this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123077#comment-14123077 ] RJ Nowling commented on SPARK-2966: --- Wonderful! If I can help or when you're ready for reviews, let me know! Add an approximation algorithm for hierarchical clustering to MLlib --- Key: SPARK-2966 URL: https://issues.apache.org/jira/browse/SPARK-2966 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor A hierarchical clustering algorithm is a useful unsupervised learning method. Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1). I would like to implement this method. I suggest adding an approximate hierarchical clustering algorithm to MLlib. I'd like this to be assigned to me. h3. Reference # Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3369: --- Priority: Critical (was: Major) Target Version/s: 1.2.0 Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Critical Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3174) Under YARN, add and remove executors based on load
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3174: --- Assignee: Andrew Or Under YARN, add and remove executors based on load -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123137#comment-14123137 ] Patrick Wendell commented on SPARK-3174: We should come up with a design doc for this. One other feature I saw on the user list was allowing the user to ask for more executors directly... that could be an interesting hook to expose as well, and maybe we just have a heuristic that calls this hook. It's also worth seeing how well this translates to Mesos and Standalone mode. Under YARN, add and remove executors based on load -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123143#comment-14123143 ] Davies Liu commented on SPARK-3399: --- Could you give an example to show the problem? Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123167#comment-14123167 ] Kousuke Saruta commented on SPARK-3399: --- Some test for pyspark, for instance rdd.py, use NamedTemporaryFile for creating input data. NamedTemporaryFile creates temporary file on /tmp on local filesystem. rdd.py is kicked by pyspark script in python/run-tests. If we set environment variables HADOOP_CONF_DIR or YARN_CONF_DIR in spark-env.sh before testing, pyspark command load values from those variables. After loading those value, Spark expects input data is on the filesystem configured by the environment variables. Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3418) Additional BLAS and Local Sparse Matrix support
Burak Yavuz created SPARK-3418: -- Summary: Additional BLAS and Local Sparse Matrix support Key: SPARK-3418 URL: https://issues.apache.org/jira/browse/SPARK-3418 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For Multi-Model training, adding support for Level-3 BLAS functions is vital. In addition, as most real data is sparse, support for Local Sparse Matrices will also be added, as supporting sparse matrices will save a lot of memory and will lead to better performance. The ability to left multiply a dense matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` is a sparse matrix will also be added. However, `B` and `C` will remain as Dense Matrices for now. I will post performance comparisons with other libraries that support sparse matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3418) [MLlib] Additional BLAS and Local Sparse Matrix support
[ https://issues.apache.org/jira/browse/SPARK-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-3418: --- Summary: [MLlib] Additional BLAS and Local Sparse Matrix support (was: Additional BLAS and Local Sparse Matrix support) [MLlib] Additional BLAS and Local Sparse Matrix support --- Key: SPARK-3418 URL: https://issues.apache.org/jira/browse/SPARK-3418 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For Multi-Model training, adding support for Level-3 BLAS functions is vital. In addition, as most real data is sparse, support for Local Sparse Matrices will also be added, as supporting sparse matrices will save a lot of memory and will lead to better performance. The ability to left multiply a dense matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` is a sparse matrix will also be added. However, `B` and `C` will remain as Dense Matrices for now. I will post performance comparisons with other libraries that support sparse matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123195#comment-14123195 ] Davies Liu commented on SPARK-3399: --- Thanks for the explain. I still can not reproduce the problem by putting HADOOP_CONF_DIR and YARN_CONF_DIR into conf/spark-env.sh, run-tests can run successfully. Did I miss something? Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2))
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123200#comment-14123200 ] Kousuke Saruta commented on SPARK-3399: --- Ah, did you set fs.defaultFs to like hdfs:// in core-site.xml on the path configured by HADOOP_CONF_DIR or YARN_CONF_DIR? Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123203#comment-14123203 ] Nicholas Chammas commented on SPARK-3369: - {quote} The API change is unlikely to happen. {quote} Is that due to the stable API promise made in 1.0? I guess if we're stuck with this API, we should at least tag this issue somehow for review during Spark 2.0 development. API changes should be fair game then. Also, would PySpark be affected at all by these changes? Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Critical Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123213#comment-14123213 ] Andrew Ash commented on SPARK-1823: --- // This was not fixed in Spark 1.1 and should be bumped to Spark 1.2 [~pwendell] about this issue on dev@ Aug 25th: {quote} We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. {quote} ExternalAppendOnlyMap can still OOM if one key is very large Key: SPARK-1823 URL: https://issues.apache.org/jira/browse/SPARK-1823 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.1.0 If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in. This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123214#comment-14123214 ] Davies Liu commented on SPARK-3399: --- Given fs.defaultFs as hdfs://, saveAsTextFile() will save the files into HDFS, but other parts of code assume that the files are saved in local filesystem, then test cases failed. Do I understand correctly? Then this patch make sense, thanks to your work! Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1823: - Target Version/s: 1.2.0 Affects Version/s: (was: 1.0.0) 1.1.0 1.0.2 Fix Version/s: (was: 1.1.0) ExternalAppendOnlyMap can still OOM if one key is very large Key: SPARK-1823 URL: https://issues.apache.org/jira/browse/SPARK-1823 Project: Spark Issue Type: Bug Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Or If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in. This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123235#comment-14123235 ] Andrew Or commented on SPARK-1823: -- Thanks Andrew, I have updated the versions. ExternalAppendOnlyMap can still OOM if one key is very large Key: SPARK-1823 URL: https://issues.apache.org/jira/browse/SPARK-1823 Project: Spark Issue Type: Bug Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Or If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in. This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123238#comment-14123238 ] Sandy Ryza commented on SPARK-3174: --- I've been putting a little bit of thought into this - I'll work on a design doc and post it here. Under YARN, add and remove executors based on load -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123237#comment-14123237 ] Kousuke Saruta commented on SPARK-3399: --- Yes I meant like what you mentioned. Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123253#comment-14123253 ] Tathagata Das commented on SPARK-2892: -- I intended it to be ERROR to catch such issues where receivers dont stop properly. If a program is supposed to shutdown after stopping the streaming context, then this probably not much of a problem as everything of Spark is torn down anyways. But a SparkContext is going to be reused, then this is indeed a problem. Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3399. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2270 [https://github.com/apache/spark/pull/2270] Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Fix For: 1.2.0 Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123326#comment-14123326 ] Josh Rosen commented on SPARK-2491: --- Ah, I see. It looks like we don't want to display exceptions thrown from other tasks during a shutdown, since those exceptions are likely triggered by the unclean shutdown itself rather than real errors, and thus will be confusing to users who read the logs. When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Components: YARN Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at
[jira] [Created] (SPARK-3419) Scheduler shouldn't delay running a task when executors don't reside at any of its preferred locations
Sandy Ryza created SPARK-3419: - Summary: Scheduler shouldn't delay running a task when executors don't reside at any of its preferred locations Key: SPARK-3419 URL: https://issues.apache.org/jira/browse/SPARK-3419 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3420) Using Sphinx to generate API docs for PySpark
Davies Liu created SPARK-3420: - Summary: Using Sphinx to generate API docs for PySpark Key: SPARK-3420 URL: https://issues.apache.org/jira/browse/SPARK-3420 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Sphinx can generate better documents than epydoc, so let's move on to Sphinx. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3160) Simplify DecisionTree data structure for training
[ https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3160: - Description: Improvement: code clarity Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array. Proposed fix: Maintain everything within a growing tree structure. For this, we could have a “LearningNode extends Node” setup where the LearningNode holds metadata for learning (such as impurities). The test-time model could be extracted from this training-time model, so that extra information (such as impurities) does not have to be kept after training. This would let us eliminate the flat array of nodes, thus saving storage when we do not grow a full tree. It would also potentially make it easier to pass subtrees to compute nodes for local training. was: Improvement: code clarity Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array. Proposed fix: Maintain everything within a growing tree structure. For this, we could have a “LearningNode extends Node” setup where the LearningNode holds metadata for learning (such as impurities). The test-time model could be extracted from this training-time model, so that extra information (such as impurities) does not have to be kept after training. Simplify DecisionTree data structure for training - Key: SPARK-3160 URL: https://issues.apache.org/jira/browse/SPARK-3160 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Improvement: code clarity Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array. Proposed fix: Maintain everything within a growing tree structure. For this, we could have a “LearningNode extends Node” setup where the LearningNode holds metadata for learning (such as impurities). The test-time model could be extracted from this training-time model, so that extra information (such as impurities) does not have to be kept after training. This would let us eliminate the flat array of nodes, thus saving storage when we do not grow a full tree. It would also potentially make it easier to pass subtrees to compute nodes for local training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123390#comment-14123390 ] Andrew Ash commented on SPARK-3280: --- [~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 3200 partitions in your chart? My interpretation is that both shuffle implementations behave similarly in this scenario up to ~1600 after which the hash based starts falling behind, then there's another step difference at 3200 where it hits a severe dropoff. I'm interested in the right third of the chart. A couple theories: - more partitions = more stuff in memory concurrently = GC pressure. Sort-based can stream and do merge sort, but hash-based needs to build the hash all at once then spill it - more partitions = more concurrent spills = disk thrashing while writing to lots of files concurrently, exacerbated if the test was on spinnies instead of SSDs. Maybe the sort-based merges spills while writing to disk so ends up writing fewer spill files concurrently. Also the chart is a little unclear, is the y-axis time in seconds? Made sort-based shuffle the default implementation -- Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Attachments: hash-sort-comp.png sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name
Cheng Lian created SPARK-3421: - Summary: StructField.toString should quote the name field to allow arbitrary character as struct field name Key: SPARK-3421 URL: https://issues.apache.org/jira/browse/SPARK-3421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian The original use case is something like this: {code} // JSON snippet with illegal characters in field names val json = { a(b): { c(d): hello } } :: { a(b): { c(d): world } } :: Nil val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json)) jsonSchemaRdd.saveAsParquetFile(/tmp/file.parquet) java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))), [1.37] failure: `,' expected but `(' found {code} The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} as struct field name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2714) DAGScheduler should log jobid when runJob finishes
[ https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2714: -- Summary: DAGScheduler should log jobid when runJob finishes (was: DAGScheduler logs jobid when runJob finishes) DAGScheduler should log jobid when runJob finishes -- Key: SPARK-2714 URL: https://issues.apache.org/jira/browse/SPARK-2714 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor DAGScheduler logs jobid when runJob finishes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2714) DAGScheduler should log jobid when runJob finishes
[ https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2714: -- Description: When DAGScheduler concurrently runs multiple jobs, SparkContext only logs Job finished and logs in the same file which doesn't tell who is who. It's difficult to found which job has finished or how much time it has taken from multiple Job finished: ..., took ... s logs. (was: DAGScheduler logs jobid when runJob finishes) DAGScheduler should log jobid when runJob finishes -- Key: SPARK-2714 URL: https://issues.apache.org/jira/browse/SPARK-2714 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor When DAGScheduler concurrently runs multiple jobs, SparkContext only logs Job finished and logs in the same file which doesn't tell who is who. It's difficult to found which job has finished or how much time it has taken from multiple Job finished: ..., took ... s logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2491: - Affects Version/s: 1.1.0 When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) at
[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123447#comment-14123447 ] Andrew Or commented on SPARK-2491: -- [~witgo] This doesn't seem to be specific to YARN. I'm changing the component to Spark Core. Let me know if there's a reason should change it back. When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at
[jira] [Updated] (SPARK-3174) Under YARN, add and remove executors based on load
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3174: -- Attachment: SPARK-3174design.pdf Under YARN, add and remove executors based on load -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123504#comment-14123504 ] Sandy Ryza commented on SPARK-3174: --- Posted a high-level design doc. Under YARN, add and remove executors based on load -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123538#comment-14123538 ] Apache Spark commented on SPARK-3414: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2293 Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Critical Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name
[ https://issues.apache.org/jira/browse/SPARK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123537#comment-14123537 ] Apache Spark commented on SPARK-3421: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2291 StructField.toString should quote the name field to allow arbitrary character as struct field name -- Key: SPARK-3421 URL: https://issues.apache.org/jira/browse/SPARK-3421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian The original use case is something like this: {code} // JSON snippet with illegal characters in field names val json = { a(b): { c(d): hello } } :: { a(b): { c(d): world } } :: Nil val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json)) jsonSchemaRdd.saveAsParquetFile(/tmp/file.parquet) java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))), [1.37] failure: `,' expected but `(' found {code} The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} as struct field name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2537) Workaround Timezone specific Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123556#comment-14123556 ] Cheng Lian commented on SPARK-2537: --- PR [#1440|https://github.com/apache/spark/pull/1440] fixes this issue. Workaround Timezone specific Hive tests --- Key: SPARK-2537 URL: https://issues.apache.org/jira/browse/SPARK-2537 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.1.0 Reporter: Cheng Lian Priority: Minor Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: - {{timestamp_1}} - {{timestamp_2}} - {{timestamp_3}} - {{timestamp_udf}} Their answers differ between different timezones. Caching golden answers naively cause build failures in other timezones. Currently these tests are blacklisted. A not so clever solution is to cache golden answers of all timezones for these tests, then select the right version for the current build according to system timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2537) Workaround Timezone specific Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-2537. --- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Workaround Timezone specific Hive tests --- Key: SPARK-2537 URL: https://issues.apache.org/jira/browse/SPARK-2537 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.1.0 Reporter: Cheng Lian Priority: Minor Fix For: 1.1.0 Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: - {{timestamp_1}} - {{timestamp_2}} - {{timestamp_3}} - {{timestamp_udf}} Their answers differ between different timezones. Caching golden answers naively cause build failures in other timezones. Currently these tests are blacklisted. A not so clever solution is to cache golden answers of all timezones for these tests, then select the right version for the current build according to system timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2099) Report TaskMetrics for running tasks
[ https://issues.apache.org/jira/browse/SPARK-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123584#comment-14123584 ] Andrew Ash commented on SPARK-2099: --- I just gave this a runthrough and most of the metrics above are live-updated but the GC one isn't. I think because it's not included in updateAggregateMetrics here: https://github.com/apache/spark/pull/1056/files#diff-1f32bcb61f51133bd0959a4177a066a5R175 Should I open a new ticket to make GC time live-updated? That's the metric I was most excited about to see live-updating. Report TaskMetrics for running tasks Key: SPARK-2099 URL: https://issues.apache.org/jira/browse/SPARK-2099 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.1.0 Spark currently collects a set of helpful task metrics, like shuffle bytes written, GC time, and displays them on the app web UI. These are only collected and displayed for tasks that have completed. This makes them unsuited to perhaps the situation where they would be most useful - determining what's going wrong in currently running tasks. Reporting metrics progrss for running tasks would probably require adding an executor-driver heartbeat that reports metrics for all tasks currently running on the executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2099) Report TaskMetrics for running tasks
[ https://issues.apache.org/jira/browse/SPARK-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123587#comment-14123587 ] Sandy Ryza commented on SPARK-2099: --- Yeah, unfortunately I haven't had the chance to add that one in yet. A new ticket would be great. Report TaskMetrics for running tasks Key: SPARK-2099 URL: https://issues.apache.org/jira/browse/SPARK-2099 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.1.0 Spark currently collects a set of helpful task metrics, like shuffle bytes written, GC time, and displays them on the app web UI. These are only collected and displayed for tasks that have completed. This makes them unsuited to perhaps the situation where they would be most useful - determining what's going wrong in currently running tasks. Reporting metrics progrss for running tasks would probably require adding an executor-driver heartbeat that reports metrics for all tasks currently running on the executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3418) [MLlib] Additional BLAS and Local Sparse Matrix support
[ https://issues.apache.org/jira/browse/SPARK-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123634#comment-14123634 ] Apache Spark commented on SPARK-3418: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/2294 [MLlib] Additional BLAS and Local Sparse Matrix support --- Key: SPARK-3418 URL: https://issues.apache.org/jira/browse/SPARK-3418 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For Multi-Model training, adding support for Level-3 BLAS functions is vital. In addition, as most real data is sparse, support for Local Sparse Matrices will also be added, as supporting sparse matrices will save a lot of memory and will lead to better performance. The ability to left multiply a dense matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` is a sparse matrix will also be added. However, `B` and `C` will remain as Dense Matrices for now. I will post performance comparisons with other libraries that support sparse matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3211) .take() is OOM-prone when there are empty partitions
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3211: - Target Version/s: 1.1.1, 1.2.0 .take() is OOM-prone when there are empty partitions Key: SPARK-3211 URL: https://issues.apache.org/jira/browse/SPARK-3211 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Ash Filed on dev@ on 22 August by [~pnepywoda]: {quote} On line 777 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 the logic for take() reads ALL partitions if the first one (or first k) are empty. This has actually lead to OOMs when we had many partitions (thousands) and unfortunately the first one was empty. Wouldn't a better implementation strategy be numPartsToTry = partsScanned * 2 instead of numPartsToTry = totalParts - 1 (this doubling is similar to most memory allocation strategies) Thanks! - Paul {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3416) Add matrix operations for large data set
[ https://issues.apache.org/jira/browse/SPARK-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123695#comment-14123695 ] Yu Ishikawa commented on SPARK-3416: We discuss about this issue on the thread. http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-td8291.html Add matrix operations for large data set Key: SPARK-3416 URL: https://issues.apache.org/jira/browse/SPARK-3416 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa I think matrix operations for large data set would be helpful. There is a method to multiply a RDD based matrix and a local matrix. However, there is not a method to operate a RDD based matrix and another one. - multiplication - addition / subraction - power - scalar - multipy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3082) yarn.Client.logClusterResourceDetails throws NPE if requested queue doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-3082. --- Resolution: Fixed Fix Version/s: 1.1.0 yarn.Client.logClusterResourceDetails throws NPE if requested queue doesn't exist - Key: SPARK-3082 URL: https://issues.apache.org/jira/browse/SPARK-3082 Project: Spark Issue Type: Bug Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Minor Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3423) Implement BETWEEN support for regular SQL parser
William Benton created SPARK-3423: - Summary: Implement BETWEEN support for regular SQL parser Key: SPARK-3423 URL: https://issues.apache.org/jira/browse/SPARK-3423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: William Benton Priority: Minor The HQL parser supports BETWEEN but the SQLParser currently does not. It would be great if it did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123749#comment-14123749 ] Marcelo Vanzin commented on SPARK-3215: --- I updated the prototype to include a Java API and not to use SparkConf in the API. Add remote interface for SparkContext - Key: SPARK-3215 URL: https://issues.apache.org/jira/browse/SPARK-3215 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Marcelo Vanzin Labels: hive Attachments: RemoteSparkContext.pdf A quick description of the issue: as part of running Hive jobs on top of Spark, it's desirable to have a SparkContext that is running in the background and listening for job requests for a particular user session. Running multiple contexts in the same JVM is not a very good solution. Not only SparkContext currently has issues sharing the same JVM among multiple instances, but that turns the JVM running the contexts into a huge bottleneck in the system. So I'm proposing a solution where we have a SparkContext that is running in a separate process, and listening for requests from the client application via some RPC interface (most probably Akka). I'll attach a document shortly with the current proposal. Let's use this bug to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3423) Implement BETWEEN support for regular SQL parser
[ https://issues.apache.org/jira/browse/SPARK-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123814#comment-14123814 ] William Benton commented on SPARK-3423: --- (PR is here: https://github.com/apache/spark/pull/2295 ) Implement BETWEEN support for regular SQL parser Key: SPARK-3423 URL: https://issues.apache.org/jira/browse/SPARK-3423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: William Benton Assignee: William Benton Priority: Minor The HQL parser supports BETWEEN but the SQLParser currently does not. It would be great if it did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123883#comment-14123883 ] Kostas Sakellis commented on SPARK-1239: [~pwendell] I'd like to take a crack at this since it is affecting one of our customers. Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.1.0 Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3423) Implement BETWEEN support for regular SQL parser
[ https://issues.apache.org/jira/browse/SPARK-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124238#comment-14124238 ] Apache Spark commented on SPARK-3423: - User 'willb' has created a pull request for this issue: https://github.com/apache/spark/pull/2295 Implement BETWEEN support for regular SQL parser Key: SPARK-3423 URL: https://issues.apache.org/jira/browse/SPARK-3423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: William Benton Assignee: William Benton Priority: Minor The HQL parser supports BETWEEN but the SQLParser currently does not. It would be great if it did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124239#comment-14124239 ] Apache Spark commented on SPARK-2334: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2296 Attribute Error calling PipelinedRDD.id() in pyspark Key: SPARK-2334 URL: https://issues.apache.org/jira/browse/SPARK-2334 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.1.0 Reporter: Diana Carroll calling the id() function of a PipelinedRDD causes an error in PySpark. (Works fine in Scala.) The second id() call here fails, the first works: {code} r1 = sc.parallelize([1,2,3]) r1.id() r2=r1.map(lambda i: i+1) r2.id() {code} Error: {code} --- AttributeErrorTraceback (most recent call last) ipython-input-31-a0cf66fcf645 in module() 1 r2.id() /usr/lib/spark/python/pyspark/rdd.py in id(self) 180 A unique ID for this RDD (within its SparkContext). 181 -- 182 return self._id 183 184 def __repr__(self): AttributeError: 'PipelinedRDD' object has no attribute '_id' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3211) .take() is OOM-prone when there are empty partitions
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3211: - Assignee: Andrew Ash .take() is OOM-prone when there are empty partitions Key: SPARK-3211 URL: https://issues.apache.org/jira/browse/SPARK-3211 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Ash Assignee: Andrew Ash Fix For: 1.1.1, 1.2.0 Filed on dev@ on 22 August by [~pnepywoda]: {quote} On line 777 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 the logic for take() reads ALL partitions if the first one (or first k) are empty. This has actually lead to OOMs when we had many partitions (thousands) and unfortunately the first one was empty. Wouldn't a better implementation strategy be numPartsToTry = partsScanned * 2 instead of numPartsToTry = totalParts - 1 (this doubling is similar to most memory allocation strategies) Thanks! - Paul {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3211) .take() is OOM-prone when there are empty partitions
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3211. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 .take() is OOM-prone when there are empty partitions Key: SPARK-3211 URL: https://issues.apache.org/jira/browse/SPARK-3211 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Ash Assignee: Andrew Ash Fix For: 1.1.1, 1.2.0 Filed on dev@ on 22 August by [~pnepywoda]: {quote} On line 777 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 the logic for take() reads ALL partitions if the first one (or first k) are empty. This has actually lead to OOMs when we had many partitions (thousands) and unfortunately the first one was empty. Wouldn't a better implementation strategy be numPartsToTry = partsScanned * 2 instead of numPartsToTry = totalParts - 1 (this doubling is similar to most memory allocation strategies) Thanks! - Paul {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3424) KMeans Plus Plus is too slow
Derrick Burns created SPARK-3424: Summary: KMeans Plus Plus is too slow Key: SPARK-3424 URL: https://issues.apache.org/jira/browse/SPARK-3424 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns The KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is the rounds of the KMeansParallel algorithm and k is the number of clusters. This can be dramatically improved by maintaining the distance the closest cluster center from round to round and then incrementally updating that value for each point. This incremental update is O(1) time, this reduces the running time for K Means Plus Plus to O( m k ). For large k, this is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3411) Improve load-balancing of concurrently-submitted drivers across workers
[ https://issues.apache.org/jira/browse/SPARK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-3411: --- Summary: Improve load-balancing of concurrently-submitted drivers across workers (was: Optimize the schedule procedure in Master) Improve load-balancing of concurrently-submitted drivers across workers --- Key: SPARK-3411 URL: https://issues.apache.org/jira/browse/SPARK-3411 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization. We should do randomization every time we dispatch a driver, in order to better balance drivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3411) Improve load-balancing of concurrently-submitted drivers across workers
[ https://issues.apache.org/jira/browse/SPARK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-3411: --- Description: If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization. We should do randomization every time we dispatch a driver, in order to better balance drivers. Update(2014/9/6):Doing shuffle is much slower, so we use round robin to avoid it. was: If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization. We should do randomization every time we dispatch a driver, in order to better balance drivers. Improve load-balancing of concurrently-submitted drivers across workers --- Key: SPARK-3411 URL: https://issues.apache.org/jira/browse/SPARK-3411 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization. We should do randomization every time we dispatch a driver, in order to better balance drivers. Update(2014/9/6):Doing shuffle is much slower, so we use round robin to avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3361) Expand PEP 8 checks to include EC2 script and Python examples
[ https://issues.apache.org/jira/browse/SPARK-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124293#comment-14124293 ] Apache Spark commented on SPARK-3361: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/2297 Expand PEP 8 checks to include EC2 script and Python examples - Key: SPARK-3361 URL: https://issues.apache.org/jira/browse/SPARK-3361 Project: Spark Issue Type: Sub-task Components: Build Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.1.0 Via {{tox.ini}}, expand the PEP 8 checks to include the EC2 script and all Python examples. That should cover everything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org