[jira] [Resolved] (SPARK-2778) Add unit tests for Yarn integration
[ https://issues.apache.org/jira/browse/SPARK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2778. Resolution: Fixed Fix Version/s: 1.2.0 Fixed by: https://github.com/apache/spark/pull/2257 Add unit tests for Yarn integration --- Key: SPARK-2778 URL: https://issues.apache.org/jira/browse/SPARK-2778 Project: Spark Issue Type: Test Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 It would be nice to add some Yarn integration tests to the unit tests in Spark; Yarn provides a MiniYARNCluster class that can be used to spawn a cluster locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147465#comment-14147465 ] Patrick Wendell commented on SPARK-3687: Can you perform a jstack on the executor when it is hanging? We usually only post things on JIRA like this when a specific issue has been debugged a bit more. But if you can produce a jstack of the hung executor we can keep it open :) Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 80. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3576) Provide script for creating the Spark AMI from scratch
[ https://issues.apache.org/jira/browse/SPARK-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3576. Resolution: Fixed This was fixed in spark-ec2 itself Provide script for creating the Spark AMI from scratch -- Key: SPARK-3576 URL: https://issues.apache.org/jira/browse/SPARK-3576 Project: Spark Issue Type: Bug Components: EC2 Reporter: Patrick Wendell Assignee: Patrick Wendell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters
[ https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3288: --- Assignee: (was: Andrew Or) All fields in TaskMetrics should be private and use getters/setters --- Key: SPARK-3288 URL: https://issues.apache.org/jira/browse/SPARK-3288 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Labels: starter This is particularly bad because we expose this as a developer API. Technically a library could create a TaskMetrics object and then change the values inside of it and pass it onto someone else. It can be written pretty compactly like below: {code} /** * Number of bytes written for the shuffle by this task */ @volatile private var _shuffleBytesWritten: Long = _ def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value def shuffleBytesWritten = _shuffleBytesWritten {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters
[ https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3288: --- Labels: starter (was: ) All fields in TaskMetrics should be private and use getters/setters --- Key: SPARK-3288 URL: https://issues.apache.org/jira/browse/SPARK-3288 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Labels: starter This is particularly bad because we expose this as a developer API. Technically a library could create a TaskMetrics object and then change the values inside of it and pass it onto someone else. It can be written pretty compactly like below: {code} /** * Number of bytes written for the shuffle by this task */ @volatile private var _shuffleBytesWritten: Long = _ def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value def shuffleBytesWritten = _shuffleBytesWritten {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147504#comment-14147504 ] Ziv Huang commented on SPARK-3687: -- The following is the jstack dump of one executor when it hangs: File appending thread for /opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stderr daemon prio=10 tid=0x7ffe0c002800 nid=0x18a3 runnable [0x7ffebc402000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:272) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0xfaeee1d0 (a java.lang.UNIXProcess$ProcessPipeInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) File appending thread for /opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stdout daemon prio=10 tid=0x7ffe0c004000 nid=0x18a2 runnable [0x7ffebc503000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:272) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0xfaeec108 (a java.lang.UNIXProcess$ProcessPipeInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) process reaper daemon prio=10 tid=0x7ffe0c001000 nid=0x1868 runnable [0x7ffecc0c7000] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.waitForProcessExit(Native Method) at java.lang.UNIXProcess.access$500(UNIXProcess.java:54) at java.lang.UNIXProcess$4.run(UNIXProcess.java:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ExecutorRunner for app-20140925150845-0007/2 daemon prio=10 tid=0x7ffe7011b800 nid=0x1866 in Object.wait() [0x7ffebc705000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xfaee9df8 (a java.lang.UNIXProcess) at java.lang.Object.wait(Object.java:503) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:263) - locked 0xfaee9df8 (a java.lang.UNIXProcess) at org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:164) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:63) Attach Listener daemon prio=10 tid=0x7ffe84001000 nid=0x170f waiting on condition [0x] java.lang.Thread.State: RUNNABLE sparkWorker-akka.actor.default-dispatcher-16 daemon prio=10 tid=0x7ffe68214800 nid=0x13a3 waiting on condition [0x7ffebc806000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xfd614a78 (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) sparkWorker-akka.actor.default-dispatcher-15 daemon prio=10 tid=0x7ffe7011e000 nid=0x13a2 waiting on condition [0x7ffebc604000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native
[jira] [Created] (SPARK-3688) LogicalPlan can't resolve column correctlly
Yi Tian created SPARK-3688: -- Summary: LogicalPlan can't resolve column correctlly Key: SPARK-3688 URL: https://issues.apache.org/jira/browse/SPARK-3688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yi Tian How to reproduce this problem: create a table: {quote} create table test (a string, b string); {quote} execute sql: {quote} select a.b ,count(1) from test a join test t group by a.b; {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147513#comment-14147513 ] Ziv Huang commented on SPARK-3687: -- Just a few mins ago I ran a job twice, processing 203 sequence files. Both times I saw the job hanging with different behavior from before: 1. the web UI of spark master shows that the job is finished with state failed after 3.x mins 2. the job stage web UI still hangs, and execution duration time is still accumulating. Hope this information helps debugging :) Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 80. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3422) JavaAPISuite.getHadoopInputSplits isn't used anywhere
[ https://issues.apache.org/jira/browse/SPARK-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-3422. --- Resolution: Fixed JavaAPISuite.getHadoopInputSplits isn't used anywhere - Key: SPARK-3422 URL: https://issues.apache.org/jira/browse/SPARK-3422 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3688) LogicalPlan can't resolve column correctlly
[ https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147520#comment-14147520 ] Yi Tian commented on SPARK-3688: As we know, the hive support complex colunm datatype like struct. So it is hard to resolve a reference like a.b.c I think we should add some judgements on the datatype attributes, like: {quote} option.dataType.isInstanceOf[StructType] {quote} LogicalPlan can't resolve column correctlly --- Key: SPARK-3688 URL: https://issues.apache.org/jira/browse/SPARK-3688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yi Tian How to reproduce this problem: create a table: {quote} create table test (a string, b string); {quote} execute sql: {quote} select a.b ,count(1) from test a join test t group by a.b; {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147513#comment-14147513 ] Ziv Huang edited comment on SPARK-3687 at 9/25/14 8:36 AM: --- Just a few mins ago I ran a job twice, processing 203 sequence files. Both times I saw the job hanging with different behavior than before: 1. the web UI of spark master shows that the job is finished with state failed after 3.x mins 2. the job stage web UI still hangs, and execution duration time is still accumulating. Hope this information helps debugging :) was (Author: taqilabon): Just a few mins ago I ran a job twice, processing 203 sequence files. Both times I saw the job hanging with different behavior from before: 1. the web UI of spark master shows that the job is finished with state failed after 3.x mins 2. the job stage web UI still hangs, and execution duration time is still accumulating. Hope this information helps debugging :) Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 80. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3651) Consolidate executor maps in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147619#comment-14147619 ] Apache Spark commented on SPARK-3651: - User 'tigerquoll' has created a pull request for this issue: https://github.com/apache/spark/pull/2533 Consolidate executor maps in CoarseGrainedSchedulerBackend -- Key: SPARK-3651 URL: https://issues.apache.org/jira/browse/SPARK-3651 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Dale Richardson In CoarseGrainedSchedulerBackend, we have: {code} private val executorActor = new HashMap[String, ActorRef] private val executorAddress = new HashMap[String, Address] private val executorHost = new HashMap[String, String] private val freeCores = new HashMap[String, Int] private val totalCores = new HashMap[String, Int] {code} We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3689) FileLogger should create new instance of FileSystem regardless of it's scheme
Kousuke Saruta created SPARK-3689: - Summary: FileLogger should create new instance of FileSystem regardless of it's scheme Key: SPARK-3689 URL: https://issues.apache.org/jira/browse/SPARK-3689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta FileLogger creates new instance of FileSystem to avoid the effect of FileSystem#close from another module but it's expected only HDFS. We can used another filesystem for the directory which event log is stored to. {code} if (scheme == hdfs) { conf.setBoolean(fs.hdfs.impl.disable.cache, true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3689) FileLogger should create new instance of FileSystem regardless of it's scheme
[ https://issues.apache.org/jira/browse/SPARK-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147689#comment-14147689 ] Apache Spark commented on SPARK-3689: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2534 FileLogger should create new instance of FileSystem regardless of it's scheme - Key: SPARK-3689 URL: https://issues.apache.org/jira/browse/SPARK-3689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta FileLogger creates new instance of FileSystem to avoid the effect of FileSystem#close from another module but it's expected only HDFS. We can used another filesystem for the directory which event log is stored to. {code} if (scheme == hdfs) { conf.setBoolean(fs.hdfs.impl.disable.cache, true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3682) Add helpful warnings to the UI
[ https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147730#comment-14147730 ] Arun Ahuja commented on SPARK-3682: --- We've been running to a lot of these issues so this would be very helpful - could you explain this one however Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be decreased.? Thanks! Add helpful warnings to the UI -- Key: SPARK-3682 URL: https://issues.apache.org/jira/browse/SPARK-3682 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.1.0 Reporter: Sandy Ryza Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be decreased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
[ https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147751#comment-14147751 ] Apache Spark commented on SPARK-3638: - User 'aniketbhatnagar' has created a pull request for this issue: https://github.com/apache/spark/pull/2535 Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: dependencies Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3639) Kinesis examples set master as local
[ https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147758#comment-14147758 ] Apache Spark commented on SPARK-3639: - User 'aniketbhatnagar' has created a pull request for this issue: https://github.com/apache/spark/pull/2536 Kinesis examples set master as local Key: SPARK-3639 URL: https://issues.apache.org/jira/browse/SPARK-3639 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Aniket Bhatnagar Priority: Minor Labels: examples Kinesis examples set master as local thus not allowing the example to be tested on a cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3690) Closing shuffle writers we swallow more important exception
Egor Pakhomov created SPARK-3690: Summary: Closing shuffle writers we swallow more important exception Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {quote} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {quote} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3690) Closing shuffle writers we swallow more important exception
[ https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pakhomov updated SPARK-3690: - Description: ShaffleMapTask: line 75 {code:title=Bar.java|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. was: ShaffleMapTask: line 75 {quote} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {quote} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. Closing shuffle writers we swallow more important exception --- Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {code:title=Bar.java|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3690) Closing shuffle writers we swallow more important exception
[ https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pakhomov updated SPARK-3690: - Description: ShaffleMapTask: line 75 {code:title=ShaffleMapTask|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. was: ShaffleMapTask: line 75 {code:title=Bar.java|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. Closing shuffle writers we swallow more important exception --- Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {code:title=ShaffleMapTask|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode
[ https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147835#comment-14147835 ] Thomas Graves commented on SPARK-3678: -- Also note that the spark-submit --name option doesn't work in client mode. Atleast not for the spark examples. Yarn app name reported in RM is different between cluster and client mode - Key: SPARK-3678 URL: https://issues.apache.org/jira/browse/SPARK-3678 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves If you launch an application in yarn cluster mode the name of the application in the ResourceManager generally shows up as the full name org.apache.spark.examples.SparkHdfsLR. If you start the same app in client mode it shows up as SparkHdfsLR. We should be consistent between them. I haven't looked at it in detail, perhaps its only the examples but I think I've seen this with customer apps also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Comment: was deleted (was: Just a few mins ago I ran a job twice, processing 203 sequence files. Both times I saw the job hanging with different behavior than before: 1. the web UI of spark master shows that the job is finished with state failed after 3.x mins 2. the job stage web UI still hangs, and execution duration time is still accumulating. Hope this information helps debugging :)) Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 120th-150th tasks. In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't kill the job from web UI. In 1.1.0, the job hangs for couple mins (3.x mins actually), and then web UI of spark master shows that the job is finished with state FAILED. In addition, the job stage web UI still hangs, and execution duration time is still accumulating. For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Description: In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 120th-150th tasks. In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't kill the job from web UI. In 1.1.0, the job hangs for couple mins (3.x mins actually), and then web UI of spark master shows that the job is finished with state FAILED. In addition, the job stage web UI still hangs, and execution duration time is still accumulating. For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 100. was: In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 80. Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 120th-150th tasks. In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't kill the job from web UI. In 1.1.0, the job hangs for couple mins (3.x mins actually), and then web UI of spark master shows that the job is finished with state FAILED. In addition, the job stage web UI still hangs, and execution duration time is still accumulating. For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147504#comment-14147504 ] Ziv Huang edited comment on SPARK-3687 at 9/25/14 3:09 PM: --- The following is the jstack dump of one executor when it hangs (the spark version is 1.1.0): File appending thread for /opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stderr daemon prio=10 tid=0x7ffe0c002800 nid=0x18a3 runnable [0x7ffebc402000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:272) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0xfaeee1d0 (a java.lang.UNIXProcess$ProcessPipeInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) File appending thread for /opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stdout daemon prio=10 tid=0x7ffe0c004000 nid=0x18a2 runnable [0x7ffebc503000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:272) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0xfaeec108 (a java.lang.UNIXProcess$ProcessPipeInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) process reaper daemon prio=10 tid=0x7ffe0c001000 nid=0x1868 runnable [0x7ffecc0c7000] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.waitForProcessExit(Native Method) at java.lang.UNIXProcess.access$500(UNIXProcess.java:54) at java.lang.UNIXProcess$4.run(UNIXProcess.java:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ExecutorRunner for app-20140925150845-0007/2 daemon prio=10 tid=0x7ffe7011b800 nid=0x1866 in Object.wait() [0x7ffebc705000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xfaee9df8 (a java.lang.UNIXProcess) at java.lang.Object.wait(Object.java:503) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:263) - locked 0xfaee9df8 (a java.lang.UNIXProcess) at org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:164) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:63) Attach Listener daemon prio=10 tid=0x7ffe84001000 nid=0x170f waiting on condition [0x] java.lang.Thread.State: RUNNABLE sparkWorker-akka.actor.default-dispatcher-16 daemon prio=10 tid=0x7ffe68214800 nid=0x13a3 waiting on condition [0x7ffebc806000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xfd614a78 (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) sparkWorker-akka.actor.default-dispatcher-15 daemon prio=10 tid=0x7ffe7011e000 nid=0x13a2 waiting on condition [0x7ffebc604000]
[jira] [Commented] (SPARK-3690) Closing shuffle writers we swallow more important exception
[ https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147864#comment-14147864 ] Egor Pakhomov commented on SPARK-3690: -- https://github.com/apache/spark/pull/2537 Closing shuffle writers we swallow more important exception --- Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {code:title=ShaffleMapTask|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3690) Closing shuffle writers we swallow more important exception
[ https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147867#comment-14147867 ] Apache Spark commented on SPARK-3690: - User 'epahomov' has created a pull request for this issue: https://github.com/apache/spark/pull/2537 Closing shuffle writers we swallow more important exception --- Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {code:title=ShaffleMapTask|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2516) Bootstrapping
[ https://issues.apache.org/jira/browse/SPARK-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147875#comment-14147875 ] Yu Ishikawa commented on SPARK-2516: Hi [~mengxr], I would like to work this issue, if possible. If you have any design of bootstrapping, could you share it with me? Bootstrapping - Key: SPARK-2516 URL: https://issues.apache.org/jira/browse/SPARK-2516 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Support re-sampling and bootstrap estimators in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147918#comment-14147918 ] Arun Ahuja commented on SPARK-3633: --- Which timeout values were increased to work around this? We have been seeing many more errors with FetchFailed(BlockManagerId(21, And I also see a java.io.IOException: Failed to list files for dir: /data/09/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1403901413406_1926/spark-local-20140925115858-c4a7 at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:673) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:685) exception with that failure Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147939#comment-14147939 ] Zhan Zhang commented on SPARK-3633: --- Increasing timeout does not help my case either. I still keep getting fetch error. Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148004#comment-14148004 ] Arun Ahuja commented on SPARK-3633: --- Also, which timeout setting was useful: spark.akka.timeout or spark.core.connection.ack.wait.timeout. Using GC logging I see this both when there are many Full GC and even on smaller datasets when there are not. It much more frequent on the former. Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148010#comment-14148010 ] Mayank Bansal commented on SPARK-3561: -- HI Guys, we at ebay are having some issues in cluster utilization while running spark-on-yarn with batch workloads of Hadoop. I think this would be nice to try out and see if we overcome this issue. We would be intrested in trying this out. Thanks, Mayank Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148026#comment-14148026 ] Nishkam Ravi commented on SPARK-3633: - Increasing the value of spark.core.connection.ack.wait.timeout (600) worked in my case Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle
[ https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148052#comment-14148052 ] Aaron Staple commented on SPARK-546: Hi, I think there are two features requested in this ticket: 1) full outer join 2) an RDD function to join 2 rdds in a single shuffle (e.g. multiJoin function) I’ve implemented #1 in my recent PR, but not #2. I’m happy to implement #2 as well though. Would it make sense to reopen this ticket? File a new ticket? Support full outer join and multiple join in a single shuffle - Key: SPARK-546 URL: https://issues.apache.org/jira/browse/SPARK-546 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Reynold Xin Assignee: Aaron Staple Fix For: 1.2.0 RDD[(K,V)] now supports left/right outer join but not full outer join. Also it'd be nice to provide a way for users to join multiple RDDs on the same key in a single shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3691) Provide a mini cluster for testing system built on Spark
Xuefu Zhang created SPARK-3691: -- Summary: Provide a mini cluster for testing system built on Spark Key: SPARK-3691 URL: https://issues.apache.org/jira/browse/SPARK-3691 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 1.1.0 Reporter: Xuefu Zhang Most Hadoop components such MR, DFS, Tez, and Yarn provide a mini cluster that can be used to test the external systems that rely on those frameworks, such as Pig and Hive. While Spark's local mode can be used to do such testing and is friendly for debugging, it's too far from a real Spark cluster and a lot of problems cannot be discovered. Thus, an equivalent of Hadoop MR mini cluster in Spark would be very helpful in testing system such as Hive/Pig on Spark. Spark's local-cluster is considered for this purpose but it doesn't fit well because it requires a Spark installation on the box where the tests run. Also, local-cluster isn't exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2932) Move MasterFailureTest out of main source directory
[ https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2932: - Fix Version/s: 1.2.0 Move MasterFailureTest out of main source directory - Key: SPARK-2932 URL: https://issues.apache.org/jira/browse/SPARK-2932 Project: Spark Issue Type: Task Components: Streaming Reporter: Marcelo Vanzin Priority: Trivial Fix For: 1.2.0 Currently, MasterFailureTest.scala lives in streaming/src/main, which means it ends up in the published streaming jar. It's only used by other test code, and although it also provides a main() entry point, that's also only usable for testing, so the code should probably be moved to the test directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2932) Move MasterFailureTest out of main source directory
[ https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-2932. -- Resolution: Fixed Move MasterFailureTest out of main source directory - Key: SPARK-2932 URL: https://issues.apache.org/jira/browse/SPARK-2932 Project: Spark Issue Type: Task Components: Streaming Reporter: Marcelo Vanzin Priority: Trivial Fix For: 1.2.0 Currently, MasterFailureTest.scala lives in streaming/src/main, which means it ends up in the published streaming jar. It's only used by other test code, and although it also provides a main() entry point, that's also only usable for testing, so the code should probably be moved to the test directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3691) Provide a mini cluster for testing system built on Spark
[ https://issues.apache.org/jira/browse/SPARK-3691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148071#comment-14148071 ] Xuefu Zhang commented on SPARK-3691: cc [~sandyr] Provide a mini cluster for testing system built on Spark Key: SPARK-3691 URL: https://issues.apache.org/jira/browse/SPARK-3691 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 1.1.0 Reporter: Xuefu Zhang Most Hadoop components such MR, DFS, Tez, and Yarn provide a mini cluster that can be used to test the external systems that rely on those frameworks, such as Pig and Hive. While Spark's local mode can be used to do such testing and is friendly for debugging, it's too far from a real Spark cluster and a lot of problems cannot be discovered. Thus, an equivalent of Hadoop MR mini cluster in Spark would be very helpful in testing system such as Hive/Pig on Spark. Spark's local-cluster is considered for this purpose but it doesn't fit well because it requires a Spark installation on the box where the tests run. Also, local-cluster isn't exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3691) Provide a mini cluster for testing system built on Spark
[ https://issues.apache.org/jira/browse/SPARK-3691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148071#comment-14148071 ] Xuefu Zhang edited comment on SPARK-3691 at 9/25/14 6:21 PM: - cc [~sandyr], [~rxin] was (Author: xuefuz): cc [~sandyr] Provide a mini cluster for testing system built on Spark Key: SPARK-3691 URL: https://issues.apache.org/jira/browse/SPARK-3691 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 1.1.0 Reporter: Xuefu Zhang Most Hadoop components such MR, DFS, Tez, and Yarn provide a mini cluster that can be used to test the external systems that rely on those frameworks, such as Pig and Hive. While Spark's local mode can be used to do such testing and is friendly for debugging, it's too far from a real Spark cluster and a lot of problems cannot be discovered. Thus, an equivalent of Hadoop MR mini cluster in Spark would be very helpful in testing system such as Hive/Pig on Spark. Spark's local-cluster is considered for this purpose but it doesn't fit well because it requires a Spark installation on the box where the tests run. Also, local-cluster isn't exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148208#comment-14148208 ] Josh Rosen edited comment on SPARK-1823 at 9/25/14 7:42 PM: SPARK-3074 is a related issue for large group-bys PySpark. was (Author: joshrosen): SPARK-3074 is a related issue for PySpark. ExternalAppendOnlyMap can still OOM if one key is very large Key: SPARK-1823 URL: https://issues.apache.org/jira/browse/SPARK-1823 Project: Spark Issue Type: Bug Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Or If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in. This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148208#comment-14148208 ] Josh Rosen commented on SPARK-1823: --- SPARK-3074 is a related issue for PySpark. ExternalAppendOnlyMap can still OOM if one key is very large Key: SPARK-1823 URL: https://issues.apache.org/jira/browse/SPARK-1823 Project: Spark Issue Type: Bug Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Or If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in. This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2546) Configuration object thread safety issue
[ https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148244#comment-14148244 ] Andrew Ash commented on SPARK-2546: --- Another proposed fix: extend JobConf as a shim and replace the Hadoop one with one that's threadsafe Configuration object thread safety issue Key: SPARK-2546 URL: https://issues.apache.org/jira/browse/SPARK-2546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Reporter: Andrew Ash Assignee: Josh Rosen Priority: Critical // observed in 0.9.1 but expected to exist in 1.0.1 as well This ticket is copy-pasted from a thread on the dev@ list: {quote} We discovered a very interesting bug in Spark at work last week in Spark 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to thread safety issues. I believe it still applies in Spark 1.0.1 as well. Let me explain: Observations - Was running a relatively simple job (read from Avro files, do a map, do another map, write back to Avro files) - 412 of 413 tasks completed, but the last task was hung in RUNNING state - The 412 successful tasks completed in median time 3.4s - The last hung task didn't finish even in 20 hours - The executor with the hung task was responsible for 100% of one core of CPU usage - Jstack of the executor attached (relevant thread pasted below) Diagnosis After doing some code spelunking, we determined the issue was concurrent use of a Configuration object for each task on an executor. In Hadoop each task runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so the single-threaded access assumptions of the Configuration object no longer hold in Spark. The specific issue is that the AvroRecordReader actually _modifies_ the JobConf it's given when it's instantiated! It adds a key for the RPC protocol engine in the process of connecting to the Hadoop FileSystem. When many tasks start at the same time (like at the start of a job), many tasks are adding this configuration item to the one Configuration object at once. Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… The below post is an excellent explanation of what happens in the situation where multiple threads insert into a HashMap at the same time. http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html The gist is that you have a thread following a cycle of linked list nodes indefinitely. This exactly matches our observations of the 100% CPU core and also the final location in the stack trace. So it seems the way Spark shares a Configuration object between task threads in an executor is incorrect. We need some way to prevent concurrent access to a single Configuration object. Proposed fix We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets its own JobConf object (and thus Configuration object). The optimization of broadcasting the Configuration object across the cluster can remain, but on the other side I think it needs to be cloned for each task to allow for concurrent access. I'm not sure the performance implications, but the comments suggest that the Configuration object is ~10KB so I would expect a clone on the object to be relatively speedy. Has this been observed before? Does my suggested fix make sense? I'd be happy to file a Jira ticket and continue discussion there for the right way to fix. Thanks! Andrew P.S. For others seeing this issue, our temporary workaround is to enable spark.speculation, which retries failed (or hung) tasks on other machines. {noformat} Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 nid=0x54b1 runnable [0x7f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193) at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:436) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:403) at
[jira] [Commented] (SPARK-1241) Support sliding in RDD
[ https://issues.apache.org/jira/browse/SPARK-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148281#comment-14148281 ] Frens Jan Rumph commented on SPARK-1241: Hi, I'm investigating use of Spark for matching patterns in symbolically represented time series where the sliding functionality as available from scala iterators would make life a lot easier. From the ticket I'd say this functionality is implemented (status resolved, fix version 1.0.0, ...), but I can't find it in the docs and the PR indicates that this functionality hasn't made it into Spark (just yet ...). Is this functionality available? Cheers, Frens Background: I want to compare strings of segments to other (larger) strings of segments. As segment strings may be split up over partitions, the more straight forward aproaches I could come up with don't work. Support sliding in RDD -- Key: SPARK-1241 URL: https://issues.apache.org/jira/browse/SPARK-1241 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 Sliding is useful for operations like creating n-grams, calculating total variation, numerical integration, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3692) RBF Kernel implementation to SVM
Ekrem Aksoy created SPARK-3692: -- Summary: RBF Kernel implementation to SVM Key: SPARK-3692 URL: https://issues.apache.org/jira/browse/SPARK-3692 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ekrem Aksoy Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3692) RBF Kernel implementation to SVM
[ https://issues.apache.org/jira/browse/SPARK-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekrem Aksoy updated SPARK-3692: --- Description: Radial Basis Function is another type of kernel that can be used instead of linear kernel in SVM. RBF Kernel implementation to SVM Key: SPARK-3692 URL: https://issues.apache.org/jira/browse/SPARK-3692 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ekrem Aksoy Priority: Minor Radial Basis Function is another type of kernel that can be used instead of linear kernel in SVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
Xuefu Zhang created SPARK-3693: -- Summary: Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { private static final Pattern SPACE = Pattern.compile( ); public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
[ https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated SPARK-3693: --- Description: While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. was: While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { private static final Pattern SPACE = Pattern.compile( ); public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf
[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
[ https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148295#comment-14148295 ] Xuefu Zhang commented on SPARK-3693: cc [~rxin], [~sandyr] Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
[ https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148303#comment-14148303 ] Sandy Ryza commented on SPARK-3693: --- Spark's documentation actually makes a note of this. {code} * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD will create many references to the same object. * If you plan to directly cache Hadoop writable objects, you should first copy them using * a `map` function. {code} Having MemoryStore make a copy would degrade performance in other situations where the objects aren't used. While it's not pretty, I don't see a better approach than the current behavior where the user needs to explicitly makes a copy. Though we could possibly provide some utility to help if it requires a lot of user boilerplate? Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
[ https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148306#comment-14148306 ] Reynold Xin commented on SPARK-3693: Just responded to you offline as well. This is a known issue. The problem is that not all record types implement copy. As a matter of fact, even some common ones like Avro don't deal with that as well. For now the workaround is really to call copy by the user since the user knows what record type they are using. Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
[ https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3693. Resolution: Duplicate Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value
[ https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148318#comment-14148318 ] Xuefu Zhang commented on SPARK-3693: Thanks, guys. We are fine with the workaround. Cached Hadoop RDD always return rows with the same value Key: SPARK-3693 URL: https://issues.apache.org/jira/browse/SPARK-3693 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang While trying RDD caching, it's found that caching a Hadoop RDD causes data correctness issues. The following code snippet demonstrates the usage: {code} public final class Test { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(Test); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ... JavaPairRDDBytesWritable, BytesWritable input = ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, WritableComparable.class, Writable.class); input = input.cache(); input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() { @Override public void call(Tuple2BytesWritable, BytesWritable row) throws Exception { if (row._1() != null) { System.out.println(Key: + row._1()); } if (row._2() != null) { System.out.println(Value: + row._2()); } } }); ctx.stop(); } } {code} In this case, row._2() always gives the same value. If we disable caching by removing input.cache(), the program gives the expected rows. Further analysis shows that MemoryStore (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236) is storing the references to (key, value) pairs returned by HadoopRDD.getNext() (See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220), but this method always returns the same (key, value) object references, except each getNext() call updates values inside these objects. When there are no more records (key, value) objects are filled with empty strings (no values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same key, value object pairs, all values become NULL. Probably MemoryStore should instead store a copy of key, value pair rather than keeping a reference to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners
[ https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Staple resolved SPARK-3550. - Resolution: Fixed Disable automatic rdd caching in python api for relevant learners - Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners
[ https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148347#comment-14148347 ] Aaron Staple commented on SPARK-3550: - This has been addressed in another commit: https://github.com/apache/spark/commit/fce5e251d636c788cda91345867e0294280c074d See comment here: https://github.com/apache/spark/pull/2412#issuecomment-56865408 Disable automatic rdd caching in python api for relevant learners - Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3488) cache deserialized python RDDs before iterative learning
[ https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Staple resolved SPARK-3488. - Resolution: Won't Fix cache deserialized python RDDs before iterative learning Key: SPARK-3488 URL: https://issues.apache.org/jira/browse/SPARK-3488 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, currently the python RDD is always cached, then in scala that cached RDD is mapped to an uncached deserialized RDD, and the uncached RDD is passed to the learning algorithm. Instead the deserialized RDD should be cached. This was originally discussed here: https://github.com/apache/spark/pull/2347#issuecomment-55181535 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3690) Closing shuffle writers we swallow more important exception
[ https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148365#comment-14148365 ] Josh Rosen commented on SPARK-3690: --- For additional context, here's the mailing list thread: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-FileNotFoundException-in-usercache-td15135.html Closing shuffle writers we swallow more important exception --- Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {code:title=ShaffleMapTask|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3690) Closing shuffle writers we swallow more important exception
[ https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3690. --- Resolution: Fixed Issue resolved by pull request 2537 [https://github.com/apache/spark/pull/2537] Closing shuffle writers we swallow more important exception --- Key: SPARK-3690 URL: https://issues.apache.org/jira/browse/SPARK-3690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Egor Pakhomov Priority: Minor Fix For: 1.2.0 ShaffleMapTask: line 75 {code:title=ShaffleMapTask|borderStyle=solid} case e: Exception = if (writer != null) { writer.stop(success = false) } throw e {code} Exception in writer.stop() swallows the important one. Couldn't find the reason for problems for days. Look up in internet java.io.FileNotFoundException: /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147 (No such file or directory) - there are plenty poor guys like me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3661) spark.driver.memory is ignored in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3661: - Priority: Critical (was: Major) spark.driver.memory is ignored in cluster mode -- Key: SPARK-3661 URL: https://issues.apache.org/jira/browse/SPARK-3661 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that `spark.executor.memory` is fine because we pass the Spark system properties to the driver after it has started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI
[ https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3682: -- Description: Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be increased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. was: Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be decreased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. Add helpful warnings to the UI -- Key: SPARK-3682 URL: https://issues.apache.org/jira/browse/SPARK-3682 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.1.0 Reporter: Sandy Ryza Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be increased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3682) Add helpful warnings to the UI
[ https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148379#comment-14148379 ] Sandy Ryza commented on SPARK-3682: --- Oops, that should have read increased. When a task fetches more shuffle data from the previous stage than it can fit in memory, it needs to spill the extra data to disk. Increasing the number of partitions makes it so that each task will be responsible for dealing with less data and will need to spill less. Add helpful warnings to the UI -- Key: SPARK-3682 URL: https://issues.apache.org/jira/browse/SPARK-3682 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.1.0 Reporter: Sandy Ryza Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be increased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148385#comment-14148385 ] Davies Liu commented on SPARK-2377: --- [~giwa] I also start to work on this (based on your branch), will send out an WIP PR recently. Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas Assignee: Kenichi Takagiwa [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
Patrick Wendell created SPARK-3694: -- Summary: Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Patrick Wendell This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3694: --- Description: This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. was: This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Patrick Wendell This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148411#comment-14148411 ] Andrew Ash edited comment on SPARK-3032 at 9/25/14 10:35 PM: - This bug prevents people from doing further testing of sort-based shuffle on the rest of the 1.1.x series. Is this a good candidate for a backport to 1.1.1 or a later 1.1 hotfix ? was (Author: aash): This bug prevents people from doing testing of sort-based shuffle on the rest of the 1.1.x series. Is this a good candidate for a backport to 1.1 ? Potential bug when running sort-based shuffle with sorting using TimSort Key: SPARK-3032 URL: https://issues.apache.org/jira/browse/SPARK-3032 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Blocker When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, data type for key and value is (String, String), always meet this issue: {noformat} java.lang.IllegalArgumentException: Comparison method violates its general contract! at org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) at org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) at org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) at org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) at org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} Seems the current partitionKeyComparator which use hashcode of String as key comparator break some sorting contracts. Also I tested using data type Int as key, this is OK to pass the test, since hashcode of Int is its self. So I think potentially partitionDiff + hashcode of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148411#comment-14148411 ] Andrew Ash commented on SPARK-3032: --- This bug prevents people from doing testing of sort-based shuffle on the rest of the 1.1.x series. Is this a good candidate for a backport to 1.1 ? Potential bug when running sort-based shuffle with sorting using TimSort Key: SPARK-3032 URL: https://issues.apache.org/jira/browse/SPARK-3032 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Blocker When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, data type for key and value is (String, String), always meet this issue: {noformat} java.lang.IllegalArgumentException: Comparison method violates its general contract! at org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) at org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) at org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) at org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) at org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} Seems the current partitionKeyComparator which use hashcode of String as key comparator break some sorting contracts. Also I tested using data type Int as key, this is OK to pass the test, since hashcode of Int is its self. So I think potentially partitionDiff + hashcode of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data
[ https://issues.apache.org/jira/browse/SPARK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1484. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2347 [https://github.com/apache/spark/pull/2347] MLlib should warn if you are using an iterative algorithm on non-cached data Key: SPARK-1484 URL: https://issues.apache.org/jira/browse/SPARK-1484 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Matei Zaharia Fix For: 1.2.0 Not sure what the best way to warn is, but even printing to the log is probably fine. We may want to print at the end of the training run as well as the beginning to make it more visible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data
[ https://issues.apache.org/jira/browse/SPARK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1484: - Assignee: Aaron Staple MLlib should warn if you are using an iterative algorithm on non-cached data Key: SPARK-1484 URL: https://issues.apache.org/jira/browse/SPARK-1484 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Matei Zaharia Assignee: Aaron Staple Fix For: 1.2.0 Not sure what the best way to warn is, but even printing to the log is probably fine. We may want to print at the end of the training run as well as the beginning to make it more visible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3584) sbin/slaves doesn't work when we use password authentication for SSH
[ https://issues.apache.org/jira/browse/SPARK-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3584. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) Resolved by: https://github.com/apache/spark/pull/2444 sbin/slaves doesn't work when we use password authentication for SSH Key: SPARK-3584 URL: https://issues.apache.org/jira/browse/SPARK-3584 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.2.0 In sbin/slaves, ssh command run in the background but if we use password authentication, background ssh command doesn't work so sbin/slaves doesn't work. Also I suggest improvement for sbin/slaves. In current implementation, slaves file is trucked by Git but it can be edited by user so we prepare slaves.template instead of slaves. Default slaves file has one entry, localhost, so we should use localhost as a default host list. I modified sbin/slaves to choose localhost as a default host list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148512#comment-14148512 ] Apache Spark commented on SPARK-2377: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2538 Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas Assignee: Kenichi Takagiwa [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148529#comment-14148529 ] Xiangrui Meng commented on SPARK-1405: -- [~Guoqiang Li] and [~pedrorodriguez], since there are already 4~5 implementations of LDA on Spark and [~dlwh] is also interested in one with partial labels, we do need to coordinate to avoid duplicate effort. I think the TODOs are: 0. Make progress updates frequently. 1. Test Joey's implementation and Guoqiang's (both on GraphX) on some common datasets. We also need to verify the correctness of the output, by comparing the result with some single machine solvers. 2. Discuss the public APIs in MLlib. Because GraphX is an alpha component, we should not expose GraphX APIs in MLlib. See my previous comments on the input and model types. 3. Have a standard implementation of LDA with Gibbs Sampling in MLlib. The target is v1.2, which means it should be merged by the end of Nov. Improvements can be made in future releases. Could you share your timeline? Thanks! parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Xusen Yin Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1405: - Assignee: Guoqiang Li (was: Xusen Yin) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1405: - Shepherd: Xiangrui Meng parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1241) Support sliding in RDD
[ https://issues.apache.org/jira/browse/SPARK-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148549#comment-14148549 ] Xiangrui Meng commented on SPARK-1241: -- This is implemented MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala . You can check the discussion about where to put it here: https://github.com/apache/spark/pull/136 . Support sliding in RDD -- Key: SPARK-1241 URL: https://issues.apache.org/jira/browse/SPARK-1241 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 Sliding is useful for operations like creating n-grams, calculating total variation, numerical integration, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148555#comment-14148555 ] Pedro Rodriguez commented on SPARK-1405: [~mengxr], definitely a good idea to be coordinated about it. I have been working with Evan so have been giving status updates and making todos with him. I will post here on progress updates as well. I have been working on creating a design doc/reference which you can find here: https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing It is for a large part, a way for us/me to keep notes while working, but I would like to take some of it and convert it to documentation. It primarily contains 1. Relevant links to papers/code/repositories 2. Thorough explanation/documentation of LDA and motivation behind the graph implementation (Joey's version) 3. Testing steps (which data sets on what) 4. Current todos (perhaps we should post them here primarily and update doc for consistency). 1. Currently I am working on testing (unit test functions and correctness testing), refactoring, and extending Joey's implementation. The objective for this week is to have the mini-test running (a set of ~10 documents which acts as a sanity check). Goal for early next week is to be running on NIPS. I think the majority of time to get there will be putting the dataset in a parseable format (remove equations, stop words...) and insuring that the result looks correct. To that end, we plan on running the same datasets through Graphlab for benchmarking machine/ML performance and a python implementation for ML performance/correctness. Once we are there, the plan is to start looking at running on wikipedia. 2. The code I am currently working on lives here: https://github.com/EntilZha/spark https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/TopicModeling.scala which is within GraphX, with the other graph based algorithms. 3. Prior to knowing about Joey's graph implementation, I wrote my own for a final project. I stopped working on it since the graph implementation should be more performant. Probably a good point of discussion if there should be a standard and graph implementation together. When you reference standard implementation, is there a particular implementation you are referring to that I can look at? TLDR timeline: End of this week: mini-dataset for sanity check + refactoring code + unit testing Next week: Format NIPS for input + run NIPS data set on Spark, GraphLab, and Python LDA. I will be away from Berkeley at a conference, but hope to still get those done. From there, we would like to get running on larger datasets for performance testing. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2634) MapOutputTrackerWorker.mapStatuses should be thread-safe
[ https://issues.apache.org/jira/browse/SPARK-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2634. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 1541 [https://github.com/apache/spark/pull/1541] MapOutputTrackerWorker.mapStatuses should be thread-safe Key: SPARK-2634 URL: https://issues.apache.org/jira/browse/SPARK-2634 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Shixiong Zhu Labels: easyfix Fix For: 1.2.0 MapOutputTrackerWorker.mapStatuses will be used concurrently, so it should be a thread-safe Map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2546) Configuration object thread safety issue
[ https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148645#comment-14148645 ] Josh Rosen commented on SPARK-2546: --- JobConf has a _ton_ of methods and it's not clear whether we can get away with synchronizing only some of them. I'm going to look into using Scala macro annotations (http://docs.scala-lang.org/overviews/macros/annotations.html) to create a {{@synchronizeAll}} macro for adding synchronization to all methods of a class. Configuration object thread safety issue Key: SPARK-2546 URL: https://issues.apache.org/jira/browse/SPARK-2546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Reporter: Andrew Ash Assignee: Josh Rosen Priority: Critical // observed in 0.9.1 but expected to exist in 1.0.1 as well This ticket is copy-pasted from a thread on the dev@ list: {quote} We discovered a very interesting bug in Spark at work last week in Spark 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to thread safety issues. I believe it still applies in Spark 1.0.1 as well. Let me explain: Observations - Was running a relatively simple job (read from Avro files, do a map, do another map, write back to Avro files) - 412 of 413 tasks completed, but the last task was hung in RUNNING state - The 412 successful tasks completed in median time 3.4s - The last hung task didn't finish even in 20 hours - The executor with the hung task was responsible for 100% of one core of CPU usage - Jstack of the executor attached (relevant thread pasted below) Diagnosis After doing some code spelunking, we determined the issue was concurrent use of a Configuration object for each task on an executor. In Hadoop each task runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so the single-threaded access assumptions of the Configuration object no longer hold in Spark. The specific issue is that the AvroRecordReader actually _modifies_ the JobConf it's given when it's instantiated! It adds a key for the RPC protocol engine in the process of connecting to the Hadoop FileSystem. When many tasks start at the same time (like at the start of a job), many tasks are adding this configuration item to the one Configuration object at once. Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… The below post is an excellent explanation of what happens in the situation where multiple threads insert into a HashMap at the same time. http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html The gist is that you have a thread following a cycle of linked list nodes indefinitely. This exactly matches our observations of the 100% CPU core and also the final location in the stack trace. So it seems the way Spark shares a Configuration object between task threads in an executor is incorrect. We need some way to prevent concurrent access to a single Configuration object. Proposed fix We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets its own JobConf object (and thus Configuration object). The optimization of broadcasting the Configuration object across the cluster can remain, but on the other side I think it needs to be cloned for each task to allow for concurrent access. I'm not sure the performance implications, but the comments suggest that the Configuration object is ~10KB so I would expect a clone on the object to be relatively speedy. Has this been observed before? Does my suggested fix make sense? I'd be happy to file a Jira ticket and continue discussion there for the right way to fix. Thanks! Andrew P.S. For others seeing this issue, our temporary workaround is to enable spark.speculation, which retries failed (or hung) tasks on other machines. {noformat} Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 nid=0x54b1 runnable [0x7f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193) at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168) at
[jira] [Created] (SPARK-3695) Enable to show host and port in block fetch failure
Adrian Wang created SPARK-3695: -- Summary: Enable to show host and port in block fetch failure Key: SPARK-3695 URL: https://issues.apache.org/jira/browse/SPARK-3695 Project: Spark Issue Type: Bug Components: Input/Output Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3695) Enable to show host and port in block fetch failure
[ https://issues.apache.org/jira/browse/SPARK-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148656#comment-14148656 ] Apache Spark commented on SPARK-3695: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2539 Enable to show host and port in block fetch failure --- Key: SPARK-3695 URL: https://issues.apache.org/jira/browse/SPARK-3695 Project: Spark Issue Type: Bug Components: Input/Output Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2546) Configuration object thread safety issue
[ https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148657#comment-14148657 ] Josh Rosen commented on SPARK-2546: --- A synchronization wrapper (whether written by hand or generated using macros) might introduce an unwanted runtime dependency on the exact compile-time version of Hadoop that we used. For example, say we compile against Hadoop 1.x and run on Hadoop 1.y (where y x) and the runtime version of JobConf contains methods that were not present in the version that we wrapped at compile-time. What happens in this case? Before we explore this option, I should probably re-visit SPARK-2585 to see if I can understand why the patch seemed to introduce a performance regression, since that approach is Hadoop version agnostic. Configuration object thread safety issue Key: SPARK-2546 URL: https://issues.apache.org/jira/browse/SPARK-2546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Reporter: Andrew Ash Assignee: Josh Rosen Priority: Critical // observed in 0.9.1 but expected to exist in 1.0.1 as well This ticket is copy-pasted from a thread on the dev@ list: {quote} We discovered a very interesting bug in Spark at work last week in Spark 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to thread safety issues. I believe it still applies in Spark 1.0.1 as well. Let me explain: Observations - Was running a relatively simple job (read from Avro files, do a map, do another map, write back to Avro files) - 412 of 413 tasks completed, but the last task was hung in RUNNING state - The 412 successful tasks completed in median time 3.4s - The last hung task didn't finish even in 20 hours - The executor with the hung task was responsible for 100% of one core of CPU usage - Jstack of the executor attached (relevant thread pasted below) Diagnosis After doing some code spelunking, we determined the issue was concurrent use of a Configuration object for each task on an executor. In Hadoop each task runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so the single-threaded access assumptions of the Configuration object no longer hold in Spark. The specific issue is that the AvroRecordReader actually _modifies_ the JobConf it's given when it's instantiated! It adds a key for the RPC protocol engine in the process of connecting to the Hadoop FileSystem. When many tasks start at the same time (like at the start of a job), many tasks are adding this configuration item to the one Configuration object at once. Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… The below post is an excellent explanation of what happens in the situation where multiple threads insert into a HashMap at the same time. http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html The gist is that you have a thread following a cycle of linked list nodes indefinitely. This exactly matches our observations of the 100% CPU core and also the final location in the stack trace. So it seems the way Spark shares a Configuration object between task threads in an executor is incorrect. We need some way to prevent concurrent access to a single Configuration object. Proposed fix We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets its own JobConf object (and thus Configuration object). The optimization of broadcasting the Configuration object across the cluster can remain, but on the other side I think it needs to be cloned for each task to allow for concurrent access. I'm not sure the performance implications, but the comments suggest that the Configuration object is ~10KB so I would expect a clone on the object to be relatively speedy. Has this been observed before? Does my suggested fix make sense? I'd be happy to file a Jira ticket and continue discussion there for the right way to fix. Thanks! Andrew P.S. For others seeing this issue, our temporary workaround is to enable spark.speculation, which retries failed (or hung) tasks on other machines. {noformat} Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 nid=0x54b1 runnable [0x7f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) at
[jira] [Commented] (SPARK-2532) Fix issues with consolidated shuffle
[ https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148666#comment-14148666 ] Andrew Ash commented on SPARK-2532: --- [~pwendell] should we close this ticket and track the individual items separately? It sounds like we should expect consolidated shuffle to work in 1.1 and any issues should have separate tickets filed for them. I know [~mridulm80] has several fixes on his branch that should be cherry picked over at some point as well though. Fix issues with consolidated shuffle Key: SPARK-2532 URL: https://issues.apache.org/jira/browse/SPARK-2532 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.1.0 Environment: All Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical Will file PR with changes as soon as merge is done (earlier merge became outdated in 2 weeks unfortunately :) ). Consolidated shuffle is broken in multiple ways in spark : a) Task failure(s) can cause the state to become inconsistent. b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent. (As part of exception/error handling). c) Some of the api in block writer causes implementation issues - for example: a revert is always followed by close : but the implemention tries to keep them separate, resulting in surface for errors. d) Fetching data from consolidated shuffle files can go badly wrong if the file is being actively written to : it computes length by subtracting next offset from current offset (or length if this is last offset)- the latter fails when fetch is happening in parallel to write. Note, this happens even if there are no task failures of any kind ! This usually results in stream corruption or decompression errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3688) LogicalPlan can't resolve column correctlly
[ https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-3688: --- Description: How to reproduce this problem: create a table: {code} create table test (a string, b string); {code} execute sql: {code} select a.b ,count(1) from test a join test t group by a.b; {code} was: How to reproduce this problem: create a table: {quote} create table test (a string, b string); {quote} execute sql: {quote} select a.b ,count(1) from test a join test t group by a.b; {quote} LogicalPlan can't resolve column correctlly --- Key: SPARK-3688 URL: https://issues.apache.org/jira/browse/SPARK-3688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yi Tian How to reproduce this problem: create a table: {code} create table test (a string, b string); {code} execute sql: {code} select a.b ,count(1) from test a join test t group by a.b; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3686) flume.SparkSinkSuite.Success is flaky
[ https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3686. Resolution: Fixed Resolved by: https://github.com/apache/spark/pull/2531 flume.SparkSinkSuite.Success is flaky - Key: SPARK-3686 URL: https://issues.apache.org/jira/browse/SPARK-3686 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Hari Shreedharan Priority: Blocker {code} Error Message 4000 did not equal 5000 Stacktrace sbt.ForkMain$ForkError: 4000 did not equal 5000 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416) at org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.scalatest.FunSuite.run(FunSuite.scala:1559) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Example test result (this will stop working in a few days): https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: