[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-30602: - Description: In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] was: 白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot
[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-30602: - Description: 白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html was: In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot
[jira] [Updated] (SPARK-20986) Reset table's statistics after PruneFileSourcePartitions rule.
[ https://issues.apache.org/jira/browse/SPARK-20986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-20986: - Description: After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions rule can filter some unnecessary partitions. So the statistics need to be changed. (was: After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed.) > Reset table's statistics after PruneFileSourcePartitions rule. > -- > > Key: SPARK-20986 > URL: https://issues.apache.org/jira/browse/SPARK-20986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lianhui Wang >Assignee: Apache Spark > > After PruneFileSourcePartitions rule, It needs reset table's statistics > because PruneFileSourcePartitions rule can filter some unnecessary > partitions. So the statistics need to be changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20986) Reset table's statistics after PruneFileSourcePartitions rule.
Lianhui Wang created SPARK-20986: Summary: Reset table's statistics after PruneFileSourcePartitions rule. Key: SPARK-20986 URL: https://issues.apache.org/jira/browse/SPARK-20986 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Lianhui Wang After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15616) CatalogRelation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15616: - Summary: CatalogRelation should fallback to HDFS size of partitions that are involved in Query if statistics are not available. (was: Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.) > CatalogRelation should fallback to HDFS size of partitions that are involved > in Query if statistics are not available. > -- > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948578#comment-15948578 ] Lianhui Wang commented on SPARK-14560: -- [~darshankhamar123] [~mhornbech] Are you using ExternalAppendOnlyMap or ExternalSorter? This issue is about them, not for UnsafeInMemorySorter. You can fetch more information about memory using Debug level log. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for future versions of spark. -- This message was sent by
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690589#comment-15690589 ] Lianhui Wang commented on SPARK-14560: -- Can you provide some debug log of TaskMemoryManager? I think that can tell who use the memory. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for future versions of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670007#comment-15670007 ] Lianhui Wang commented on SPARK-14560: -- Now this issue is not in Branch-1.6. You can see https://github.com/apache/spark/pull/13027 for branch-1.6. Thanks. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for future versions of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622645#comment-15622645 ] Lianhui Wang commented on SPARK-15616: -- For 2.0, I have created a new branch https://github.com/lianhuiwang/spark/tree/2.0_partition_broadcast. You can retry again. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618476#comment-15618476 ] Lianhui Wang commented on SPARK-15616: -- I have updated the code and fixed the problem that you have pointed out. Thanks. I think you can try again. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618475#comment-15618475 ] Lianhui Wang commented on SPARK-15616: -- I have updated the code and fixed the problem that you have pointed out. Thanks. I think you can try again. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15616: - Comment: was deleted (was: I have updated the code and fixed the problem that you have pointed out. Thanks. I think you can try again.) > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613846#comment-15613846 ] Lianhui Wang edited comment on SPARK-15616 at 10/28/16 12:54 AM: - Yes, I think it can. But now the PR is based on branch 2.0, not branch master. I will update it based on branch master later. I think you can try this PR to do broadcast join. Please tell me any feedback. Thanks. was (Author: lianhuiwang): Yes, I think it can. But now the PR is based on branch 2.0, not branch master. I will update it based on branch master later. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613846#comment-15613846 ] Lianhui Wang commented on SPARK-15616: -- Yes, I think it can. But now the PR is based on branch 2.0, not branch master. I will update it based on branch master later. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610455#comment-15610455 ] Lianhui Wang commented on SPARK-15616: -- if filter is for partition key, So this issue can did it for broadcast join. if filter is for no-partition key, it is not include in this issue. Maybe it needs use cost based optimizer. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145 ] Lianhui Wang edited comment on SPARK-2666 at 7/21/16 4:38 AM: -- Thanks. I think what [~irashid] said is more about non-external shuffle. But in our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2, if it hitx a shuffle fetch failure while running stage 2, say on executor A. So it needs to regenerate the map output for stage 1 that was on executor A. But it don't rerun for stage 0 on executor A. So i think we can firstly handle with FetchFailed on Yarn-external shuffle service(maybe connection timeout, out of memory, etc). I think many users have met FetchFailed on Yarn-external shuffle service. as [~tgraves] said before, Now If the stages fails because FetchFailed, it rerun 1) all the ones not succeeded yet in the failed stage (including the ones that could still be running). So it cause many duplicate running tasks of failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful tasks of the failed stage. Until now, i think our first target is for Yarn-external shuffle service if the stages fails because FetchFailed it should decrease the number of rerunning tasks of the failed stage. As i pointed out before that the best way is like Mapreduce we just resubmit the map stage of failed stage. 1. When FetchFailed has happened on task, the task don't be finished and continue to fetch other results. It just report the ShuffleBlockId of FetchFailed to DAGScheduler. other running tasks of this stage did like this task. 2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task for the ShuffleBlockId. Once the task has been finished, it will register the map output to MapOutputTracker. 3. The task that has FetchFailed before get the map output of FetchFailed from MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the map output of FetchFailed successfully and will fetch the results of FetchFailed. But there is a dead lock if the tasks of Step-2 can not be run because there is no slots for it.Under this situation it should kill some running tasks for it. In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did it for 2) it only run the failed ones and wait for the ones still running in failed stage. The disadvantage of SPARK-14649 is that other running tasks of the failed stage maybe need a long time to rerun when they spend time to fetch other's results. was (Author: lianhuiwang): I think what [~irashid] said is more about non-external shuffle. But in our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2, if it hitx a shuffle fetch failure while running stage 2, say on executor A. So it needs to regenerate the map output for stage 1 that was on executor A. But it don't rerun for stage 0 on executor A. So i think we can firstly handle with FetchFailed on Yarn-external shuffle service(maybe connection timeout, out of memory, etc). I think many users have met FetchFailed on Yarn-external shuffle service. as [~tgraves] said before, Now If the stages fails because FetchFailed, it rerun 1) all the ones not succeeded yet in the failed stage (including the ones that could still be running). So it cause many duplicate running tasks of failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful tasks of the failed stage. Until now, i think our first target is for Yarn-external shuffle service if the stages fails because FetchFailed it should decrease the number of rerunning tasks of the failed stage. As i pointed out before that the best way is like Mapreduce we just resubmit the map stage of failed stage. 1. When FetchFailed has happened on task, the task don't be finished and continue to fetch other results. It just report the ShuffleBlockId of FetchFailed to DAGScheduler. other running tasks of this stage did like this task. 2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task for the ShuffleBlockId. Once the task has been finished, it will register the map output to MapOutputTracker. 3. The task that has FetchFailed before get the map output of FetchFailed from MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the map output of FetchFailed successfully and will fetch the results of FetchFailed. But there is a dead lock if the tasks of Step-2 can not be run because there is no slots for it.Under this situation it should kill some running tasks for it. In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did it for 2) it only run the failed ones and wait for the ones still running in failed stage. The disadvantage of SPARK-14649 is that other running tasks of the failed stage maybe need a long time to rerun when they spend
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145 ] Lianhui Wang commented on SPARK-2666: - I think what [~irashid] said is more about non-external shuffle. But in our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2, if it hitx a shuffle fetch failure while running stage 2, say on executor A. So it needs to regenerate the map output for stage 1 that was on executor A. But it don't rerun for stage 0 on executor A. So i think we can firstly handle with FetchFailed on Yarn-external shuffle service(maybe connection timeout, out of memory, etc). I think many users have met FetchFailed on Yarn-external shuffle service. as [~tgraves] said before, Now If the stages fails because FetchFailed, it rerun 1) all the ones not succeeded yet in the failed stage (including the ones that could still be running). So it cause many duplicate running tasks of failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful tasks of the failed stage. Until now, i think our first target is for Yarn-external shuffle service if the stages fails because FetchFailed it should decrease the number of rerunning tasks of the failed stage. As i pointed out before that the best way is like Mapreduce we just resubmit the map stage of failed stage. 1. When FetchFailed has happened on task, the task don't be finished and continue to fetch other results. It just report the ShuffleBlockId of FetchFailed to DAGScheduler. other running tasks of this stage did like this task. 2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task for the ShuffleBlockId. Once the task has been finished, it will register the map output to MapOutputTracker. 3. The task that has FetchFailed before get the map output of FetchFailed from MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the map output of FetchFailed successfully and will fetch the results of FetchFailed. But there is a dead lock if the tasks of Step-2 can not be run because there is no slots for it.Under this situation it should kill some running tasks for it. In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did it for 2) it only run the failed ones and wait for the ones still running in failed stage. The disadvantage of SPARK-14649 is that other running tasks of the failed stage maybe need a long time to rerun when they spend time to fetch other's results. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA
[jira] [Updated] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16649: - Description: SPARK-6910 has supported for pushing partition predicates down into the Hive metastore for table scan. So it also should push partition predicates down into metastore for OptimizeMetadataOnlyQuery. (was: SPARK-6910 has supported for pushing partition predicates down into the Hive metastore for table scan. So it also should push partition predicates dow into metastore for OptimizeMetadataOnlyQuery.) > Push partition predicates down into metastore for OptimizeMetadataOnlyQuery > --- > > Key: SPARK-16649 > URL: https://issues.apache.org/jira/browse/SPARK-16649 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > SPARK-6910 has supported for pushing partition predicates down into the Hive > metastore for table scan. So it also should push partition predicates down > into metastore for OptimizeMetadataOnlyQuery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16649: - Summary: Push partition predicates down into metastore for OptimizeMetadataOnlyQuery (was: Push partition predicates down into the Hive metastore for OptimizeMetadataOnlyQuery) > Push partition predicates down into metastore for OptimizeMetadataOnlyQuery > --- > > Key: SPARK-16649 > URL: https://issues.apache.org/jira/browse/SPARK-16649 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > SPARK-6910 has supported for pushing partition predicates down into the Hive > metastore for table scan. So it also should push partition predicates dow > into metastore for OptimizeMetadataOnlyQuery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16649) Push partition predicates down into the Hive metastore for OptimizeMetadataOnlyQuery
Lianhui Wang created SPARK-16649: Summary: Push partition predicates down into the Hive metastore for OptimizeMetadataOnlyQuery Key: SPARK-16649 URL: https://issues.apache.org/jira/browse/SPARK-16649 Project: Spark Issue Type: Improvement Components: SQL Reporter: Lianhui Wang SPARK-6910 has supported for pushing partition predicates down into the Hive metastore for table scan. So it also should push partition predicates dow into metastore for OptimizeMetadataOnlyQuery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385510#comment-15385510 ] Lianhui Wang commented on SPARK-2666: - [~tgraves] Sorry for late reply. In https://github.com/apache/spark/pull/1572, it will kill all running tasks before we resubmit for FetchFailed. But [~kayousterhout] said that it keep the remaining tasks because the running tasks may hit Fetch failures from different map outputs than the original fetch failure. I think the best way is like Mapreduce we just resubmit the map stage of failed stage. if the reduce stage has a FetchFailed, it just report FetchFailed to DAGScheduler and fetch other results. Then the reduce stage getOutputStatus of FetchFailed every hearbeat like https://github.com/apache/spark/pull/3430. [~tgraves] How about your ideas about this? Thanks. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16497) Don't throw an exception if drop non-existent TABLE/VIEW/Function/Partitions
[ https://issues.apache.org/jira/browse/SPARK-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16497: - Summary: Don't throw an exception if drop non-existent TABLE/VIEW/Function/Partitions (was: Don't throw an exception for drop non-exist TABLE/VIEW/Function/Partitions) > Don't throw an exception if drop non-existent TABLE/VIEW/Function/Partitions > > > Key: SPARK-16497 > URL: https://issues.apache.org/jira/browse/SPARK-16497 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > from > https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.drop.ignorenonexistent, > Hive use 'hive.exec.drop.ignorenonexistent'(default=true) to do not report > an error if DROP TABLE/VIEW/PARTITION/INDEX/TEMPORARY FUNCTION specifies a > non-existent table/view. So SparkSQL also should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16497) Don't throw an exception for drop non-exist TABLE/VIEW/Function/Partitions
Lianhui Wang created SPARK-16497: Summary: Don't throw an exception for drop non-exist TABLE/VIEW/Function/Partitions Key: SPARK-16497 URL: https://issues.apache.org/jira/browse/SPARK-16497 Project: Spark Issue Type: Improvement Components: SQL Reporter: Lianhui Wang from https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.drop.ignorenonexistent, Hive use 'hive.exec.drop.ignorenonexistent'(default=true) to do not report an error if DROP TABLE/VIEW/PARTITION/INDEX/TEMPORARY FUNCTION specifies a non-existent table/view. So SparkSQL also should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16456: - Summary: Reuse the uncorrelated scalar subqueries with the same logical plan in a query (was: Reuse the uncorrelated scalar subqueries with with the same logical plan in a query) > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing on the same uncorrelated scalar subquery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16456: - Description: In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a CTE could be executed multiple times, we should re-use the same result to avoid the duplicated computing. (was: In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a CTE could be executed multiple times, we should re-use the same result to avoid the duplicated computing on the same uncorrelated scalar subquery.) > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15752) Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators
[ https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15752: - Summary: Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators (was: support optimization for metadata only queries) > Optimize metadata only query that has an aggregate whose children are > deterministic project or filter operators > --- > > Key: SPARK-15752 > URL: https://issues.apache.org/jira/browse/SPARK-15752 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > when query only use metadata (example: partition key), it can return results > based on metadata without scaning files. Hive did it in HIVE-1003. > design document: > https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16302) Set the right number of partitions for reading data from a local collection.
[ https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16302: - Summary: Set the right number of partitions for reading data from a local collection. (was: Set the default number of partitions for reading data from a local collection.) > Set the right number of partitions for reading data from a local collection. > > > Key: SPARK-16302 > URL: https://issues.apache.org/jira/browse/SPARK-16302 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Lianhui Wang > > query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use > defaultParallelism tasks. So i cause run empty or small tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16302) Set the default number of partitions for reading data from a local collection.
[ https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16302: - Summary: Set the default number of partitions for reading data from a local collection. (was: LocalTableScanExec always use defaultParallelism tasks even though it is very small seq.) > Set the default number of partitions for reading data from a local collection. > -- > > Key: SPARK-16302 > URL: https://issues.apache.org/jira/browse/SPARK-16302 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Lianhui Wang > > query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use > defaultParallelism tasks. So i cause run empty or small tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16302) LocalTableScanExec always use defaultParallelism tasks even though it is very small seq.
Lianhui Wang created SPARK-16302: Summary: LocalTableScanExec always use defaultParallelism tasks even though it is very small seq. Key: SPARK-16302 URL: https://issues.apache.org/jira/browse/SPARK-16302 Project: Spark Issue Type: Bug Components: SQL Reporter: Lianhui Wang query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use defaultParallelism tasks. So i cause run empty or small tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15988) Implement DDL commands: CREATE/DROP TEMPORARY MACRO
Lianhui Wang created SPARK-15988: Summary: Implement DDL commands: CREATE/DROP TEMPORARY MACRO Key: SPARK-15988 URL: https://issues.apache.org/jira/browse/SPARK-15988 Project: Spark Issue Type: Improvement Components: SQL Reporter: Lianhui Wang in https://issues.apache.org/jira/browse/HIVE-2655, Hive have implemented CREATE/DROP TEMPORARY MACRO. So I think Spark SQL should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15752) support optimization for metadata only queries
[ https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15752: - Description: when query only use metadata (example: partition key), it can return results based on metadata without scaning files. Hive did it in HIVE-1003. design document: https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing was:when query only use metadata (example: partition key), it can return results based on metadata without scaning files. Hive did it in HIVE-1003. > support optimization for metadata only queries > -- > > Key: SPARK-15752 > URL: https://issues.apache.org/jira/browse/SPARK-15752 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > when query only use metadata (example: partition key), it can return results > based on metadata without scaning files. Hive did it in HIVE-1003. > design document: > https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15756) Support command 'create table stored as orcfile/parquetfile/avrofile'
[ https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15756: - Description: Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'. So as Hive Spark SQL needs support 'stored as orcfile/parquetfile/avrofile'. (was: in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, we can know Hive support create table stored as orcfile/parquetfile/avrofile.) > Support command 'create table stored as orcfile/parquetfile/avrofile' > - > > Key: SPARK-15756 > URL: https://issues.apache.org/jira/browse/SPARK-15756 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > > Now Spark SQL can support 'create table src stored as orc/parquet/avro' for > orc/parquet/avro table. But Hive can support both commands: ' stored as > orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'. So as Hive > Spark SQL needs support 'stored as orcfile/parquetfile/avrofile'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15756) Support command 'create table stored as orcfile/parquetfile/avrofile'
[ https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15756: - Summary: Support command 'create table stored as orcfile/parquetfile/avrofile' (was: Support create table stored as orcfile/parquetfile/avrofile) > Support command 'create table stored as orcfile/parquetfile/avrofile' > - > > Key: SPARK-15756 > URL: https://issues.apache.org/jira/browse/SPARK-15756 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > > in > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, > we can know Hive support create table stored as orcfile/parquetfile/avrofile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15756) Support create table stored as orcfile/parquetfile/avrofile
[ https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15756: - Description: in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, we can know Hive support create table stored as orcfile/parquetfile/avrofile. (was: in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, we can know Hive support ) > Support create table stored as orcfile/parquetfile/avrofile > --- > > Key: SPARK-15756 > URL: https://issues.apache.org/jira/browse/SPARK-15756 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > > in > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, > we can know Hive support create table stored as orcfile/parquetfile/avrofile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15756) Support create table stored as orcfile/parquetfile/avrofile
[ https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15756: - Description: in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, we can know Hive support > Support create table stored as orcfile/parquetfile/avrofile > --- > > Key: SPARK-15756 > URL: https://issues.apache.org/jira/browse/SPARK-15756 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > > in > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java, > we can know Hive support -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15756) Support create table stored as orcfile/parquetfile/avrofile
[ https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15756: - Summary: Support create table stored as orcfile/parquetfile/avrofile (was: SQL “stored as orcfile” cannot be supported while hive supports both keywords "orc" and "orcfile") > Support create table stored as orcfile/parquetfile/avrofile > --- > > Key: SPARK-15756 > URL: https://issues.apache.org/jira/browse/SPARK-15756 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15752) support optimization for metadata only queries
[ https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15752: - Description: when query only use metadata (example: partition key), it can return results based on metadata without scaning files. Hive did it in HIVE-1003. (was: when query just has use metadata (example: partition key), it can return results based on metadata without scaning files. Hive did it in HIVE-1003.) > support optimization for metadata only queries > -- > > Key: SPARK-15752 > URL: https://issues.apache.org/jira/browse/SPARK-15752 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > when query only use metadata (example: partition key), it can return results > based on metadata without scaning files. Hive did it in HIVE-1003. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15752) support optimization for metadata only queries
[ https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15752: - Description: when query just has use metadata (example: partition key), it can return results based on metadata without scaning files. Hive did it in HIVE-1003. (was: when query just has use metadata, so it can not scan files. we need to support it like HIVE-1003.) > support optimization for metadata only queries > -- > > Key: SPARK-15752 > URL: https://issues.apache.org/jira/browse/SPARK-15752 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > when query just has use metadata (example: partition key), it can return > results based on metadata without scaning files. Hive did it in HIVE-1003. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15752) support optimization for metadata only queries
[ https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15752: - Summary: support optimization for metadata only queries (was: support optimization for metadata only queries.) > support optimization for metadata only queries > -- > > Key: SPARK-15752 > URL: https://issues.apache.org/jira/browse/SPARK-15752 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > when query just has use metadata, so it can not scan files. we need to > support it like HIVE-1003. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15752) support optimization for metadata only queries.
Lianhui Wang created SPARK-15752: Summary: support optimization for metadata only queries. Key: SPARK-15752 URL: https://issues.apache.org/jira/browse/SPARK-15752 Project: Spark Issue Type: Bug Components: SQL Reporter: Lianhui Wang when query just has use metadata, so it can not scan files. we need to support it like HIVE-1003. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15752) support optimization for metadata only queries.
[ https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15752: - Issue Type: Improvement (was: Bug) > support optimization for metadata only queries. > --- > > Key: SPARK-15752 > URL: https://issues.apache.org/jira/browse/SPARK-15752 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > when query just has use metadata, so it can not scan files. we need to > support it like HIVE-1003. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
[ https://issues.apache.org/jira/browse/SPARK-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15664: - Description: if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when > Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing > CheckpointFile in MLlib > > > Key: SPARK-15664 > URL: https://issues.apache.org/jira/browse/SPARK-15664 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Lianhui Wang > > if sparkContext.set CheckpointDir to another Dir that is not default > FileSystem, it will throw exception when -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
[ https://issues.apache.org/jira/browse/SPARK-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15664: - Description: if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib. (was: if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when ) > Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing > CheckpointFile in MLlib > > > Key: SPARK-15664 > URL: https://issues.apache.org/jira/browse/SPARK-15664 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Lianhui Wang > > if sparkContext.set CheckpointDir to another Dir that is not default > FileSystem, it will throw exception when removing CheckpointFile in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
[ https://issues.apache.org/jira/browse/SPARK-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15664: - Component/s: MLlib > Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing > CheckpointFile in MLlib > > > Key: SPARK-15664 > URL: https://issues.apache.org/jira/browse/SPARK-15664 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Lianhui Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
Lianhui Wang created SPARK-15664: Summary: Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib Key: SPARK-15664 URL: https://issues.apache.org/jira/browse/SPARK-15664 Project: Spark Issue Type: Bug Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15649) avoid to serialize MetastoreRelation in HiveTableScanExec
Lianhui Wang created SPARK-15649: Summary: avoid to serialize MetastoreRelation in HiveTableScanExec Key: SPARK-15649 URL: https://issues.apache.org/jira/browse/SPARK-15649 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang Priority: Minor in HiveTableScanExec, schema is lazy and is related with relation.attributeMap. So it needs to serialize MetastoreRelation when serializing task binary bytes.It can avoid to serialize MetastoreRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15616: - Description: Currently if some partitions of a partitioned table are used in join operation we rely on Metastore returned size of table to calculate if we can convert the operation to Broadcast join. if Filter can prune some partitions, Hive can prune partition before determining to use broadcast joins according to HDFS size of partitions that are involved in Query.So sparkSQL needs it that can improve join's performance for partitioned table. was: Currently if some partitions of a partitioned table are used in join operation we rely on Metastore returned size of table to calculate if we can convert the operation to Broadcast join. if Filter can prune some partitions, Hive can prune partition before determining to use broadcast joins according to HDFS size of partitions that are involved in Query.So sparkSQL needs it that can improve join's performance for partitioned table. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
Lianhui Wang created SPARK-15616: Summary: Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available. Key: SPARK-15616 URL: https://issues.apache.org/jira/browse/SPARK-15616 Project: Spark Issue Type: Improvement Components: SQL Reporter: Lianhui Wang Currently if some partitions of a partitioned table are used in join operation we rely on Metastore returned size of table to calculate if we can convert the operation to Broadcast join. if Filter can prune some partitions, Hive can prune partition before determining to use broadcast joins according to HDFS size of partitions that are involved in Query.So sparkSQL needs it that can improve join's performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15335) Implement TRUNCATE TABLE Command
[ https://issues.apache.org/jira/browse/SPARK-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15335: - Summary: Implement TRUNCATE TABLE Command (was: In Spark 2.0 TRUNCATE TABLE is unsupported) > Implement TRUNCATE TABLE Command > > > Key: SPARK-15335 > URL: https://issues.apache.org/jira/browse/SPARK-15335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Weizhong >Priority: Minor > > Spark version based on commit b3930f74a0929b2cdcbbe5cbe34f0b1d35eb01cc, test > result is like below: > {noformat} > spark-sql> create table truncateTT(c string); > 16/05/16 10:23:15 INFO execution.SparkSqlParser: Parsing command: create > table truncateTT(c string) > 16/05/16 10:23:15 INFO metastore.HiveMetaStore: 0: get_database: default > 16/05/16 10:23:15 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_database: default > 16/05/16 10:23:15 INFO metastore.HiveMetaStore: 0: get_database: default > 16/05/16 10:23:15 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_database: default > 16/05/16 10:23:15 INFO metastore.HiveMetaStore: 0: create_table: > Table(tableName:truncatett, dbName:default, owner:root, > createTime:1463365395, lastAccessTime:0, retention:0, > sd:StorageDescriptor(cols:[FieldSchema(name:c, type:string, comment:null)], > location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 16/05/16 10:23:15 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=create_table: Table(tableName:truncatett, dbName:default, owner:root, > createTime:1463365395, lastAccessTime:0, retention:0, > sd:StorageDescriptor(cols:[FieldSchema(name:c, type:string, comment:null)], > location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 16/05/16 10:23:15 INFO common.FileUtils: Creating directory if it doesn't > exist: hdfs://vm001:9000/opt/apache/spark/spark-warehouse/truncatett > 16/05/16 10:23:16 INFO spark.SparkContext: Starting job: processCmd at > CliDriver.java:376 > 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Got job 1 (processCmd at > CliDriver.java:376) with 1 output partitions > 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 > (processCmd at CliDriver.java:376) > 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Missing parents: List() > 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Submitting ResultStage 1 > (MapPartitionsRDD[5] at processCmd at CliDriver.java:376), which has no > missing parents > 16/05/16 10:23:16 INFO memory.MemoryStore: Block broadcast_1 stored as values > in memory (estimated size 3.2 KB, free 1823.2 MB) > 16/05/16 10:23:16 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as > bytes in memory (estimated size 1965.0 B, free 1823.2 MB) > 16/05/16 10:23:16 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in > memory on 192.168.151.146:47228 (size: 1965.0 B, free: 1823.2 MB) > 16/05/16 10:23:16 INFO spark.SparkContext: Created broadcast 1 from broadcast > at DAGScheduler.scala:1012 > 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:376) > 16/05/16 10:23:16 INFO cluster.YarnScheduler: Adding task set 1.0 with 1 tasks > 16/05/16 10:23:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 1.0 (TID 1, vm001, partition 0, PROCESS_LOCAL, 5387 bytes) > 16/05/16 10:23:16 INFO cluster.YarnClientSchedulerBackend:
[jira] [Updated] (SPARK-15246) Fix code style and improve volatile for SPARK-4452
[ https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15246: - Summary: Fix code style and improve volatile for SPARK-4452 (was: Fix code style and improve volatile for Spillable) > Fix code style and improve volatile for SPARK-4452 > --- > > Key: SPARK-15246 > URL: https://issues.apache.org/jira/browse/SPARK-15246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Lianhui Wang > > for SPARK-4452 > 1. Fix code style > 2. remote volatile of elementsRead method because there is only one thread to > use it. > 3. avoid volatile of _elementsRead because Collection increases number of > _elementsRead when it insert a element. It is very expensive. So we can avoid > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15246) Fix code style and improve volatile for Spillable
Lianhui Wang created SPARK-15246: Summary: Fix code style and improve volatile for Spillable Key: SPARK-15246 URL: https://issues.apache.org/jira/browse/SPARK-15246 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Lianhui Wang for SPARK-4452 1. Fix code style 2. remote volatile of elementsRead method because there is only one thread to use it. 3. avoid volatile of _elementsRead because Collection increases number of _elementsRead when it insert a element. It is very expensive. So we can avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14705) support Multiple FileSystem for YARN STAGING DIR
Lianhui Wang created SPARK-14705: Summary: support Multiple FileSystem for YARN STAGING DIR Key: SPARK-14705 URL: https://issues.apache.org/jira/browse/SPARK-14705 Project: Spark Issue Type: Bug Components: YARN Reporter: Lianhui Wang In SPARK-13063, It makes the SPARK YARN STAGING DIR as configurable. But it only support default FileSystem. If there are many clusters, It can be different FileSystem for different cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12322) recompute an cached RDD partition when getting its block is failed
[ https://issues.apache.org/jira/browse/SPARK-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang closed SPARK-12322. Resolution: Invalid > recompute an cached RDD partition when getting its block is failed > -- > > Key: SPARK-12322 > URL: https://issues.apache.org/jira/browse/SPARK-12322 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Lianhui Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12322) recompute an cached RDD partition when getting its block is failed
Lianhui Wang created SPARK-12322: Summary: recompute an cached RDD partition when getting its block is failed Key: SPARK-12322 URL: https://issues.apache.org/jira/browse/SPARK-12322 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Summary: Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io (was: when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io) > Shuffle index can cached for SortShuffleManager in ExternalShuffle in order > to reduce indexFile's io > - > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Priority: Minor > > in IndexShuffleBlockManager, we can use LRUCache to store recently finished > shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Description: in ExternalShuffle, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io. At first, i implement it for ExternalShuffle. Latter i will add it to IndexShuffleBlockResolver if i can. (was: in IndexShuffleBlockManager, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io.) > Shuffle index can be cached for SortShuffleManager in ExternalShuffle in > order to reduce indexFile's io > > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Assignee: Apache Spark >Priority: Minor > > in ExternalShuffle, we can use LRUCache to store recently finished shuffle > index and that can reduce indexFile's io. At first, i implement it for > ExternalShuffle. Latter i will add it to IndexShuffleBlockResolver if i can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Summary: Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io (was: Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io) > Shuffle index can be cached for SortShuffleManager in ExternalShuffle in > order to reduce indexFile's io > > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Priority: Minor > > in IndexShuffleBlockManager, we can use LRUCache to store recently finished > shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver
[ https://issues.apache.org/jira/browse/SPARK-12130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-12130: - Component/s: YARN Shuffle > Replace shuffleManagerClass with shortShuffleMgrNames in > ExternalShuffleBlockResolver > - > > Key: SPARK-12130 > URL: https://issues.apache.org/jira/browse/SPARK-12130 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Reporter: Lianhui Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver
Lianhui Wang created SPARK-12130: Summary: Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver Key: SPARK-12130 URL: https://issues.apache.org/jira/browse/SPARK-12130 Project: Spark Issue Type: Bug Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11252) andrewor14
[ https://issues.apache.org/jira/browse/SPARK-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-11252: - Summary: andrewor14 (was: ExternalShuffleClient should release connection after it had completed to fetch blocks from yarn's NameManager) > andrewor14 > --- > > Key: SPARK-11252 > URL: https://issues.apache.org/jira/browse/SPARK-11252 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Reporter: Lianhui Wang > > ExternalShuffleClient of executors reserve its connection with yarn's > NodeManager until application has been completed. so it will make NodeManager > has many socket connections. > in order to reduce network pressure of NodeManager's shuffleService, after > registerWithShuffleServer or fetchBlocks have been completed in > ExternalShuffleClient, connection for shuffleService needs to be closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11252) ShuffleClient should release connection after fetching blocks had been completed in yarn's external shuffle
[ https://issues.apache.org/jira/browse/SPARK-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-11252: - Summary: ShuffleClient should release connection after fetching blocks had been completed in yarn's external shuffle (was: andrewor14 ) > ShuffleClient should release connection after fetching blocks had been > completed in yarn's external shuffle > --- > > Key: SPARK-11252 > URL: https://issues.apache.org/jira/browse/SPARK-11252 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Reporter: Lianhui Wang > > ExternalShuffleClient of executors reserve its connection with yarn's > NodeManager until application has been completed. so it will make NodeManager > has many socket connections. > in order to reduce network pressure of NodeManager's shuffleService, after > registerWithShuffleServer or fetchBlocks have been completed in > ExternalShuffleClient, connection for shuffleService needs to be closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11252) ExternalShuffleClient should release connection after it had completed to fetch blocks from yarn's NameManager
Lianhui Wang created SPARK-11252: Summary: ExternalShuffleClient should release connection after it had completed to fetch blocks from yarn's NameManager Key: SPARK-11252 URL: https://issues.apache.org/jira/browse/SPARK-11252 Project: Spark Issue Type: Bug Reporter: Lianhui Wang ExternalShuffleClient of executors reserve its connection with yarn's NodeManager until application has been completed. so it will make NodeManager has many socket connections. in order to reduce network pressure of NodeManager's shuffleService, after registerWithShuffleServer or fetchBlocks have been completed in ExternalShuffleClient, connection for shuffleService needs to be closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11026) spark.yarn.user.classpath.first doesn't work for remote addJars
Lianhui Wang created SPARK-11026: Summary: spark.yarn.user.classpath.first doesn't work for remote addJars Key: SPARK-11026 URL: https://issues.apache.org/jira/browse/SPARK-11026 Project: Spark Issue Type: Bug Reporter: Lianhui Wang when spark.yarn.user.classpath.first=true and addJars is on hdfs path, need to add the yarn's linkName of addJars to the system classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10775) add search keywords in history page ui
Lianhui Wang created SPARK-10775: Summary: add search keywords in history page ui Key: SPARK-10775 URL: https://issues.apache.org/jira/browse/SPARK-10775 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Lianhui Wang Priority: Minor add search button in history page ui that we can search some applications of keywords,example: date or time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902459#comment-14902459 ] Lianhui Wang edited comment on SPARK-2666 at 9/22/15 12:22 PM: --- [~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. this PR do not cancel running tasks of completed stage. i think when one taskSet of stage has finished in TaskSchedulerImpl.taskSetFinished, running tasks of this stage should be killed. was (Author: lianhuiwang): [~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. And i think that's logic is right. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902459#comment-14902459 ] Lianhui Wang commented on SPARK-2666: - [~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. And i think that's logic is right. it is ok except unit test. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902459#comment-14902459 ] Lianhui Wang edited comment on SPARK-2666 at 9/22/15 12:11 PM: --- [~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. And i think that's logic is right. was (Author: lianhuiwang): [~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. And i think that's logic is right. it is ok except unit test. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409 ] Lianhui Wang commented on SPARK-8646: - yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409 ] Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:25 AM: -- yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. there is some problems on spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in Client. was (Author: lianhuiwang): yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409 ] Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:26 AM: -- yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. there is some problems on spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in Client. if this must be needed in spark-1.4.0, latter i will take a look at it. was (Author: lianhuiwang): yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. there is some problems on spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in Client. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629103#comment-14629103 ] Lianhui Wang commented on SPARK-8646: - yes, when we set master=yarn-client on pyspark/SparkContext.py, it do not take effect. so spark-submit consider master=local. i will look at it. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8646: Comment: was deleted (was: yes, when we set master=yarn-client on pyspark/SparkContext.py, it do not take effect. so spark-submit consider master=local. i will look at it. ) PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624521#comment-14624521 ] Lianhui Wang commented on SPARK-8646: - [~juliet] from your spark1.4-verbose.log, i find that master= local[*]. so maybe in spark-defaults.conf, you config spark.master=local? other situation is in your data_transform.py, maybe you use sparkConf.set(spark.master,local). Can you check whether these situations have been happened? PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625725#comment-14625725 ] Lianhui Wang commented on SPARK-8646: - [~juliet] can you provide your spark-submit command? i think the correct command in spark 1.4 is $SPARK_HOME/bin/spark-submit --master yarn-client outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ is it the same as your command? PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620466#comment-14620466 ] Lianhui Wang commented on SPARK-8646: - [~j_houg] can you add --verbose to spark-submit command? and look at what is your spark.submit.pyArchives. because from you logs, i find that it do not upload pyArchive files, like:pyspark.zip and py4j-0.8.2.1-src.zip. and you can check whether in SPARK_HOME/python/lib path it has these two zips. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603973#comment-14603973 ] Lianhui Wang commented on SPARK-8646: - from [~juliet] 's logs, i think you miss python 'pandas.algos' module that pyspark does not provide. i think that you need to install it on nodes. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6954) ExecutorAllocationManager can end up requesting a negative number of executors
[ https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-6954: Summary: ExecutorAllocationManager can end up requesting a negative number of executors (was: tdwadmin) ExecutorAllocationManager can end up requesting a negative number of executors -- Key: SPARK-6954 URL: https://issues.apache.org/jira/browse/SPARK-6954 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Cheolsoo Park Assignee: Sandy Ryza Labels: yarn Fix For: 1.3.2, 1.4.0 Attachments: with_fix.png, without_fix.png I have a simple test case for dynamic allocation on YARN that fails with the following stack trace- {code} 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -21 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} My test is as follows- # Start spark-shell with a single executor. # Run a {{select count(\*)}} query. The number of executors rises as input size is non-trivial. # After the job finishes, the number of executors falls as most of them become idle. # Rerun the same query again, and the request to add executors fails with the above error. In fact, the job itself continues to run with whatever executors it already has, but it never gets more executors unless the shell is closed and restarted. In fact, this error only happens when I configure {{executorIdleTimeout}} very small. For eg, I can reproduce it with the following configs- {code} spark.dynamicAllocation.executorIdleTimeout 5 spark.dynamicAllocation.schedulerBacklogTimeout 5 {code} Although I can simply increase {{executorIdleTimeout}} to something like 60 secs to avoid the error, I think this is still a bug to be fixed. The root cause seems that {{numExecutorsPending}} accidentally becomes negative if executors are killed too aggressively (i.e. {{executorIdleTimeout}} is too small) because under that circumstance, the new target # of executors can be smaller than the current # of executors. When that happens, {{ExecutorAllocationManager}} ends up trying to add a negative number of executors, which throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8430) in Yarn's shuffle service ExternalShuffleBlockResolver should support UnsafeShuffleManager
Lianhui Wang created SPARK-8430: --- Summary: in Yarn's shuffle service ExternalShuffleBlockResolver should support UnsafeShuffleManager Key: SPARK-8430 URL: https://issues.apache.org/jira/browse/SPARK-8430 Project: Spark Issue Type: Bug Components: YARN Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8381: Description: This method CatalystTypeConverters.convertToCatalyst is slow, so for batch conversion we should be using converter produced by createToCatalystConverter. (was: This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter.) reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, so for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to CatalystType
Lianhui Wang created SPARK-8381: --- Summary: reuse-typeConvert when convert Seq[Row] to CatalystType Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8381: Summary: reuse typeConvert when convert Seq[Row] to catalyst type (was: reuse-typeConvert when convert Seq[Row] to catalyst type) reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8381: Summary: reuse-typeConvert when convert Seq[Row] to catalyst type (was: reuse-typeConvert when convert Seq[Row] to CatalystType) reuse-typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988 ] Lianhui Wang commented on SPARK-6700: - i do not think this is related to SPARK-6506 because YarnClusterSuite setted SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application test in YarnClusterSuite is successfully.[~davies] can you report your unit-test.log or appMaster.log? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at
[jira] [Comment Edited] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988 ] Lianhui Wang edited comment on SPARK-6700 at 4/6/15 6:49 AM: - i do not think this is related to SPARK-6506 because YarnClusterSuite setted SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application test in YarnClusterSuite is successfully.[~davies] can you report your unit-test.log or appMaster.log? in addition, i think you can try again because there maybe has other errors to cause it failed. was (Author: lianhuiwang): i do not think this is related to SPARK-6506 because YarnClusterSuite setted SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application test in YarnClusterSuite is successfully.[~davies] can you report your unit-test.log or appMaster.log? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[jira] [Comment Edited] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381844#comment-14381844 ] Lianhui Wang edited comment on SPARK-6506 at 3/26/15 1:18 PM: -- hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why pyspark of spark.jar cannot be worked. [~andrewor14] can you help me? so i think now we should put spark dirs to PYTHONPATH or SPARK_HOME at every node. was (Author: lianhuiwang): hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why it is. [~andrewor14] can you help me? so i think now we should put spark dirs to PYTHONPATH or SPARK_HOME at every node. python support yarn cluster mode requires SPARK_HOME to be set -- Key: SPARK-6506 URL: https://issues.apache.org/jira/browse/SPARK-6506 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Thomas Graves We added support for python running in yarn cluster mode in https://issues.apache.org/jira/browse/SPARK-5173, but it requires that SPARK_HOME be set in the environment variables for application master and executor. It doesn't have to be set to anything real but it fails if its not set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381844#comment-14381844 ] Lianhui Wang commented on SPARK-6506: - hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why it is. [~andrewor14] can you help me? so i think now we should put spark dirs to SPARK_HOME at every node. python support yarn cluster mode requires SPARK_HOME to be set -- Key: SPARK-6506 URL: https://issues.apache.org/jira/browse/SPARK-6506 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Thomas Graves We added support for python running in yarn cluster mode in https://issues.apache.org/jira/browse/SPARK-5173, but it requires that SPARK_HOME be set in the environment variables for application master and executor. It doesn't have to be set to anything real but it fails if its not set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381844#comment-14381844 ] Lianhui Wang edited comment on SPARK-6506 at 3/26/15 1:17 PM: -- hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why it is. [~andrewor14] can you help me? so i think now we should put spark dirs to PYTHONPATH or SPARK_HOME at every node. was (Author: lianhuiwang): hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why it is. [~andrewor14] can you help me? so i think now we should put spark dirs to SPARK_HOME at every node. python support yarn cluster mode requires SPARK_HOME to be set -- Key: SPARK-6506 URL: https://issues.apache.org/jira/browse/SPARK-6506 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Thomas Graves We added support for python running in yarn cluster mode in https://issues.apache.org/jira/browse/SPARK-5173, but it requires that SPARK_HOME be set in the environment variables for application master and executor. It doesn't have to be set to anything real but it fails if its not set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6103) remove unused class to import in EdgeRDDImpl
Lianhui Wang created SPARK-6103: --- Summary: remove unused class to import in EdgeRDDImpl Key: SPARK-6103 URL: https://issues.apache.org/jira/browse/SPARK-6103 Project: Spark Issue Type: Bug Components: GraphX Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container
[ https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341498#comment-14341498 ] Lianhui Wang edited comment on SPARK-6056 at 2/28/15 12:36 PM: --- [~carlmartin] what is your executor's memory? that is ok when i run your test using 1g. in addition, i use spark 1.2.0 release version. [~adav] yes, thanks. was (Author: lianhuiwang): [~carlmartin] what is your executor's memory? that is ok when i run your test using 1g. in addition, i use spark 1.2.0 release version. Unlimit offHeap memory use cause RM killing the container - Key: SPARK-6056 URL: https://issues.apache.org/jira/browse/SPARK-6056 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: SaintBacchus No matter set the `preferDirectBufs` or limit the number of thread or not ,spark can not limit the use of offheap memory. At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, Netty had allocated a offheap memory buffer with the same size in heap. So how many buffer you want to transfor, the same size offheap memory will be allocated. But once the allocated memory size reach the capacity of the overhead momery set in yarn, this executor will be killed. I wrote a simple code to test it: ```scala val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new Array[Byte](10*1024*1024)).persist bufferRdd.count val part = bufferRdd.partitions(0) val sparkEnv = SparkEnv.get val blockMgr = sparkEnv.blockManager val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index)) val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]] val len = resultIt.map(_.length).sum ``` If use multi-thread to get len, the physical memery will soon exceed the limit set by spark.yarn.executor.memoryOverhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container
[ https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341498#comment-14341498 ] Lianhui Wang commented on SPARK-6056: - [~carlmartin] what is your executor's memory? that is ok when i run your test using 1g. in addition, i use spark 1.2.0 release version. Unlimit offHeap memory use cause RM killing the container - Key: SPARK-6056 URL: https://issues.apache.org/jira/browse/SPARK-6056 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: SaintBacchus No matter set the `preferDirectBufs` or limit the number of thread or not ,spark can not limit the use of offheap memory. At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, Netty had allocated a offheap memory buffer with the same size in heap. So how many buffer you want to transfor, the same size offheap memory will be allocated. But once the allocated memory size reach the capacity of the overhead momery set in yarn, this executor will be killed. I wrote a simple code to test it: ```scala val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new Array[Byte](10*1024*1024)).persist bufferRdd.count val part = bufferRdd.partitions(0) val sparkEnv = SparkEnv.get val blockMgr = sparkEnv.blockManager val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index)) val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]] val len = resultIt.map(_.length).sum ``` If use multi-thread to get len, the physical memery will soon exceed the limit set by spark.yarn.executor.memoryOverhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container
[ https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341260#comment-14341260 ] Lianhui Wang commented on SPARK-6056: - [~adav] from your given information, when preferDirectBufs is set to false, i think we need to set io.netty.allocator.numDirectArenas=0 that avoid direct allocation. right? [~carlmartin] i think you can try the method that [~adav] said before. Unlimit offHeap memory use cause RM killing the container - Key: SPARK-6056 URL: https://issues.apache.org/jira/browse/SPARK-6056 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: SaintBacchus No matter set the `preferDirectBufs` or limit the number of thread or not ,spark can not limit the use of offheap memory. At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, Netty had allocated a offheap memory buffer with the same size in heap. So how many buffer you want to transfor, the same size offheap memory will be allocated. But once the allocated memory size reach the capacity of the overhead momery set in yarn, this executor will be killed. I wrote a simple code to test it: ```scala val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new Array[Byte](10*1024*1024)).persist bufferRdd.count val part = bufferRdd.partitions(0) val sparkEnv = SparkEnv.get val blockMgr = sparkEnv.blockManager val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index)) val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]] val len = resultIt.map(_.length).sum ``` If use multi-thread to get len, the physical memery will soon exceed the limit set by spark.yarn.executor.memoryOverhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
[ https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-5763: Description: In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinions about this? was: In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinion about this? Sort-based Groupby and Join to resolve skewed data -- Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinions about this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
[ https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-5763: Description: In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinion about this? was:In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. Sort-based Groupby and Join to resolve skewed data -- Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinion about this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5721) Propagate missing external shuffle service errors to client
[ https://issues.apache.org/jira/browse/SPARK-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317895#comment-14317895 ] Lianhui Wang edited comment on SPARK-5721 at 2/12/15 9:39 AM: -- yes, i think it is same as SPARK-5759. now in https://github.com/apache/spark/pull/4554 i log the container ID and host name for exception. was (Author: lianhuiwang): yes, i think it is same as SPARK-5759. now in https://github.com/apache/spark/pull/4554 i report the container ID and host name for exception. Propagate missing external shuffle service errors to client --- Key: SPARK-5721 URL: https://issues.apache.org/jira/browse/SPARK-5721 Project: Spark Issue Type: Bug Components: Spark Core, YARN Reporter: Kostas Sakellis When spark.shuffle.service.enabled=true, the yarn AM expects to find an aux service running in the namenode. If it cannot find one an exception like this is present in the app master logs. {noformat} Exception in thread ContainerLauncher #0 Exception in thread ContainerLauncher #1 java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206) at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110) at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ... 2 more java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206) at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110) at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ... 2 more {noformat} We should propagate this error to the driver (in yarn-client mode) because it is otherwise unclear why the number of executors expected are not starting up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
Lianhui Wang created SPARK-5763: --- Summary: Sort-based Groupby and Join to resolve skewed data Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5759) ExecutorRunnable should catch YarnException while NMClient start container
Lianhui Wang created SPARK-5759: --- Summary: ExecutorRunnable should catch YarnException while NMClient start container Key: SPARK-5759 URL: https://issues.apache.org/jira/browse/SPARK-5759 Project: Spark Issue Type: Bug Components: YARN Reporter: Lianhui Wang some time since some of reasons, it lead to some exception while NMClient start container.example:we do not config spark_shuffle on some machines, so it will throw a exception: java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist. because YarnAllocator use ThreadPoolExecutor to start Container, so we can not find which container or hostname throw exception. I think we should catch YarnException in ExecutorRunnable when start container. if there are some exceptions, we can know the container id or hostname of failed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles
[ https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312163#comment-14312163 ] Lianhui Wang edited comment on SPARK-5227 at 2/9/15 12:16 PM: -- function of computing split size in hadoop's FileInputFormat is: protected long computeSplitSize(long goalSize, long minSize, long blockSize) { return Math.max(minSize, Math.min(goalSize, blockSize)); } long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits); and minSize = 1 blockSize = getConf().getLong(fs.local.block.size, 32 * 1024 * 1024); so when numSplits=2 and totalSize is very large, then goalSize blockSize. if it is true, actual splitSize is blockSize, not goalSize. In this situation, the error in this PR will be happened. was (Author: lianhuiwang): split size in hadoop's FileInputFormat: protected long computeSplitSize(long goalSize, long minSize, long blockSize) { return Math.max(minSize, Math.min(goalSize, blockSize)); } long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits); and minSize = 1 blockSize = getConf().getLong(fs.local.block.size, 32 * 1024 * 1024); so when numSplits=2 and totalSize is very large, then goalSize blockSize. if it is true, actual splitSize is blockSize, not goalSize. In this situation, the error in this PR will be happened. InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles - Key: SPARK-5227 URL: https://issues.apache.org/jira/browse/SPARK-5227 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Josh Rosen Priority: Blocker Labels: flaky-test The InputOutputMetricsSuite input metrics when reading text file with multiple splits test has been failing consistently in our new {{branch-1.2}} Jenkins SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/ Here's the error message {code} ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
[jira] [Created] (SPARK-5687) in TaskResultGetter need to catch OutOfMemoryError and report failed when it cannot fetch results.
Lianhui Wang created SPARK-5687: --- Summary: in TaskResultGetter need to catch OutOfMemoryError and report failed when it cannot fetch results. Key: SPARK-5687 URL: https://issues.apache.org/jira/browse/SPARK-5687 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Lianhui Wang because in enqueueSuccessfulTask there is another thread to fetch result, if result is very large,it maybe throw a OutOfMemoryError. so if we donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this task. another is: when totalResultSize maxResultSize and it cannot fetch large result, we need to report failed status to DAGDAGScheduler. if it donot report any status of this tasks.DAGDAGScheduler also donot know status of this task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles
[ https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312163#comment-14312163 ] Lianhui Wang commented on SPARK-5227: - split size in hadoop's FileInputFormat: protected long computeSplitSize(long goalSize, long minSize, long blockSize) { return Math.max(minSize, Math.min(goalSize, blockSize)); } long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits); and minSize = 1 blockSize = getConf().getLong(fs.local.block.size, 32 * 1024 * 1024); so when numSplits=2 and totalSize is very large, then goalSize blockSize. if it is true, actual splitSize is blockSize, not goalSize. In this situation, the error in this PR will be happened. InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles - Key: SPARK-5227 URL: https://issues.apache.org/jira/browse/SPARK-5227 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Josh Rosen Priority: Blocker Labels: flaky-test The InputOutputMetricsSuite input metrics when reading text file with multiple splits test has been failing consistently in our new {{branch-1.2}} Jenkins SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/ Here's the error message {code} ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
[jira] [Updated] (SPARK-5687) in TaskResultGetter need to catch OutOfMemoryError.
[ https://issues.apache.org/jira/browse/SPARK-5687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-5687: Description: because in enqueueSuccessfulTask there is another thread to fetch result, if result is very large,it maybe throw a OutOfMemoryError. so if we donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this task. (was: because in enqueueSuccessfulTask there is another thread to fetch result, if result is very large,it maybe throw a OutOfMemoryError. so if we donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this task. another is: when totalResultSize maxResultSize and it cannot fetch large result, we need to report failed status to DAGDAGScheduler. if it donot report any status of this tasks.DAGDAGScheduler also donot know status of this task.) in TaskResultGetter need to catch OutOfMemoryError. --- Key: SPARK-5687 URL: https://issues.apache.org/jira/browse/SPARK-5687 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Lianhui Wang because in enqueueSuccessfulTask there is another thread to fetch result, if result is very large,it maybe throw a OutOfMemoryError. so if we donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org