[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-04-16 Thread Lianhui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-30602:
-
Description: 
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]

  was:
白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html


> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 

[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-04-16 Thread Lianhui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-30602:
-
Description: 
白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html

  was:
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html


> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 

[jira] [Updated] (SPARK-20986) Reset table's statistics after PruneFileSourcePartitions rule.

2017-06-05 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-20986:
-
Description: After PruneFileSourcePartitions rule, It needs reset table's 
statistics because PruneFileSourcePartitions rule can filter some unnecessary 
partitions. So the statistics need to be changed.  (was: After 
PruneFileSourcePartitions rule, It needs reset table's statistics because 
PruneFileSourcePartitions can filter some unnecessary partitions. So the 
statistics need to be changed.)

> Reset table's statistics after PruneFileSourcePartitions rule.
> --
>
> Key: SPARK-20986
> URL: https://issues.apache.org/jira/browse/SPARK-20986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> After PruneFileSourcePartitions rule, It needs reset table's statistics 
> because PruneFileSourcePartitions rule can filter some unnecessary 
> partitions. So the statistics need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20986) Reset table's statistics after PruneFileSourcePartitions rule.

2017-06-05 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-20986:


 Summary: Reset table's statistics after PruneFileSourcePartitions 
rule.
 Key: SPARK-20986
 URL: https://issues.apache.org/jira/browse/SPARK-20986
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Lianhui Wang


After PruneFileSourcePartitions rule, It needs reset table's statistics because 
PruneFileSourcePartitions can filter some unnecessary partitions. So the 
statistics need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15616) CatalogRelation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2017-06-04 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15616:
-
Summary: CatalogRelation should fallback to HDFS size of partitions that 
are involved in Query if statistics are not available.  (was: Metastore 
relation should fallback to HDFS size of partitions that are involved in Query 
if statistics are not available.)

> CatalogRelation should fallback to HDFS size of partitions that are involved 
> in Query if statistics are not available.
> --
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-03-30 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948578#comment-15948578
 ] 

Lianhui Wang commented on SPARK-14560:
--

[~darshankhamar123] [~mhornbech] Are you using ExternalAppendOnlyMap or 
ExternalSorter? This issue is about them, not for UnsafeInMemorySorter.
You can fetch more information about memory using Debug level log. 

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by 

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-11-23 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690589#comment-15690589
 ] 

Lianhui Wang commented on SPARK-14560:
--

Can you provide some debug log of TaskMemoryManager? I think that can tell who 
use the memory.

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-11-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670007#comment-15670007
 ] 

Lianhui Wang commented on SPARK-14560:
--

Now this issue is not in Branch-1.6. You can see 
https://github.com/apache/spark/pull/13027 for branch-1.6. Thanks.

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-31 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622645#comment-15622645
 ] 

Lianhui Wang commented on SPARK-15616:
--

For 2.0, I have created a new branch 
https://github.com/lianhuiwang/spark/tree/2.0_partition_broadcast. You can 
retry again.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-29 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618476#comment-15618476
 ] 

Lianhui Wang commented on SPARK-15616:
--

I have updated the code and fixed the problem that you have pointed out. 
Thanks. I think you can try again.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-29 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618475#comment-15618475
 ] 

Lianhui Wang commented on SPARK-15616:
--

I have updated the code and fixed the problem that you have pointed out. 
Thanks. I think you can try again.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-29 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15616:
-
Comment: was deleted

(was: I have updated the code and fixed the problem that you have pointed out. 
Thanks. I think you can try again.)

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613846#comment-15613846
 ] 

Lianhui Wang edited comment on SPARK-15616 at 10/28/16 12:54 AM:
-

Yes, I think it can. But now the PR is based on branch 2.0, not branch master. 
I will update it based on branch master later. I think you can try this PR to 
do broadcast join. Please tell me any feedback. Thanks.


was (Author: lianhuiwang):
Yes, I think it can. But now the PR is based on branch 2.0, not branch master. 
I will update it based on branch master later.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613846#comment-15613846
 ] 

Lianhui Wang commented on SPARK-15616:
--

Yes, I think it can. But now the PR is based on branch 2.0, not branch master. 
I will update it based on branch master later.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610455#comment-15610455
 ] 

Lianhui Wang commented on SPARK-15616:
--

if filter is for partition key, So this issue can did it for broadcast join.
if filter is for no-partition key, it is not include in this issue. Maybe it 
needs use cost based optimizer.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145
 ] 

Lianhui Wang edited comment on SPARK-2666 at 7/21/16 4:38 AM:
--

Thanks. I think what [~irashid] said is more about non-external shuffle. But in 
our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  
if it hitx a shuffle fetch failure while running stage 2, say on executor A. So 
it needs to regenerate the map output for stage 1 that was on executor A. But 
it don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend time to fetch 
other's results.



was (Author: lianhuiwang):
I think what [~irashid] said is more about non-external shuffle. But in our use 
cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  if it 
hitx a shuffle fetch failure while running stage 2, say on executor A. So it 
needs to regenerate the map output for stage 1 that was on executor A. But it 
don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend 

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145
 ] 

Lianhui Wang commented on SPARK-2666:
-

I think what [~irashid] said is more about non-external shuffle. But in our use 
cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  if it 
hitx a shuffle fetch failure while running stage 2, say on executor A. So it 
needs to regenerate the map output for stage 1 that was on executor A. But it 
don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend time to fetch 
other's results.


> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA

[jira] [Updated] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16649:
-
Description: SPARK-6910 has supported for  pushing partition predicates 
down into the Hive metastore for table scan. So it also should push partition 
predicates down into metastore for OptimizeMetadataOnlyQuery.  (was: SPARK-6910 
has supported for  pushing partition predicates down into the Hive metastore 
for table scan. So it also should push partition predicates dow into metastore 
for OptimizeMetadataOnlyQuery.)

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates down 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16649:
-
Summary: Push partition predicates down into metastore for 
OptimizeMetadataOnlyQuery  (was: Push partition predicates down into the Hive 
metastore for OptimizeMetadataOnlyQuery)

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates dow 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16649) Push partition predicates down into the Hive metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-16649:


 Summary: Push partition predicates down into the Hive metastore 
for OptimizeMetadataOnlyQuery
 Key: SPARK-16649
 URL: https://issues.apache.org/jira/browse/SPARK-16649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Lianhui Wang


SPARK-6910 has supported for  pushing partition predicates down into the Hive 
metastore for table scan. So it also should push partition predicates dow into 
metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385510#comment-15385510
 ] 

Lianhui Wang commented on SPARK-2666:
-

[~tgraves] Sorry for late reply. In https://github.com/apache/spark/pull/1572, 
it will kill all running tasks before we resubmit for FetchFailed. But 
[~kayousterhout] said that  it keep the remaining tasks because the running 
tasks may hit Fetch failures from different map outputs than the original fetch 
failure. 
I think the best way is like Mapreduce we just resubmit the map stage of failed 
stage. if the reduce stage has a FetchFailed, it just report FetchFailed to 
DAGScheduler and fetch other results. Then the reduce stage getOutputStatus of 
FetchFailed every hearbeat like https://github.com/apache/spark/pull/3430.
[~tgraves] How about your ideas about this? Thanks. 

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16497) Don't throw an exception if drop non-existent TABLE/VIEW/Function/Partitions

2016-07-12 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16497:
-
Summary: Don't throw an exception if drop non-existent 
TABLE/VIEW/Function/Partitions  (was: Don't throw an exception for drop 
non-exist TABLE/VIEW/Function/Partitions)

> Don't throw an exception if drop non-existent TABLE/VIEW/Function/Partitions
> 
>
> Key: SPARK-16497
> URL: https://issues.apache.org/jira/browse/SPARK-16497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> from 
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.drop.ignorenonexistent,
>  Hive use 'hive.exec.drop.ignorenonexistent'(default=true) to do not report 
> an error if DROP TABLE/VIEW/PARTITION/INDEX/TEMPORARY FUNCTION specifies a 
> non-existent table/view. So SparkSQL also should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16497) Don't throw an exception for drop non-exist TABLE/VIEW/Function/Partitions

2016-07-12 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-16497:


 Summary: Don't throw an exception for drop non-exist 
TABLE/VIEW/Function/Partitions
 Key: SPARK-16497
 URL: https://issues.apache.org/jira/browse/SPARK-16497
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Lianhui Wang


from 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.drop.ignorenonexistent,
 Hive use 'hive.exec.drop.ignorenonexistent'(default=true) to do not report an 
error if DROP TABLE/VIEW/PARTITION/INDEX/TEMPORARY FUNCTION specifies a 
non-existent table/view. So SparkSQL also should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16456:
-
Summary: Reuse the uncorrelated scalar subqueries with the same logical 
plan in a query  (was: Reuse the uncorrelated scalar subqueries with with the 
same logical plan in a query)

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing on the same uncorrelated scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16456:
-
Description: In TPCDS Q14, the same physical plan of uncorrelated scalar 
subqueries from a CTE could be executed multiple times, we should re-use the 
same result  to avoid the duplicated computing.  (was: In TPCDS Q14, the same 
physical plan of uncorrelated scalar subqueries from a CTE could be executed 
multiple times, we should re-use the same result  to avoid the duplicated 
computing on the same uncorrelated scalar subquery.)

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15752) Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators

2016-07-06 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15752:
-
Summary: Optimize metadata only query that has an aggregate whose children 
are deterministic project or filter operators  (was: support optimization for 
metadata only queries)

> Optimize metadata only query that has an aggregate whose children are 
> deterministic project or filter operators
> ---
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> when query only use metadata (example: partition key), it can return results 
> based on metadata without scaning files. Hive did it in HIVE-1003.
> design document:
> https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16302:
-
Summary: Set the right number of partitions for reading data from a local 
collection.  (was: Set the default number of partitions for reading data from a 
local collection.)

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16302) Set the default number of partitions for reading data from a local collection.

2016-06-29 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16302:
-
Summary: Set the default number of partitions for reading data from a local 
collection.  (was: LocalTableScanExec always use defaultParallelism tasks even 
though it is very small seq.)

> Set the default number of partitions for reading data from a local collection.
> --
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16302) LocalTableScanExec always use defaultParallelism tasks even though it is very small seq.

2016-06-29 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-16302:


 Summary: LocalTableScanExec always use defaultParallelism tasks 
even though it is very small seq.
 Key: SPARK-16302
 URL: https://issues.apache.org/jira/browse/SPARK-16302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Lianhui Wang


query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15988) Implement DDL commands: CREATE/DROP TEMPORARY MACRO

2016-06-16 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-15988:


 Summary: Implement DDL commands: CREATE/DROP TEMPORARY MACRO 
 Key: SPARK-15988
 URL: https://issues.apache.org/jira/browse/SPARK-15988
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Lianhui Wang


in https://issues.apache.org/jira/browse/HIVE-2655, Hive have implemented 
CREATE/DROP TEMPORARY MACRO. So I think Spark SQL should support  it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15752) support optimization for metadata only queries

2016-06-04 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15752:
-
Description: 
when query only use metadata (example: partition key), it can return results 
based on metadata without scaning files. Hive did it in HIVE-1003.
design document:
https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing

  was:when query only use metadata (example: partition key), it can return 
results based on metadata without scaning files. Hive did it in HIVE-1003.


> support optimization for metadata only queries
> --
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> when query only use metadata (example: partition key), it can return results 
> based on metadata without scaning files. Hive did it in HIVE-1003.
> design document:
> https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15756) Support command 'create table stored as orcfile/parquetfile/avrofile'

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15756:
-
Description: Now Spark SQL can support 'create table src stored as 
orc/parquet/avro' for orc/parquet/avro table. But Hive can support both 
commands: ' stored as orc/parquet/avro' and 'stored as 
orcfile/parquetfile/avrofile'. So as Hive Spark SQL needs support 'stored as 
orcfile/parquetfile/avrofile'.  (was: in 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
 we can know Hive support create table stored as orcfile/parquetfile/avrofile.)

> Support command 'create table stored as orcfile/parquetfile/avrofile'
> -
>
> Key: SPARK-15756
> URL: https://issues.apache.org/jira/browse/SPARK-15756
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> Now Spark SQL can support 'create table src stored as orc/parquet/avro' for 
> orc/parquet/avro table. But Hive can support both commands: ' stored as 
> orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'. So as Hive 
> Spark SQL needs support 'stored as orcfile/parquetfile/avrofile'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15756) Support command 'create table stored as orcfile/parquetfile/avrofile'

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15756:
-
Summary: Support command 'create table stored as 
orcfile/parquetfile/avrofile'  (was: Support create table stored as 
orcfile/parquetfile/avrofile)

> Support command 'create table stored as orcfile/parquetfile/avrofile'
> -
>
> Key: SPARK-15756
> URL: https://issues.apache.org/jira/browse/SPARK-15756
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> in 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
>  we can know Hive support create table stored as orcfile/parquetfile/avrofile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15756) Support create table stored as orcfile/parquetfile/avrofile

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15756:
-
Description: in 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
 we can know Hive support create table stored as orcfile/parquetfile/avrofile.  
(was: in 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
 we can know Hive support )

> Support create table stored as orcfile/parquetfile/avrofile
> ---
>
> Key: SPARK-15756
> URL: https://issues.apache.org/jira/browse/SPARK-15756
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> in 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
>  we can know Hive support create table stored as orcfile/parquetfile/avrofile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15756) Support create table stored as orcfile/parquetfile/avrofile

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15756:
-
Description: in 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
 we can know Hive support 

> Support create table stored as orcfile/parquetfile/avrofile
> ---
>
> Key: SPARK-15756
> URL: https://issues.apache.org/jira/browse/SPARK-15756
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> in 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOConstants.java,
>  we can know Hive support 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15756) Support create table stored as orcfile/parquetfile/avrofile

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15756:
-
Summary: Support create table stored as orcfile/parquetfile/avrofile  (was: 
SQL “stored as orcfile” cannot be supported while hive supports both keywords 
"orc" and "orcfile")

> Support create table stored as orcfile/parquetfile/avrofile
> ---
>
> Key: SPARK-15756
> URL: https://issues.apache.org/jira/browse/SPARK-15756
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15752) support optimization for metadata only queries

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15752:
-
Description: when query only use metadata (example: partition key), it can 
return results based on metadata without scaning files. Hive did it in 
HIVE-1003.  (was: when query just has use metadata (example: partition key), it 
can return results based on metadata without scaning files. Hive did it in 
HIVE-1003.)

> support optimization for metadata only queries
> --
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> when query only use metadata (example: partition key), it can return results 
> based on metadata without scaning files. Hive did it in HIVE-1003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15752) support optimization for metadata only queries

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15752:
-
Description: when query just has use metadata (example: partition key), it 
can return results based on metadata without scaning files. Hive did it in 
HIVE-1003.  (was: when query just has use metadata, so it can not scan files. 
we need to support it like HIVE-1003.)

> support optimization for metadata only queries
> --
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> when query just has use metadata (example: partition key), it can return 
> results based on metadata without scaning files. Hive did it in HIVE-1003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15752) support optimization for metadata only queries

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15752:
-
Summary: support optimization for metadata only queries  (was: support 
optimization for metadata only queries.)

> support optimization for metadata only queries
> --
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> when query just has use metadata, so it can not scan files. we need to 
> support it like HIVE-1003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15752) support optimization for metadata only queries.

2016-06-03 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-15752:


 Summary: support optimization for metadata only queries.
 Key: SPARK-15752
 URL: https://issues.apache.org/jira/browse/SPARK-15752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Lianhui Wang


when query just has use metadata, so it can not scan files. we need to support 
it like HIVE-1003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15752) support optimization for metadata only queries.

2016-06-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15752:
-
Issue Type: Improvement  (was: Bug)

> support optimization for metadata only queries.
> ---
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> when query just has use metadata, so it can not scan files. we need to 
> support it like HIVE-1003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib

2016-05-31 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15664:
-
Description: if sparkContext.set CheckpointDir to another Dir that is not 
default FileSystem, it will throw exception when 

> Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing 
> CheckpointFile in MLlib
> 
>
> Key: SPARK-15664
> URL: https://issues.apache.org/jira/browse/SPARK-15664
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Lianhui Wang
>
> if sparkContext.set CheckpointDir to another Dir that is not default 
> FileSystem, it will throw exception when 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib

2016-05-31 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15664:
-
Description: if sparkContext.set CheckpointDir to another Dir that is not 
default FileSystem, it will throw exception when removing CheckpointFile in 
MLlib.  (was: if sparkContext.set CheckpointDir to another Dir that is not 
default FileSystem, it will throw exception when )

> Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing 
> CheckpointFile in MLlib
> 
>
> Key: SPARK-15664
> URL: https://issues.apache.org/jira/browse/SPARK-15664
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Lianhui Wang
>
> if sparkContext.set CheckpointDir to another Dir that is not default 
> FileSystem, it will throw exception when removing CheckpointFile in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib

2016-05-31 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15664:
-
Component/s: MLlib

> Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing 
> CheckpointFile in MLlib
> 
>
> Key: SPARK-15664
> URL: https://issues.apache.org/jira/browse/SPARK-15664
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Lianhui Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15664) Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib

2016-05-31 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-15664:


 Summary: Replace FileSystem.get(conf) with 
path.getFileSystem(conf) when removing CheckpointFile in MLlib
 Key: SPARK-15664
 URL: https://issues.apache.org/jira/browse/SPARK-15664
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15649) avoid to serialize MetastoreRelation in HiveTableScanExec

2016-05-30 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-15649:


 Summary: avoid to serialize MetastoreRelation in HiveTableScanExec
 Key: SPARK-15649
 URL: https://issues.apache.org/jira/browse/SPARK-15649
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang
Priority: Minor


in HiveTableScanExec, schema is lazy and is related with relation.attributeMap. 
So it needs to serialize MetastoreRelation when serializing task binary 
bytes.It can avoid to serialize MetastoreRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-05-27 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15616:
-
Description: 
Currently if some partitions of a partitioned table are used in join operation 
we rely on Metastore returned size of table to calculate if we can convert the 
operation to Broadcast join. 
if Filter can prune some partitions, Hive can prune partition before 
determining to use broadcast joins according to HDFS size of partitions that 
are involved in Query.So sparkSQL needs it that can improve join's performance 
for partitioned table.

  was:
Currently if some partitions of a partitioned table are used in join operation 
we rely on Metastore returned size of table to calculate if we can convert the 
operation to Broadcast join. 
if Filter can prune some partitions, Hive can prune partition before 
determining to use broadcast joins according to HDFS size of partitions that 
are involved in Query.So sparkSQL needs it that can improve join's performance 
for partitioned table.


> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-05-27 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-15616:


 Summary: Metastore relation should fallback to HDFS size of 
partitions that are involved in Query if statistics are not available.
 Key: SPARK-15616
 URL: https://issues.apache.org/jira/browse/SPARK-15616
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Lianhui Wang


Currently if some partitions of a partitioned table are used in join operation 
we rely on Metastore returned size of table to calculate if we can convert the 
operation to Broadcast join. 
if Filter can prune some partitions, Hive can prune partition before 
determining to use broadcast joins according to HDFS size of partitions that 
are involved in Query.So sparkSQL needs it that can improve join's performance 
for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15335) Implement TRUNCATE TABLE Command

2016-05-18 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15335:
-
Summary: Implement TRUNCATE TABLE Command  (was: In Spark 2.0 TRUNCATE 
TABLE is unsupported)

> Implement TRUNCATE TABLE Command
> 
>
> Key: SPARK-15335
> URL: https://issues.apache.org/jira/browse/SPARK-15335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> Spark version based on commit b3930f74a0929b2cdcbbe5cbe34f0b1d35eb01cc, test 
> result is like below:
> {noformat}
> spark-sql> create table truncateTT(c string);
> 16/05/16 10:23:15 INFO execution.SparkSqlParser: Parsing command: create 
> table truncateTT(c string)
> 16/05/16 10:23:15 INFO metastore.HiveMetaStore: 0: get_database: default
> 16/05/16 10:23:15 INFO HiveMetaStore.audit: ugi=root  ip=unknown-ip-addr  
> cmd=get_database: default   
> 16/05/16 10:23:15 INFO metastore.HiveMetaStore: 0: get_database: default
> 16/05/16 10:23:15 INFO HiveMetaStore.audit: ugi=root  ip=unknown-ip-addr  
> cmd=get_database: default   
> 16/05/16 10:23:15 INFO metastore.HiveMetaStore: 0: create_table: 
> Table(tableName:truncatett, dbName:default, owner:root, 
> createTime:1463365395, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:c, type:string, comment:null)], 
> location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, 
> privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, 
> rolePrivileges:null))
> 16/05/16 10:23:15 INFO HiveMetaStore.audit: ugi=root  ip=unknown-ip-addr  
> cmd=create_table: Table(tableName:truncatett, dbName:default, owner:root, 
> createTime:1463365395, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:c, type:string, comment:null)], 
> location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, 
> privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, 
> rolePrivileges:null))   
> 16/05/16 10:23:15 INFO common.FileUtils: Creating directory if it doesn't 
> exist: hdfs://vm001:9000/opt/apache/spark/spark-warehouse/truncatett
> 16/05/16 10:23:16 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Got job 1 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 
> (processCmd at CliDriver.java:376)
> 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Submitting ResultStage 1 
> (MapPartitionsRDD[5] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/05/16 10:23:16 INFO memory.MemoryStore: Block broadcast_1 stored as values 
> in memory (estimated size 3.2 KB, free 1823.2 MB)
> 16/05/16 10:23:16 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as 
> bytes in memory (estimated size 1965.0 B, free 1823.2 MB)
> 16/05/16 10:23:16 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
> memory on 192.168.151.146:47228 (size: 1965.0 B, free: 1823.2 MB)
> 16/05/16 10:23:16 INFO spark.SparkContext: Created broadcast 1 from broadcast 
> at DAGScheduler.scala:1012
> 16/05/16 10:23:16 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:376)
> 16/05/16 10:23:16 INFO cluster.YarnScheduler: Adding task set 1.0 with 1 tasks
> 16/05/16 10:23:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 1.0 (TID 1, vm001, partition 0, PROCESS_LOCAL, 5387 bytes)
> 16/05/16 10:23:16 INFO cluster.YarnClientSchedulerBackend: 

[jira] [Updated] (SPARK-15246) Fix code style and improve volatile for SPARK-4452

2016-05-10 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15246:
-
Summary:  Fix code style and improve volatile for SPARK-4452  (was:  Fix 
code style and improve volatile for Spillable)

>  Fix code style and improve volatile for SPARK-4452
> ---
>
> Key: SPARK-15246
> URL: https://issues.apache.org/jira/browse/SPARK-15246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Lianhui Wang
>
> for SPARK-4452
> 1. Fix code style
> 2. remote volatile of elementsRead method because there is only one thread to 
> use it.
> 3. avoid volatile of _elementsRead because Collection increases number of 
> _elementsRead when it insert a element. It is very expensive. So we can avoid 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15246) Fix code style and improve volatile for Spillable

2016-05-10 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-15246:


 Summary:  Fix code style and improve volatile for Spillable
 Key: SPARK-15246
 URL: https://issues.apache.org/jira/browse/SPARK-15246
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Lianhui Wang


for SPARK-4452
1. Fix code style
2. remote volatile of elementsRead method because there is only one thread to 
use it.
3. avoid volatile of _elementsRead because Collection increases number of 
_elementsRead when it insert a element. It is very expensive. So we can avoid 
it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14705) support Multiple FileSystem for YARN STAGING DIR

2016-04-18 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-14705:


 Summary: support Multiple FileSystem for YARN STAGING DIR
 Key: SPARK-14705
 URL: https://issues.apache.org/jira/browse/SPARK-14705
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang


In SPARK-13063, It makes the SPARK YARN STAGING DIR as configurable. But it 
only support default FileSystem. If there are many clusters, It can be 
different FileSystem for different cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12322) recompute an cached RDD partition when getting its block is failed

2015-12-14 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang closed SPARK-12322.

Resolution: Invalid

> recompute an cached RDD partition when getting its block is failed
> --
>
> Key: SPARK-12322
> URL: https://issues.apache.org/jira/browse/SPARK-12322
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Lianhui Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12322) recompute an cached RDD partition when getting its block is failed

2015-12-14 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-12322:


 Summary: recompute an cached RDD partition when getting its block 
is failed
 Key: SPARK-12322
 URL: https://issues.apache.org/jira/browse/SPARK-12322
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4621) Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Summary: Shuffle index can cached for SortShuffleManager in ExternalShuffle 
in order to  reduce indexFile's io  (was: when sort- based shuffle, Cache 
recently finished shuffle index can reduce indexFile's io)

> Shuffle index can cached for SortShuffleManager in ExternalShuffle in order 
> to  reduce indexFile's io
> -
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Priority: Minor
>
> in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
> shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Description: in ExternalShuffle, we can use LRUCache to store recently 
finished shuffle index and that can reduce indexFile's io. At first, i 
implement it for ExternalShuffle. Latter i will add it to 
IndexShuffleBlockResolver if i can.  (was: in IndexShuffleBlockManager, we can 
use LRUCache to store recently finished shuffle index and that can reduce 
indexFile's io.)

> Shuffle index can be cached for SortShuffleManager in ExternalShuffle in 
> order to  reduce indexFile's io
> 
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>Priority: Minor
>
> in ExternalShuffle, we can use LRUCache to store recently finished shuffle 
> index and that can reduce indexFile's io. At first, i implement it for 
> ExternalShuffle. Latter i will add it to IndexShuffleBlockResolver if i can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Summary: Shuffle index can be cached for SortShuffleManager in 
ExternalShuffle in order to  reduce indexFile's io  (was: Shuffle index can 
cached for SortShuffleManager in ExternalShuffle in order to  reduce 
indexFile's io)

> Shuffle index can be cached for SortShuffleManager in ExternalShuffle in 
> order to  reduce indexFile's io
> 
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Priority: Minor
>
> in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
> shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver

2015-12-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-12130:
-
Component/s: YARN
 Shuffle

> Replace shuffleManagerClass with shortShuffleMgrNames in 
> ExternalShuffleBlockResolver
> -
>
> Key: SPARK-12130
> URL: https://issues.apache.org/jira/browse/SPARK-12130
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Reporter: Lianhui Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver

2015-12-03 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-12130:


 Summary: Replace shuffleManagerClass with shortShuffleMgrNames in 
ExternalShuffleBlockResolver
 Key: SPARK-12130
 URL: https://issues.apache.org/jira/browse/SPARK-12130
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11252) andrewor14

2015-10-22 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-11252:
-
Summary: andrewor14   (was: ExternalShuffleClient should release connection 
after it had completed to fetch blocks from yarn's NameManager)

> andrewor14 
> ---
>
> Key: SPARK-11252
> URL: https://issues.apache.org/jira/browse/SPARK-11252
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Reporter: Lianhui Wang
>
> ExternalShuffleClient of executors reserve its connection with yarn's 
> NodeManager until application has been completed. so it will make NodeManager 
> has many socket connections.
> in order to reduce network pressure of NodeManager's shuffleService, after 
> registerWithShuffleServer or fetchBlocks have been completed in 
> ExternalShuffleClient,  connection for shuffleService needs to be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11252) ShuffleClient should release connection after fetching blocks had been completed in yarn's external shuffle

2015-10-22 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-11252:
-
Summary: ShuffleClient should release connection after fetching blocks had 
been completed in yarn's external shuffle  (was: andrewor14 )

> ShuffleClient should release connection after fetching blocks had been 
> completed in yarn's external shuffle
> ---
>
> Key: SPARK-11252
> URL: https://issues.apache.org/jira/browse/SPARK-11252
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Reporter: Lianhui Wang
>
> ExternalShuffleClient of executors reserve its connection with yarn's 
> NodeManager until application has been completed. so it will make NodeManager 
> has many socket connections.
> in order to reduce network pressure of NodeManager's shuffleService, after 
> registerWithShuffleServer or fetchBlocks have been completed in 
> ExternalShuffleClient,  connection for shuffleService needs to be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11252) ExternalShuffleClient should release connection after it had completed to fetch blocks from yarn's NameManager

2015-10-21 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-11252:


 Summary: ExternalShuffleClient should release connection after it 
had completed to fetch blocks from yarn's NameManager
 Key: SPARK-11252
 URL: https://issues.apache.org/jira/browse/SPARK-11252
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang


ExternalShuffleClient of executors reserve its connection with yarn's 
NodeManager until application has been completed. so it will make NodeManager 
has many socket connections.
in order to reduce network pressure of NodeManager's shuffleService, after 
registerWithShuffleServer or fetchBlocks have been completed in 
ExternalShuffleClient,  connection for shuffleService needs to be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11026) spark.yarn.user.classpath.first doesn't work for remote addJars

2015-10-09 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-11026:


 Summary: spark.yarn.user.classpath.first doesn't work for remote 
addJars
 Key: SPARK-11026
 URL: https://issues.apache.org/jira/browse/SPARK-11026
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang


when spark.yarn.user.classpath.first=true and addJars is on hdfs path, need to 
add the yarn's linkName of addJars to the system classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10775) add search keywords in history page ui

2015-09-23 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-10775:


 Summary: add search keywords in history page ui
 Key: SPARK-10775
 URL: https://issues.apache.org/jira/browse/SPARK-10775
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Lianhui Wang
Priority: Minor


add search button in history page ui that we can search some applications of 
keywords,example: date or time.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2015-09-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902459#comment-14902459
 ] 

Lianhui Wang edited comment on SPARK-2666 at 9/22/15 12:22 PM:
---

[~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. 
this PR do not cancel running tasks of completed stage. i think when one 
taskSet of stage has finished in TaskSchedulerImpl.taskSetFinished, running 
tasks of this stage should be killed.


was (Author: lianhuiwang):
[~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. 
And i think that's logic is right.

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2015-09-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902459#comment-14902459
 ] 

Lianhui Wang commented on SPARK-2666:
-

[~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. 
And i think that's logic is right. it is ok except unit test.

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2015-09-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902459#comment-14902459
 ] 

Lianhui Wang edited comment on SPARK-2666 at 9/22/15 12:11 PM:
---

[~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. 
And i think that's logic is right.


was (Author: lianhuiwang):
[~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. 
And i think that's logic is right. it is ok except unit test.

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang commented on SPARK-8646:
-

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch. 


 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:25 AM:
--

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch.  there is some problems on 
spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in 
Client. 



was (Author: lianhuiwang):
yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch. 


 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:26 AM:
--

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch.  there is some problems on 
spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in 
Client.  if this must be needed in spark-1.4.0, latter i will take a look at it.



was (Author: lianhuiwang):
yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch.  there is some problems on 
spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in 
Client. 


 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-15 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629103#comment-14629103
 ] 

Lianhui Wang commented on SPARK-8646:
-

yes, when we set master=yarn-client on pyspark/SparkContext.py, it do not take 
effect. so spark-submit consider master=local. i will look at it. 

 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8646:

Comment: was deleted

(was: yes, when we set master=yarn-client on pyspark/SparkContext.py, it do not 
take effect. so spark-submit consider master=local. i will look at it. )

 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624521#comment-14624521
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~juliet] from your spark1.4-verbose.log, i find that master= local[*]. so 
maybe in spark-defaults.conf, you config spark.master=local? other situation is 
in your data_transform.py, maybe you use sparkConf.set(spark.master,local). 
Can you check whether these situations have been happened?

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625725#comment-14625725
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~juliet] can you provide your spark-submit command? 
i think the correct command in spark 1.4 is $SPARK_HOME/bin/spark-submit 
--master yarn-client outofstock/data_transform.py 
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/
is it the same as your command?

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-09 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620466#comment-14620466
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~j_houg] can you add --verbose to spark-submit command? and look at what is 
your spark.submit.pyArchives.
because from you logs, i find that it do not upload pyArchive files, 
like:pyspark.zip and py4j-0.8.2.1-src.zip. and you can check whether in 
SPARK_HOME/python/lib path it has these two zips.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603973#comment-14603973
 ] 

Lianhui Wang commented on SPARK-8646:
-

from [~juliet] 's logs, i think you miss python 'pandas.algos' module that 
pyspark does not provide. i think that you need to install it on nodes.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6954) ExecutorAllocationManager can end up requesting a negative number of executors

2015-06-19 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-6954:

Summary: ExecutorAllocationManager can end up requesting a negative number 
of executors  (was: tdwadmin)

 ExecutorAllocationManager can end up requesting a negative number of executors
 --

 Key: SPARK-6954
 URL: https://issues.apache.org/jira/browse/SPARK-6954
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Cheolsoo Park
Assignee: Sandy Ryza
  Labels: yarn
 Fix For: 1.3.2, 1.4.0

 Attachments: with_fix.png, without_fix.png


 I have a simple test case for dynamic allocation on YARN that fails with the 
 following stack trace-
 {code}
 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -21 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My test is as follows-
 # Start spark-shell with a single executor.
 # Run a {{select count(\*)}} query. The number of executors rises as input 
 size is non-trivial.
 # After the job finishes, the number of  executors falls as most of them 
 become idle.
 # Rerun the same query again, and the request to add executors fails with the 
 above error. In fact, the job itself continues to run with whatever executors 
 it already has, but it never gets more executors unless the shell is closed 
 and restarted. 
 In fact, this error only happens when I configure {{executorIdleTimeout}} 
 very small. For eg, I can reproduce it with the following configs-
 {code}
 spark.dynamicAllocation.executorIdleTimeout 5
 spark.dynamicAllocation.schedulerBacklogTimeout 5
 {code}
 Although I can simply increase {{executorIdleTimeout}} to something like 60 
 secs to avoid the error, I think this is still a bug to be fixed.
 The root cause seems that {{numExecutorsPending}} accidentally becomes 
 negative if executors are killed too aggressively (i.e. 
 {{executorIdleTimeout}} is too small) because under that circumstance, the 
 new target # of executors can be smaller than the current # of executors. 
 When that happens, {{ExecutorAllocationManager}} ends up trying to add a 
 negative number of executors, which throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8430) in Yarn's shuffle service ExternalShuffleBlockResolver should support UnsafeShuffleManager

2015-06-18 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-8430:
---

 Summary: in Yarn's shuffle service ExternalShuffleBlockResolver 
should support UnsafeShuffleManager
 Key: SPARK-8430
 URL: https://issues.apache.org/jira/browse/SPARK-8430
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Description: This method CatalystTypeConverters.convertToCatalyst is slow, 
so for batch conversion we should be using converter produced by 
createToCatalystConverter.  (was: This method 
CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we 
should be using converter produced by createToCatalystConverter.)

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, so for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to CatalystType

2015-06-15 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-8381:
---

 Summary: reuse-typeConvert when convert Seq[Row] to CatalystType
 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang


This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Summary: reuse typeConvert when convert Seq[Row] to catalyst type  (was: 
reuse-typeConvert when convert Seq[Row] to catalyst type)

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Summary: reuse-typeConvert when convert Seq[Row] to catalyst type  (was: 
reuse-typeConvert when convert Seq[Row] to CatalystType)

 reuse-typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988
 ] 

Lianhui Wang commented on SPARK-6700:
-

i do not think this is related to SPARK-6506 because YarnClusterSuite setted 
SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application 
test in YarnClusterSuite is successfully.[~davies] can you report your 
unit-test.log or appMaster.log?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at 

[jira] [Comment Edited] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988
 ] 

Lianhui Wang edited comment on SPARK-6700 at 4/6/15 6:49 AM:
-

i do not think this is related to SPARK-6506 because YarnClusterSuite setted 
SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application 
test in YarnClusterSuite is successfully.[~davies] can you report your 
unit-test.log or appMaster.log? in addition, i think you can try again because 
there maybe has other errors to cause it failed. 


was (Author: lianhuiwang):
i do not think this is related to SPARK-6506 because YarnClusterSuite setted 
SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application 
test in YarnClusterSuite is successfully.[~davies] can you report your 
unit-test.log or appMaster.log?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   

[jira] [Comment Edited] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381844#comment-14381844
 ] 

Lianhui Wang edited comment on SPARK-6506 at 3/26/15 1:18 PM:
--

hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why pyspark of spark.jar cannot be worked. 
[~andrewor14] can you help me?
so i think now we should put spark dirs to PYTHONPATH  or SPARK_HOME at every 
node.


was (Author: lianhuiwang):
hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to PYTHONPATH  or SPARK_HOME at every 
node.

 python support yarn cluster mode requires SPARK_HOME to be set
 --

 Key: SPARK-6506
 URL: https://issues.apache.org/jira/browse/SPARK-6506
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves

 We added support for python running in yarn cluster mode in 
 https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
 SPARK_HOME be set in the environment variables for application master and 
 executor.  It doesn't have to be set to anything real but it fails if its not 
 set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381844#comment-14381844
 ] 

Lianhui Wang commented on SPARK-6506:
-

hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to SPARK_HOME at every node.

 python support yarn cluster mode requires SPARK_HOME to be set
 --

 Key: SPARK-6506
 URL: https://issues.apache.org/jira/browse/SPARK-6506
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves

 We added support for python running in yarn cluster mode in 
 https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
 SPARK_HOME be set in the environment variables for application master and 
 executor.  It doesn't have to be set to anything real but it fails if its not 
 set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381844#comment-14381844
 ] 

Lianhui Wang edited comment on SPARK-6506 at 3/26/15 1:17 PM:
--

hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to PYTHONPATH  or SPARK_HOME at every 
node.


was (Author: lianhuiwang):
hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to SPARK_HOME at every node.

 python support yarn cluster mode requires SPARK_HOME to be set
 --

 Key: SPARK-6506
 URL: https://issues.apache.org/jira/browse/SPARK-6506
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves

 We added support for python running in yarn cluster mode in 
 https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
 SPARK_HOME be set in the environment variables for application master and 
 executor.  It doesn't have to be set to anything real but it fails if its not 
 set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6103) remove unused class to import in EdgeRDDImpl

2015-03-01 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-6103:
---

 Summary: remove unused class to import in EdgeRDDImpl
 Key: SPARK-6103
 URL: https://issues.apache.org/jira/browse/SPARK-6103
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container

2015-02-28 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341498#comment-14341498
 ] 

Lianhui Wang edited comment on SPARK-6056 at 2/28/15 12:36 PM:
---

[~carlmartin] what is your executor's memory?  that is ok when i run your test 
using 1g. in addition, i use spark 1.2.0 release version. 
[~adav] yes, thanks.


was (Author: lianhuiwang):
[~carlmartin] what is your executor's memory?  that is ok when i run your test 
using 1g. in addition, i use spark 1.2.0 release version. 

 Unlimit offHeap memory use cause RM killing the container
 -

 Key: SPARK-6056
 URL: https://issues.apache.org/jira/browse/SPARK-6056
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: SaintBacchus

 No matter set the `preferDirectBufs` or limit the number of thread or not 
 ,spark can not limit the use of offheap memory.
 At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, 
 Netty had allocated a offheap memory buffer with the same size in heap.
 So how many buffer you want to transfor, the same size offheap memory will be 
 allocated.
 But once the allocated memory size reach the capacity of the overhead momery 
 set in yarn, this executor will be killed.
 I wrote a simple code to test it:
 ```scala
 val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new 
 Array[Byte](10*1024*1024)).persist
 bufferRdd.count
 val part =  bufferRdd.partitions(0)
 val sparkEnv = SparkEnv.get
 val blockMgr = sparkEnv.blockManager
 val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index))
 val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]]
 val len = resultIt.map(_.length).sum
 ```
 If use multi-thread to get len, the physical memery will soon   exceed the 
 limit set by spark.yarn.executor.memoryOverhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container

2015-02-28 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341498#comment-14341498
 ] 

Lianhui Wang commented on SPARK-6056:
-

[~carlmartin] what is your executor's memory?  that is ok when i run your test 
using 1g. in addition, i use spark 1.2.0 release version. 

 Unlimit offHeap memory use cause RM killing the container
 -

 Key: SPARK-6056
 URL: https://issues.apache.org/jira/browse/SPARK-6056
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: SaintBacchus

 No matter set the `preferDirectBufs` or limit the number of thread or not 
 ,spark can not limit the use of offheap memory.
 At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, 
 Netty had allocated a offheap memory buffer with the same size in heap.
 So how many buffer you want to transfor, the same size offheap memory will be 
 allocated.
 But once the allocated memory size reach the capacity of the overhead momery 
 set in yarn, this executor will be killed.
 I wrote a simple code to test it:
 ```scala
 val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new 
 Array[Byte](10*1024*1024)).persist
 bufferRdd.count
 val part =  bufferRdd.partitions(0)
 val sparkEnv = SparkEnv.get
 val blockMgr = sparkEnv.blockManager
 val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index))
 val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]]
 val len = resultIt.map(_.length).sum
 ```
 If use multi-thread to get len, the physical memery will soon   exceed the 
 limit set by spark.yarn.executor.memoryOverhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container

2015-02-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341260#comment-14341260
 ] 

Lianhui Wang commented on SPARK-6056:
-

[~adav] from your given information, when preferDirectBufs is set to false, i 
think we need to set io.netty.allocator.numDirectArenas=0 that avoid direct 
allocation. right?
[~carlmartin] i think you can try the method that [~adav] said before.

 Unlimit offHeap memory use cause RM killing the container
 -

 Key: SPARK-6056
 URL: https://issues.apache.org/jira/browse/SPARK-6056
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: SaintBacchus

 No matter set the `preferDirectBufs` or limit the number of thread or not 
 ,spark can not limit the use of offheap memory.
 At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, 
 Netty had allocated a offheap memory buffer with the same size in heap.
 So how many buffer you want to transfor, the same size offheap memory will be 
 allocated.
 But once the allocated memory size reach the capacity of the overhead momery 
 set in yarn, this executor will be killed.
 I wrote a simple code to test it:
 ```scala
 val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new 
 Array[Byte](10*1024*1024)).persist
 bufferRdd.count
 val part =  bufferRdd.partitions(0)
 val sparkEnv = SparkEnv.get
 val blockMgr = sparkEnv.blockManager
 val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index))
 val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]]
 val len = resultIt.map(_.length).sum
 ```
 If use multi-thread to get len, the physical memery will soon   exceed the 
 limit set by spark.yarn.executor.memoryOverhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-02-12 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-5763:

Description: 
In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
we can use sort-merge to resolve skewed-groupby and skewed-join.because 
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
for skewed data in my test.Later i will implement sort-merge-join to resolve 
skewed-join.
[~rxin] [~sandyr] [~andrewor14] how about your opinions about this?

  was:
In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
we can use sort-merge to resolve skewed-groupby and skewed-join.because 
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
for skewed data in my test.Later i will implement sort-merge-join to resolve 
skewed-join.
[~rxin] [~sandyr] [~andrewor14] how about your opinion about this?


 Sort-based Groupby and Join to resolve skewed data
 --

 Key: SPARK-5763
 URL: https://issues.apache.org/jira/browse/SPARK-5763
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang

 In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
 keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
 we can use sort-merge to resolve skewed-groupby and skewed-join.because 
 SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
 on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
 for skewed data in my test.Later i will implement sort-merge-join to resolve 
 skewed-join.
 [~rxin] [~sandyr] [~andrewor14] how about your opinions about this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-02-12 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-5763:

Description: 
In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
we can use sort-merge to resolve skewed-groupby and skewed-join.because 
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
for skewed data in my test.Later i will implement sort-merge-join to resolve 
skewed-join.
[~rxin] [~sandyr] [~andrewor14] how about your opinion about this?

  was:In SPARK-4644, it provide a way to resolve skewed data. But when we has 
more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. 
So we can use sort-merge to resolve skewed-groupby and skewed-join.because 
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
for skewed data in my test.Later i will implement sort-merge-join to resolve 
skewed-join.


 Sort-based Groupby and Join to resolve skewed data
 --

 Key: SPARK-5763
 URL: https://issues.apache.org/jira/browse/SPARK-5763
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang

 In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
 keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
 we can use sort-merge to resolve skewed-groupby and skewed-join.because 
 SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
 on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
 for skewed data in my test.Later i will implement sort-merge-join to resolve 
 skewed-join.
 [~rxin] [~sandyr] [~andrewor14] how about your opinion about this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5721) Propagate missing external shuffle service errors to client

2015-02-12 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317895#comment-14317895
 ] 

Lianhui Wang edited comment on SPARK-5721 at 2/12/15 9:39 AM:
--

yes, i think it is same as SPARK-5759. now in 
https://github.com/apache/spark/pull/4554 i log the container ID and host name 
for exception.


was (Author: lianhuiwang):
yes, i think it is same as SPARK-5759. now in 
https://github.com/apache/spark/pull/4554 i report the container ID and host 
name for exception.

 Propagate missing external shuffle service errors to client
 ---

 Key: SPARK-5721
 URL: https://issues.apache.org/jira/browse/SPARK-5721
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Reporter: Kostas Sakellis

 When spark.shuffle.service.enabled=true, the yarn AM expects to find an aux 
 service running in the namenode. If it cannot find one an exception like this 
 is present in the app master logs. 
 {noformat}
 Exception in thread ContainerLauncher #0 Exception in thread 
 ContainerLauncher #1 java.lang.Error: 
 org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
 auxService:spark_shuffle does not exist
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
 auxService:spark_shuffle does not exist
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
   at 
 org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206)
   at 
 org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110)
   at 
 org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   ... 2 more
 java.lang.Error: 
 org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
 auxService:spark_shuffle does not exist
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
 auxService:spark_shuffle does not exist
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
   at 
 org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206)
   at 
 org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110)
   at 
 org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   ... 2 more
 {noformat}
 We should propagate this error to the driver (in yarn-client mode) because it 
 is otherwise unclear why the number of executors expected are not starting up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-02-12 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-5763:
---

 Summary: Sort-based Groupby and Join to resolve skewed data
 Key: SPARK-5763
 URL: https://issues.apache.org/jira/browse/SPARK-5763
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang


In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
we can use sort-merge to resolve skewed-groupby and skewed-join.because 
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
for skewed data in my test.Later i will implement sort-merge-join to resolve 
skewed-join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5759) ExecutorRunnable should catch YarnException while NMClient start container

2015-02-11 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-5759:
---

 Summary: ExecutorRunnable should catch YarnException while 
NMClient start container
 Key: SPARK-5759
 URL: https://issues.apache.org/jira/browse/SPARK-5759
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang


some time since some of reasons, it lead to some exception while NMClient start 
container.example:we do not config spark_shuffle on some machines, so it will 
throw a exception:
java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: 
The auxService:spark_shuffle does not exist.
 because YarnAllocator use ThreadPoolExecutor to start Container, so we can not 
find which container or hostname throw exception. I think we should catch 
YarnException  in ExecutorRunnable  when start container. if there are some 
exceptions, we can know the container id or hostname of failed container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles

2015-02-09 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312163#comment-14312163
 ] 

Lianhui Wang edited comment on SPARK-5227 at 2/9/15 12:16 PM:
--

function of computing split size in hadoop's  FileInputFormat is:
protected long computeSplitSize(long goalSize, long minSize,
   long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
  }
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
and minSize = 1
blockSize = getConf().getLong(fs.local.block.size, 32 * 1024 * 1024);
so when numSplits=2 and totalSize is very large, then goalSize  blockSize. if 
it is true, actual splitSize is blockSize, not goalSize.
In this situation, the error in this PR will be happened.


was (Author: lianhuiwang):
split size in hadoop's  FileInputFormat:
protected long computeSplitSize(long goalSize, long minSize,
   long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
  }
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
and minSize = 1
blockSize = getConf().getLong(fs.local.block.size, 32 * 1024 * 1024);
so when numSplits=2 and totalSize is very large, then goalSize  blockSize. if 
it is true, actual splitSize is blockSize, not goalSize.
In this situation, the error in this PR will be happened.

 InputOutputMetricsSuite input metrics when reading text file with multiple 
 splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 
 profiles
 -

 Key: SPARK-5227
 URL: https://issues.apache.org/jira/browse/SPARK-5227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Josh Rosen
Priority: Blocker
  Labels: flaky-test

 The InputOutputMetricsSuite  input metrics when reading text file with 
 multiple splits test has been failing consistently in our new {{branch-1.2}} 
 Jenkins SBT build: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/
 Here's the error message
 {code}
 ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 

[jira] [Created] (SPARK-5687) in TaskResultGetter need to catch OutOfMemoryError and report failed when it cannot fetch results.

2015-02-09 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-5687:
---

 Summary: in TaskResultGetter need to catch OutOfMemoryError and 
report failed when it cannot  fetch results.
 Key: SPARK-5687
 URL: https://issues.apache.org/jira/browse/SPARK-5687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang


because in enqueueSuccessfulTask there is another thread to fetch result, if 
result is very large,it maybe throw a OutOfMemoryError. so if we donot catch 
OutOfMemoryError, DAGDAGScheduler donot know the status of this task.
another is:
when totalResultSize  maxResultSize and it cannot fetch large result, we need 
to report failed status to DAGDAGScheduler. if it donot report any status of 
this tasks.DAGDAGScheduler also donot know status of this task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles

2015-02-09 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312163#comment-14312163
 ] 

Lianhui Wang commented on SPARK-5227:
-

split size in hadoop's  FileInputFormat:
protected long computeSplitSize(long goalSize, long minSize,
   long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
  }
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
and minSize = 1
blockSize = getConf().getLong(fs.local.block.size, 32 * 1024 * 1024);
so when numSplits=2 and totalSize is very large, then goalSize  blockSize. if 
it is true, actual splitSize is blockSize, not goalSize.
In this situation, the error in this PR will be happened.

 InputOutputMetricsSuite input metrics when reading text file with multiple 
 splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 
 profiles
 -

 Key: SPARK-5227
 URL: https://issues.apache.org/jira/browse/SPARK-5227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Josh Rosen
Priority: Blocker
  Labels: flaky-test

 The InputOutputMetricsSuite  input metrics when reading text file with 
 multiple splits test has been failing consistently in our new {{branch-1.2}} 
 Jenkins SBT build: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/
 Here's the error message
 {code}
 ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 

[jira] [Updated] (SPARK-5687) in TaskResultGetter need to catch OutOfMemoryError.

2015-02-09 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-5687:

Description: because in enqueueSuccessfulTask there is another thread to 
fetch result, if result is very large,it maybe throw a OutOfMemoryError. so if 
we donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this 
task.  (was: because in enqueueSuccessfulTask there is another thread to fetch 
result, if result is very large,it maybe throw a OutOfMemoryError. so if we 
donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this 
task.
another is:
when totalResultSize  maxResultSize and it cannot fetch large result, we need 
to report failed status to DAGDAGScheduler. if it donot report any status of 
this tasks.DAGDAGScheduler also donot know status of this task.)

 in TaskResultGetter need to catch OutOfMemoryError.
 ---

 Key: SPARK-5687
 URL: https://issues.apache.org/jira/browse/SPARK-5687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang

 because in enqueueSuccessfulTask there is another thread to fetch result, if 
 result is very large,it maybe throw a OutOfMemoryError. so if we donot catch 
 OutOfMemoryError, DAGDAGScheduler donot know the status of this task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >