[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

2015-09-02 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/3810


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-25 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71405599
  
@JoshRosen I don't think just calling rdd.partitions on the final RDD could 
achieve our goal. Furthermore, rdd.partitions has been called before:
470 // Check to make sure we are not launching a task on a partition that 
does not exist.
471 val maxPartitions = rdd.partitions.length
However, it does not work for some scene like the example contrived by me.
To avoid thread-safety issue, do you think we could use another method to 
get parent stages which does not mutate any global map, or we could  just use 
another method like getParentPartitions committed by me before to get 
partitions directly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71308409
  
@JoshRosen I've brought this up to date with master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-20 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70628086
  
@JoshRosen Thanks. I've updated it as your comments. Please review again. 
However, these's merge  conflicts. I will resolve this conflict if this 
approach is passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5316] [CORE] DAGScheduler may make shuf...

2015-01-19 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/4105

[SPARK-5316] [CORE] DAGScheduler may make shuffleToMapStage leak if 
getParentStages failes

DAGScheduler may make shuffleToMapStage leak if getParentStages failes.
If getParentStages has exception for example input path does not exist, 
DAGScheduler would fail to handle job submission, while shuffleToMapStage may 
be put some records when getParentStages. However these records in 
shuffleToMapStage aren't going to be cleaned.
A simple job as follows:
```
val inputFile1 = ... // Input path does not exist when this job submits
val inputFile2 = ...
val outputFile = ...
val conf = new SparkConf()
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile1)
.flatMap(line = line.split( ))
.map(word = (word, 1))
.reduceByKey(_ + _, 1)
val rdd2 = sc.textFile(inputFile2)
.flatMap(line = line.split(,))
.map(word = (word, 1))
.reduceByKey(_ + _, 1)
try {
  val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1)
  rdd3.saveAsTextFile(outputFile)
} catch {
  case e : Exception =
  logError(e)
}
// print the information of DAGScheduler's shuffleToMapStage to check
// whether it still has uncleaned records.
...
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-5316

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4105


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit d4bca32bf4b06d3694a5de3cf5b69bac606dda39
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-31T03:50:26Z

Merge pull request #19 from apache/master

Update

commit 5041b3574dc89cd1e8a8d46590d2aba4c050de92
Author: YanTangZhai hakeemz...@tencent.com
Date:   2015-01-12T12:33:20Z

Merge pull request #24 from apache/master

update

commit e2880f919dd54b43e0c53657a0f2d02880f47aa3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2015-01-19T09:14:27Z

Merge pull request #27 from apache/master

Update

commit 50291ca23192b3f05f572a60f68fcae0b66d5ffd
Author: yantangzhai tyz0...@163.com
Date:   2015-01-19T11:12:16Z

[SPARK-5316] [CORE] DAGScheduler may make shuffleToMapStage leak if 
getParentStages failes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70481411
  
@JoshRosen Thanks for your comments. I've updates it. I directly use 
getParentStages which will call RDD's getPartitions before sending JobSubmitted 
event. Is it ok?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-14 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-69916653
  
@JoshRosen I've updated it. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

2015-01-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3810#issuecomment-69716974
  
@srowen I've updated this PR and resolved conflict. Please review again. 
Thanks.
I explain three points:
1. I am not sure the description makes a case that it's significant enough 
to bother...
Let me give two examples:
(1) When I entered ./bin/spark-sql in command line with yarn-client mode 
and these resources requests as follows 
spark.executor.instances 100
spark.executor.memory 4g
spark.executor.cores 1.
However, I didn't enter sql query string immediately. Because I was 
interrupted for example I was called to attend a important meeting or I go to 
fire fighting in our cluster. Even sometimes I forgot enter sql query string.
Then this application ran a night using 100 * 4g * 12h memory resources and 
100 * 1 * 12h core resources. But it did nothing.
(2) When SparkContext with 100 spark.executor.instances、4g 
spark.executor.memory、1 spark.executor.cores was initialized and HadoopRDD 
scanned 11596 files taking 29.253s to compute splits. And then this job was 
submitted by DAGScheduler. The resources of 100 * 4g * 29s memory resources and 
100 * 1 * 29s core resources were idle.
2. There are several new API methods and changes here.
SparkContext firstly gets applicationId from taskScheduler and uses it to 
initialize blockManager and eventLogger. And then dagScheduler runs job and 
submits resources requests to cluster master.
Getting applicationId and submitting resources requests to cluster master 
are split into two methods.
3. My overall impression is that this adds different code paths and 
behaviors in different modes for little gain.
I'm sorry that I couldn't get mesos apis to split getting applicationId and 
submitting resources requests to cluster master into two methods.
Thus slow start of application is currently only supported in YARN mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...

2015-01-11 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3963#issuecomment-69523350
  
@pwendell Ok. Thank you very much. I close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...

2015-01-11 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/3963


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

2015-01-11 Thread YanTangZhai
Github user YanTangZhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3810#discussion_r22776305
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -55,13 +57,9 @@ private[spark] class Client(
* 
-
 */
 
   /**
-   * Submit an application running our ApplicationMaster to the 
ResourceManager.
-   *
-   * The stable Yarn API provides a convenience method 
(YarnClient#createApplication) for
-   * creating applications and setting up the application submission 
context. This was not
-   * available in the alpha API.
+   * Create an application running our ApplicationMaster to the 
ResourceManager.
*/
-  override def submitApplication(): ApplicationId = {
+  override def createApplication(): ApplicationId = {
--- End diff --

SparkContext firstly gets applicationId from taskScheduler and uses it to 
initialize blockManager and eventLogger. And then dagScheduler runs job and 
submits resources requests to cluster master.
Getting applicationId and submitting resources requests to cluster master 
are split into two methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

2015-01-11 Thread YanTangZhai
Github user YanTangZhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3810#discussion_r22776416
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -333,9 +333,15 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   new SparkException(DAGScheduler cannot be initialized due to 
%s.format(e.getMessage))
   }
 
-  // start TaskScheduler after taskScheduler sets DAGScheduler reference 
in DAGScheduler's
-  // constructor
-  taskScheduler.start()
+  if (conf.getBoolean(spark.scheduler.app.slowstart, false)  master == 
yarn-client) {
--- End diff --

I'm sorry that I couldn't get mesos apis to split getting applicationId and 
submitting resources requests to cluster master into two methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

2015-01-11 Thread YanTangZhai
Github user YanTangZhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3810#discussion_r22776371
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -333,9 +333,15 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   new SparkException(DAGScheduler cannot be initialized due to 
%s.format(e.getMessage))
   }
 
-  // start TaskScheduler after taskScheduler sets DAGScheduler reference 
in DAGScheduler's
-  // constructor
-  taskScheduler.start()
+  if (conf.getBoolean(spark.scheduler.app.slowstart, false)  master == 
yarn-client) {
--- End diff --

I'm sorry that I couldn't get mesos apis to split getting applicationId and 
submitting resources requests to cluster master into two methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...

2015-01-08 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3963

[SPARK-5163] [CORE] Load properties from configuration file for example 
spark-defaults.conf when creating SparkConf object

I create and run a Spark program which does not use SparkSubmit.
When I create a SparkConf object with `new SparkConf()`, it will not 
automatically load properties from configuration file for example 
spark-defaults.conf.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-5163

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3963


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit d4bca32bf4b06d3694a5de3cf5b69bac606dda39
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-31T03:50:26Z

Merge pull request #19 from apache/master

Update

commit ac9579ca434f559bf173ad219bd04b48a7db226f
Author: yantangzhai tyz0...@163.com
Date:   2015-01-09T03:17:51Z

[SPARK-5163] [CORE] Load properties from configuration file for example 
spark-defaults.conf when creating SparkConf object




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5007] [CORE] Try random port when start...

2015-01-08 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/3845


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5007] [CORE] Try random port when start...

2015-01-08 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3845#issuecomment-69282504
  
@andrewor14 @rxin Oh, I see. Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread YanTangZhai
Github user YanTangZhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r22376680
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -178,7 +178,7 @@ abstract class RDD[T: ClassTag](
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
   private var dependencies_ : Seq[Dependency[_]] = null
-  @transient private var partitions_ : Array[Partition] = null
+  @transient private var partitions_ : Array[Partition] = getPartitions
--- End diff --

Sorry. This approach may cause error as follows:
Exception in thread main java.lang.NullPointerException
at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)
at 
com.google.common.collect.MapMakerInternalMap.put(MapMakerInternalMap.java:3499)
at 
org.apache.spark.rdd.HadoopRDD$.putCachedMetadata(HadoopRDD.scala:273)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:151)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:173)
at org.apache.spark.rdd.RDD.init(RDD.scala:181)
at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:97)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:561)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:471)
since jobConfCacheKey has not been initialized at that time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68438167
  
@JoshRosen Thanks for your comments. I've updated it according to your 
comments and contrived a simple example as follows:
```javascript
val inputfile1 = ./testin/in_1.txt
val inputfile2 = ./testin/in_2.txt
val tempfile = ./testtmp
val outputfile = ./testout
val sc = new SparkContext(new SparkConf())
sc.textFile(inputfile1)
  .flatMap(line = line.split( ))
  .map(word = (word, 1))
  .reduceByKey(_ + _, 1)
  .map{kv = (kv._1 + , + kv._2.toString)}
  .saveAsTextFile(tempfile)
val wordCounts1 = sc.textFile(tempfile)
val wordCounts2 = sc.textFile(inputfile2)
val wordCounts = wordCounts1.union(wordCounts2)
wordCounts.map{line =
val kv = line.split(,)
(kv(0), Integer.parseInt(kv(1)))
   }
   .reduceByKey(_ + _, 1)
   .map{kv = (kv._1 + , + kv._2.toString)}
   .saveAsTextFile(outputfile)
```
./testin/in_1.txt (23 bytes) and ./testin/in_2.txt (19 bytes) are all local 
files.
- Before optimization,
 - job1
   br/New stage creation took 0.729638 s among which HadoopRDD 
getPartitions took 0.710247 s.
 - job2
   br/New stage creation took 0.882241 s among which 
HadoopRDD.getPartitions took 0.850668 + 0.023490 s.
- After optimization,
 - job1
   br/HadoopRDD getPartitions took 0.802133 s.
   br/New stage creation took 0.029328 s.
 - job2
   br/HadoopRDD getPartitions took 0.464713 + 0.022568 s.
   br/New stage creation took 0.001773 s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5007] [CORE] Try random port when start...

2014-12-30 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3845

[SPARK-5007] [CORE] Try random port when startServiceOnPort to reduce the 
chance of port collision

When multiple Spark programs are submitted at the same node (called 
springboard machine). The ports (default 4040) of these SparkUIs are from 4040 
to 4056. Then the Spark programs submitted later would fail because of SparkUI 
port collision.
The chance of port collision could be reduced by setting spark.ui.port or 
spark.port.maxRetries.
However, I think it's better to try random port when startServiceOnPort to 
reduce the chance of port collision.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-5007

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3845


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit 2fb4f4450230fee09ff8932eb107f09ef72f2402
Author: yantangzhai tyz0...@163.com
Date:   2014-12-30T13:41:59Z

[SPARK-5007] [CORE] Try random port when startServiceOnPort to reduce the 
chance of port collision




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-30 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-68425639
  
@marmbrus I've updated it. Please review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

2014-12-26 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3810

[SPARK-4962] [CORE] Put TaskScheduler.start back in SparkContext to shorten 
cluster resources occupation period

When SparkContext object is instantiated, TaskScheduler is started and some 
resources are allocated from cluster. However, these
resources may be not used for the moment. For example, 
DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in
this period. Thus, we want to put TaskScheduler.start back to shorten 
cluster resources occupation period specially for busy cluster.
TaskScheduler could be started just before running stages.
We could analyse and compare the resources occupation period before and 
after optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_]
The cluster resources occupation period before optimization is 
[time2_][time3___][time4_].
The cluster resources occupation period after optimization 
is[time3___][time4_].
In summary, the cluster resources
occupation period after optimization is less than before.
If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period 
may be shorten more which is [time4_].
The resources saving is important for busy cluster.

The main purpose of this PR is to decrease resources waste for busy cluster.
For example, a process initializes a SparkContext instance, reads a few 
files from HDFS or many records from PostgreSQL, and then calls RDD's collect 
operation to submit a job.
When SparkContext is initialized, an app is submitted to cluster and some 
resources are hold by this app. 
These resources are not used really until the job is submitted by RDD's 
action.
The resources in the period from initialization to actual use could be 
considered wasteful.
If app is submitted when SparkContext is initialized, all of resources 
needed by the app may be granted before running job. 
Then the job could runs efficiently without resource constraint.
On the contrary, if app is submitted when job is submitted, resources 
needed by the app may be granted at different times. Then the job may run not 
so efficiently since some resources are applying.
Thus I use a configuration parameter spark.scheduler.app.slowstart (default 
false) to let user make tradeoffs between economy and efficiency.
There are 9 kinds of master URL and 6 kinds of SchedulerBackend.
LocalBackend and SimrSchedulerBackend don't need to put starting back since 
there is no difference.
SparkClusterSchedulerBackend (yarn-standalone or yarn-cluster) does not put 
starting back since the app should be submitted in advance by SparkSubmit.
CoarseMesosSchedulerBackend and MesosSchedulerBackend could put starting 
back.
YarnClientSchedulerBackend (yarn-client) could put starting back.
This PR puts TaskScheduler.start back only for yarn-client mode in the 
early.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4962

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3810


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12

[GitHub] spark pull request: [SPARK-4723] [CORE] To abort the stages which ...

2014-12-24 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3786#issuecomment-68082514
  
@markhamstra Thanks for your comment. I will analyse deeply why stage 
attempts so many times. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4723] [CORE] To abort the stages which ...

2014-12-24 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/3786


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3794

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time

HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
If inputdir is large, getPartitions may spend much time.
For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we
want to put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
need to wait much time. HadoopRDD object could get its partitons when it is 
instantiated.
We could analyse and compare the execution time before and after 
optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_]
(1) The app has only one job
(a)
The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_].
The execution time of the job after optimization 
is[time1__][time3___][time2_][time4_].
In summary, if the app has only one job, the total execution time is same 
before and after optimization.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_],
job2 execution time is [time2__][time3___][time4_],
job3 execution time 
is[time2][time3___][time4_],
job4 execution time 
is[time2_][time3___][time4_].
After optimization, 
job1 execution time is [time3___][time2_][time4_],
job2 execution time is [time3___][time2__][time4_],
job3 execution time 
is[time3___][time2_][time4_],
job4 execution time 
is[time3___][time2__][time4_].
In summary, if the app has multiple jobs, average execution time after 
optimization is less than before.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4961

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3794.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3794


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit 5601a8b1458c9a7317a2e4e0463358f0a054c181
Author: yantangzhai tyz0...@163.com
Date:   2014-12-25T03:17:57Z

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org

[GitHub] spark pull request: [SPARK-3545] Put HadoopRDD.getPartitions forwa...

2014-12-23 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/2409#issuecomment-68021964
  
@JoshRosen Thanks. I will divide this JIRA/PR into two JIRAs/PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4946] [CORE] Using AkkaUtils.askWithRep...

2014-12-23 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3785

[SPARK-4946] [CORE] Using AkkaUtils.askWithReply in 
MapOutputTracker.askTracker to reduce the chance of the communicating problem

Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the 
chance of the communicating problem

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4946

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3785.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3785


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit 9ca65418c4d859b7ded77697e81d09f33a43b9a4
Author: yantangzhai tyz0...@163.com
Date:   2014-12-24T06:17:32Z

[SPARK-4946] [CORE] Using AkkaUtils.askWithReply in 
MapOutputTracker.askTracker to reduce the chance of the communicating problem




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4723] [CORE] To abort the stages which ...

2014-12-23 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3786

[SPARK-4723] [CORE] To abort the stages which have attempted some times

For some reason, some stages may attempt many times. A threshold could be 
added and the stages which have attempted more than the threshold could be 
aborted.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4723

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3786.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3786


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit 003774ab2dea5c0f6fd70e68c385178cc235d1c2
Author: yantangzhai tyz0...@163.com
Date:   2014-12-24T06:54:17Z

[SPARK-4723] [CORE] To abort the stages which have attempted some times




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-22 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-67816709
  
@liancheng I will revert the last space change. Thanks for your comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67472596
  
@marmbrus Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-18 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-67473028
  
@marmbrus  Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-17 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67437985
  
@marmbrus Thank you for your comments. I will do it right away.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4273] [SQL] Providing ExternalSet...

2014-12-17 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3137#issuecomment-67452153
  
@marmbrus Thanks. I'm also trying another approach to optimize this 
operation. I want to discuss it with you later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-02 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3555

[SPARK-4692] [SQL] Support ! boolean logic operator like NOT

Support ! boolean logic operator like NOT in sql as follows
select * from for_test where !(col1  col2)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4692

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3555


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 92242c7c07d7d9f5aea2111b548a3355f3633a7d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-02T10:57:59Z

Update HiveQl.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-02 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3556

[SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an 
empty AttributeSet() references

The sql select * from spark_test::for_test where abs(20141202) is not 
null has predicates=List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and 
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the 
exception java.lang.IllegalArgumentException: requirement failed: Partition 
pruning predicates only supported for partitioned tables. is thrown.
The sql select * from spark_test::for_test_partitioned_table where 
abs(20141202) is not null and type_id=11 and platform = 3 with partitioned key 
insert_date has predicates=List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 
11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). 
PruningPredicates is List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4693

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3556


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit e572b9a754a71da1f5bdb53c283b936ab803def2
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-02T12:27:14Z

Update HiveStrategies.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4676] [SQL] JavaSchemaRDD.schema may th...

2014-12-01 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3538

[SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if 
sql has null

val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql(select null from spark_test.for_test)
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class 
org.apache.spark.sql.catalyst.types.NullType$)
at 
org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark MatchNullType

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3538.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3538


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 896c7b73f0ba1b2d3dccf6fed6410bf077eb3d54
Author: yantangzhai tyz0...@163.com
Date:   2014-12-01T13:08:41Z

fix NullType MatchError in JavaSchemaRDD when sql has null




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4677] [WEB] Add hadoop input time in ta...

2014-12-01 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3539

[SPARK-4677] [WEB] Add hadoop input time in task webui

Add hadoop input time in task webui like GC Time to explicitly show the 
time used by task to read input data.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark WebuiInputTime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3539.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3539


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 3816f8540b947809cb821bcb3af36d7be0210d9c
Author: yantangzhai tyz0...@163.com
Date:   2014-12-01T14:09:24Z

add hadoop input read time in webui




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4677] [WEB] Add hadoop input time in ta...

2014-12-01 Thread YanTangZhai
Github user YanTangZhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3539#discussion_r21140476
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -238,10 +238,13 @@ class HadoopRDD[K, V](
   val value: V = reader.createValue()
 
   var recordsSinceMetricsUpdate = 0
+  var startTime : Long = 0L
 
   override def getNext() = {
 try {
+  startTime = System.nanoTime
   finished = !reader.next(key, value)
+  inputMetrics.readTime += (System.nanoTime - startTime)
--- End diff --

Oh sorry. It may be expensive. Let me think about it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4677] [WEB] Add hadoop input time in ta...

2014-12-01 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/3539


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4401] [SQL] RuleExecutor should log tra...

2014-11-14 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3265

[SPARK-4401] [SQL] RuleExecutor should log trace correct iteration num

RuleExecutor should log trace correct iteration num

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4401

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3265.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3265


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit af326f76c46e2d019dc492fafaac7d3468e837b1
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-14T12:23:55Z

Update RuleExecutor.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4401] [SQL] RuleExecutor should log tra...

2014-11-14 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/3265


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4401] [SQL] RuleExecutor should log tra...

2014-11-14 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3265#issuecomment-63058643
  
@srowen Thanks. I close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4273] [SQL] Providing ExternalSet...

2014-11-06 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3137

[WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when 
count(distinct)

Some task may OOM when count(distinct) if it needs to process many records. 
CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs 
many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
provided to store OpenHashSet data in disks when it's capacity exceeds some 
threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 
with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could 
be indicated as follows:
ohs1 [d, b, c, a] = [a, b, c, d] = file1
ohs2 [e, f, g, a] = [a, e, f, g] = file2
ohs3 [e, h, i, g] = [e, g, h, i] = file3
ohs4 [j, h, a] = [a, h, j] = sortedSet
When output, all keys with the same hashCode will be put into a 
OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure 
could be indicated as follows:
file1- a - ohsA; file2 - a - ohsA; sortedSet - a - ohsA; ohsA - a;
file1 - b - ohsB; ohsB - b;
file1 - c - ohsC; ohsC - c;
file1 - d - ohsD; ohsD - d;
file2  e - ohsE; file3 - e - ohsE; ohsE e;
...
I think using the ExternalSet could avoid OOM when count(distinct). 
Welcomes comments.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark ExternalAggregate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3137.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3137


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit eecb499bb10b21d648ae9e6c0282fafcde111994
Author: yantangzhai tyz0...@163.com
Date:   2014-11-06T12:57:29Z

A method to avoid OOM when count(distinct) by providing ExternalSet




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4009][SQL]HiveTableScan should use make...

2014-10-21 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/2857


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4009][SQL]HiveTableScan should use make...

2014-10-21 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/2857#issuecomment-59915528
  
@marmbrus Thanks. Please disregard it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4009][SQL]HiveTableScan should use make...

2014-10-20 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/2857

[SPARK-4009][SQL]HiveTableScan should use makeRDDForTable instead of 
makeRDDForPartitionedTable for partitioned table when partitionPruningPred is 
None

HiveTableScan should use makeRDDForTable instead of 
makeRDDForPartitionedTable for partitioned table when partitionPruningPred is 
None.
If a table has many partitions for example more than 20 thousands while it 
has a few data for example less than 512MB, some sql querying the table will 
produce more than 2 RDDs. The job would submit failed with exception: java 
stack overflow.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4009

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2857.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2857


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit be7882ce16911d018571fa46c1a175d063bdfd03
Author: yantangzhai tyz0...@163.com
Date:   2014-10-20T13:05:44Z

[SPARK-4009][SQL]HiveTableScan should use makeRDDForTable instead of 
makeRDDForPartitionedTable for partitioned table when partitionPruningPred is 
None




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3545] Put HadoopRDD.getPartitions forwa...

2014-09-16 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/2409

[SPARK-3545] Put HadoopRDD.getPartitions forward and put 
TaskScheduler.start back to reduce DAGScheduler.JobSubmitted processing time 
and shorten cluster resources occupation period

We have two problems:
(1) HadoopRDD.getPartitions is lazyied to process in 
DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much 
time.
For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we
want to put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
need to wait much time. HadoopRDD object could get its partitons when it is 
instantiated.
(2) When SparkContext object is instantiated, TaskScheduler is started and 
some resources are allocated from cluster. However, these
resources may be not used for the moment. For example, 
DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in
this period. Thus, we want to put TaskScheduler.start back to shorten 
cluster resources occupation period specially for busy cluster.
TaskScheduler could be started just before running stages.
We could analyse and compare the execution time before and after 
optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_]
(1) The app has only one job
(a)
The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_].
The execution time of the job after optimization 
is[time3___][time2_][time1__][time4_].
(b)
The cluster resources occupation period before optimization is 
[time2_][time3___][time4_].
The cluster resources occupation period after optimization 
is[time4_].
In summary, if the app has only one job, the total execution time is same 
before and after optimization while the cluster resources
occupation period after optimization is less than before.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_],
job2 execution time is [time2__][time3___][time4_],
job3 execution time 
is[time2][time3___][time4_],
job4 execution time 
is[time2__][time3___][time4_].
After optimization, 
job1 execution time is [time3___][time2_][time1__][time4_],
job2 execution time is [time3___][time2__][time4_],
job3 execution time 
is[time3___][time2_][time4_],
job4 execution time 
is[time3___][time2__][time4_].
In summary, if the app has multiple jobs, average execution time after 
optimization is less than before and the cluster resources
occupation period after optimization is less than before.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-3545

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2409.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2409


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit b88df438033eecbdbe8cad37b2bd4ad3620de6e2
Author: yantangzhai tyz0...@163.com
Date:   2014-09-16T13:22:12Z

[SPARK-3545] Put HadoopRDD.getPartitions forward and put 
TaskScheduler.start back to reduce DAGScheduler.JobSubmitted processing time 
and shorten cluster resources occupation period




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA

[GitHub] spark pull request: [SPARK-3003] FailedStage could not be cancelle...

2014-09-12 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/1921


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3003] FailedStage could not be cancelle...

2014-09-12 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1921#issuecomment-55371512
  
@andrewor14 If a running stage is fetch failed, it will be moved to 
failedStages from runningStages. But it is still kept alive in web ui. Then I 
try to kill this stage. It could not be cancelled. I check again, this problem 
wont occur in the latest Spark version. I will close this PR. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2714] DAGScheduler logs jobid when runJ...

2014-09-12 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1617#issuecomment-55376375
  
@andrewor14 Thanks. Please review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...

2014-09-11 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1854#issuecomment-55254070
  
@jkbradley I will close this PR. Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...

2014-09-11 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/1854


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...

2014-09-11 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1618#issuecomment-55254537
  
@andrewor14 Yeah, I see. I will close the PR. If needed, it could be 
reopened. Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...

2014-09-11 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/1618


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3148] Update global variables of HttpBr...

2014-08-21 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/2059#issuecomment-52884506
  
Hi @JoshRosen SparkContext1 creates broadcastManager and initializes 
HttpBroadcast object. HttpBroadcast creates httpserver and broadcastDir and so 
on. However SparkContext2 in the same process won't initialize HttpBroadcast 
object when creating broadcastManager. Since HttpBroadcast object is marked 
initialized and will not be initialized any more. SparkContext1 and 
SparkContext2 will share the same HttpBroadcast object. When SparkContext1 
stops HttpBroadcast, HttpBroadcast in SparkContext2 actually is stopped. When 
HttpBroadcast1 cleans up files, some files owned by SparkContext2 may be 
removed. Since they are the same one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Update global variables of HttpBroadcast so th...

2014-08-20 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/2058

Update global variables of HttpBroadcast so that multiple SparkContexts can 
coexist

Update global variables of HttpBroadcast so that multiple SparkContexts can 
coexist

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark httpbroadcast

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2058.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2058


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit b9921f37aa13620c5bea82512b63e2b0a73b5ffa
Author: yantangzhai tyz0...@163.com
Date:   2014-08-20T12:56:00Z

Update global variables of HttpBroadcast so that multiple SparkContexts can 
coexist

commit 07d719ff7d77a66a4b67ef84ba9d4e5e881391fb
Author: yantangzhai tyz0...@163.com
Date:   2014-08-20T12:57:19Z

Update global variables of HttpBroadcast so that multiple SparkContexts can 
coexist




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3148] Update global variables of HttpBr...

2014-08-20 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/2059

[SPARK-3148] Update global variables of HttpBroadcast so that multiple 
SparkContexts can coexist

Update global variables of HttpBroadcast so that multiple SparkContexts can 
coexist

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-3148

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2059


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit e751ebd22c0683746b8e13c48570bf22b4de45db
Author: yantangzhai tyz0...@163.com
Date:   2014-08-20T14:07:57Z

[SPARK-3148] Update global variables of HttpBroadcast so that multiple 
SparkContexts can coexist

commit 97b34079b4af178ff2bca42c314aeb0e51687167
Author: yantangzhai tyz0...@163.com
Date:   2014-08-20T14:11:34Z

[SPARK-3148] Update global variables of HttpBroadcast so that multiple 
SparkContexts can coexist




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Update global variables of HttpBroadcast so th...

2014-08-20 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/2058#issuecomment-52783462
  
#2059 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Update global variables of HttpBroadcast so th...

2014-08-20 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/2058


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3067] JobProgressPage could not show Fa...

2014-08-15 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1966

[SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section 
sometimes

JobProgressPage could not show Fair Scheduler Pools section sometimes.
SparkContext starts webui and then postEnvironmentUpdate. Sometimes 
JobProgressPage is accessed between webui starting and postEnvironmentUpdate, 
then the lazy val isFairScheduler will be false. The Fair Scheduler Pools 
section will not display any more.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-3067

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1966.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1966


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit aac7f7b67d83d4175018d58568cfbd1a639e3d7e
Author: yantangzhai tyz0...@163.com
Date:   2014-08-15T09:04:24Z

[SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section 
sometimes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3003] FailedStage could not be cancelle...

2014-08-13 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1921

[SPARK-3003] FailedStage could not be cancelled by DAGScheduler when 
cancelJob or cancelStage

Some stage is changed from running to failed, then DAGSCheduler could not 
cancel it when cancelJob or cancelStage. Since in failJobAndIndependentStages, 
DAGSCheduler will only cancel runningStage and post SparkListenerStageCompleted 
for it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-3003

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1921.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1921


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit b736bd729713ba6ca23ae901b34cb8523f2d24b2
Author: yantangzhai tyz0...@163.com
Date:   2014-08-13T13:33:24Z

[SPARK-3003] FailedStage could not be cancelled by DAGScheduler when 
cancelJob or cancelStage




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 2643] Stages web ui has ERROR when pool...

2014-08-08 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1854

[Spark 2643] Stages web ui has ERROR when pool name is None

14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:313)
at scala.None$.get(Option.scala:311)
at org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132)
at 
org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150)
at 
org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
at 
org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
at 
org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
at 
org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at 
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at 
scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38)
at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40)
at org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60)
at org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52)
at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91)
at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:744)
14/07/23 16:01:44 WARN server.AbstractHttpConnection: /stages/
java.lang.NoSuchMethodError: 
javax.servlet.http.HttpServletRequest.isAsyncStarted()Z
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255

[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...

2014-08-08 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1854#issuecomment-51595260
  
@srowen The completedStages may contains stages as follows ...10, 10, 10, 
10, 10, 11, 18... and  the activeStages may contains 1, 10, 5 with unique 10 
and the stageIdToData may contains ...10, 11, 18... with unique 10. When the 
completedStages is trimmed, ...10, 10 may be removed whereas the stageIdToData 
should not remove 11, 18. If the stageIdToData has removed 11, 18, web ui could 
not show poolname or description of 11, 18 in completed stages, this problem 
still exists.
Please review again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...

2014-08-08 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1854#issuecomment-51597142
  
@srowen I see, thanks. I will modify. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...

2014-08-08 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1854#issuecomment-51601604
  
@srowen The stage 10 will be removed from stageIdToData later. Since it 
will be added into completedStages or failedStages again and will be removed 
from activeStages when it is completed. Some time later, it will be removed 
from  stageIdToData by trimIfNecessary since it's not in activeStages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...

2014-08-08 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1854#issuecomment-51604647
  
Please review again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-08-05 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/1392


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-08-05 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-51190110
  
@pwendell Sorry, I'm late. Please disregard this PR since #1734 has been 
closed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-08-05 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/1244


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2714] DAGScheduler logs jobid when runJ...

2014-07-29 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1617#issuecomment-50564465
  
Hi @markhamstra When DAGScheduler concurrently runs multiple jobs, 
SparkContext only logs Job finished and logs in the same file which doesn't 
tell who is who. It's difficult to found which job has finished or how much 
time it has taken from multiple Job finished: ..., took ... s logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2714] DAGScheduler logs jobid when runJ...

2014-07-28 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1617

[SPARK-2714] DAGScheduler logs jobid when runJob finishes

DAGScheduler logs jobid when runJob finishes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2714

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1617.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1617


commit 090d90874cc3d2c3f6e884ab7942f7554025535c
Author: yantangzhai tyz0...@163.com
Date:   2014-07-28T13:41:39Z

[SPARK-2714] DAGScheduler logs jobid when runJob finishes

commit fb42f0f831d2ec094f26e7f4d5812c05e8c60e99
Author: yantangzhai tyz0...@163.com
Date:   2014-07-28T13:47:15Z

[SPARK-2714] DAGScheduler logs jobid when runJob finishes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...

2014-07-28 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1618

[SPARK-2715] ExternalAppendOnlyMap adds max limit of times and max limit of 
disk bytes written for spilling

ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes 
written for spilling. Therefore, some task could be let fail fast instead of 
running for a long time if it has data skew.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2715

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1618.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1618


commit 3fc60119e9dee8d2a781316ade17812b1367849b
Author: yantangzhai tyz0...@163.com
Date:   2014-07-28T14:22:38Z

[SPARK-2715] ExternalAppendOnlyMap adds max limit of times and max limit of 
disk bytes written for spilling




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...

2014-07-28 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1618#issuecomment-50425621
  
Hi @andrewor14 The default values of the two max limits are zero, which 
does not change the original operating mode and does not fail an application 
that is running perfectly fine. If some application has skew data which we 
don't expect, it will run for a very long time. In contrast, we want this 
application fail  fast.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...

2014-07-27 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1548#issuecomment-50292640
  
@markhamstra Ok. Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...

2014-07-27 Thread YanTangZhai
Github user YanTangZhai closed the pull request at:

https://github.com/apache/spark/pull/1548


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...

2014-07-24 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1548#issuecomment-5727
  
Hi @markhamstra , you are right. I will think of other ways to solve this 
problem. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...

2014-07-23 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1548

[SPARK-2647] DAGScheduler plugs other JobSubmitted events when processing 
one JobSubmitted event

If a few of jobs are submitted, DAGScheduler plugs other JobSubmitted 
events when processing one JobSubmitted event.
For example ont JobSubmitted event is processed as follows and costs much 
time
spark-akka.actor.default-dispatcher-67 daemon prio=10 
tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130)
locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call)
at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
at 
org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at 
org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472)
at 
org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233)
at 
StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-21 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-49584362
  
Hi @andrewor14 , that's ok. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1392

[SPARK-2290] Worker should directly use its own sparkHome instead of 
appDesc.sparkHome when LaunchExecutor

Worker should directly use its own sparkHome instead of appDesc.sparkHome 
when LaunchExecutor

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2290

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1392


commit d3072fc05c7c20ec9d90732db2b9b26a4d27e290
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:50:14Z

Update ApplicationDescription.scala

commit 78ec6bc8c5d1af64ca21e1a231b47911df6d4f90
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:52:34Z

Update JsonProtocol.scala

commit 95e6ccc354167117430ce4cb7b2f5063a454ff1d
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:54:55Z

Update TestClient.scala

commit 508dcb65d04e3f12f99e03572a1cc277e7f1aeca
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:58:01Z

Update SparkDeploySchedulerBackend.scala

commit 6d6700aaad941779485eee2c35c4ab0cd278529e
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T12:01:40Z

Update Worker.scala

commit c360154ae5b03e7854d63573494fc6113295a7ec
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T12:04:16Z

Update JsonProtocolSuite.scala

commit 6febb215fb73735760fae957a4e71e2a61c17c77
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T12:07:35Z

Update ExecutorRunnerTest.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48839557
  
#1244


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48839668
  
fix #1244 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1244#issuecomment-48839912
  
I've fixed the compile problem. Please review and test again. Thanks very 
much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1281#issuecomment-48840373
  
Hi @ash211, I think this change is needed. Since the method 
Utils.getLocalDir is used by some function such as HttpBroadcast, which is 
different from DiskBlockManager. The two problems are different. Even though 
#1274 has been merged, the problem is still exist. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1281#issuecomment-48840378
  
Hi @ash211, I think this change is needed. Since the method 
Utils.getLocalDir is used by some function such as HttpBroadcast, which is 
different from DiskBlockManager. The two problems are different. Even though 
#1274 has been merged, the problem is still exist. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1281#issuecomment-48840401
  
Hi @ash211, I think this change is needed. Since the method 
Utils.getLocalDir is used by some function such as HttpBroadcast, which is 
different from DiskBlockManager. The two problems are different. Even though 
#1274 has been merged, the problem is still exist. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-02 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1281

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a 
good one instead of choosing the first one directly

If the first directory of spark.local.dir is bad, application will exit 
with the exception:
Exception in thread main java.io.IOException: Failed to create a temp 
directory (under /data1/sparkenv/local) after 10 attempts!
at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258)
at 
org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154)
at 
org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at 
org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at 
org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.init(SparkContext.scala:202)
at JobTaskJoin$.main(JobTaskJoin.scala:9)
at JobTaskJoin.main(JobTaskJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Utils.getLocalDir had better check the directory and choose a good one 
instead of choosing the first one directly. For example, spark.local.dir is 
/data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the 
disk data2 is good, we could choose the data2 not data1.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2325

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1281.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1281


commit 08424ce408b5e1ee679d15e46ea5b08979511fae
Author: yantangzhai tyz0...@163.com
Date:   2014-07-02T06:55:39Z

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a 
good one instead of choosing the first one directly




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1274

[SPARK-2324] SparkContext should not exit directly when spark.local.dir is 
a list of multiple paths and one of them has error

The spark.local.dir is configured as a list of multiple paths as follows 
/data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver 
node has error, the application will exit since DiskBlockManager exits directly 
at createLocalDirs. If the disk data2 of the worker node has error, the 
executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of 
spark.local.dir has error. Since spark.local.dir has multiple paths, a problem 
should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2324

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1274


commit df086731952c669e12673fd673d829b9fdd790a2
Author: yantangzhai tyz0...@163.com
Date:   2014-07-01T10:39:46Z

[SPARK-2324] SparkContext should not exit directly when spark.local.dir is 
a list of multiple paths and one of them has error




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1274#issuecomment-47737851
  
Thank aarondav. I've modified some codes. Please help to review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-06-27 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1244

[SPARK-2290] Worker should directly use its own sparkHome instead of 
appDesc.sparkHome when LaunchExecutor

Worker should directly use its own sparkHome instead of appDesc.sparkHome 
when LaunchExecutor

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1244


commit 05c3a789a00996a5502b78711b44d80e8812fdbb
Author: hakeemzhai hakeemzhai@hakeemzhai.(none)
Date:   2014-06-27T07:42:18Z

[SPARK-2290] Worker should directly use its own sparkHome instead of 
appDesc.sparkHome when LaunchExecutor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---