[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/3810 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3794#issuecomment-71405599 @JoshRosen I don't think just calling rdd.partitions on the final RDD could achieve our goal. Furthermore, rdd.partitions has been called before: 470 // Check to make sure we are not launching a task on a partition that does not exist. 471 val maxPartitions = rdd.partitions.length However, it does not work for some scene like the example contrived by me. To avoid thread-safety issue, do you think we could use another method to get parent stages which does not mutate any global map, or we could just use another method like getParentPartitions committed by me before to get partitions directly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3794#issuecomment-71308409 @JoshRosen I've brought this up to date with master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3794#issuecomment-70628086 @JoshRosen Thanks. I've updated it as your comments. Please review again. However, these's merge conflicts. I will resolve this conflict if this approach is passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5316] [CORE] DAGScheduler may make shuf...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/4105 [SPARK-5316] [CORE] DAGScheduler may make shuffleToMapStage leak if getParentStages failes DAGScheduler may make shuffleToMapStage leak if getParentStages failes. If getParentStages has exception for example input path does not exist, DAGScheduler would fail to handle job submission, while shuffleToMapStage may be put some records when getParentStages. However these records in shuffleToMapStage aren't going to be cleaned. A simple job as follows: ``` val inputFile1 = ... // Input path does not exist when this job submits val inputFile2 = ... val outputFile = ... val conf = new SparkConf() val sc = new SparkContext(conf) val rdd1 = sc.textFile(inputFile1) .flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _, 1) val rdd2 = sc.textFile(inputFile2) .flatMap(line = line.split(,)) .map(word = (word, 1)) .reduceByKey(_ + _, 1) try { val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1) rdd3.saveAsTextFile(outputFile) } catch { case e : Exception = logError(e) } // print the information of DAGScheduler's shuffleToMapStage to check // whether it still has uncleaned records. ... ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-5316 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4105 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit d4bca32bf4b06d3694a5de3cf5b69bac606dda39 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-31T03:50:26Z Merge pull request #19 from apache/master Update commit 5041b3574dc89cd1e8a8d46590d2aba4c050de92 Author: YanTangZhai hakeemz...@tencent.com Date: 2015-01-12T12:33:20Z Merge pull request #24 from apache/master update commit e2880f919dd54b43e0c53657a0f2d02880f47aa3 Author: YanTangZhai hakeemz...@tencent.com Date: 2015-01-19T09:14:27Z Merge pull request #27 from apache/master Update commit 50291ca23192b3f05f572a60f68fcae0b66d5ffd Author: yantangzhai tyz0...@163.com Date: 2015-01-19T11:12:16Z [SPARK-5316] [CORE] DAGScheduler may make shuffleToMapStage leak if getParentStages failes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3794#issuecomment-70481411 @JoshRosen Thanks for your comments. I've updates it. I directly use getParentStages which will call RDD's getPartitions before sending JobSubmitted event. Is it ok? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3794#issuecomment-69916653 @JoshRosen I've updated it. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3810#issuecomment-69716974 @srowen I've updated this PR and resolved conflict. Please review again. Thanks. I explain three points: 1. I am not sure the description makes a case that it's significant enough to bother... Let me give two examples: (1) When I entered ./bin/spark-sql in command line with yarn-client mode and these resources requests as follows spark.executor.instances 100 spark.executor.memory 4g spark.executor.cores 1. However, I didn't enter sql query string immediately. Because I was interrupted for example I was called to attend a important meeting or I go to fire fighting in our cluster. Even sometimes I forgot enter sql query string. Then this application ran a night using 100 * 4g * 12h memory resources and 100 * 1 * 12h core resources. But it did nothing. (2) When SparkContext with 100 spark.executor.instancesã4g spark.executor.memoryã1 spark.executor.cores was initialized and HadoopRDD scanned 11596 files taking 29.253s to compute splits. And then this job was submitted by DAGScheduler. The resources of 100 * 4g * 29s memory resources and 100 * 1 * 29s core resources were idle. 2. There are several new API methods and changes here. SparkContext firstly gets applicationId from taskScheduler and uses it to initialize blockManager and eventLogger. And then dagScheduler runs job and submits resources requests to cluster master. Getting applicationId and submitting resources requests to cluster master are split into two methods. 3. My overall impression is that this adds different code paths and behaviors in different modes for little gain. I'm sorry that I couldn't get mesos apis to split getting applicationId and submitting resources requests to cluster master into two methods. Thus slow start of application is currently only supported in YARN mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3963#issuecomment-69523350 @pwendell Ok. Thank you very much. I close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/3963 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...
Github user YanTangZhai commented on a diff in the pull request: https://github.com/apache/spark/pull/3810#discussion_r22776305 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -55,13 +57,9 @@ private[spark] class Client( * - */ /** - * Submit an application running our ApplicationMaster to the ResourceManager. - * - * The stable Yarn API provides a convenience method (YarnClient#createApplication) for - * creating applications and setting up the application submission context. This was not - * available in the alpha API. + * Create an application running our ApplicationMaster to the ResourceManager. */ - override def submitApplication(): ApplicationId = { + override def createApplication(): ApplicationId = { --- End diff -- SparkContext firstly gets applicationId from taskScheduler and uses it to initialize blockManager and eventLogger. And then dagScheduler runs job and submits resources requests to cluster master. Getting applicationId and submitting resources requests to cluster master are split into two methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...
Github user YanTangZhai commented on a diff in the pull request: https://github.com/apache/spark/pull/3810#discussion_r22776416 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -333,9 +333,15 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli new SparkException(DAGScheduler cannot be initialized due to %s.format(e.getMessage)) } - // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's - // constructor - taskScheduler.start() + if (conf.getBoolean(spark.scheduler.app.slowstart, false) master == yarn-client) { --- End diff -- I'm sorry that I couldn't get mesos apis to split getting applicationId and submitting resources requests to cluster master into two methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...
Github user YanTangZhai commented on a diff in the pull request: https://github.com/apache/spark/pull/3810#discussion_r22776371 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -333,9 +333,15 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli new SparkException(DAGScheduler cannot be initialized due to %s.format(e.getMessage)) } - // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's - // constructor - taskScheduler.start() + if (conf.getBoolean(spark.scheduler.app.slowstart, false) master == yarn-client) { --- End diff -- I'm sorry that I couldn't get mesos apis to split getting applicationId and submitting resources requests to cluster master into two methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3963 [SPARK-5163] [CORE] Load properties from configuration file for example spark-defaults.conf when creating SparkConf object I create and run a Spark program which does not use SparkSubmit. When I create a SparkConf object with `new SparkConf()`, it will not automatically load properties from configuration file for example spark-defaults.conf. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-5163 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3963 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit d4bca32bf4b06d3694a5de3cf5b69bac606dda39 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-31T03:50:26Z Merge pull request #19 from apache/master Update commit ac9579ca434f559bf173ad219bd04b48a7db226f Author: yantangzhai tyz0...@163.com Date: 2015-01-09T03:17:51Z [SPARK-5163] [CORE] Load properties from configuration file for example spark-defaults.conf when creating SparkConf object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5007] [CORE] Try random port when start...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/3845 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5007] [CORE] Try random port when start...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3845#issuecomment-69282504 @andrewor14 @rxin Oh, I see. Thank you very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on a diff in the pull request: https://github.com/apache/spark/pull/3794#discussion_r22376680 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -178,7 +178,7 @@ abstract class RDD[T: ClassTag]( // Our dependencies and partitions will be gotten by calling subclass's methods below, and will // be overwritten when we're checkpointed private var dependencies_ : Seq[Dependency[_]] = null - @transient private var partitions_ : Array[Partition] = null + @transient private var partitions_ : Array[Partition] = getPartitions --- End diff -- Sorry. This approach may cause error as follows: Exception in thread main java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191) at com.google.common.collect.MapMakerInternalMap.put(MapMakerInternalMap.java:3499) at org.apache.spark.rdd.HadoopRDD$.putCachedMetadata(HadoopRDD.scala:273) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:151) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:173) at org.apache.spark.rdd.RDD.init(RDD.scala:181) at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:97) at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:561) at org.apache.spark.SparkContext.textFile(SparkContext.scala:471) since jobConfCacheKey has not been initialized at that time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3794#issuecomment-68438167 @JoshRosen Thanks for your comments. I've updated it according to your comments and contrived a simple example as follows: ```javascript val inputfile1 = ./testin/in_1.txt val inputfile2 = ./testin/in_2.txt val tempfile = ./testtmp val outputfile = ./testout val sc = new SparkContext(new SparkConf()) sc.textFile(inputfile1) .flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _, 1) .map{kv = (kv._1 + , + kv._2.toString)} .saveAsTextFile(tempfile) val wordCounts1 = sc.textFile(tempfile) val wordCounts2 = sc.textFile(inputfile2) val wordCounts = wordCounts1.union(wordCounts2) wordCounts.map{line = val kv = line.split(,) (kv(0), Integer.parseInt(kv(1))) } .reduceByKey(_ + _, 1) .map{kv = (kv._1 + , + kv._2.toString)} .saveAsTextFile(outputfile) ``` ./testin/in_1.txt (23 bytes) and ./testin/in_2.txt (19 bytes) are all local files. - Before optimization, - job1 br/New stage creation took 0.729638 s among which HadoopRDD getPartitions took 0.710247 s. - job2 br/New stage creation took 0.882241 s among which HadoopRDD.getPartitions took 0.850668 + 0.023490 s. - After optimization, - job1 br/HadoopRDD getPartitions took 0.802133 s. br/New stage creation took 0.029328 s. - job2 br/HadoopRDD getPartitions took 0.464713 + 0.022568 s. br/New stage creation took 0.001773 s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5007] [CORE] Try random port when start...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3845 [SPARK-5007] [CORE] Try random port when startServiceOnPort to reduce the chance of port collision When multiple Spark programs are submitted at the same node (called springboard machine). The ports (default 4040) of these SparkUIs are from 4040 to 4056. Then the Spark programs submitted later would fail because of SparkUI port collision. The chance of port collision could be reduced by setting spark.ui.port or spark.port.maxRetries. However, I think it's better to try random port when startServiceOnPort to reduce the chance of port collision. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-5007 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3845.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3845 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit 2fb4f4450230fee09ff8932eb107f09ef72f2402 Author: yantangzhai tyz0...@163.com Date: 2014-12-30T13:41:59Z [SPARK-5007] [CORE] Try random port when startServiceOnPort to reduce the chance of port collision --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-68425639 @marmbrus I've updated it. Please review again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3810 [SPARK-4962] [CORE] Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the resources occupation period before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time3___][time4_]. In summary, the cluster resources occupation period after optimization is less than before. If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may be shorten more which is [time4_]. The resources saving is important for busy cluster. The main purpose of this PR is to decrease resources waste for busy cluster. For example, a process initializes a SparkContext instance, reads a few files from HDFS or many records from PostgreSQL, and then calls RDD's collect operation to submit a job. When SparkContext is initialized, an app is submitted to cluster and some resources are hold by this app. These resources are not used really until the job is submitted by RDD's action. The resources in the period from initialization to actual use could be considered wasteful. If app is submitted when SparkContext is initialized, all of resources needed by the app may be granted before running job. Then the job could runs efficiently without resource constraint. On the contrary, if app is submitted when job is submitted, resources needed by the app may be granted at different times. Then the job may run not so efficiently since some resources are applying. Thus I use a configuration parameter spark.scheduler.app.slowstart (default false) to let user make tradeoffs between economy and efficiency. There are 9 kinds of master URL and 6 kinds of SchedulerBackend. LocalBackend and SimrSchedulerBackend don't need to put starting back since there is no difference. SparkClusterSchedulerBackend (yarn-standalone or yarn-cluster) does not put starting back since the app should be submitted in advance by SparkSubmit. CoarseMesosSchedulerBackend and MesosSchedulerBackend could put starting back. YarnClientSchedulerBackend (yarn-client) could put starting back. This PR puts TaskScheduler.start back only for yarn-client mode in the early. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4962 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3810.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3810 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12
[GitHub] spark pull request: [SPARK-4723] [CORE] To abort the stages which ...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3786#issuecomment-68082514 @markhamstra Thanks for your comment. I will analyse deeply why stage attempts so many times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4723] [CORE] To abort the stages which ...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/3786 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3794 [SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time. For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't need to wait much time. HadoopRDD object could get its partitons when it is instantiated. We could analyse and compare the execution time before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] (1) The app has only one job (a) The execution time of the job before optimization is [time1__][time2_][time3___][time4_]. The execution time of the job after optimization is[time1__][time3___][time2_][time4_]. In summary, if the app has only one job, the total execution time is same before and after optimization. (2) The app has 4 jobs (a) Before optimization, job1 execution time is [time2_][time3___][time4_], job2 execution time is [time2__][time3___][time4_], job3 execution time is[time2][time3___][time4_], job4 execution time is[time2_][time3___][time4_]. After optimization, job1 execution time is [time3___][time2_][time4_], job2 execution time is [time3___][time2__][time4_], job3 execution time is[time3___][time2_][time4_], job4 execution time is[time3___][time2__][time4_]. In summary, if the app has multiple jobs, average execution time after optimization is less than before. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4961 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3794.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3794 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit 5601a8b1458c9a7317a2e4e0463358f0a054c181 Author: yantangzhai tyz0...@163.com Date: 2014-12-25T03:17:57Z [SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org
[GitHub] spark pull request: [SPARK-3545] Put HadoopRDD.getPartitions forwa...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/2409#issuecomment-68021964 @JoshRosen Thanks. I will divide this JIRA/PR into two JIRAs/PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4946] [CORE] Using AkkaUtils.askWithRep...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3785 [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4946 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3785.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3785 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit 9ca65418c4d859b7ded77697e81d09f33a43b9a4 Author: yantangzhai tyz0...@163.com Date: 2014-12-24T06:17:32Z [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4723] [CORE] To abort the stages which ...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3786 [SPARK-4723] [CORE] To abort the stages which have attempted some times For some reason, some stages may attempt many times. A threshold could be added and the stages which have attempted more than the threshold could be aborted. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4723 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3786.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3786 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit 003774ab2dea5c0f6fd70e68c385178cc235d1c2 Author: yantangzhai tyz0...@163.com Date: 2014-12-24T06:54:17Z [SPARK-4723] [CORE] To abort the stages which have attempted some times --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-67816709 @liancheng I will revert the last space change. Thanks for your comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67472596 @marmbrus Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-67473028 @marmbrus Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67437985 @marmbrus Thank you for your comments. I will do it right away. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-4273] [SQL] Providing ExternalSet...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3137#issuecomment-67452153 @marmbrus Thanks. I'm also trying another approach to optimize this operation. I want to discuss it with you later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3555 [SPARK-4692] [SQL] Support ! boolean logic operator like NOT Support ! boolean logic operator like NOT in sql as follows select * from for_test where !(col1 col2) You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4692 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3555.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3555 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 92242c7c07d7d9f5aea2111b548a3355f3633a7d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-02T10:57:59Z Update HiveQl.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3556 [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references The sql select * from spark_test::for_test where abs(20141202) is not null has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables. is thrown. The sql select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3 with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4693 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3556 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit e572b9a754a71da1f5bdb53c283b936ab803def2 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-02T12:27:14Z Update HiveStrategies.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4676] [SQL] JavaSchemaRDD.schema may th...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3538 [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null val jsc = new org.apache.spark.api.java.JavaSparkContext(sc) val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc) val nrdd = jhc.hql(select null from spark_test.for_test) println(nrdd.schema) Then the error is thrown as follows: scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$) at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43) You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark MatchNullType Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3538.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3538 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 896c7b73f0ba1b2d3dccf6fed6410bf077eb3d54 Author: yantangzhai tyz0...@163.com Date: 2014-12-01T13:08:41Z fix NullType MatchError in JavaSchemaRDD when sql has null --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4677] [WEB] Add hadoop input time in ta...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3539 [SPARK-4677] [WEB] Add hadoop input time in task webui Add hadoop input time in task webui like GC Time to explicitly show the time used by task to read input data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark WebuiInputTime Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3539.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3539 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 3816f8540b947809cb821bcb3af36d7be0210d9c Author: yantangzhai tyz0...@163.com Date: 2014-12-01T14:09:24Z add hadoop input read time in webui --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4677] [WEB] Add hadoop input time in ta...
Github user YanTangZhai commented on a diff in the pull request: https://github.com/apache/spark/pull/3539#discussion_r21140476 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -238,10 +238,13 @@ class HadoopRDD[K, V]( val value: V = reader.createValue() var recordsSinceMetricsUpdate = 0 + var startTime : Long = 0L override def getNext() = { try { + startTime = System.nanoTime finished = !reader.next(key, value) + inputMetrics.readTime += (System.nanoTime - startTime) --- End diff -- Oh sorry. It may be expensive. Let me think about it. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4677] [WEB] Add hadoop input time in ta...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/3539 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4401] [SQL] RuleExecutor should log tra...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3265 [SPARK-4401] [SQL] RuleExecutor should log trace correct iteration num RuleExecutor should log trace correct iteration num You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4401 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3265.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3265 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit af326f76c46e2d019dc492fafaac7d3468e837b1 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-14T12:23:55Z Update RuleExecutor.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4401] [SQL] RuleExecutor should log tra...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/3265 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4401] [SQL] RuleExecutor should log tra...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3265#issuecomment-63058643 @srowen Thanks. I close this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-4273] [SQL] Providing ExternalSet...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3137 [WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when count(distinct) Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory. I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold. For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows: ohs1 [d, b, c, a] = [a, b, c, d] = file1 ohs2 [e, f, g, a] = [a, e, f, g] = file2 ohs3 [e, h, i, g] = [e, g, h, i] = file3 ohs4 [j, h, a] = [a, h, j] = sortedSet When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows: file1- a - ohsA; file2 - a - ohsA; sortedSet - a - ohsA; ohsA - a; file1 - b - ohsB; ohsB - b; file1 - c - ohsC; ohsC - c; file1 - d - ohsD; ohsD - d; file2 e - ohsE; file3 - e - ohsE; ohsE e; ... I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark ExternalAggregate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3137.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3137 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit eecb499bb10b21d648ae9e6c0282fafcde111994 Author: yantangzhai tyz0...@163.com Date: 2014-11-06T12:57:29Z A method to avoid OOM when count(distinct) by providing ExternalSet --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4009][SQL]HiveTableScan should use make...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/2857 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4009][SQL]HiveTableScan should use make...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/2857#issuecomment-59915528 @marmbrus Thanks. Please disregard it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4009][SQL]HiveTableScan should use make...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/2857 [SPARK-4009][SQL]HiveTableScan should use makeRDDForTable instead of makeRDDForPartitionedTable for partitioned table when partitionPruningPred is None HiveTableScan should use makeRDDForTable instead of makeRDDForPartitionedTable for partitioned table when partitionPruningPred is None. If a table has many partitions for example more than 20 thousands while it has a few data for example less than 512MB, some sql querying the table will produce more than 2 RDDs. The job would submit failed with exception: java stack overflow. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-4009 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2857.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2857 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit be7882ce16911d018571fa46c1a175d063bdfd03 Author: yantangzhai tyz0...@163.com Date: 2014-10-20T13:05:44Z [SPARK-4009][SQL]HiveTableScan should use makeRDDForTable instead of makeRDDForPartitionedTable for partitioned table when partitionPruningPred is None --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3545] Put HadoopRDD.getPartitions forwa...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/2409 [SPARK-3545] Put HadoopRDD.getPartitions forward and put TaskScheduler.start back to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occupation period We have two problems: (1) HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time. For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't need to wait much time. HadoopRDD object could get its partitons when it is instantiated. (2) When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the execution time before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] (1) The app has only one job (a) The execution time of the job before optimization is [time1__][time2_][time3___][time4_]. The execution time of the job after optimization is[time3___][time2_][time1__][time4_]. (b) The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time4_]. In summary, if the app has only one job, the total execution time is same before and after optimization while the cluster resources occupation period after optimization is less than before. (2) The app has 4 jobs (a) Before optimization, job1 execution time is [time2_][time3___][time4_], job2 execution time is [time2__][time3___][time4_], job3 execution time is[time2][time3___][time4_], job4 execution time is[time2__][time3___][time4_]. After optimization, job1 execution time is [time3___][time2_][time1__][time4_], job2 execution time is [time3___][time2__][time4_], job3 execution time is[time3___][time2_][time4_], job4 execution time is[time3___][time2__][time4_]. In summary, if the app has multiple jobs, average execution time after optimization is less than before and the cluster resources occupation period after optimization is less than before. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-3545 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2409.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2409 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit b88df438033eecbdbe8cad37b2bd4ad3620de6e2 Author: yantangzhai tyz0...@163.com Date: 2014-09-16T13:22:12Z [SPARK-3545] Put HadoopRDD.getPartitions forward and put TaskScheduler.start back to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occupation period --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA
[GitHub] spark pull request: [SPARK-3003] FailedStage could not be cancelle...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/1921 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3003] FailedStage could not be cancelle...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1921#issuecomment-55371512 @andrewor14 If a running stage is fetch failed, it will be moved to failedStages from runningStages. But it is still kept alive in web ui. Then I try to kill this stage. It could not be cancelled. I check again, this problem wont occur in the latest Spark version. I will close this PR. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2714] DAGScheduler logs jobid when runJ...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1617#issuecomment-55376375 @andrewor14 Thanks. Please review again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1854#issuecomment-55254070 @jkbradley I will close this PR. Thank you very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/1854 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1618#issuecomment-55254537 @andrewor14 Yeah, I see. I will close the PR. If needed, it could be reopened. Thank you very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/1618 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3148] Update global variables of HttpBr...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/2059#issuecomment-52884506 Hi @JoshRosen SparkContext1 creates broadcastManager and initializes HttpBroadcast object. HttpBroadcast creates httpserver and broadcastDir and so on. However SparkContext2 in the same process won't initialize HttpBroadcast object when creating broadcastManager. Since HttpBroadcast object is marked initialized and will not be initialized any more. SparkContext1 and SparkContext2 will share the same HttpBroadcast object. When SparkContext1 stops HttpBroadcast, HttpBroadcast in SparkContext2 actually is stopped. When HttpBroadcast1 cleans up files, some files owned by SparkContext2 may be removed. Since they are the same one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update global variables of HttpBroadcast so th...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/2058 Update global variables of HttpBroadcast so that multiple SparkContexts can coexist Update global variables of HttpBroadcast so that multiple SparkContexts can coexist You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark httpbroadcast Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2058.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2058 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit b9921f37aa13620c5bea82512b63e2b0a73b5ffa Author: yantangzhai tyz0...@163.com Date: 2014-08-20T12:56:00Z Update global variables of HttpBroadcast so that multiple SparkContexts can coexist commit 07d719ff7d77a66a4b67ef84ba9d4e5e881391fb Author: yantangzhai tyz0...@163.com Date: 2014-08-20T12:57:19Z Update global variables of HttpBroadcast so that multiple SparkContexts can coexist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3148] Update global variables of HttpBr...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/2059 [SPARK-3148] Update global variables of HttpBroadcast so that multiple SparkContexts can coexist Update global variables of HttpBroadcast so that multiple SparkContexts can coexist You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-3148 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2059.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2059 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit e751ebd22c0683746b8e13c48570bf22b4de45db Author: yantangzhai tyz0...@163.com Date: 2014-08-20T14:07:57Z [SPARK-3148] Update global variables of HttpBroadcast so that multiple SparkContexts can coexist commit 97b34079b4af178ff2bca42c314aeb0e51687167 Author: yantangzhai tyz0...@163.com Date: 2014-08-20T14:11:34Z [SPARK-3148] Update global variables of HttpBroadcast so that multiple SparkContexts can coexist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update global variables of HttpBroadcast so th...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/2058#issuecomment-52783462 #2059 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update global variables of HttpBroadcast so th...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/2058 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3067] JobProgressPage could not show Fa...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1966 [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes JobProgressPage could not show Fair Scheduler Pools section sometimes. SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-3067 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1966.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1966 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit aac7f7b67d83d4175018d58568cfbd1a639e3d7e Author: yantangzhai tyz0...@163.com Date: 2014-08-15T09:04:24Z [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3003] FailedStage could not be cancelle...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1921 [SPARK-3003] FailedStage could not be cancelled by DAGScheduler when cancelJob or cancelStage Some stage is changed from running to failed, then DAGSCheduler could not cancel it when cancelJob or cancelStage. Since in failJobAndIndependentStages, DAGSCheduler will only cancel runningStage and post SparkListenerStageCompleted for it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-3003 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1921.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1921 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit b736bd729713ba6ca23ae901b34cb8523f2d24b2 Author: yantangzhai tyz0...@163.com Date: 2014-08-13T13:33:24Z [SPARK-3003] FailedStage could not be cancelled by DAGScheduler when cancelJob or cancelStage --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark 2643] Stages web ui has ERROR when pool...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1854 [Spark 2643] Stages web ui has ERROR when pool name is None 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132) at org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40) at org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60) at org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52) at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) 14/07/23 16:01:44 WARN server.AbstractHttpConnection: /stages/ java.lang.NoSuchMethodError: javax.servlet.http.HttpServletRequest.isAsyncStarted()Z at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255
[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1854#issuecomment-51595260 @srowen The completedStages may contains stages as follows ...10, 10, 10, 10, 10, 11, 18... and the activeStages may contains 1, 10, 5 with unique 10 and the stageIdToData may contains ...10, 11, 18... with unique 10. When the completedStages is trimmed, ...10, 10 may be removed whereas the stageIdToData should not remove 11, 18. If the stageIdToData has removed 11, 18, web ui could not show poolname or description of 11, 18 in completed stages, this problem still exists. Please review again, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1854#issuecomment-51597142 @srowen I see, thanks. I will modify. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1854#issuecomment-51601604 @srowen The stage 10 will be removed from stageIdToData later. Since it will be added into completedStages or failedStages again and will be removed from activeStages when it is completed. Some time later, it will be removed from stageIdToData by trimIfNecessary since it's not in activeStages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2643] Stages web ui has ERROR when pool...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1854#issuecomment-51604647 Please review again, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/1392 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-51190110 @pwendell Sorry, I'm late. Please disregard this PR since #1734 has been closed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2714] DAGScheduler logs jobid when runJ...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1617#issuecomment-50564465 Hi @markhamstra When DAGScheduler concurrently runs multiple jobs, SparkContext only logs Job finished and logs in the same file which doesn't tell who is who. It's difficult to found which job has finished or how much time it has taken from multiple Job finished: ..., took ... s logs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2714] DAGScheduler logs jobid when runJ...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1617 [SPARK-2714] DAGScheduler logs jobid when runJob finishes DAGScheduler logs jobid when runJob finishes You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-2714 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1617.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1617 commit 090d90874cc3d2c3f6e884ab7942f7554025535c Author: yantangzhai tyz0...@163.com Date: 2014-07-28T13:41:39Z [SPARK-2714] DAGScheduler logs jobid when runJob finishes commit fb42f0f831d2ec094f26e7f4d5812c05e8c60e99 Author: yantangzhai tyz0...@163.com Date: 2014-07-28T13:47:15Z [SPARK-2714] DAGScheduler logs jobid when runJob finishes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1618 [SPARK-2715] ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling. Therefore, some task could be let fail fast instead of running for a long time if it has data skew. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-2715 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1618.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1618 commit 3fc60119e9dee8d2a781316ade17812b1367849b Author: yantangzhai tyz0...@163.com Date: 2014-07-28T14:22:38Z [SPARK-2715] ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2715] ExternalAppendOnlyMap adds max li...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1618#issuecomment-50425621 Hi @andrewor14 The default values of the two max limits are zero, which does not change the original operating mode and does not fail an application that is running perfectly fine. If some application has skew data which we don't expect, it will run for a very long time. In contrast, we want this application fail fast. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1548#issuecomment-50292640 @markhamstra Ok. Thank you very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...
Github user YanTangZhai closed the pull request at: https://github.com/apache/spark/pull/1548 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1548#issuecomment-5727 Hi @markhamstra , you are right. I will think of other ways to solve this problem. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2647] DAGScheduler plugs other JobSubmi...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1548 [SPARK-2647] DAGScheduler plugs other JobSubmitted events when processing one JobSubmitted event If a few of jobs are submitted, DAGScheduler plugs other JobSubmitted events when processing one JobSubmitted event. For example ont JobSubmitted event is processed as follows and costs much time spark-akka.actor.default-dispatcher-67 daemon prio=10 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130) locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call) at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472) at org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233) at StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-49584362 Hi @andrewor14 , that's ok. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1392 [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-2290 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1392.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1392 commit d3072fc05c7c20ec9d90732db2b9b26a4d27e290 Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:50:14Z Update ApplicationDescription.scala commit 78ec6bc8c5d1af64ca21e1a231b47911df6d4f90 Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:52:34Z Update JsonProtocol.scala commit 95e6ccc354167117430ce4cb7b2f5063a454ff1d Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:54:55Z Update TestClient.scala commit 508dcb65d04e3f12f99e03572a1cc277e7f1aeca Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:58:01Z Update SparkDeploySchedulerBackend.scala commit 6d6700aaad941779485eee2c35c4ab0cd278529e Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T12:01:40Z Update Worker.scala commit c360154ae5b03e7854d63573494fc6113295a7ec Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T12:04:16Z Update JsonProtocolSuite.scala commit 6febb215fb73735760fae957a4e71e2a61c17c77 Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T12:07:35Z Update ExecutorRunnerTest.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839557 #1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839668 fix #1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1244#issuecomment-48839912 I've fixed the compile problem. Please review and test again. Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840373 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840378 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840401 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1281 [SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly If the first directory of spark.local.dir is bad, application will exit with the exception: Exception in thread main java.io.IOException: Failed to create a temp directory (under /data1/sparkenv/local) after 10 attempts! at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258) at org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154) at org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31) at org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48) at org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at JobTaskJoin$.main(JobTaskJoin.scala:9) at JobTaskJoin.main(JobTaskJoin.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly. For example, spark.local.dir is /data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the disk data2 is good, we could choose the data2 not data1. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-2325 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1281.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1281 commit 08424ce408b5e1ee679d15e46ea5b08979511fae Author: yantangzhai tyz0...@163.com Date: 2014-07-02T06:55:39Z [SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1274 [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either. DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation. I think DiskBlockManager could ignore the bad directory at createLocalDirs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-2324 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1274.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1274 commit df086731952c669e12673fd673d829b9fdd790a2 Author: yantangzhai tyz0...@163.com Date: 2014-07-01T10:39:46Z [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1274#issuecomment-47737851 Thank aarondav. I've modified some codes. Please help to review again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1244 [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1244 commit 05c3a789a00996a5502b78711b44d80e8812fdbb Author: hakeemzhai hakeemzhai@hakeemzhai.(none) Date: 2014-06-27T07:42:18Z [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---