[GitHub] spark pull request #16099: [SPARK-18665][SQL] set statement state to "ERROR"...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/16099#discussion_r165878592 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala --- @@ -241,6 +241,8 @@ private[hive] class SparkExecuteStatementOperation( dataTypes = result.queryExecution.analyzed.output.map(_.dataType).toArray } catch { case e: HiveSQLException => +HiveThriftServer2.listener.onStatementError( + statementId, e.getMessage, SparkUtils.exceptionString(e)) --- End diff -- please catch the exception in ThriftServerPage.errorMessageCell. Without my pr, that function still throw exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16099: [SPARK-18665][SQL] set statement state to "ERROR" after ...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/16099 @gatorsmile two years passed... I don't know what to say. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16099: [SPARK-18665][SQL] set statement state to "ERROR"...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/16099#discussion_r165877094 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala --- @@ -241,6 +241,8 @@ private[hive] class SparkExecuteStatementOperation( dataTypes = result.queryExecution.analyzed.output.map(_.dataType).toArray } catch { case e: HiveSQLException => +HiveThriftServer2.listener.onStatementError( + statementId, e.getMessage, SparkUtils.exceptionString(e)) --- End diff -- ok, but My pr is closed by community... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14129: [SPARK-16280][SQL] Implement histogram_numeric SQL funct...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14129 Is this pr available? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19756: [SPARK-22527][SQL] Reuse coordinated exchanges if...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19756#discussion_r151615572 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala --- @@ -109,3 +109,67 @@ case class ReuseExchange(conf: SQLConf) extends Rule[SparkPlan] { } } } + +/** + * Find out duplicated coordinated exchanges in the spark plan, then use the same exchange for all + * the references. + */ +case class ReuseExchangeWithCoordinator(conf: SQLConf) extends Rule[SparkPlan] { + + // Returns true if a SparkPlan has coordinated ShuffleExchangeExec children. + private def hasCoordinatedExchanges(plan: SparkPlan): Boolean = { +plan.children.nonEmpty && plan.children.forall { + case ShuffleExchangeExec(_, _, Some(_)) => true + case _ => false +} + } + + // Returns true if two sequences of exchanges are producing the same results. + private def hasExchangesWithSameResults( + source: Seq[ShuffleExchangeExec], + target: Seq[ShuffleExchangeExec]): Boolean = { +source.length == target.length && + source.zip(target).forall(x => x._1.withoutCoordinator.sameResult(x._2.withoutCoordinator)) + } + + type CoordinatorSignature = (Int, Long, Option[Int]) + + def apply(plan: SparkPlan): SparkPlan = { +if (!conf.exchangeReuseEnabled) { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19756: [SPARK-22527][SQL] Reuse coordinated exchanges if...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19756#discussion_r151612678 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala --- @@ -109,3 +109,67 @@ case class ReuseExchange(conf: SQLConf) extends Rule[SparkPlan] { } } } + +/** + * Find out duplicated coordinated exchanges in the spark plan, then use the same exchange for all + * the references. + */ +case class ReuseExchangeWithCoordinator(conf: SQLConf) extends Rule[SparkPlan] { + + // Returns true if a SparkPlan has coordinated ShuffleExchangeExec children. + private def hasCoordinatedExchanges(plan: SparkPlan): Boolean = { +plan.children.nonEmpty && plan.children.forall { + case ShuffleExchangeExec(_, _, Some(_)) => true + case _ => false +} + } + + // Returns true if two sequences of exchanges are producing the same results. + private def hasExchangesWithSameResults( + source: Seq[ShuffleExchangeExec], + target: Seq[ShuffleExchangeExec]): Boolean = { +source.length == target.length && + source.zip(target).forall(x => x._1.withoutCoordinator.sameResult(x._2.withoutCoordinator)) + } + + type CoordinatorSignature = (Int, Long, Option[Int]) + + def apply(plan: SparkPlan): SparkPlan = { +if (!conf.exchangeReuseEnabled) { --- End diff -- I don't know whether spark 2.2 still has this bug or not. I am using spark 2.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19756: [SPARK-22527][SQL] Reuse coordinated exchanges if...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19756#discussion_r151611386 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala --- @@ -109,3 +109,67 @@ case class ReuseExchange(conf: SQLConf) extends Rule[SparkPlan] { } } } + +/** + * Find out duplicated coordinated exchanges in the spark plan, then use the same exchange for all + * the references. + */ +case class ReuseExchangeWithCoordinator(conf: SQLConf) extends Rule[SparkPlan] { + + // Returns true if a SparkPlan has coordinated ShuffleExchangeExec children. + private def hasCoordinatedExchanges(plan: SparkPlan): Boolean = { +plan.children.nonEmpty && plan.children.forall { + case ShuffleExchangeExec(_, _, Some(_)) => true + case _ => false +} + } + + // Returns true if two sequences of exchanges are producing the same results. + private def hasExchangesWithSameResults( + source: Seq[ShuffleExchangeExec], + target: Seq[ShuffleExchangeExec]): Boolean = { +source.length == target.length && + source.zip(target).forall(x => x._1.withoutCoordinator.sameResult(x._2.withoutCoordinator)) + } + + type CoordinatorSignature = (Int, Long, Option[Int]) + + def apply(plan: SparkPlan): SparkPlan = { +if (!conf.exchangeReuseEnabled) { --- End diff -- exchangeReuseEnabled still has a bug: SPARK-20295, can we use a new configuration? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/19301 should `sum(mt_cnt)` and `sum(ele_cnt)` be compute again? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/19301 I don't know wether my case can be optimized or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/19301 my case: ```sql select dt, geohash_of_latlng, sum(mt_cnt), sum(ele_cnt), round(sum(mt_cnt) * 1.0 * 100 / sum(mt_cnt_all), 2), round(sum(ele_cnt) * 1.0 * 100 / sum(ele_cnt_all), 2) from temp.test_geohash_match_parquet group by dt, geohash_of_latlng order by dt, geohash_of_latlng limit 10; ``` before your fix ```java TakeOrderedAndProject(limit=10, orderBy=[dt#502 ASC NULLS FIRST,geohash_of_latlng#507 ASC NULLS FIRST], output=[dt#502,geohash_of_latlng#507,sum(mt_cnt)#521L,sum(ele_cnt)#522L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#523,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#524]) +- *HashAggregate(keys=[dt#502, geohash_of_latlng#507], functions=[sum(cast(mt_cnt#511 as bigint)), sum(cast(ele_cnt#512 as bigint)), sum(cast(mt_cnt#511 as bigint)), sum(cast(mt_cnt_all#513 as bigint)), sum(cast(ele_cnt#512 as bigint)), sum(cast(ele_cnt_all#514 as bigint))]) +- Exchange(coordinator id: 148401229) hashpartitioning(dt#502, geohash_of_latlng#507, 1000), coordinator[target post-shuffle partition size: 200] +- *HashAggregate(keys=[dt#502, geohash_of_latlng#507], functions=[partial_sum(cast(mt_cnt#511 as bigint)), partial_sum(cast(ele_cnt#512 as bigint)), partial_sum(cast(mt_cnt#511 as bigint)), partial_sum(cast(mt_cnt_all#513 as bigint)), partial_sum(cast(ele_cnt#512 as bigint)), partial_sum(cast(ele_cnt_all#514 as bigint))]) +- HiveTableScan [geohash_of_latlng#507, mt_cnt#511, ele_cnt#512, mt_cnt_all#513, ele_cnt_all#514, dt#502], MetastoreRelation temp, test_geohash_match_parquet ``` after your fix ```java TakeOrderedAndProject(limit=10, orderBy=[dt#467 ASC NULLS FIRST,geohash_of_latlng#472 ASC NULLS FIRST], output=[dt#467,geohash_of_latlng#472,sum(mt_cnt)#486L,sum(ele_cnt)#487L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#488,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#489]) +- *HashAggregate(keys=[dt#467, geohash_of_latlng#472], functions=[sum(cast(mt_cnt#476 as bigint)), sum(cast(ele_cnt#477 as bigint)), sum(cast(mt_cnt#476 as bigint)), sum(cast(mt_cnt_all#478 as bigint)), sum(cast(ele_cnt#477 as bigint)), sum(cast(ele_cnt_all#479 as bigint))]) +- Exchange(coordinator id: 227998366) hashpartitioning(dt#467, geohash_of_latlng#472, 1000), coordinator[target post-shuffle partition size: 200] +- *HashAggregate(keys=[dt#467, geohash_of_latlng#472], functions=[partial_sum(cast(mt_cnt#476 as bigint)), partial_sum(cast(ele_cnt#477 as bigint)), partial_sum(cast(mt_cnt#476 as bigint)), partial_sum(cast(mt_cnt_all#478 as bigint)), partial_sum(cast(ele_cnt#477 as bigint)), partial_sum(cast(ele_cnt_all#479 as bigint))]) +- HiveTableScan [geohash_of_latlng#472, mt_cnt#476, ele_cnt#477, mt_cnt_all#478, ele_cnt_all#479, dt#467], MetastoreRelation temp, test_geohash_match_parquet ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19301#discussion_r140155406 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -38,7 +38,7 @@ import org.apache.spark.sql.internal.SQLConf * view resolution, in this way, we are able to get the correct view column ordering and * omit the extra columns that we don't require); *1.2. Else set the child output attributes to `queryOutput`. - * 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, + * 2. Map the `queryOutput` to view output by index, if the corresponding attributes don't match, --- End diff -- It looks all the same? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18193: [SPARK-15616] [SQL] CatalogRelation should fallba...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18193#discussion_r139861601 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -140,6 +141,62 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { } /** + * + * TODO: merge this with PruneFileSourcePartitions after we completely make hive as a data source. + */ +case class PruneHiveTablePartitions( +session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case filter @ Filter(condition, relation: HiveTableRelation) if relation.isPartitioned => + val predicates = splitConjunctivePredicates(condition) + val normalizedFilters = predicates.map { e => +e transform { + case a: AttributeReference => +a.withName(relation.output.find(_.semanticEquals(a)).get.name) +} + } + val partitionSet = AttributeSet(relation.partitionCols) + val pruningPredicates = normalizedFilters.filter { predicate => +!predicate.references.isEmpty && + predicate.references.subsetOf(partitionSet) + } + if (pruningPredicates.nonEmpty && session.sessionState.conf.fallBackToHdfsForStatsEnabled && +session.sessionState.conf.metastorePartitionPruning) { +val prunedPartitions = session.sharedState.externalCatalog.listPartitionsByFilter( + relation.tableMeta.database, + relation.tableMeta.identifier.table, + pruningPredicates, + session.sessionState.conf.sessionLocalTimeZone) +val sizeInBytes = try { + prunedPartitions.map { part => +val totalSize = part.parameters.get(StatsSetupConst.TOTAL_SIZE).map(_.toLong) +val rawDataSize = part.parameters.get(StatsSetupConst.RAW_DATA_SIZE).map(_.toLong) +if (totalSize.isDefined && totalSize.get > 0L) { --- End diff -- I think we should first use rawDataSize, because 1MB orc file is equal to 5MB textfile... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18193: [SPARK-15616] [SQL] CatalogRelation should fallba...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18193#discussion_r139312866 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -139,6 +138,54 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { } /** + * + * TODO: merge this with PruneFileSourcePartitions after we completely make hive as a data source. + */ +case class PruneHiveTablePartitions( +session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case filter @ Filter(condition, relation: HiveTableRelation) if relation.isPartitioned => + val predicates = splitConjunctivePredicates(condition) + val normalizedFilters = predicates.map { e => +e transform { + case a: AttributeReference => +a.withName(relation.output.find(_.semanticEquals(a)).get.name) +} + } + val partitionSet = AttributeSet(relation.partitionCols) + val pruningPredicates = normalizedFilters.filter { predicate => +!predicate.references.isEmpty && + predicate.references.subsetOf(partitionSet) + } + if (pruningPredicates.nonEmpty && session.sessionState.conf.fallBackToHdfsForStatsEnabled && +session.sessionState.conf.metastorePartitionPruning) { +val prunedPartitions = session.sharedState.externalCatalog.listPartitionsByFilter( + relation.tableMeta.database, + relation.tableMeta.identifier.table, + pruningPredicates, + session.sessionState.conf.sessionLocalTimeZone) +val sizeInBytes = try { + prunedPartitions.map { part => +CommandUtils.calculateLocationSize( --- End diff -- I think we should first check whether partition.parameters contains SetupConst.RAW_DATA_SIZE and StatsSetupConst.TOTAL_SIZE) or not. If partition.parameters contains the size of the partition, use it instead of getConetSummary of hdfs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19219: [SPARK-21993][SQL][WIP] Close sessionState when f...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19219#discussion_r139281683 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala --- @@ -42,7 +42,7 @@ class HiveSessionStateBuilder(session: SparkSession, parentState: Option[Session * Create a Hive aware resource loader. */ override protected lazy val resourceLoader: HiveSessionResourceLoader = { -val client: HiveClient = externalCatalog.client.newSession() +val client: HiveClient = externalCatalog.client --- End diff -- newSession is to isolate SparkSession in Spark ThriftServer, You must judge whether is thriftserver or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17924: [SPARK-20682][SQL] Support a new faster ORC data source ...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/17924 @dongjoon-hyun I have a question: does this orc data sources reader support a table contains multiple file format for example: table/ day=2017-09-01 RCFile day=2017-09-02 ORCFile ParquetFileFormat doesn't support this feature. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 @jinxing64 I think you may revert the changes in Spark, and use the same logic of grouping__id as hive. Keep the wrong result consistently as hive did. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 @gatorsmile I had already tried to resolve grouping__id in ResolveFunctions. But ResolveFunctions is behind ResolveGroupingAnalytics. grouping__id may change in ResolveGroupingAnalytics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18270: [SPARK-21055][SQL] replace grouping__id with grou...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18270#discussion_r136695110 --- Diff: sql/core/src/test/resources/sql-tests/results/group-analytics.sql.out --- @@ -223,12 +223,19 @@ grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 16 -SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) +SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year --- End diff -- thanks for your tips --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 why failed? Couldn't I add order byï¼ ```java org.scalatest.exceptions.TestFailedException: Expected "...Y CUBE(course, year)[ ORDER BY grouping__id, course, year]", but got "...Y CUBE(course, year)[]" SQL query did not match for query #16 SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 I can't see any comment at 77d4f7c? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18270: [SPARK-21055][SQL] replace grouping__id with grou...
GitHub user cenyuhai reopened a pull request: https://github.com/apache/spark/pull/18270 [SPARK-21055][SQL] replace grouping__id with grouping_id() ## What changes were proposed in this pull request? spark does not support grouping__id, it has grouping_id() instead. But it is not convenient for hive user to change to spark-sql so this pr is to replace grouping__id with grouping_id() hive user need not to alter their scripts ## How was this patch tested? test with SQLQuerySuite.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-21055 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18270.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18270 commit 36ff72a44ad19efe5bcb2fe461a700d4c54f89ef Author: CenYuhai <yuhai@ele.me> Date: 2017-08-30T15:22:13Z eplace grouping__id with grouping_id() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19087: [SPARK-21055][SQL] replace grouping__id with grou...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/19087 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19087: [SPARK-21055][SQL] replace grouping__id with grou...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/19087 [SPARK-21055][SQL] replace grouping__id with grouping_id() ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-21055 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19087 commit 36ff72a44ad19efe5bcb2fe461a700d4c54f89ef Author: CenYuhai <yuhai@ele.me> Date: 2017-08-30T15:22:13Z eplace grouping__id with grouping_id() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18270: [SPARK-21055][SQL] replace grouping__id with grou...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/18270 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18270: [SPARK-21055][SQL] replace grouping__id with grou...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18270#discussion_r136082775 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -954,6 +951,12 @@ class Analyzer( try { expr transformUp { case GetColumnByOrdinal(ordinal, _) => plan.output(ordinal) +case u @ UnresolvedAttribute(_) if resolver(u.name, VirtualColumn.hiveGroupingIdName) => --- End diff -- I don't think I can do itï¼because ResolveFunctions is behind ResolveGroupingAnalytics --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 Okï¼I will update it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18193: [SPARK-15616] [SQL] CatalogRelation should fallba...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18193#discussion_r123903846 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -139,6 +138,54 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { } } +case class DeterminePartitionedTableStats( +session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case filter @ Filter(condition, relation: CatalogRelation) + if DDLUtils.isHiveTable(relation.tableMeta) && relation.isPartitioned => + val predicates = splitConjunctivePredicates(condition) + val normalizedFilters = predicates.map { e => +e transform { + case a: AttributeReference => +a.withName(relation.output.find(_.semanticEquals(a)).get.name) +} + } + val partitionSet = AttributeSet(relation.partitionCols) + val pruningPredicates = normalizedFilters.filter { predicate => +!predicate.references.isEmpty && + predicate.references.subsetOf(partitionSet) + } + if (pruningPredicates.nonEmpty && session.sessionState.conf.fallBackToHdfsForStatsEnabled && +session.sessionState.conf.metastorePartitionPruning) { +val prunedPartitions = session.sharedState.externalCatalog.listPartitionsByFilter( + relation.tableMeta.database, + relation.tableMeta.identifier.table, + pruningPredicates, + session.sessionState.conf.sessionLocalTimeZone) +val hiveTable = HiveClientImpl.toHiveTable(relation.tableMeta) +val partitions = prunedPartitions.map(HiveClientImpl.toHivePartition(_, hiveTable)) +val sizeInBytes = try { + val hadoopConf = session.sessionState.newHadoopConf() + partitions.map { partition => +val fs: FileSystem = partition.getDataLocation.getFileSystem(hadoopConf) +fs.getContentSummary(partition.getDataLocation).getLength + }.sum --- End diff -- if there are too many partitions, it will be very slow. can you add a check that whether the sum is larger than threshold, if true then break. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16832: [SPARK-19490][SQL] ignore case sensitivity when filterin...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/16832 @taklwu this pr is completed, you can merge this pr by yourself. A committer told me that other pr has fixed this bug, my pr will not be merged.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/18270 why it failedï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18270: [SPARK-21055][SQL] replace grouping__id with grou...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/18270 [SPARK-21055][SQL] replace grouping__id with grouping_id() ## What changes were proposed in this pull request? spark does not support grouping__id, it has grouping_id() instead. But it is not convenient for hive user to change to spark-sql so this pr is to replace grouping__id with grouping_id() hive user need not to alter their scripts ## How was this patch tested? test with SQLQuerySuite.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-21055 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18270.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18270 commit 6fd567ca8d2d9e302612f281f4143033efa2c156 Author: cenyuhai <261810...@qq.com> Date: 2017-06-11T12:04:05Z replace grouping__id with grouping_id() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17362: [SPARK-20033][SQL] support hive permanent function
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/17362 Ok, I think @weiqingy 's pr will resolve this problem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17362: [SPARK-20033][SQL] support hive permanent functio...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/17362#discussion_r107828225 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -135,6 +142,35 @@ private[sql] class HiveSessionCatalog( } } + override def loadFunctionResources(resources: Seq[FunctionResource]): Unit = { +logDebug("loading hive permanent function resources") +resources.foreach { resource => + val resourceType = resource.resourceType match { +case JarResource => + ResourceType.JAR +case FileResource => + ResourceType.FILE +case ArchiveResource => + ResourceType.ARCHIVE + } + val uri = if (!Shell.WINDOWS) { +new URI(resource.uri) + } + else { +new Path(resource.uri).toUri + } + val scheme = if (uri.getScheme == null) null else uri.getScheme.toLowerCase + if (scheme == null || scheme == "file") { +functionResourceLoader.loadResource(resource) + } else { +val sessionState = SessionState.get() +val localPath = sessionState.add_resource(resourceType, resource.uri) --- End diff -- no, I just use sessionState.add_resource to download the resouces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17362: [SPARK-20033][SQL] support hive permanent function
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/17362 @gatorsmile hiï¼spark just suport HIVE UDF which resources is in the local disk, or spark-sql --jars xxx.jar or something else. But I think spark don't support the hive permanent function which resources are in hdfs. CREATE FUNCTION hdfs_udf AS 'xxx.udf' USING JAR 'hdfs:///user/udf/.jar'; My pr just download the hdfs resources and add to the classpath. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17362: [SPARK-20033][SQL] support hive permanent functio...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/17362 [SPARK-20033][SQL] support hive permanent function ## What changes were proposed in this pull request? support hive permanent function ## How was this patch tested? You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-20033 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17362.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17362 commit 703d23e067e2d02f69bd4ec429106a426c6bf132 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2017-03-20T12:36:22Z support hive permanent function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16832: [SPARK-19490][SQL] change hive column names to lower cas...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/16832 @rxin I think it is safe, it is only used to check whether the schema contains the columns. Hive columns are not case-sensitive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16832: [SPARK-19490][SQL] change column names to lower c...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/16832 [SPARK-19490][SQL] change column names to lower case ## What changes were proposed in this pull request? change column names to lower case ## How was this patch tested? You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-19490 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16832.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16832 commit efb0b4bfc5643b5cd513682b035f24b5fa9656e9 Author: cenyuhai <261810...@qq.com> Date: 2017-02-07T11:33:15Z change column names to lower case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16481: [SPARK-19092] [SQL] Save() API of DataFrameWriter...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/16481#discussion_r95538748 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -413,17 +413,22 @@ case class DataSource( relation } - /** Writes the given [[DataFrame]] out to this [[DataSource]]. */ + /** + * Writes the given [[DataFrame]] out to this [[DataSource]]. + * + * @param isForWriteOnly Whether to just write the data without returning a [[BaseRelation]]. + */ def write( mode: SaveMode, - data: DataFrame): BaseRelation = { + data: DataFrame, + isForWriteOnly: Boolean = false): Option[BaseRelation] = { if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) { throw new AnalysisException("Cannot save interval data type into external storage.") } providingClass.newInstance() match { case dataSource: CreatableRelationProvider => -dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data) +Some(dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data)) --- End diff -- maybe we can set a parameter here, let user to choose true or false --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15109: [SPARK-17501][CORE] Record executor heartbeat timestamp ...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/15109 Ok, I will close this PR. This is not a big problem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15109: [SPARK-17501][CORE] Record executor heartbeat tim...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/15109 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16099: [SPARK-18665][SQL] set statement state to error a...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/16099 [SPARK-18665][SQL] set statement state to error after user canceled job ## What changes were proposed in this pull request? set statement state to error after user canceled job You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark spark-18665 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16099.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16099 commit e3a9b0896db7371252a66cc6b84bd7db921268c1 Author: cenyuhai <261810...@qq.com> Date: 2016-12-01T10:26:43Z set statement state to error after user canceled job --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16097: [SPARK-18665] set job to "ERROR" when job is canc...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/16097 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16097: [SPARK-18665] set job to "ERROR" when job is canc...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/16097 [SPARK-18665] set job to "ERROR" when job is canceled ## What changes were proposed in this pull request? set job to "ERROR" when job is canceled You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-18665 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16097.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16097 commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit b6b0d0a41c1aa59bc97a0aa438619d903b78b108 Author: å²çæµ· <261810...@qq.com> Date: 2016-09-06T03:03:08Z Merge pull request #9 from apache/master Merge latest code to my fork commit abd7924eab25b6dfdfd78c23a78dadcb3b9fbe1e Author: å²çæµ· <261810...@qq.com> Date: 2016-09-08T17:10:12Z Merge pull request #10 from apache/master Merge latest code to my fork commit 4b460e218244cdb0884e73c5fca29cc43b516972 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-09-15T09:25:24Z Merge remote-tracking branch 'remotes/apache/master' commit 22cb0a6f6f60ffae4a449727959cdd2940699f8e Author: å²çæµ· <261810...@qq.com> Date: 2016-12-01T06:09:42Z Merge pull request #12 from apache/master Merge latest code to my branch commit 8b9322d1f8421a1868d8d39472d1b6f3681b4de3 Author: cenyuhai <261810...@qq.com> Date: 2016-12-01T06:36:26Z set statement state to error after user canceled job --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when the data ...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/15041 Ok, I close this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when th...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/15041 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when the data ...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/15041 Driver is ok, but executor is running out of memory, this method is called by executor. Our maximum limit of memory is 15G. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when the data ...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/15041 In terms of data security, we can't get all data. Every sql should limit 10 million records. But sometimes it will OOM... My demand is to avoid OOM. @srowen Do you have any idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when th...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/15041#discussion_r79754097 --- Diff: core/src/main/scala/org/apache/spark/util/collection/Utils.scala --- @@ -30,10 +34,22 @@ private[spark] object Utils { * Returns the first K elements from the input as defined by the specified implicit Ordering[T] * and maintains the ordering. */ - def takeOrdered[T](input: Iterator[T], num: Int)(implicit ord: Ordering[T]): Iterator[T] = { -val ordering = new GuavaOrdering[T] { - override def compare(l: T, r: T): Int = ord.compare(l, r) + def takeOrdered[T](input: Iterator[T], num: Int, + ser: Serializer = SparkEnv.get.serializer)(implicit ord: Ordering[T]): Iterator[T] = { +val context = TaskContext.get() +if (context == null) { + val ordering = new GuavaOrdering[T] { +override def compare(l: T, r: T): Int = ord.compare(l, r) + } + ordering.leastOf(input.asJava, num).iterator.asScala +} else { + val sorter = +new ExternalSorter[T, Any, Any](context, None, None, Some(ord), ser) + sorter.insertAll(input.map(x => (x, null))) --- End diff -- 1.In my case, user execute a sql "select * from table sort by time limit 1000", the k is very large, it's an extreme case. I need not change RDD.takeOrdered. I will limit the changes in limit.scala. 2. GuavaOrdering will sort all data in memory and then take top k. If there is enough memory, ExternalSorter will not spill. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when th...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/15041#discussion_r79753174 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1384,14 +1385,15 @@ abstract class RDD[T: ClassTag]( * @param ord the implicit ordering for T * @return an array of top elements */ - def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { + def takeOrdered(num: Int, ser: Serializer = SparkEnv.get.serializer) --- End diff -- Yes, you are right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14969: [SPARK-17406][WEB UI] limit timeline executor events
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14969 OK, so it's still not sure that this will never happen again because SparkQA can't find out whether developer has added all excludes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14969: [SPARK-17406][WEB UI] limit timeline executor events
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14969 Ah, I was confused by MimaExcludes.scala. I asked @liancheng, he told me that just add these to MimaExcludes.scala which is imported from spark 2.0. I see your HOTFIX, you just remove what I added. If I don't add this changes into MimaExcludes.scala, I can't compile project. Do you know the right way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15109: [SPARK-17501][CORE] Record executor heartbeat tim...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/15109 [SPARK-17501][CORE] Record executor heartbeat timestamp when received heartbeat event. ## What changes were proposed in this pull request? Record executor's latest heartbeat timestamp when receiving heartbeat event. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17501 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15109.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15109 commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit b6b0d0a41c1aa59bc97a0aa438619d903b78b108 Author: å²çæµ· <261810...@qq.com> Date: 2016-09-06T03:03:08Z Merge pull request #9 from apache/master Merge latest code to my fork commit abd7924eab25b6dfdfd78c23a78dadcb3b9fbe1e Author: å²çæµ· <261810...@qq.com> Date: 2016-09-08T17:10:12Z Merge pull request #10 from apache/master Merge latest code to my fork commit 4b460e218244cdb0884e73c5fca29cc43b516972 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-09-15T09:25:24Z Merge remote-tracking branch 'remotes/apache/master' commit 9839193f3ce234525040adcbd91c79eeac067c0e Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-09-15T09:34:43Z fix registered blockmanager again and again --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14969: [SPARK-17406][WEB UI] limit timeline executor events
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14969 OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r78360879 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -38,47 +37,68 @@ private[ui] class ExecutorsTab(parent: SparkUI) extends SparkUITab(parent, "exec } } +private[ui] case class ExecutorTaskSummary( +var executorId: String, +var totalCores: Int = 0, +var tasksMax: Int = 0, +var tasksActive: Int = 0, +var tasksFailed: Int = 0, +var tasksComplete: Int = 0, +var duration: Long = 0L, +var jvmGCTime: Long = 0L, +var inputBytes: Long = 0L, +var inputRecords: Long = 0L, +var outputBytes: Long = 0L, +var outputRecords: Long = 0L, +var shuffleRead: Long = 0L, +var shuffleWrite: Long = 0L, +var executorLogs: Map[String, String] = Map.empty, +var isAlive: Boolean = true +) + /** * :: DeveloperApi :: * A SparkListener that prepares information to be displayed on the ExecutorsTab */ @DeveloperApi class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf) extends SparkListener { - val executorToTotalCores = HashMap[String, Int]() - val executorToTasksMax = HashMap[String, Int]() - val executorToTasksActive = HashMap[String, Int]() - val executorToTasksComplete = HashMap[String, Int]() - val executorToTasksFailed = HashMap[String, Int]() - val executorToDuration = HashMap[String, Long]() - val executorToJvmGCTime = HashMap[String, Long]() - val executorToInputBytes = HashMap[String, Long]() - val executorToInputRecords = HashMap[String, Long]() - val executorToOutputBytes = HashMap[String, Long]() - val executorToOutputRecords = HashMap[String, Long]() - val executorToShuffleRead = HashMap[String, Long]() - val executorToShuffleWrite = HashMap[String, Long]() - val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorToTaskSummary = LinkedHashMap[String, ExecutorTaskSummary]() + var executorEvents = new ListBuffer[SparkListenerEvent]() + + private val maxTimelineExecutors = conf.getInt("spark.ui.timeline.executors.maximum", 1000) + private val retainedDeadExecutors = conf.getInt("spark.ui.retainedDeadExecutors", 100) + private var deadExecutorCount = 0 --- End diff -- yes, thank you for your advise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r78358827 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -38,47 +37,68 @@ private[ui] class ExecutorsTab(parent: SparkUI) extends SparkUITab(parent, "exec } } +private[ui] case class ExecutorTaskSummary( +var executorId: String, +var totalCores: Int = 0, +var tasksMax: Int = 0, +var tasksActive: Int = 0, +var tasksFailed: Int = 0, +var tasksComplete: Int = 0, +var duration: Long = 0L, +var jvmGCTime: Long = 0L, +var inputBytes: Long = 0L, +var inputRecords: Long = 0L, +var outputBytes: Long = 0L, +var outputRecords: Long = 0L, +var shuffleRead: Long = 0L, +var shuffleWrite: Long = 0L, +var executorLogs: Map[String, String] = Map.empty, +var isAlive: Boolean = true +) + /** * :: DeveloperApi :: * A SparkListener that prepares information to be displayed on the ExecutorsTab */ @DeveloperApi class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf) extends SparkListener { - val executorToTotalCores = HashMap[String, Int]() - val executorToTasksMax = HashMap[String, Int]() - val executorToTasksActive = HashMap[String, Int]() - val executorToTasksComplete = HashMap[String, Int]() - val executorToTasksFailed = HashMap[String, Int]() - val executorToDuration = HashMap[String, Long]() - val executorToJvmGCTime = HashMap[String, Long]() - val executorToInputBytes = HashMap[String, Long]() - val executorToInputRecords = HashMap[String, Long]() - val executorToOutputBytes = HashMap[String, Long]() - val executorToOutputRecords = HashMap[String, Long]() - val executorToShuffleRead = HashMap[String, Long]() - val executorToShuffleWrite = HashMap[String, Long]() - val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorToTaskSummary = LinkedHashMap[String, ExecutorTaskSummary]() + var executorEvents = new ListBuffer[SparkListenerEvent]() + + private val maxTimelineExecutors = conf.getInt("spark.ui.timeline.executors.maximum", 1000) + private val retainedDeadExecutors = conf.getInt("spark.ui.retainedDeadExecutors", 100) + private var deadExecutorCount = 0 --- End diff -- If I dont' remove dead executor information, there may be a memory leak. If I remove immediately, we can hardly see the dead executor log from executor page. Do you have any idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r78357353 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -38,47 +37,68 @@ private[ui] class ExecutorsTab(parent: SparkUI) extends SparkUITab(parent, "exec } } +private[ui] case class ExecutorTaskSummary( +var executorId: String, +var totalCores: Int = 0, +var tasksMax: Int = 0, +var tasksActive: Int = 0, +var tasksFailed: Int = 0, +var tasksComplete: Int = 0, +var duration: Long = 0L, +var jvmGCTime: Long = 0L, +var inputBytes: Long = 0L, +var inputRecords: Long = 0L, +var outputBytes: Long = 0L, +var outputRecords: Long = 0L, +var shuffleRead: Long = 0L, +var shuffleWrite: Long = 0L, +var executorLogs: Map[String, String] = Map.empty, +var isAlive: Boolean = true +) + /** * :: DeveloperApi :: * A SparkListener that prepares information to be displayed on the ExecutorsTab */ @DeveloperApi class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf) extends SparkListener { - val executorToTotalCores = HashMap[String, Int]() - val executorToTasksMax = HashMap[String, Int]() - val executorToTasksActive = HashMap[String, Int]() - val executorToTasksComplete = HashMap[String, Int]() - val executorToTasksFailed = HashMap[String, Int]() - val executorToDuration = HashMap[String, Long]() - val executorToJvmGCTime = HashMap[String, Long]() - val executorToInputBytes = HashMap[String, Long]() - val executorToInputRecords = HashMap[String, Long]() - val executorToOutputBytes = HashMap[String, Long]() - val executorToOutputRecords = HashMap[String, Long]() - val executorToShuffleRead = HashMap[String, Long]() - val executorToShuffleWrite = HashMap[String, Long]() - val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorToTaskSummary = LinkedHashMap[String, ExecutorTaskSummary]() + var executorEvents = new ListBuffer[SparkListenerEvent]() + + private val maxTimelineExecutors = conf.getInt("spark.ui.timeline.executors.maximum", 1000) + private val retainedDeadExecutors = conf.getInt("spark.ui.retainedDeadExecutors", 100) + private var deadExecutorCount = 0 --- End diff -- deadExecutorCount is used to count the dead executors. It is compatible for that the dead executors are removed immediately after spark 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r78355162 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -38,47 +37,68 @@ private[ui] class ExecutorsTab(parent: SparkUI) extends SparkUITab(parent, "exec } } +private[ui] case class ExecutorTaskSummary( +var executorId: String, +var totalCores: Int = 0, +var tasksMax: Int = 0, +var tasksActive: Int = 0, +var tasksFailed: Int = 0, +var tasksComplete: Int = 0, +var duration: Long = 0L, +var jvmGCTime: Long = 0L, +var inputBytes: Long = 0L, +var inputRecords: Long = 0L, +var outputBytes: Long = 0L, +var outputRecords: Long = 0L, +var shuffleRead: Long = 0L, +var shuffleWrite: Long = 0L, +var executorLogs: Map[String, String] = Map.empty, +var isAlive: Boolean = true +) + /** * :: DeveloperApi :: * A SparkListener that prepares information to be displayed on the ExecutorsTab */ @DeveloperApi class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf) extends SparkListener { - val executorToTotalCores = HashMap[String, Int]() - val executorToTasksMax = HashMap[String, Int]() - val executorToTasksActive = HashMap[String, Int]() - val executorToTasksComplete = HashMap[String, Int]() - val executorToTasksFailed = HashMap[String, Int]() - val executorToDuration = HashMap[String, Long]() - val executorToJvmGCTime = HashMap[String, Long]() - val executorToInputBytes = HashMap[String, Long]() - val executorToInputRecords = HashMap[String, Long]() - val executorToOutputBytes = HashMap[String, Long]() - val executorToOutputRecords = HashMap[String, Long]() - val executorToShuffleRead = HashMap[String, Long]() - val executorToShuffleWrite = HashMap[String, Long]() - val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorToTaskSummary = LinkedHashMap[String, ExecutorTaskSummary]() + var executorEvents = new ListBuffer[SparkListenerEvent]() + + private val maxTimelineExecutors = conf.getInt("spark.ui.timeline.executors.maximum", 1000) --- End diff -- executorToTaskSummary is used by ExecutorsPage. Dead executors are still retained in ExecutorsPage. So I can't remove this executor's information immediately after it is removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r78352242 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -38,47 +37,68 @@ private[ui] class ExecutorsTab(parent: SparkUI) extends SparkUITab(parent, "exec } } +private[ui] case class ExecutorTaskSummary( +var executorId: String, +var totalCores: Int = 0, +var tasksMax: Int = 0, +var tasksActive: Int = 0, +var tasksFailed: Int = 0, +var tasksComplete: Int = 0, +var duration: Long = 0L, +var jvmGCTime: Long = 0L, +var inputBytes: Long = 0L, +var inputRecords: Long = 0L, +var outputBytes: Long = 0L, +var outputRecords: Long = 0L, +var shuffleRead: Long = 0L, +var shuffleWrite: Long = 0L, +var executorLogs: Map[String, String] = Map.empty, +var isAlive: Boolean = true +) + /** * :: DeveloperApi :: * A SparkListener that prepares information to be displayed on the ExecutorsTab */ @DeveloperApi class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf) extends SparkListener { - val executorToTotalCores = HashMap[String, Int]() - val executorToTasksMax = HashMap[String, Int]() - val executorToTasksActive = HashMap[String, Int]() - val executorToTasksComplete = HashMap[String, Int]() - val executorToTasksFailed = HashMap[String, Int]() - val executorToDuration = HashMap[String, Long]() - val executorToJvmGCTime = HashMap[String, Long]() - val executorToInputBytes = HashMap[String, Long]() - val executorToInputRecords = HashMap[String, Long]() - val executorToOutputBytes = HashMap[String, Long]() - val executorToOutputRecords = HashMap[String, Long]() - val executorToShuffleRead = HashMap[String, Long]() - val executorToShuffleWrite = HashMap[String, Long]() - val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorToTaskSummary = LinkedHashMap[String, ExecutorTaskSummary]() + var executorEvents = new ListBuffer[SparkListenerEvent]() + + private val maxTimelineExecutors = conf.getInt("spark.ui.timeline.executors.maximum", 1000) --- End diff -- spark.ui.timeline.executors.maximum is similar to spark.ui.timeline.tasks.maximum. It is a configuration about ExecutorAdded event and ExecutorRemoved event, so spark.ui.timeline.retainedDeadExecutors is not suitable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r78351884 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -38,47 +37,68 @@ private[ui] class ExecutorsTab(parent: SparkUI) extends SparkUITab(parent, "exec } } +private[ui] case class ExecutorTaskSummary( +var executorId: String, +var totalCores: Int = 0, +var tasksMax: Int = 0, +var tasksActive: Int = 0, +var tasksFailed: Int = 0, +var tasksComplete: Int = 0, +var duration: Long = 0L, +var jvmGCTime: Long = 0L, +var inputBytes: Long = 0L, +var inputRecords: Long = 0L, +var outputBytes: Long = 0L, +var outputRecords: Long = 0L, +var shuffleRead: Long = 0L, +var shuffleWrite: Long = 0L, +var executorLogs: Map[String, String] = Map.empty, +var isAlive: Boolean = true +) + /** * :: DeveloperApi :: * A SparkListener that prepares information to be displayed on the ExecutorsTab */ @DeveloperApi class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf) extends SparkListener { - val executorToTotalCores = HashMap[String, Int]() - val executorToTasksMax = HashMap[String, Int]() - val executorToTasksActive = HashMap[String, Int]() - val executorToTasksComplete = HashMap[String, Int]() - val executorToTasksFailed = HashMap[String, Int]() - val executorToDuration = HashMap[String, Long]() - val executorToJvmGCTime = HashMap[String, Long]() - val executorToInputBytes = HashMap[String, Long]() - val executorToInputRecords = HashMap[String, Long]() - val executorToOutputBytes = HashMap[String, Long]() - val executorToOutputRecords = HashMap[String, Long]() - val executorToShuffleRead = HashMap[String, Long]() - val executorToShuffleWrite = HashMap[String, Long]() - val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorToTaskSummary = LinkedHashMap[String, ExecutorTaskSummary]() + var executorEvents = new ListBuffer[SparkListenerEvent]() + + private val maxTimelineExecutors = conf.getInt("spark.ui.timeline.executors.maximum", 1000) + private val retainedDeadExecutors = conf.getInt("spark.ui.retainedDeadExecutors", 100) + private var deadExecutorCount = 0 def activeStorageStatusList: Seq[StorageStatus] = storageStatusListener.storageStatusList def deadStorageStatusList: Seq[StorageStatus] = storageStatusListener.deadStorageStatusList override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit = synchronized { val eid = executorAdded.executorId -executorToLogUrls(eid) = executorAdded.executorInfo.logUrlMap -executorToTotalCores(eid) = executorAdded.executorInfo.totalCores -executorToTasksMax(eid) = executorToTotalCores(eid) / conf.getInt("spark.task.cpus", 1) -executorIdToData(eid) = new ExecutorUIData(executorAdded.time) +val taskSummary = executorToTaskSummary.getOrElseUpdate(eid, ExecutorTaskSummary(eid)) +taskSummary.executorLogs = executorAdded.executorInfo.logUrlMap +taskSummary.totalCores = executorAdded.executorInfo.totalCores +taskSummary.tasksMax = taskSummary.totalCores / conf.getInt("spark.task.cpus", 1) +executorEvents += executorAdded +if (executorEvents.size > maxTimelineExecutors) { + executorEvents.remove(0) +} +if (deadExecutorCount > retainedDeadExecutors) { + val head = executorToTaskSummary.filter(e => !e._2.isAlive).head + executorToTaskSummary.remove(head._1) + deadExecutorCount -= 1 +} } override def onExecutorRemoved( executorRemoved: SparkListenerExecutorRemoved): Unit = synchronized { -val eid = executorRemoved.executorId -val uiData = executorIdToData(eid) -uiData.finishTime = Some(executorRemoved.time) -uiData.finishReason = Some(executorRemoved.reason) +executorEvents += executorRemoved +if (executorEvents.size > maxTimelineExecutors) { + executorEvents.remove(0) +} +deadExecutorCount += 1 +executorToTaskSummary.get(executorRemoved.executorId).map(e => e.isAlive = false) --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure a
[GitHub] spark issue #14737: [SPARK-17171][WEB UI] DAG will list all partitions in th...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14737 thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when th...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/15041#discussion_r78272386 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -493,8 +494,7 @@ abstract class RDD[T: ClassTag]( * * @param weights weights for splits, will be normalized if they don't sum to 1 * @param seed random seed - * - * @return split RDDs in an array +* @return split RDDs in an array --- End diff -- I am so sorry for this. I will revert it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15041: [SPARK-17488][CORE] TakeAndOrder will OOM when th...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/15041 [SPARK-17488][CORE] TakeAndOrder will OOM when the data is very large ## What changes were proposed in this pull request? In function Utils.takeOrdered, it will sort all data in memory, when the data is very large, It will OOM. This pr is to add external sorter for function takeOrdered. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17488 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15041.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15041 commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit b6b0d0a41c1aa59bc97a0aa438619d903b78b108 Author: å²çæµ· <261810...@qq.com> Date: 2016-09-06T03:03:08Z Merge pull request #9 from apache/master Merge latest code to my fork commit abd7924eab25b6dfdfd78c23a78dadcb3b9fbe1e Author: å²çæµ· <261810...@qq.com> Date: 2016-09-08T17:10:12Z Merge pull request #10 from apache/master Merge latest code to my fork commit 07ad91b02ad2e644788a7e432472e8c5384a29c6 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-09-10T05:17:49Z add exterlnal sorter for takeOrdered function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15014: [SPARK-17429][SQL] use ImplicitCastInputTypes with funct...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/15014 In this case , we store a business type by int (to decrease record size). for example, xxx are machine error types, are application types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15014: [SPARK-17429][SQL] use ImplicitCastInputTypes with funct...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/15014 @hvanhovell Why hive is so popular? because hive is compatible and stable. From the user's point of view, hive is easy to use. Users need not care about types all the time. I agree that hive compatibility is not the goal, but making spark-sql easier to use is my goal, are you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14737: [SPARK-17171][WEB UI] DAG will list all partitions in th...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14737 @srowen If it is okï¼can you merge this pr to masterï¼thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14969: [SPARK-17406][WEB UI] limit timeline executor events
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14969 @srowen I remove parallel maps, please review the latest codes.Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15014: [SPARK-17429][SQL] use ImplicitCastInputTypes wit...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/15014 [SPARK-17429][SQL] use ImplicitCastInputTypes with function Length ## What changes were proposed in this pull request? select length(11); select length(2.0); these sql will return errors, but hive is ok. this PR will support casting input types implicitly for function length the correct result is: select length(11) return 2 select length(2.0) return 3 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17429 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15014.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15014 commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit b6b0d0a41c1aa59bc97a0aa438619d903b78b108 Author: å²çæµ· <261810...@qq.com> Date: 2016-09-06T03:03:08Z Merge pull request #9 from apache/master Merge latest code to my fork commit abd7924eab25b6dfdfd78c23a78dadcb3b9fbe1e Author: å²çæµ· <261810...@qq.com> Date: 2016-09-08T17:10:12Z Merge pull request #10 from apache/master Merge latest code to my fork commit 51fe8a1d141f700a2b417878c1c19af25d922198 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-09-08T17:46:37Z use ImplicitCastInputTypes for Length --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r77652830 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -70,15 +72,33 @@ class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: Spar executorToLogUrls(eid) = executorAdded.executorInfo.logUrlMap executorToTotalCores(eid) = executorAdded.executorInfo.totalCores executorToTasksMax(eid) = executorToTotalCores(eid) / conf.getInt("spark.task.cpus", 1) -executorIdToData(eid) = new ExecutorUIData(executorAdded.time) +executorEvents += executorAdded +if (executorEvents.size > MAX_EXECUTOR_LIMIT) { + executorEvents = executorEvents.drop(1) +} } override def onExecutorRemoved( executorRemoved: SparkListenerExecutorRemoved): Unit = synchronized { +executorEvents += executorRemoved +if (executorEvents.size > MAX_EXECUTOR_LIMIT) { + executorEvents = executorEvents.drop(1) +} val eid = executorRemoved.executorId -val uiData = executorIdToData(eid) -uiData.finishTime = Some(executorRemoved.time) -uiData.finishReason = Some(executorRemoved.reason) +executorToTotalCores.remove(eid) --- End diff -- let me see. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r77652206 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala --- @@ -123,55 +123,55 @@ private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage("") { } } - private def makeExecutorEvent(executorUIDatas: HashMap[String, ExecutorUIData]): Seq[String] = { + private def makeExecutorEvent(executorUIDatas: Seq[SparkListenerEvent]): + Seq[String] = { val events = ListBuffer[String]() executorUIDatas.foreach { - case (executorId, event) => + case a: SparkListenerExecutorAdded => val addedEvent = s""" |{ | 'className': 'executor added', | 'group': 'executors', - | 'start': new Date(${event.startTime}), + | 'start': new Date(${a.time}), | 'content': 'Executor ${executorId} added' + |'data-title="Executor ${a.executorId}' + + |'Added at ${UIUtils.formatDate(new Date(a.time))}"' + + |'data-html="true">Executor ${a.executorId} added' |} """.stripMargin events += addedEvent + case e: SparkListenerExecutorRemoved => --- End diff -- yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r77651971 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -70,15 +72,33 @@ class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: Spar executorToLogUrls(eid) = executorAdded.executorInfo.logUrlMap executorToTotalCores(eid) = executorAdded.executorInfo.totalCores executorToTasksMax(eid) = executorToTotalCores(eid) / conf.getInt("spark.task.cpus", 1) -executorIdToData(eid) = new ExecutorUIData(executorAdded.time) +executorEvents += executorAdded +if (executorEvents.size > MAX_EXECUTOR_LIMIT) { + executorEvents = executorEvents.drop(1) --- End diff -- Because drop function don't really remove element, it just return a new collection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB UI] limit timeline executor eve...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14969#discussion_r77651759 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -59,7 +59,9 @@ class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: Spar val executorToShuffleRead = HashMap[String, Long]() val executorToShuffleWrite = HashMap[String, Long]() val executorToLogUrls = HashMap[String, Map[String, String]]() - val executorIdToData = HashMap[String, ExecutorUIData]() + var executorEvents = new mutable.ListBuffer[SparkListenerEvent]() + + val MAX_EXECUTOR_LIMIT = conf.getInt("spark.ui.timeline.executors.maximum", 1000) --- End diff -- Noï¼ it is abount executors(SparkListenerExecutorAdded and SparkListenerExecutorRemoved). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14969: [SPARK-17406][WEB-UI] limit timeline executor events
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14969 [error] * method executorIdToData()scala.collection.mutable.HashMap in class org.apache.spark.ui.exec.ExecutorsListener does not have a correspondent in current version [error]filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorIdToData") I have remove "executorIdToData", why it will failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14969: [SPARK-17406][WEB-UI] limit timeline executor eve...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/14969 [SPARK-17406][WEB-UI] limit timeline executor events ## What changes were proposed in this pull request? The job page will be too slow to open when there are thousands of executor events(added or removed). I found that in ExecutorsTab file, executorIdToData will not remove elements, it will increase all the time.Before this pr, it looks like [timeline1.png](https://issues.apache.org/jira/secure/attachment/12827112/timeline1.png). After this pr, it looks like [timeline2.png](https://issues.apache.org/jira/secure/attachment/12827113/timeline2.png)(we can set how many events will be displayed) You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17406 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14969.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14969 commit c368f885aa539da622f95093c51205af11c9d7a1 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-09-06T05:25:53Z limit timeline executor events --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14966: Merge pull request #8 from apache/master
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/14966 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14966: Merge pull request #8 from apache/master
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14966 Sorry, I make a mistake... I want to merge pull request to my fork. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14966: Merge pull request #8 from apache/master
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/14966 Merge pull request #8 from apache/master ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) merge latest code to my fork You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14966.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14966 commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14737: [SPARK-17171][WEB UI] DAG will list all partitions in th...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14737 @srowen Why I set this value 2, because a "JOIN" action needs 2 elements.Because Users always don't care about how many partitions the graphs has, they just want to know the relations from DAG graphs. I have changed the codes as @markhamstra said, it will not remove elements by default. Users can set this value as they like. But I recommend that 2 is better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14737: [SPARK-17171][WEB UI] DAG will list all partitions in th...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14737 I am very sorry about, the first picture is for stage, the second picture is for job, but it is the same job "select count(1) from partitionedTables " --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14737: [SPARK-17171][WEB UI] DAG will list all partitions in th...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14737 @srowen please review the latest codes, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14739: [SPARK-17176][WEB UI]set default task sort column to "St...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14739 @srowen can we make it an option, default by "Index", users can choose "Status" or anything else? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14737: [Spark-17171][WEB UI] DAG will list all partitions in th...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14737 @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14737: [Spark-17171][WEB UI] DAG will list all partition...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14737#discussion_r75593904 --- Diff: core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala --- @@ -119,18 +119,47 @@ private[ui] object RDDOperationGraph extends Logging { { if (stage.attemptId == 0) "" else s" (attempt ${stage.attemptId})" } val rootCluster = new RDDOperationCluster(stageClusterId, stageClusterName) +var rootNodeCount = 0 +val addRDDIds = new mutable.HashSet[Int]() +val dropRDDIds = new mutable.HashSet[Int]() + +def isAllowed(ids: mutable.HashSet[Int], rdd: RDDInfo): Boolean = { + val parentIds = rdd.parentIds + if (parentIds.size == 0) { +rootNodeCount < retainedNodes + } else { +if (ids.size > 0) { --- End diff -- yes, you are right... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14737: [Spark-17171][WEB UI] DAG will list all partition...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/14737#discussion_r75593902 --- Diff: core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala --- @@ -119,18 +119,47 @@ private[ui] object RDDOperationGraph extends Logging { { if (stage.attemptId == 0) "" else s" (attempt ${stage.attemptId})" } val rootCluster = new RDDOperationCluster(stageClusterId, stageClusterName) +var rootNodeCount = 0 +val addRDDIds = new mutable.HashSet[Int]() +val dropRDDIds = new mutable.HashSet[Int]() + +def isAllowed(ids: mutable.HashSet[Int], rdd: RDDInfo): Boolean = { + val parentIds = rdd.parentIds + if (parentIds.size == 0) { +rootNodeCount < retainedNodes + } else { +if (ids.size > 0) { +parentIds.exists(id => ids.contains(id) || !dropRDDIds.contains(id)) +} else { +true +} + } +} + // Find nodes, edges, and operation scopes that belong to this stage -stage.rddInfos.foreach { rdd => - edges ++= rdd.parentIds.map { parentId => RDDOperationEdge(parentId, rdd.id) } +stage.rddInfos.sortBy(_.id).foreach { rdd => + val keepNode: Boolean = isAllowed(addRDDIds, rdd) --- End diff -- OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14739: [SPARK-17176][WEB UI]set default task sort column to "St...
Github user cenyuhai commented on the issue: https://github.com/apache/spark/pull/14739 YES, "FAILED" will come before "RUNNING".That is what I want, because we want to know why task will fail more than the need to sort by ID. ID is just a unique identifier for task, in the most cases, we don't case about it.But we will care about why tasks will fail, why tasks are running such a long time. When there are too many tasks, it is not easy to sort by status, it is very slow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14739: [SPARK-17176][WEB UI]set default task sort column...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/14739 [SPARK-17176][WEB UI]set default task sort column to "Status" ## What changes were proposed in this pull request? Task are sorted by "Index" in Stage Page, but user are always concerned about tasks which are failed(see error messages) or still running (maybe it is skewed). When there are too many tasks, it is too slow to sort. So it is better to set the default sort column to âStatusâ. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17176 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14739.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14739 commit c1f3c9e90d9c465eb356b01d136a963ff1e75fc3 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-08-21T06:49:15Z set default task sort column to "Status" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14737: [Spark-17171][WEB UI] DAG will list all partition...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/14737 [Spark-17171][WEB UI] DAG will list all partitions in the graph ## What changes were proposed in this pull request? DAG will list all partitions in the graph, it is too slow and hard to see all graph. Always we don't want to see all partitionsï¼we just want to see the relations of DAG graph. So I just show 2 root nodes for Rdds. Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17171 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14737.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14737 commit 7991d7622260bc8e65ee9b934d376df2597c9a11 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-08-20T15:44:38Z Just show 2 root partitions for a stage commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit 595453fbb2ccdd4009821724adefb829a13890c7 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-08-21T04:06:06Z Merge remote-tracking branch 'remotes/origin/master' into SPARK-17171 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566][CORE] Avoid deadlock between Blo...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/11546 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566][CORE] Avoid deadlock between Blo...
Github user cenyuhai commented on the pull request: https://github.com/apache/spark/pull/11546#issuecomment-217467789 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566][CORE] Avoid deadlock between Blo...
Github user cenyuhai commented on the pull request: https://github.com/apache/spark/pull/11546#issuecomment-217466944 @andrewor14 I alter the code as what you said, but the test failed because of timeout. It seems like that it is none of my business... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566][CORE] Avoid deadlock between Blo...
Github user cenyuhai commented on the pull request: https://github.com/apache/spark/pull/11546#issuecomment-212208592 @jamesecahill I don't know whether @JoshRosen will provide any other patch for this issue. But I have fixed this bug in my production environment by this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-13772][SQL] fix data type mismatch for ...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/11605 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566] Avoid deadlock between BlockMana...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/11546#discussion_r55668498 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -227,6 +228,17 @@ private[spark] class Executor( logError(errMsg) } } + + if (releasedLocks.nonEmpty) { +val errMsg = + s"${releasedLocks.size} block locks were not released by TID = $taskId:\n" + --- End diff -- In my production environment, when the storage memory is full, there is a great probability of deadlock. It is a temporary patch because JoshRosen add a read/write lock for block in https://github.com/apache/spark/pull/10705 for Spark 2.0. Two theads are removing the same block which result in deadlock. BlockManager will first lock MemoryManager and wait to lock BlockInfo in function 'dropFromMemory', Execturo task lock BlockInfo and wait to lock MemoryManager calling 'memstore.remove(block)' in function 'removeBlock' or function 'removeOldBlocks'. So just a ConcurrentHashMap to record the locks by tasks. In case of failure, release all lock after task complete. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-13772] fix data type mismatch for decim...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/11605#discussion_r55665217 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala --- @@ -299,6 +299,19 @@ class HiveTypeCoercionSuite extends PlanTest { ) } + test("test for SPARK-13772") { +val rule = HiveTypeCoercion.IfCoercion +ruleTest(rule, + If(Literal(true), Literal(1.0), Cast(Literal(1.0), DecimalType(19, 0))), --- End diff -- It is ok in hive 1.2.1 and spark 1.4.1. test case: select if(1=1, cast(1 as double), cast(1 as decimal)) from test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566] Avoid deadlock between BlockMana...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/11546#discussion_r55664945 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -227,6 +228,17 @@ private[spark] class Executor( logError(errMsg) } } + + if (releasedLocks.nonEmpty) { +val errMsg = + s"${releasedLocks.size} block locks were not released by TID = $taskId:\n" + --- End diff -- These codes are from https://github.com/apache/spark/pull/10705 by JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-13772] fix data type mismatch for decim...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/11605 [Spark-13772] fix data type mismatch for decimal fix data type mismatch for decimal, patch for branch 1.6 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-13772 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11605.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11605 commit 5236dcbc4afc31293a550d2fd87419bcc4ba7e61 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-03-06T11:30:43Z temp patch for SPARK-13566 commit 8d539df190a21e7d3d93ab078866267ece2f1df0 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-03-09T12:52:19Z fix data type mismatch for decimal commit 42addd64bfb864ff59fecc5c4c11852d7cd49f60 Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-03-09T12:57:26Z rebase to branch 1.6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13566] Avoid deadlock between BlockMana...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/11546 [SPARK-13566] Avoid deadlock between BlockManager and Executor Thread Temp patch for branch 1.6ï¼ avoid deadlock between BlockManager and Executor Thread. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-13566 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11546.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11546 commit 3f2ac8d6d977da2577c56f3bfcd51e8b053d952d Author: cenyuhai <cenyu...@didichuxing.com> Date: 2016-03-06T11:30:43Z temp patch for SPARK-13566 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Multi user
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/6812 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Multi user
Github user cenyuhai commented on the pull request: https://github.com/apache/spark/pull/6812#issuecomment-111811208 I am so sorry, I just push my commits to my branch. I don't know it will happen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Multi user
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/6812 Multi user You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark MultiUser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6812.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6812 commit e18d623d93505bc5fddeec0281ee3baef3638c3e Author: Santiago M. Mola sa...@mola.io Date: 2015-05-22T22:10:27Z [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. Author: Santiago M. Mola sa...@mola.io Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits: 11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. (cherry picked from commit e4aef91fe70d6c9765d530b913a9d79103fc27ce) Signed-off-by: Michael Armbrust mich...@databricks.com commit d6cb0446304c5cc438e2bcabd8b39ea4c408a2da Author: Liang-Chi Hsieh vii...@gmail.com Date: 2015-05-22T22:39:58Z [SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table JIRA: https://issues.apache.org/jira/browse/SPARK-7270 Author: Liang-Chi Hsieh vii...@gmail.com Closes #5864 from viirya/dyn_partition_insert and squashes the following commits: b5627df [Liang-Chi Hsieh] For comments. 3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert 8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table. (cherry picked from commit 126d7235de649ea5619dee6ad3a70970ee90df93) Signed-off-by: Michael Armbrust mich...@databricks.com commit afde4019b81b6feb57f0374f2cb097844ea4c04b Author: Imran Rashid iras...@cloudera.com Date: 2015-05-22T23:05:07Z [SPARK-7760] add /json back into master worker pages; add test Author: Imran Rashid iras...@cloudera.com Closes #6284 from squito/SPARK-7760 and squashes the following commits: 5e02d8a [Imran Rashid] style; increase timeout 9987399 [Imran Rashid] comment 8c7ed63 [Imran Rashid] add /json back into master worker pages; add test (cherry picked from commit 821254fb945c3e19540eb57fff1f656737ef484b) Signed-off-by: Josh Rosen joshro...@databricks.com commit d7660dc2f5c53dd6b3ffc57b05c0daa67a16f5f3 Author: Michael Armbrust mich...@databricks.com Date: 2015-05-23T00:23:12Z [SPARK-7834] [SQL] Better window error messages Author: Michael Armbrust mich...@databricks.com Closes #6363 from marmbrus/windowErrors and squashes the following commits: 516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages (cherry picked from commit 3c1305107a2d6d2de862e8b41dbad0e85585b1ef) Signed-off-by: Michael Armbrust mich...@databricks.com commit 0be6e3b3e60768012e2337d1cbf2967275007a11 Author: Andrew Or and...@databricks.com Date: 2015-05-23T00:37:38Z [SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages. Author: Andrew Or and...@databricks.com Closes #6301 from andrewor14/da-minor and squashes the following commits: 6d614a6 [Andrew Or] Lower log level 2811492 [Andrew Or] Log information when requests are canceled 5fcd3eb [Andrew Or] Fix tests 3320710 [Andrew Or] Lower timeouts + rephrase a few log messages (cherry picked from commit 3d8760d76eae41dcaab8e9aeda19619f3d5f1596) Signed-off-by: Andrew Or and...@databricks.com commit 130ec219aa40cd8cebf4105053d4c92d840e127e Author: Tathagata Das tathagata.das1...@gmail.com Date: 2015-05-23T00:39:01Z [SPARK-7788] Made KinesisReceiver.onStart() non-blocking KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Author: Tathagata Das tathagata.das1...@gmail.com Closes #6348 from tdas/SPARK-7788 and squashes the following commits: 2584683 [Tathagata Das] Added receiver id in thread name 6cf1cd4 [Tathagata Das] Made KinesisReceiver.onStart non-blocking (cherry