date:20190513

[GitHub] [spark] HyukjinKwon commented on a change in pull request #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

HyukjinKwon commented on a change in pull request #24589: [MINOR][SS]Remove 
duplicate 'add' in comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#discussion_r283635769
 
 

 ##
 File path: 
examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala
 ##
 @@ -70,15 +70,15 @@ object StructuredSessionization {
 line.split(" ").map(word => Event(sessionId = word, timestamp))
   }
 
-// Sessionize the events. Track number of events, start and end timestamps 
of session, and
+// Sessionize the events. Track number of events, start and end timestamps 
of session,
 // and report session updates.
 val sessionUpdates = events
   .groupByKey(event => event.sessionId)
   .mapGroupsWithState[SessionInfo, 
SessionUpdate](GroupStateTimeout.ProcessingTimeTimeout) {
 
 case (sessionId: String, events: Iterator[Event], state: 
GroupState[SessionInfo]) =>
 
-  // If timed out, then remove session and send final update
+  // If timed out, remove session and send final update
 
 Review comment:
   Hm, other two occurrences are fine.
   
   I'll just get this in - let's just keep the `and` typo fix only in this PR
   
   ```diff
   -// Sessionize the events. Track number of events, start and end 
timestamps of session, and
   +// Sessionize the events. Track number of events, start and end 
timestamps of session,
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283635315
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
 ##
 @@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2
+
+import scala.collection.JavaConverters._
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.sql.{AnalysisException, QueryTest}
+import org.apache.spark.sql.catalog.v2.Identifier
+import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException
+import org.apache.spark.sql.execution.datasources.v2.orc.OrcDataSourceV2
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.{LongType, StringType, StructType}
+
+class DataSourceV2SQLSuite extends QueryTest with SharedSQLContext with 
BeforeAndAfter {
+
+  import org.apache.spark.sql.catalog.v2.CatalogV2Implicits._
+
+  private val orc2 = classOf[OrcDataSourceV2].getName
+
+  before {
+spark.conf.set("spark.sql.catalog.testcat", 
classOf[TestInMemoryTableCatalog].getName)
+spark.conf.set("spark.sql.default.catalog", "testcat")
 
 Review comment:
   I think it has been moved to another PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283635032
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
 ##
 @@ -47,6 +51,60 @@ case class WriteToDataSourceV2(batchWrite: BatchWrite, 
query: LogicalPlan)
   override def output: Seq[Attribute] = Nil
 }
 
+/**
+ * Physical plan node for v2 create table as select.
+ *
+ * A new table will be created using the schema of the query, and rows from 
the query are appended.
+ * If either table creation or the append fails, the table will be deleted. 
This implementation does
+ * not provide an atomic CTAS.
+ */
+case class CreateTableAsSelectExec(
+catalog: TableCatalog,
+ident: Identifier,
+partitioning: Seq[Transform],
+query: SparkPlan,
+properties: Map[String, String],
+writeOptions: CaseInsensitiveStringMap,
+ifNotExists: Boolean) extends V2TableWriteExec {
+
+  import org.apache.spark.sql.catalog.v2.CatalogV2Implicits.IdentifierHelper
+
+  override protected def doExecute(): RDD[InternalRow] = {
+if (catalog.tableExists(ident)) {
+  if (ifNotExists) {
+return sparkContext.parallelize(Seq.empty, 1)
+  }
+
+  throw new TableAlreadyExistsException(ident)
+}
+
+Utils.tryWithSafeFinallyAndFailureCallbacks({
+  catalog.createTable(ident, query.schema, partitioning.toArray, 
properties.asJava) match {
+case table: SupportsWrite =>
+  val builder = table.newWriteBuilder(writeOptions)
+  .withInputDataSchema(query.schema)
+  .withQueryId(UUID.randomUUID().toString)
+  val batchWrite = builder match {
+case supportsSaveMode: SupportsSaveMode =>
+  supportsSaveMode.mode(SaveMode.Append).buildForBatch()
 
 Review comment:
   I think `SupportsSaveMode` is a hack in `TableProvider` only. Seems we don't 
need to deal with it for `TableCatalog`. Anyway it will be removed soon.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283634702
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
 ##
 @@ -112,4 +129,41 @@ case class DataSourceResolution(conf: SQLConf) extends 
Rule[LogicalPlan] with Ca
   properties = properties,
   comment = comment)
   }
+
+  private def convertCTAS(
+  catalog: TableCatalog,
+  identifier: Identifier,
+  ctas: CreateTableAsSelectStatement): CreateTableAsSelect = {
+if (ctas.options.contains("path") && ctas.location.isDefined) {
+  throw new AnalysisException(
+"LOCATION and 'path' in OPTIONS are both used to indicate the custom 
table path, " +
 
 Review comment:
   and shall we do the same check for `provider` and `comment`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283634338
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
 ##
 @@ -112,4 +129,41 @@ case class DataSourceResolution(conf: SQLConf) extends 
Rule[LogicalPlan] with Ca
   properties = properties,
   comment = comment)
   }
+
+  private def convertCTAS(
+  catalog: TableCatalog,
+  identifier: Identifier,
+  ctas: CreateTableAsSelectStatement): CreateTableAsSelect = {
+if (ctas.options.contains("path") && ctas.location.isDefined) {
+  throw new AnalysisException(
+"LOCATION and 'path' in OPTIONS are both used to indicate the custom 
table path, " +
+"you can only specify one of them.")
+}
+
+val options = ctas.options.filterKeys(_ != "path")
+
+// convert the bucket spec and add it as a transform
+val partitioning = ctas.partitioning ++ ctas.bucketSpec.map(_.asTransform)
+
+// create table properties from TBLPROPERTIES and OPTIONS clauses
+val properties = new mutable.HashMap[String, String]()
+properties ++= ctas.properties
+properties ++= options
+
+// convert USING, LOCATION, and COMMENT clauses to table properties
+properties += ("provider" -> ctas.provider)
+ctas.comment.map(text => properties += ("comment" -> text))
 
 Review comment:
   This is a good point. When saving Spark tables to Hive metastore, we need to 
store some spark specific information in the table properties. And we always 
use `spark.sql.` as the prefix for the property keys.
   
   ~Shall we follow it and add a prefix here as well?~
   
   EDIT:
   Since data source implementations need to know these properties, shall we 
just document these special properties?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283634338
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
 ##
 @@ -112,4 +129,41 @@ case class DataSourceResolution(conf: SQLConf) extends 
Rule[LogicalPlan] with Ca
   properties = properties,
   comment = comment)
   }
+
+  private def convertCTAS(
+  catalog: TableCatalog,
+  identifier: Identifier,
+  ctas: CreateTableAsSelectStatement): CreateTableAsSelect = {
+if (ctas.options.contains("path") && ctas.location.isDefined) {
+  throw new AnalysisException(
+"LOCATION and 'path' in OPTIONS are both used to indicate the custom 
table path, " +
+"you can only specify one of them.")
+}
+
+val options = ctas.options.filterKeys(_ != "path")
+
+// convert the bucket spec and add it as a transform
+val partitioning = ctas.partitioning ++ ctas.bucketSpec.map(_.asTransform)
+
+// create table properties from TBLPROPERTIES and OPTIONS clauses
+val properties = new mutable.HashMap[String, String]()
+properties ++= ctas.properties
+properties ++= options
+
+// convert USING, LOCATION, and COMMENT clauses to table properties
+properties += ("provider" -> ctas.provider)
+ctas.comment.map(text => properties += ("comment" -> text))
 
 Review comment:
   This is a good point. When saving Spark tables to Hive metastore, we need to 
store some spark specific information in the table properties. And we always 
use `spark.sql.` as the prefix for the property keys.
   
   Shall we follow it and add a prefix here as well?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #24576: [SPARK-27671][SQL] Fix error when casting from a nested null in a struct

2019-05-13 Thread GitBox

viirya commented on a change in pull request #24576: [SPARK-27671][SQL] Fix 
error when casting from a nested null in a struct
URL: https://github.com/apache/spark/pull/24576#discussion_r283634266
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -620,6 +620,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 // We can return what the children return. Same thing should happen in the 
codegen path.
 if (DataType.equalsStructurally(from, to)) {
   identity
+} else if (from == NullType) {
+  // According to `canCast`, NullType can be casted to any type.
+  // For primitive types, we don't reach here because the guard of 
`nullSafeEval`.
+  // But for nested types like struct, we might reach here for nested null 
type field.
+  // We won't call the returned function actually, but returns a 
placeholder.
+  _ => throw new SparkException(s"should not directly cast from NullType 
to $to.")
 
 Review comment:
   If reaching here, means we call the cast function for a null, unnecessarily.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS created data source table support collect statistics

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS 
created data source table support collect statistics
URL: https://github.com/apache/spark/pull/24596#issuecomment-492087546
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS created data source table support collect statistics

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS 
created data source table support collect statistics
URL: https://github.com/apache/spark/pull/24596#issuecomment-492087551
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105373/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data source table support collect statistics

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data 
source table support collect statistics
URL: https://github.com/apache/spark/pull/24596#issuecomment-492087546
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data source table support collect statistics

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data 
source table support collect statistics
URL: https://github.com/apache/spark/pull/24596#issuecomment-492087551
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105373/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283633844
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
 ##
 @@ -112,4 +129,41 @@ case class DataSourceResolution(conf: SQLConf) extends 
Rule[LogicalPlan] with Ca
   properties = properties,
   comment = comment)
   }
+
+  private def convertCTAS(
+  catalog: TableCatalog,
+  identifier: Identifier,
+  ctas: CreateTableAsSelectStatement): CreateTableAsSelect = {
+if (ctas.options.contains("path") && ctas.location.isDefined) {
+  throw new AnalysisException(
+"LOCATION and 'path' in OPTIONS are both used to indicate the custom 
table path, " +
 
 Review comment:
   just for curiosity, does v1 fail this case too?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #24596: [SPARK-27694][SQL] CTAS created data source table support collect statistics

2019-05-13 Thread GitBox

SparkQA removed a comment on issue #24596: [SPARK-27694][SQL] CTAS created data 
source table support collect statistics
URL: https://github.com/apache/spark/pull/24596#issuecomment-492059055
 
 
   **[Test build #105373 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105373/testReport)**
 for PR 24596 at commit 
[`914f85a`](https://github.com/apache/spark/commit/914f85a9ec406f011b1806b3debca12280280757).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24596: [SPARK-27694][SQL] CTAS created data source table support collect statistics

2019-05-13 Thread GitBox

SparkQA commented on issue #24596: [SPARK-27694][SQL] CTAS created data source 
table support collect statistics
URL: https://github.com/apache/spark/pull/24596#issuecomment-492087218
 
 
   **[Test build #105373 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105373/testReport)**
 for PR 24596 at commit 
[`914f85a`](https://github.com/apache/spark/commit/914f85a9ec406f011b1806b3debca12280280757).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283633501
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
 ##
 @@ -46,14 +55,22 @@ case class DataSourceResolution(conf: SQLConf) extends 
Rule[LogicalPlan] with Ca
   CreateTable(tableDesc, mode, None)
 
 case CreateTableAsSelectStatement(
-table, query, partitionCols, bucketSpec, properties, 
V1WriteProvider(provider), options,
-location, comment, ifNotExists) =>
+AsTableIdentifier(table), query, partitionCols, bucketSpec, properties,
+V1WriteProvider(provider), options, location, comment, ifNotExists) =>
 
   val tableDesc = buildCatalogTable(table, new StructType, partitionCols, 
bucketSpec,
 properties, provider, options, location, comment, ifNotExists)
   val mode = if (ifNotExists) SaveMode.Ignore else SaveMode.ErrorIfExists
 
   CreateTable(tableDesc, mode, Some(query))
+
+case create: CreateTableAsSelectStatement =>
+  // the provider was not a v1 source, convert to a v2 plan
+  val CatalogObjectIdentifier(maybeCatalog, identifier) = create.tableName
+  val catalog = maybeCatalog
+  .getOrElse(throw new AnalysisException("Default catalog is not set"))
 
 Review comment:
   nit: `catalog is not specified in the table identifier`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24580: [SPARK-27674][SQL] the hint 
should not be dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#issuecomment-492086100
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10639/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283632754
 
 

 ##
 File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/CreateTablePartitioningValidationSuite.scala
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalog.v2.{Identifier, TableCatalog, 
TestTableCatalog}
+import org.apache.spark.sql.catalog.v2.expressions.LogicalExpressions
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.plans.logical.{CreateTableAsSelect, 
LeafNode}
+import org.apache.spark.sql.types.{DoubleType, LongType, StringType, 
StructType}
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+class CreateTablePartitioningValidationSuite extends AnalysisTest {
+  import CreateTablePartitioningValidationSuite._
+
+  test("CreateTableAsSelect: fail missing top-level column") {
+val plan = CreateTableAsSelect(
+  catalog,
+  Identifier.of(Array(), "table_name"),
+  LogicalExpressions.bucket(4, "does_not_exist") :: Nil,
+  TestRelation2,
+  Map.empty,
+  Map.empty,
+  ignoreIfExists = false)
+
+assert(!plan.resolved)
+assertAnalysisError(plan, Seq(
+  "Invalid partitioning",
+  "does_not_exist is missing or is in a map or array"))
+  }
+
+  test("CreateTableAsSelect: fail missing top-level column nested reference") {
+val plan = CreateTableAsSelect(
+  catalog,
+  Identifier.of(Array(), "table_name"),
+  LogicalExpressions.bucket(4, "does_not_exist.z") :: Nil,
+  TestRelation2,
+  Map.empty,
+  Map.empty,
+  ignoreIfExists = false)
+
+assert(!plan.resolved)
+assertAnalysisError(plan, Seq(
+  "Invalid partitioning",
+  "does_not_exist.z is missing or is in a map or array"))
 
 Review comment:
   Not a blocker, but we can improve the error message in the future. It's 
better to let users know which column/field is missing. For example, `a.b`, 
it's possible that column `a` exists but `a` is not a struct or it doesn't have 
a `b` field.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24580: [SPARK-27674][SQL] the hint 
should not be dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#issuecomment-492086096
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24580: [SPARK-27674][SQL] the hint should not 
be dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#issuecomment-492086100
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10639/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24580: [SPARK-27674][SQL] the hint should not 
be dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#issuecomment-492086096
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283632288
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ##
 @@ -402,6 +404,35 @@ trait V2WriteCommand extends Command {
   }
 }
 
+/**
+ * Create a new table from a select query with a v2 catalog.
+ */
+case class CreateTableAsSelect(
+catalog: TableCatalog,
+tableName: Identifier,
+partitioning: Seq[Transform],
+query: LogicalPlan,
+properties: Map[String, String],
+writeOptions: Map[String, String],
+ignoreIfExists: Boolean) extends Command {
+
+  override def children: Seq[LogicalPlan] = Seq(query)
+
+  override lazy val resolved: Boolean = {
+// the table schema is created from the query schema, so the only 
resolution needed is to check
+// that the columns referenced by the table's partitioning exist in the 
query schema
+val references = partitioning.flatMap(_.references).toSet
+references.map(_.fieldNames).forall { column =>
+  query.schema.findNestedField(column) match {
 
 Review comment:
   ah good catch, it should be `query.schema.findNestedField(column).isDefined`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

SparkQA commented on issue #24580: [SPARK-27674][SQL] the hint should not be 
dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#issuecomment-492084979
 
 
   **[Test build #105374 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105374/testReport)**
 for PR 24580 at commit 
[`f47807e`](https://github.com/apache/spark/commit/f47807e0019013c0de7e491dcc95abd80219e12f).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24576: [SPARK-27671][SQL] Fix error when casting from a nested null in a struct

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24576: [SPARK-27671][SQL] Fix 
error when casting from a nested null in a struct
URL: https://github.com/apache/spark/pull/24576#discussion_r283631760
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -620,6 +620,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 // We can return what the children return. Same thing should happen in the 
codegen path.
 if (DataType.equalsStructurally(from, to)) {
   identity
+} else if (from == NullType) {
+  // According to `canCast`, NullType can be casted to any type.
+  // For primitive types, we don't reach here because the guard of 
`nullSafeEval`.
+  // But for nested types like struct, we might reach here for nested null 
type field.
+  // We won't call the returned function actually, but returns a 
placeholder.
+  _ => throw new SparkException(s"should not directly cast from NullType 
to $to.")
 
 Review comment:
   Which means, it's a bug if the exception is thrown?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #24576: [SPARK-27671][SQL] Fix error when casting from a nested null in a struct

2019-05-13 Thread GitBox

viirya commented on a change in pull request #24576: [SPARK-27671][SQL] Fix 
error when casting from a nested null in a struct
URL: https://github.com/apache/spark/pull/24576#discussion_r283630039
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -620,6 +620,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 // We can return what the children return. Same thing should happen in the 
codegen path.
 if (DataType.equalsStructurally(from, to)) {
   identity
+} else if (from == NullType) {
+  // According to `canCast`, NullType can be casted to any type.
+  // For primitive types, we don't reach here because the guard of 
`nullSafeEval`.
+  // But for nested types like struct, we might reach here for nested null 
type field.
+  // We won't call the returned function actually, but returns a 
placeholder.
+  _ => throw new SparkException(s"should not directly cast from NullType 
to $to.")
 
 Review comment:
   We do the thing as follow:
   ```scala
   buildCast[InternalRow](_, row => {
 val newRow = new GenericInternalRow(from.fields.length)
 var i = 0
 while (i < row.numFields) {
   newRow.update(i,
 // We don't call cast function but directly set a null, if finding a 
null.
 if (row.isNullAt(i)) null else castFuncs(i)(row.get(i, 
from.apply(i).dataType)))
   i += 1
 }
 newRow
   })
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #24576: [SPARK-27671][SQL] Fix error when casting from a nested null in a struct

2019-05-13 Thread GitBox

viirya commented on a change in pull request #24576: [SPARK-27671][SQL] Fix 
error when casting from a nested null in a struct
URL: https://github.com/apache/spark/pull/24576#discussion_r283629638
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -620,6 +620,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 // We can return what the children return. Same thing should happen in the 
codegen path.
 if (DataType.equalsStructurally(from, to)) {
   identity
+} else if (from == NullType) {
+  // According to `canCast`, NullType can be casted to any type.
+  // For primitive types, we don't reach here because the guard of 
`nullSafeEval`.
+  // But for nested types like struct, we might reach here for nested null 
type field.
+  // We won't call the returned function actually, but returns a 
placeholder.
+  _ => throw new SparkException(s"should not directly cast from NullType 
to $to.")
 
 Review comment:
   In the case of casting a nested field in a struct or array, if we find it is 
a null, we don't call the cast function but set the destination as null 
directly. If we call this cast function, we hit this exception.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24580: [SPARK-27674][SQL] the 
hint should not be dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#discussion_r283628689
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
 ##
 @@ -212,17 +213,18 @@ class CacheManager extends Logging {
   def useCachedData(plan: LogicalPlan): LogicalPlan = {
 val newPlan = plan transformDown {
   case command: IgnoreCachedData => command
-  // Do not lookup the cache by hint node. Hint node is special, we should 
ignore it when
-  // canonicalizing plans, so that plans which are same except hint can 
hit the same cache.
-  // However, we also want to keep the hint info after cache lookup. Here 
we skip the hint
-  // node, so that the returned caching plan won't replace the hint node 
and drop the hint info
-  // from the original plan.
-  case hint: ResolvedHint => hint
 
   case currentFragment =>
-lookupCachedData(currentFragment)
-  .map(_.cachedRepresentation.withOutput(currentFragment.output))
-  .getOrElse(currentFragment)
+lookupCachedData(currentFragment).map { cached =>
+  // After cache lookup, we should still keep the hints from the input 
plan.
+  val hints = 
EliminateResolvedHint.extractHintsFromPlan(currentFragment)._2
 
 Review comment:
   It's natural to return the hints in a top-down fashion. And the caller side 
is free to process the returned hints, including reverse it. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24580: [SPARK-27674][SQL] the hint should not be dropped after cache lookup

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24580: [SPARK-27674][SQL] the 
hint should not be dropped after cache lookup
URL: https://github.com/apache/spark/pull/24580#discussion_r283626354
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
 ##
 @@ -212,17 +213,18 @@ class CacheManager extends Logging {
   def useCachedData(plan: LogicalPlan): LogicalPlan = {
 val newPlan = plan transformDown {
   case command: IgnoreCachedData => command
-  // Do not lookup the cache by hint node. Hint node is special, we should 
ignore it when
-  // canonicalizing plans, so that plans which are same except hint can 
hit the same cache.
-  // However, we also want to keep the hint info after cache lookup. Here 
we skip the hint
-  // node, so that the returned caching plan won't replace the hint node 
and drop the hint info
-  // from the original plan.
-  case hint: ResolvedHint => hint
 
   case currentFragment =>
-lookupCachedData(currentFragment)
-  .map(_.cachedRepresentation.withOutput(currentFragment.output))
-  .getOrElse(currentFragment)
+lookupCachedData(currentFragment).map { cached =>
+  // After cache lookup, we should still keep the hints from the input 
plan.
+  val hints = 
EliminateResolvedHint.extractHintsFromPlan(currentFragment)._2
+  val cachedPlan = 
cached.cachedRepresentation.withOutput(currentFragment.output)
+  // The returned hint list is in top-down order. We should reverse it 
so that the top hint
+  // is still in the top node.
+  hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) =>
 
 Review comment:
   What if the plan is `Filter(ResolvedHint(...))`? This PR is trying to fix 
the problem that when the hint node is not the root node.
   
   BTW `lookupCachedData` needs to return a `CachedData`, so we can't add hint 
node there.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24576: [SPARK-27671][SQL] Fix error when casting from a nested null in a struct

2019-05-13 Thread GitBox

cloud-fan commented on a change in pull request #24576: [SPARK-27671][SQL] Fix 
error when casting from a nested null in a struct
URL: https://github.com/apache/spark/pull/24576#discussion_r283625673
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -620,6 +620,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 // We can return what the children return. Same thing should happen in the 
codegen path.
 if (DataType.equalsStructurally(from, to)) {
   identity
+} else if (from == NullType) {
+  // According to `canCast`, NullType can be casted to any type.
+  // For primitive types, we don't reach here because the guard of 
`nullSafeEval`.
+  // But for nested types like struct, we might reach here for nested null 
type field.
+  // We won't call the returned function actually, but returns a 
placeholder.
+  _ => throw new SparkException(s"should not directly cast from NullType 
to $to.")
 
 Review comment:
   when will we hit this exception?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492076751
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105372/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492076746
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for 
cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492076746
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for 
cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492076751
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105372/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

SparkQA removed a comment on issue #24406: [SPARK-27024] Executor interface for 
cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492054811
 
 
   **[Test build #105372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105372/testReport)**
 for PR 24406 at commit 
[`b9dacef`](https://github.com/apache/spark/commit/b9dacef1d7d47d19df300628d3841b3a13c03547).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

SparkQA commented on issue #24406: [SPARK-27024] Executor interface for cluster 
managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492076455
 
 
   **[Test build #105372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105372/testReport)**
 for PR 24406 at commit 
[`b9dacef`](https://github.com/apache/spark/commit/b9dacef1d7d47d19df300628d3841b3a13c03547).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mengxr commented on a change in pull request #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

mengxr commented on a change in pull request #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#discussion_r283617243
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ##
 @@ -72,6 +83,97 @@ private[spark] class CoarseGrainedExecutorBackend(
 }(ThreadUtils.sameThread)
   }
 
+  // Check that the actual resources discovered will satisfy the user specified
+  // requirements and that they match the configs specified by the user to 
catch
+  // mismatches between what the user requested and what resource manager gave 
or
+  // what the discovery script found.
+  private def checkResourcesMeetRequirements(
 
 Review comment:
   Had an offline discussion with @WeichenXu123 . He suggested refactoring this 
check to make it easier to read. Now the arguments are:
   
   * `reqResourcesAndCounts`: request per task (not per executor)
   * `actualResources`: resources allocated per executor
   
   It is not easy to tell from the variable names and hence make the code 
harder to read. Basically we need the following:
   
   1. number allocated per executor cannot be smaller than requested count for 
each resource name
   2. requested count for executor cannot be smaller than requested count for 
task for each resource name. Note that this doesn't require resource discovery.
   3. the set of requested resource names for executors should match the set of 
requested resource names for tasks.
   
   It would be nice to refactor the method into those three. We can also do it 
in a follow-up PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mengxr commented on a change in pull request #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

mengxr commented on a change in pull request #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#discussion_r283617243
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ##
 @@ -72,6 +83,97 @@ private[spark] class CoarseGrainedExecutorBackend(
 }(ThreadUtils.sameThread)
   }
 
+  // Check that the actual resources discovered will satisfy the user specified
+  // requirements and that they match the configs specified by the user to 
catch
+  // mismatches between what the user requested and what resource manager gave 
or
+  // what the discovery script found.
+  private def checkResourcesMeetRequirements(
 
 Review comment:
   Had an offline discussion with @WeichenXu123 . He suggested refactoring this 
check to make it easier to read. Now the arguments are:
   
   * `reqResourcesAndCounts`: request per task (not per executor)
   * `actualResources`: resources allocated per executor
   
   It is not easy to tell from the variable names and hence make the code 
harder to read. Basically we need the following:
   
   1. number allocated per executor equals requested count for each resource 
name
   2. requested count for executor cannot be smaller than requested count for 
task for each resource name. Note that this doesn't require resource discovery.
   3. the set of requested resource names for executors should match the set of 
requested resource names for tasks.
   
   It would be nice to refactor the method into those three. We can also do it 
in a follow-up PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals

2019-05-13 Thread GitBox

JoshRosen commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule 
to evaluate the deterministic scala udf only once if all inputs are literals
URL: https://github.com/apache/spark/pull/24593#issuecomment-492065908
 
 
   This makes sense in theory, but could it be a problem in practice in case 
users are (incorrectly but implicitly) relying on existing behavior (e.g. to 
perform side-effects)? I'm not saying that we shouldn't consider this, but just 
wanted to consider whether there's any user workloads that are likely to be 
broken by this (and whether this merits a mention in a migration guide / 
release notes).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals

2019-05-13 Thread GitBox

JoshRosen commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule 
to evaluate the deterministic scala udf only once if all inputs are literals
URL: https://github.com/apache/spark/pull/24593#issuecomment-492065223
 
 
   Instead of defining a new `DeterministicLiteralUDF` rule, could we instead 
   
   ```scala
   override def foldable: Boolean = deterministic && children.forall(_.foldable)
   ```
   
   on `ScalaUDF`? That would be a slight generalization of the optimization 
proposed here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hehuiyuan commented on issue #24219: [SPARK-27258][K8S]Deal with the k8s resource names that don't match their own regular expression

2019-05-13 Thread GitBox

hehuiyuan commented on issue #24219: [SPARK-27258][K8S]Deal with the k8s 
resource names that don't match their own regular expression 
URL: https://github.com/apache/spark/pull/24219#issuecomment-492065168
 
 
   > There are some public utility methods that could be used offered by the 
fabric8io client for this purprose:
   > 
https://github.com/fabric8io/kubernetes-client/blob/v4.2.2/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/KubernetesResourceUtil.java
   > For example 
`io.fabric8.kubernetes.client.utils.KubernetesResourceUtil.{sanitizeName(),isValidName()}`
 methods etc.
   > This way code is cleaner and we dont need to maintain changes as K8s 
evolves.
   
   Hi, that sounds good. This class supports many methods.  But  I don't found 
the method that processes the first character of resourceName is `-` and  
checks the name of Service in KubernetesResourceUtil .
   
   Service  is different from the others.
   
   ```
   The regex of Servive is ^[a-z]([-a-z0-9]*[a-z0-9])?.
   The regex of Pod is 
^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*.
   The regex of Secret is 
^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*.
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cxzl25 commented on a change in pull request #24497: [SPARK-27630][CORE]Stage retry causes totalRunningTasks calculation to be negative

2019-05-13 Thread GitBox

cxzl25 commented on a change in pull request #24497: [SPARK-27630][CORE]Stage 
retry causes totalRunningTasks calculation to be negative
URL: https://github.com/apache/spark/pull/24497#discussion_r283613509
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
 ##
 @@ -757,8 +753,11 @@ private[spark] class ExecutorAllocationManager(
   val taskIndex = taskEnd.taskInfo.index
   val stageId = taskEnd.stageId
   allocationManager.synchronized {
-if (stageIdToNumRunningTask.contains(stageId)) {
-  stageIdToNumRunningTask(stageId) -= 1
+if (taskIdToNumLiveTasks.contains(taskId)) {
+  taskIdToNumLiveTasks(taskId) -= 1
+  if (taskIdToNumLiveTasks(taskId) <= 0) {
 
 Review comment:
   In theory, there is no negative number of tasks.
   Just to prevent negative numbers and to clear them in time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer edited a comment on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer edited a comment on issue #24372: [SPARK-27462][SQL] Enhance insert 
into hive table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-492047853
 
 
   > @beliefer this is way too many pings. Look at the history. I would infer 
that there isn't support for this change.
   
   Yes,I made many pings. I will wait a committer to review this PR. If 
committers really don't want to support this syntax, I want know the reason 
such as overall consideration and planning.Thank you very much, srowen. You 
have more attention to normal contributors.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS created data source table should update statistics if spark.sql.statistics.size.autoUpdate.enabled is enabled

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS 
created data source table should update statistics if 
spark.sql.statistics.size.autoUpdate.enabled is enabled
URL: https://github.com/apache/spark/pull/24596#issuecomment-492058699
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10638/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS created data source table should update statistics if spark.sql.statistics.size.autoUpdate.enabled is enabled

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24596: [SPARK-27694][SQL] CTAS 
created data source table should update statistics if 
spark.sql.statistics.size.autoUpdate.enabled is enabled
URL: https://github.com/apache/spark/pull/24596#issuecomment-492058695
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24596: [SPARK-27694][SQL] CTAS created data source table should update statistics if spark.sql.statistics.size.autoUpdate.enabled is enabled

2019-05-13 Thread GitBox

SparkQA commented on issue #24596: [SPARK-27694][SQL] CTAS created data source 
table should update statistics if spark.sql.statistics.size.autoUpdate.enabled 
is enabled
URL: https://github.com/apache/spark/pull/24596#issuecomment-492059055
 
 
   **[Test build #105373 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105373/testReport)**
 for PR 24596 at commit 
[`914f85a`](https://github.com/apache/spark/commit/914f85a9ec406f011b1806b3debca12280280757).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data source table should update statistics if spark.sql.statistics.size.autoUpdate.enabled is enabled

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data 
source table should update statistics if 
spark.sql.statistics.size.autoUpdate.enabled is enabled
URL: https://github.com/apache/spark/pull/24596#issuecomment-492058695
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data source table should update statistics if spark.sql.statistics.size.autoUpdate.enabled is enabled

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24596: [SPARK-27694][SQL] CTAS created data 
source table should update statistics if 
spark.sql.statistics.size.autoUpdate.enabled is enabled
URL: https://github.com/apache/spark/pull/24596#issuecomment-492058699
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10638/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a change in pull request #20430: [SPARK-23263][TEST] CTAS should update stat if autoUpdate statistics is enabled

2019-05-13 Thread GitBox

wangyum commented on a change in pull request #20430: [SPARK-23263][TEST] CTAS 
should update stat if autoUpdate statistics is enabled
URL: https://github.com/apache/spark/pull/20430#discussion_r283607816
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
 ##
 @@ -1416,4 +1416,21 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   assert(catalogStats.rowCount.isEmpty)
 }
   }
+
+  test(s"CTAS should update statistics if 
${SQLConf.AUTO_SIZE_UPDATE_ENABLED.key} is enabled") {
 
 Review comment:
   Please review this PR first: https://github.com/apache/spark/pull/24596


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum opened a new pull request #24596: [SPARK-27694][SQL] CTAS created data source table should update statistics if spark.sql.statistics.size.autoUpdate.enabled is enabled

2019-05-13 Thread GitBox

wangyum opened a new pull request #24596: [SPARK-27694][SQL] CTAS created data 
source table should update statistics if 
spark.sql.statistics.size.autoUpdate.enabled is enabled
URL: https://github.com/apache/spark/pull/24596
 
 
   ## What changes were proposed in this pull request?
   
   How to reproduce:
   ```sql
   bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true -S
   
   spark-sql> CREATE TABLE spark_27694 USING parquet AS SELECT 'a', 'b';
   spark-sql> desc formatted spark_27694;
   astring  NULL
   bstring  NULL
   
   # Detailed Table Information
   Database default
   Tablespark_27694
   Owneryumwang
   Created Time Tue May 14 10:38:25 CST 2019
   Last Access  Thu Jan 01 08:00:00 CST 1970
   Created By   Spark 2.4.0
   Type MANAGED
   Provider parquet
   Table Properties [transient_lastDdlTime=1557801505]
   Location file:/user/hive/warehouse/spark_27694
   Serde Library
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   InputFormat  org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
   OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
   Storage Properties   [serialization.format=1]
   ```
   This pr fix this issue.
   
   ## How was this patch tested?
   
   unit tests and manual tests:
   ```
   bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true -S
   
   spark-sql> CREATE TABLE spark_27694 USING parquet AS SELECT 'a', 'b';
   spark-sql> DESC FORMATTED spark_27694;
   astring  NULL
   bstring  NULL
   
   # Detailed Table Information
   Database default
   Tablespark_27694
   Ownerroot
   Created Time Mon May 13 19:45:33 GMT-07:00 2019
   Last Access  Wed Dec 31 17:00:00 GMT-07:00 1969
   Created By   Spark 3.0.0-SNAPSHOT
   Type MANAGED
   Provider parquet
   Statistics   561 bytes
   Location file:/user/hive/warehouse/spark_27694
   Serde Library
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   InputFormat  org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
   OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on issue #24525: [SPARK-27633][SQL] Remove redundant aliases in NestedColumnAliasing

2019-05-13 Thread GitBox

viirya commented on issue #24525: [SPARK-27633][SQL] Remove redundant aliases 
in NestedColumnAliasing
URL: https://github.com/apache/spark/pull/24525#issuecomment-492055997
 
 
   ping @dongjoon-hyun 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492055911
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10637/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492055905
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for 
cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492055911
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10637/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24406: [SPARK-27024] Executor interface for 
cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492055905
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer commented on a change in pull request #24372: [SPARK-27462][SQL] 
Enhance insert into hive table that could choose some columns in target table 
flexibly.
URL: https://github.com/apache/spark/pull/24372#discussion_r283605014
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
 ##
 @@ -332,24 +334,58 @@ case class PreprocessTableInsertion(conf: SQLConf) 
extends Rule[LogicalPlan] {
   private def preprocess(
   insert: InsertIntoTable,
   tblName: String,
+  insertedCols: Option[Seq[String]],
   partColNames: Seq[String]): InsertIntoTable = {
 
 val normalizedPartSpec = PartitioningUtils.normalizePartitionSpec(
   insert.partition, partColNames, tblName, conf.resolver)
 
 val staticPartCols = normalizedPartSpec.filter(_._2.isDefined).keySet
+val selectedCols = if (insertedCols == None) {
+  insert.table.output
+} else {
+  val tableCols = insert.table.output.map(_.name)
+  val noexistsCols = insertedCols.get.filterNot(col => 
tableCols.contains(col))
+  if (noexistsCols.size > 0) {
+throw new AnalysisException(s"Table $tblName does not exists these 
columns: $noexistsCols.")
+  }
+  insert.table.output.filter(a => insertedCols.get.contains(a.name))
+}
 val expectedColumns = insert.table.output.filterNot(a => 
staticPartCols.contains(a.name))
 
 if (expectedColumns.length != insert.query.schema.length) {
   throw new AnalysisException(
-s"$tblName requires that the data to be inserted have the same number 
of columns as the " +
-  s"target table: target table has ${insert.table.output.size} 
column(s) but the " +
-  s"inserted data has ${insert.query.output.length + 
staticPartCols.size} column(s), " +
+s"$tblName requires that the data to be inserted have the same number 
of columns as " +
+  s"the number of columns selected in the target table: the number of 
columns selected " +
+  s"has ${expectedColumns.length + staticPartCols.size} column(s) but 
the inserted data " +
+  s"has ${insert.query.output.length + staticPartCols.size} column(s), 
" +
   s"including ${staticPartCols.size} partition column(s) having 
constant value(s).")
 }
 
+val tableColumns = insert.table.output.filterNot(a => 
staticPartCols.contains(a.name))
+val tableCols = tableColumns.map(_.name)
+val filledQuery = if (insertedCols == None) {
+  insert.query
+} else {
+  // Because `HiveFileFormat` writes data according to the index of 
columns which belongs
+  // target table, in order to align the data, we need to fill in some 
empty expressions.
+  val cols = insertedCols.get
+  val project = insert.query.asInstanceOf[Project]
+  val filledProjectList = ArrayBuffer.empty[NamedExpression]
+  var i = 0
+  tableCols.foreach { tableCol =>
+if (cols.contains(tableCol)) {
+  filledProjectList += project.projectList(i)
+  i += 1
+} else {
+  filledProjectList += Alias(Literal(null, NullType), "NULL")()
 
 Review comment:
   @cloud-fan wenchen, help me to review this PR again, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert 
into hive table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-491124534
 
 
   @cloud-fan wenchen, help me to review this PR again, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

SparkQA commented on issue #24406: [SPARK-27024] Executor interface for cluster 
managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#issuecomment-492054811
 
 
   **[Test build #105372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105372/testReport)**
 for PR 24406 at commit 
[`b9dacef`](https://github.com/apache/spark/commit/b9dacef1d7d47d19df300628d3841b3a13c03547).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 
'add' in comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492054037
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105371/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 
'add' in comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492054029
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in 
comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492054029
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in 
comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492054037
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105371/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

SparkQA removed a comment on issue #24589: [MINOR][SS]Remove duplicate 'add' in 
comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051869
 
 
   **[Test build #105371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105371/testReport)**
 for PR 24589 at commit 
[`2d1c9f6`](https://github.com/apache/spark/commit/2d1c9f6457e07e23523e76f8d00c9af34f80e81f).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

SparkQA commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment 
of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492053974
 
 
   **[Test build #105371 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105371/testReport)**
 for PR 24589 at commit 
[`2d1c9f6`](https://github.com/apache/spark/commit/2d1c9f6457e07e23523e76f8d00c9af34f80e81f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

SparkQA commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment 
of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051869
 
 
   **[Test build #105371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105371/testReport)**
 for PR 24589 at commit 
[`2d1c9f6`](https://github.com/apache/spark/commit/2d1c9f6457e07e23523e76f8d00c9af34f80e81f).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 
'add' in comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051535
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24589: [MINOR][SS]Remove duplicate 
'add' in comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051540
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10636/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in 
comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051535
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in 
comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051540
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10636/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in comment of `StructuredSessionization`.

2019-05-13 Thread GitBox

beliefer commented on issue #24589: [MINOR][SS]Remove duplicate 'add' in 
comment of `StructuredSessionization`.
URL: https://github.com/apache/spark/pull/24589#issuecomment-492051377
 
 
   > It's fine. but mind taking another look and see if there are some more 
typos to fix? I am sure there are more.
   
   Thank you. I found another two typos, but I'm not sure. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert 
into hive table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-487243349
 
 
   cc @vanzin 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert 
into hive table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-489413223
 
 
   cc @dongjoon-hyun 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert 
into hive table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-490030600
 
 
   cc @cloud-fan @gatorsmile 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer removed a comment on issue #24372: [SPARK-27462][SQL] Enhance insert 
into hive table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-487333959
 
 
   cc @HyukjinKwon 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on issue #24372: [SPARK-27462][SQL] Enhance insert into hive table that could choose some columns in target table flexibly.

2019-05-13 Thread GitBox

beliefer commented on issue #24372: [SPARK-27462][SQL] Enhance insert into hive 
table that could choose some columns in target table flexibly.
URL: https://github.com/apache/spark/pull/24372#issuecomment-492047853
 
 
   > @beliefer this is way too many pings. Look at the history. I would infer 
that there isn't support for this change.
   
   Yes,I made many pings. I will wait a committer to review this PR. If 
committers really don't want to support this syntax, I want know the reason 
overall consideration and planning.Thank you very much, srowen. You have more 
attention to normal contributors.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-05-13 Thread GitBox

melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade 
Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-492044362
 
 
   The following error occurred in spark 2.4
   Expected schema message in stream, was null or length 0 
   ```
 df=pd.DataFrame([1,2,3,4,5, 6,8,9], columns=['test']).astype(str)
   
   test_data = sparkSession.createDataFrame(df)
   
   def mapfunc(row):
   value = 1 + float(row['test'])
   return Row(test1=value)
   
   aa = test_data.rdd.map(mapfunc).toDF().toPandas()
   ```
   
![image](https://user-images.githubusercontent.com/1145830/57664558-3a949c80-762b-11e9-9a98-efb102129824.png)
   
![image](https://user-images.githubusercontent.com/1145830/57664569-42ecd780-762b-11e9-87f4-5a616e21be73.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-05-13 Thread GitBox

melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade 
Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-492044362
 
 
   The following error occurred in spark 2.4
   ```
 df=pd.DataFrame([1,2,3,4,5, 6,8,9], columns=['test']).astype(str)
   
   test_data = sparkSession.createDataFrame(df)
   
   def mapfunc(row):
   value = 1 + float(row['test'])
   return Row(test1=value)
   
   aa = test_data.rdd.map(mapfunc).toDF().toPandas()
   ```
   
![image](https://user-images.githubusercontent.com/1145830/57664558-3a949c80-762b-11e9-9a98-efb102129824.png)
   
![image](https://user-images.githubusercontent.com/1145830/57664569-42ecd780-762b-11e9-87f4-5a616e21be73.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-05-13 Thread GitBox

melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade 
Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-492044362
 
 
   The following error occurred in spark 2.4
   
   
![image](https://user-images.githubusercontent.com/1145830/57664558-3a949c80-762b-11e9-9a98-efb102129824.png)
   
![image](https://user-images.githubusercontent.com/1145830/57664569-42ecd780-762b-11e9-87f4-5a616e21be73.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-05-13 Thread GitBox

melin edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade 
Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-492044362
 
 
   The following error occurred in spark 2.4
   
![image](https://user-images.githubusercontent.com/1145830/57664558-3a949c80-762b-11e9-9a98-efb102129824.png)
   
![image](https://user-images.githubusercontent.com/1145830/57664569-42ecd780-762b-11e9-87f4-5a616e21be73.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] melin commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-05-13 Thread GitBox

melin commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache 
Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-492044362
 
 
   The following error occurred in spark 2.4
   [image: image.png]
   [image: image.png]
   
   Hyukjin Kwon  于2019年5月9日周四 下午10:03写道：
   
   > nope this change only goes into Spark 3.0 but I think Arrow 0.12.x is
   > still supported in Spark 2.4 too
   >
   > —
   > You are receiving this because you commented.
   > Reply to this email directly, view it on GitHub
   > , or 
mute
   > the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] turboFei edited a comment on issue #24533: [SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive

2019-05-13 Thread GitBox

turboFei edited a comment on issue #24533: [SPARK-27637][Shuffle] For 
nettyBlockTransferService, if IOException occurred while fetching data, check 
whether relative executor is alive  before retry
URL: https://github.com/apache/spark/pull/24533#issuecomment-492044193
 
 
   @cloud-fan I also add relative unit test to test retryingBlockFetcher, which 
asserts that retryingBlockFetcher does not retry because the relative executor 
has been dead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] turboFei commented on issue #24533: [SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before

2019-05-13 Thread GitBox

turboFei commented on issue #24533: [SPARK-27637][Shuffle] For 
nettyBlockTransferService, if IOException occurred while fetching data, check 
whether relative executor is alive  before retry
URL: https://github.com/apache/spark/pull/24533#issuecomment-492044193
 
 
   @cloud-fan I also add relative unit test to test retryingBlockFetcher. Which 
assert that retryingBlockFetcher does not retry because the relative executor 
has been dead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] carsonwang commented on issue #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL

2019-05-13 Thread GitBox

carsonwang commented on issue #20303: [SPARK-23128][SQL] A new approach to do 
adaptive execution in Spark SQL
URL: https://github.com/apache/spark/pull/20303#issuecomment-492043576
 
 
   @justinuang , we are still waiting some updates from @cloud-fan 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283592816
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
 ##
 @@ -47,6 +51,60 @@ case class WriteToDataSourceV2(batchWrite: BatchWrite, 
query: LogicalPlan)
   override def output: Seq[Attribute] = Nil
 }
 
+/**
+ * Physical plan node for v2 create table as select.
+ *
+ * A new table will be created using the schema of the query, and rows from 
the query are appended.
+ * If either table creation or the append fails, the table will be deleted. 
This implementation does
+ * not provide an atomic CTAS.
+ */
+case class CreateTableAsSelectExec(
+catalog: TableCatalog,
+ident: Identifier,
+partitioning: Seq[Transform],
+query: SparkPlan,
+properties: Map[String, String],
+writeOptions: CaseInsensitiveStringMap,
+ifNotExists: Boolean) extends V2TableWriteExec {
+
+  import org.apache.spark.sql.catalog.v2.CatalogV2Implicits.IdentifierHelper
+
+  override protected def doExecute(): RDD[InternalRow] = {
+if (catalog.tableExists(ident)) {
+  if (ifNotExists) {
+return sparkContext.parallelize(Seq.empty, 1)
+  }
+
+  throw new TableAlreadyExistsException(ident)
+}
+
+Utils.tryWithSafeFinallyAndFailureCallbacks({
+  catalog.createTable(ident, query.schema, partitioning.toArray, 
properties.asJava) match {
+case table: SupportsWrite =>
+  val builder = table.newWriteBuilder(writeOptions)
+  .withInputDataSchema(query.schema)
+  .withQueryId(UUID.randomUUID().toString)
+  val batchWrite = builder match {
+case supportsSaveMode: SupportsSaveMode =>
+  supportsSaveMode.mode(SaveMode.Append).buildForBatch()
 
 Review comment:
   I'm not sure why we have to specifically tell the writer to use append mode. 
Can you elaborate? I think I'm missing something. It would be simpler to remove 
this branch entirely if possible.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283591843
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ##
 @@ -2031,7 +2031,8 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
 if (temporary && ifNotExists) {
   operationNotAllowed("CREATE TEMPORARY TABLE ... IF NOT EXISTS", ctx)
 }
-(visitTableIdentifier(ctx.tableIdentifier), temporary, ifNotExists, 
ctx.EXTERNAL != null)
+val multipartIdentifier: Seq[String] = 
ctx.multipartIdentifier.parts.asScala.map(_.getText)
 
 Review comment:
   Is there a real need to declare the type?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283591739
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ##
 @@ -402,6 +404,35 @@ trait V2WriteCommand extends Command {
   }
 }
 
+/**
+ * Create a new table from a select query with a v2 catalog.
+ */
+case class CreateTableAsSelect(
+catalog: TableCatalog,
+tableName: Identifier,
+partitioning: Seq[Transform],
+query: LogicalPlan,
+properties: Map[String, String],
+writeOptions: Map[String, String],
+ignoreIfExists: Boolean) extends Command {
+
+  override def children: Seq[LogicalPlan] = Seq(query)
+
+  override lazy val resolved: Boolean = {
+// the table schema is created from the query schema, so the only 
resolution needed is to check
+// that the columns referenced by the table's partitioning exist in the 
query schema
+val references = partitioning.flatMap(_.references).toSet
+references.map(_.fieldNames).forall { column =>
+  query.schema.findNestedField(column) match {
 
 Review comment:
   Can we just check on `isDefined` on the returned option?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283591739
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ##
 @@ -402,6 +404,35 @@ trait V2WriteCommand extends Command {
   }
 }
 
+/**
+ * Create a new table from a select query with a v2 catalog.
+ */
+case class CreateTableAsSelect(
+catalog: TableCatalog,
+tableName: Identifier,
+partitioning: Seq[Transform],
+query: LogicalPlan,
+properties: Map[String, String],
+writeOptions: Map[String, String],
+ignoreIfExists: Boolean) extends Command {
+
+  override def children: Seq[LogicalPlan] = Seq(query)
+
+  override lazy val resolved: Boolean = {
+// the table schema is created from the query schema, so the only 
resolution needed is to check
+// that the columns referenced by the table's partitioning exist in the 
query schema
+val references = partitioning.flatMap(_.references).toSet
+references.map(_.fieldNames).forall { column =>
+  query.schema.findNestedField(column) match {
 
 Review comment:
   Can we just return the result of `isDefined` on the returned option?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24595: [SPARK-20774][CORE] Cancel the running broadcast execution on BroadcastTimeout

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24595: [SPARK-20774][CORE] Cancel the running 
broadcast execution on BroadcastTimeout
URL: https://github.com/apache/spark/pull/24595#issuecomment-492039735
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105370/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24595: [SPARK-20774][CORE] Cancel the running broadcast execution on BroadcastTimeout

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24595: [SPARK-20774][CORE] Cancel the 
running broadcast execution on BroadcastTimeout
URL: https://github.com/apache/spark/pull/24595#issuecomment-492039734
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24595: [SPARK-20774][CORE] Cancel the running broadcast execution on BroadcastTimeout

2019-05-13 Thread GitBox

AmplabJenkins commented on issue #24595: [SPARK-20774][CORE] Cancel the running 
broadcast execution on BroadcastTimeout
URL: https://github.com/apache/spark/pull/24595#issuecomment-492039734
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24595: [SPARK-20774][CORE] Cancel the running broadcast execution on BroadcastTimeout

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24595: [SPARK-20774][CORE] Cancel the 
running broadcast execution on BroadcastTimeout
URL: https://github.com/apache/spark/pull/24595#issuecomment-492039735
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105370/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #24595: [SPARK-20774][CORE] Cancel the running broadcast execution on BroadcastTimeout

2019-05-13 Thread GitBox

SparkQA removed a comment on issue #24595: [SPARK-20774][CORE] Cancel the 
running broadcast execution on BroadcastTimeout
URL: https://github.com/apache/spark/pull/24595#issuecomment-492006398
 
 
   **[Test build #105370 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105370/testReport)**
 for PR 24595 at commit 
[`d2bbdbe`](https://github.com/apache/spark/commit/d2bbdbe157b040f2f5e873e4554b52cc5d48ce89).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24595: [SPARK-20774][CORE] Cancel the running broadcast execution on BroadcastTimeout

2019-05-13 Thread GitBox

SparkQA commented on issue #24595: [SPARK-20774][CORE] Cancel the running 
broadcast execution on BroadcastTimeout
URL: https://github.com/apache/spark/pull/24595#issuecomment-492039408
 
 
   **[Test build #105370 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105370/testReport)**
 for PR 24595 at commit 
[`d2bbdbe`](https://github.com/apache/spark/commit/d2bbdbe157b040f2f5e873e4554b52cc5d48ce89).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] turboFei edited a comment on issue #24533: [SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive

2019-05-13 Thread GitBox

turboFei edited a comment on issue #24533: [SPARK-27637][Shuffle] For 
nettyBlockTransferService, if IOException occurred while fetching data, check 
whether relative executor is alive  before retry
URL: https://github.com/apache/spark/pull/24533#issuecomment-492038984
 
 
   I have added a unit test, I create a driverEndpointRef which always return 
false when ask any message.
   And let mock client throw IOException when sendRpc, then the 
nettyBlockTransferService will get if relative executors is alive, which must 
be false.
   At last, I let the blockFetchListener throw ExecutorDeadException when it 
fetch failed for ExecutorDeadException. @cloud-fan 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] turboFei commented on issue #24533: [SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before

2019-05-13 Thread GitBox

turboFei commented on issue #24533: [SPARK-27637][Shuffle] For 
nettyBlockTransferService, if IOException occurred while fetching data, check 
whether relative executor is alive  before retry
URL: https://github.com/apache/spark/pull/24533#issuecomment-492038984
 
 
   I have added a unit test, I create a driverEndpointRef which always return 
false when ask any message.
   And let mock client throw IOException when sendRpc, then the 
nettyBlockTransferService will get if relative executors is alive, which must 
be false.
   At last, I let the blockFetchListener throw ExecutorDeadException when it 
fetch failed for ExecutorDeadException.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kiszk commented on a change in pull request #24406: [SPARK-27024] Executor interface for cluster managers to support GPU and other resources

2019-05-13 Thread GitBox

kiszk commented on a change in pull request #24406: [SPARK-27024] Executor 
interface for cluster managers to support GPU and other resources
URL: https://github.com/apache/spark/pull/24406#discussion_r283589732
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ##
 @@ -72,6 +83,97 @@ private[spark] class CoarseGrainedExecutorBackend(
 }(ThreadUtils.sameThread)
   }
 
+  // Check that the actual resources discovered will satisfy the user specified
+  // requirements and that they match the configs specified by the user to 
catch
+  // mismatches between what the user requested and what resource manager gave 
or
+  // what the discovery script found.
+  private def checkResourcesMeetRequirements(
+  resourceConfigPrefix: String,
+  reqResourcesAndCounts: Array[(String, String)],
+  actualResources: Map[String, ResourceInformation]): Unit = {
+
+reqResourcesAndCounts.foreach { case (rName, reqCount) =>
+  if (actualResources.contains(rName)) {
+val resourceInfo = actualResources(rName)
+
+if (resourceInfo.addresses.size < reqCount.toLong) {
+  throw new SparkException(s"Resource: $rName with addresses: " +
+s"${resourceInfo.addresses.mkString(",")} doesn't meet the " +
+s"requirements of needing $reqCount of them")
+}
+// also make sure the resource count on start matches the
+// resource configs specified by user
+val userCountConfigName =
+  resourceConfigPrefix + rName + SPARK_RESOURCE_COUNT_POSTFIX
+val userConfigCount = env.conf.getOption(userCountConfigName).
+  getOrElse(throw new SparkException(s"Resource: $rName not specified 
" +
+s"via config: $userCountConfigName, but required, " +
+"please fix your configuration"))
+
+if (userConfigCount.toLong > resourceInfo.addresses.size) {
+  throw new SparkException(s"Resource: $rName, with addresses: " +
+s"${resourceInfo.addresses.mkString(",")} " +
+s"is less than what the user requested for count: 
$userConfigCount, " +
+s"via $userCountConfigName")
+}
+  } else {
+throw new SparkException(s"Executor resource config missing required 
task resource: $rName")
+  }
+}
+  }
+
+  // visible for testing
+  def parseOrFindResources(resourcesFile: Option[String]): Map[String, 
ResourceInformation] = {
+// only parse the resources if a task requires them
+val taskResourceConfigs = 
env.conf.getAllWithPrefix(SPARK_TASK_RESOURCE_PREFIX)
+val resourceInfo = if (taskResourceConfigs.nonEmpty) {
+  val execResources = resourcesFile.map { resourceFileStr => {
+val source = new BufferedInputStream(new 
FileInputStream(resourceFileStr))
+val resourceMap = try {
+  val parsedJson = parse(source).asInstanceOf[JArray].arr
+  parsedJson.map { json =>
+val name = (json \ "name").extract[String]
+val addresses = (json \ "addresses").extract[Array[String]]
+new ResourceInformation(name, addresses)
+  }.map(x => (x.name -> x)).toMap
+} catch {
+  case e @ (_: MappingException | _: MismatchedInputException) =>
+throw new SparkException(
+  s"Exception parsing the resources in $resourceFileStr", e)
+} finally {
+  source.close()
+}
+resourceMap
+  }}.getOrElse(ResourceDiscoverer.findResources(env.conf, isDriver = 
false))
+
+  if (execResources.isEmpty) {
+throw new SparkException("User specified resources per task via: " +
+  s"$SPARK_TASK_RESOURCE_PREFIX, but can't find any resources 
available on the executor.")
+  }
+  // get just the map of resource name to count
+  val resourcesAndCounts = taskResourceConfigs.
+withFilter { case (k, v) => k.endsWith(SPARK_RESOURCE_COUNT_POSTFIX)}.
+map { case (k, v) => (k.dropRight(SPARK_RESOURCE_COUNT_POSTFIX.size), 
v)}
+
+  checkResourcesMeetRequirements(SPARK_EXECUTOR_RESOURCE_PREFIX, 
resourcesAndCounts,
+execResources)
+
+  
logInfo("===")
+  logInfo("Executor $executorId Resources:")
 
 Review comment:
   `s` is required before `"` as `s"Executor $executorId Resources:"`   
   This is because we are using `$executorId`.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,

[GitHub] [spark] mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

2019-05-13 Thread GitBox

mccheah commented on a change in pull request #24570: [SPARK-24923][SQL] 
Implement v2 CreateTableAsSelect
URL: https://github.com/apache/spark/pull/24570#discussion_r283588685
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
 ##
 @@ -112,4 +129,41 @@ case class DataSourceResolution(conf: SQLConf) extends 
Rule[LogicalPlan] with Ca
   properties = properties,
   comment = comment)
   }
+
+  private def convertCTAS(
+  catalog: TableCatalog,
+  identifier: Identifier,
+  ctas: CreateTableAsSelectStatement): CreateTableAsSelect = {
+if (ctas.options.contains("path") && ctas.location.isDefined) {
+  throw new AnalysisException(
+"LOCATION and 'path' in OPTIONS are both used to indicate the custom 
table path, " +
+"you can only specify one of them.")
+}
+
+val options = ctas.options.filterKeys(_ != "path")
+
+// convert the bucket spec and add it as a transform
+val partitioning = ctas.partitioning ++ ctas.bucketSpec.map(_.asTransform)
+
+// create table properties from TBLPROPERTIES and OPTIONS clauses
+val properties = new mutable.HashMap[String, String]()
+properties ++= ctas.properties
+properties ++= options
+
+// convert USING, LOCATION, and COMMENT clauses to table properties
+properties += ("provider" -> ctas.provider)
+ctas.comment.map(text => properties += ("comment" -> text))
 
 Review comment:
   What happens if these overwrite table properties of the same type in the 
properties part of the SQL statement itself? Do these overwrite properties of 
the same name? Are these special property keys documented anywhere?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kiszk commented on a change in pull request #24585: [Spark-27664][CORE] Performance issue while listing large number of files on an object store.

2019-05-13 Thread GitBox

kiszk commented on a change in pull request #24585: [Spark-27664][CORE] 
Performance issue while listing large number of files on an object store.
URL: https://github.com/apache/spark/pull/24585#discussion_r283588413
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##
 @@ -585,6 +585,17 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_FILESOURCE_FILE_CACHE_CONCURRENCY_LEVEL =
+buildConf("spark.sql.hive.filesourceFileCacheConcurrencyLevel")
+  .doc("Provide concurrency level for underlying guava cache. The default 
is 1." +
+" Higher value may give better performance, when the number of " +
 
 Review comment:
   We would like to see the evidence (i.e. multiple experimental results) to 
change the default value.   
   Why do we need to change the default value in addition to making this 
parameter changeable?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jiangxb1987 commented on a change in pull request #24497: [SPARK-27630][CORE]Stage retry causes totalRunningTasks calculation to be negative

2019-05-13 Thread GitBox

jiangxb1987 commented on a change in pull request #24497: 
[SPARK-27630][CORE]Stage retry causes totalRunningTasks calculation to be 
negative
URL: https://github.com/apache/spark/pull/24497#discussion_r283583278
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
 ##
 @@ -757,8 +753,11 @@ private[spark] class ExecutorAllocationManager(
   val taskIndex = taskEnd.taskInfo.index
   val stageId = taskEnd.stageId
   allocationManager.synchronized {
-if (stageIdToNumRunningTask.contains(stageId)) {
-  stageIdToNumRunningTask(stageId) -= 1
+if (taskIdToNumLiveTasks.contains(taskId)) {
+  taskIdToNumLiveTasks(taskId) -= 1
+  if (taskIdToNumLiveTasks(taskId) <= 0) {
 
 Review comment:
   We shall still ensure it's not < 0 ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] William1104 commented on issue #24221: [SPARK-27248][SQL] `refreshTable` should recreate cache with same cache name and storage level

2019-05-13 Thread GitBox

William1104 commented on issue #24221: [SPARK-27248][SQL] `refreshTable` should 
recreate cache with same cache name and storage level
URL: https://github.com/apache/spark/pull/24221#issuecomment-492026196
 
 
   > I left a few comment. One of them should be fixed; 
(https://github.com/apache/spark/pull/24221/files#r280989808). I'll review this 
again after the fix. Thanks.
   
   
   Hi @dongjoon-hyun , I fixed the test again by changing the SQL for table 
creation. Appreciate if you can take a look again. Many thanks. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24592: [SPARK-27690][SQL] Remove materialized views first in `HiveClientImpl.reset`

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24592: [SPARK-27690][SQL] Remove 
materialized views first in `HiveClientImpl.reset`
URL: https://github.com/apache/spark/pull/24592#issuecomment-492021235
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24592: [SPARK-27690][SQL] Remove materialized views first in `HiveClientImpl.reset`

2019-05-13 Thread GitBox

AmplabJenkins removed a comment on issue #24592: [SPARK-27690][SQL] Remove 
materialized views first in `HiveClientImpl.reset`
URL: https://github.com/apache/spark/pull/24592#issuecomment-492021245
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105368/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 548 matches

Mail list logo