[GitHub] [spark] ulysses-you commented on pull request #30029: [SPARK-33131][SQL] Fix grouping sets with having clause can not resolve qualified col name

2020-10-13 Thread GitBox


ulysses-you commented on pull request #30029:
URL: https://github.com/apache/spark/pull/30029#issuecomment-708171740


   @maropu @cloud-fan do you have time to take a look ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #30011: [WIP][SPARK-32281][SQL] Spark keep SORTED spec in metastore

2020-10-13 Thread GitBox


cloud-fan commented on pull request #30011:
URL: https://github.com/apache/spark/pull/30011#issuecomment-708167152


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR commented on pull request #30033:
URL: https://github.com/apache/spark/pull/30033#issuecomment-708160066


   >> We can't get the information about the reason why the operator is 
considered as unresolved
   
   > This is a good point. CheckAnalysis provides better error messages for 
many cases, and the unresolved operator error is the last resort. I think we 
should provide better error message for this case as well.
   
   Yeah. It looks to be a bit tricky to generalize as any operator can make a 
decision on resolvable arbitrary. Probably the operator needs to have some 
method to provide additional information on `resolve` so that Analyzer can call 
to print out?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504403294



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##
@@ -45,7 +45,7 @@ trait V2WriteCommand extends Command {
   case (inAttr, outAttr) =>
 // names and types must match, nullability must be compatible
 inAttr.name == outAttr.name &&
-  DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
+  DataType.equalsIgnoreCompatibleNullability(inAttr.dataType, 
outAttr.dataType) &&

Review comment:
   The usage of AppendData is reverted in 
https://github.com/apache/spark/commit/b6e4aca0be7f3b863c326063a3c02aa8a1c266a3 
for branch-2.4 and shipped to Spark 2.4.0. So while the code in AppendData is 
broken, it's a dead code.
   
   We seem to have three options: 1) revert remaining part of AppendData in 
branch-2.4 2) fix the code but leave it as dead 3) leave it as it is. What's 
our preference?
   
   cc. @cloud-fan @HyukjinKwon 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #30011: [WIP][SPARK-32281][SQL] Spark keep SORTED spec in metastore

2020-10-13 Thread GitBox


AngersZh commented on pull request #30011:
URL: https://github.com/apache/spark/pull/30011#issuecomment-708158773


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504403294



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##
@@ -45,7 +45,7 @@ trait V2WriteCommand extends Command {
   case (inAttr, outAttr) =>
 // names and types must match, nullability must be compatible
 inAttr.name == outAttr.name &&
-  DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
+  DataType.equalsIgnoreCompatibleNullability(inAttr.dataType, 
outAttr.dataType) &&

Review comment:
   The usage of AppendData is reverted in 
https://github.com/apache/spark/commit/b6e4aca0be7f3b863c326063a3c02aa8a1c266a3 
for branch-2.4 and shipped to Spark 2.4.0. So while the code in AppendData for 
branch-2.4 is broken as well, it's a dead code.
   
   We seem to have three options: 1) revert remaining part of AppendData in 
branch-2.4 2) fix the code but leave it as dead 3) leave it as it is. What's 
our preference?
   
   cc. @cloud-fan @HyukjinKwon 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504402399



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala
##
@@ -230,6 +230,13 @@ class HiveParquetSourceSuite extends 
ParquetPartitioningTest {
 withTempPath { path =>
   withTable("parq_tbl1", "parq_tbl2", "parq_tbl3",
 "tbl1", "tbl2", "tbl3", "tbl4", "tbl5", "tbl6") {
+
+def errorMsg(path: String): String = {

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504402354



##
File path: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
##
@@ -232,6 +232,11 @@ class HadoopRDD[K, V](
 logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist 
and no" +
 s" partitions returned from this path.", e)
 Array.empty[Partition]
+  case e: IOException if e.getMessage.contains("Not a file") =>
+val path = e.getMessage.split(":").map(_.trim).apply(2)
+throw new IOException(s"Path: ${path} is a directory, it is not" +
+  s" allowed for `serde` reader when" +

Review comment:
   > How about:
   > 
   > ```
   > s"Path: ${path} is a directory, which is not supported by the record 
reader when `mapreduce.input.fileinputformat.input.dir.recursive` is false." 
   > ```
   
   updated





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #30027:
URL: https://github.com/apache/spark/pull/30027#issuecomment-708156797







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29818: [WIP][SPARK-32953][PYTHON] Add Arrow self_destruct support to toPandas

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #29818:
URL: https://github.com/apache/spark/pull/29818#issuecomment-708156469







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29818: [WIP][SPARK-32953][PYTHON] Add Arrow self_destruct support to toPandas

2020-10-13 Thread GitBox


AmplabJenkins removed a comment on pull request #29818:
URL: https://github.com/apache/spark/pull/29818#issuecomment-696117614


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29818: [WIP][SPARK-32953][PYTHON] Add Arrow self_destruct support to toPandas

2020-10-13 Thread GitBox


HyukjinKwon commented on a change in pull request #29818:
URL: https://github.com/apache/spark/pull/29818#discussion_r504401223



##
File path: python/pyspark/sql/pandas/serializers.py
##
@@ -90,7 +90,10 @@ def load_stream(self, stream):
 import pyarrow as pa
 reader = pa.ipc.open_stream(stream)
 for batch in reader:
-yield batch
+split_batch = pa.RecordBatch.from_arrays([
+pa.concat_arrays([array]) for array in batch
+], schema=batch.schema)

Review comment:
   Sorry for asking a question without taking a close look but would you 
mind elaborating why we should do this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29818: [WIP][SPARK-32953][PYTHON] Add Arrow self_destruct support to toPandas

2020-10-13 Thread GitBox


HyukjinKwon commented on a change in pull request #29818:
URL: https://github.com/apache/spark/pull/29818#discussion_r504400977



##
File path: python/pyspark/sql/pandas/conversion.py
##
@@ -103,10 +103,18 @@ def toPandas(self):
 batches = self.toDF(*tmp_column_names)._collect_as_arrow()
 if len(batches) > 0:
 table = pyarrow.Table.from_batches(batches)
+del batches
 # Pandas DataFrame created from PyArrow uses 
datetime64[ns] for date type
 # values, but we should use datetime.date to match the 
behavior with when
 # Arrow optimization is disabled.
-pdf = table.to_pandas(date_as_object=True)
+pandas_options = {'date_as_object': True}
+if self_destruct:
+pandas_options.update({
+'self_destruct': True,

Review comment:
   Would you mind leaving a comment on the codes about this set of 
parameter configurations?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29818: [WIP][SPARK-32953][PYTHON] Add Arrow self_destruct support to toPandas

2020-10-13 Thread GitBox


HyukjinKwon commented on a change in pull request #29818:
URL: https://github.com/apache/spark/pull/29818#discussion_r504400867



##
File path: python/pyspark/sql/pandas/conversion.py
##
@@ -34,7 +34,7 @@ class PandasConversionMixin(object):
 """
 
 @since(1.3)
-def toPandas(self):
+def toPandas(self, self_destruct=False):

Review comment:
   Let's name it as `selfDestruct` to make the naming role consistent.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


gengliangwang commented on pull request #30027:
URL: https://github.com/apache/spark/pull/30027#issuecomment-708154528


   LGTM otherwise. Thanks for doing this work.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


gengliangwang commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504399096



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala
##
@@ -230,6 +230,13 @@ class HiveParquetSourceSuite extends 
ParquetPartitioningTest {
 withTempPath { path =>
   withTable("parq_tbl1", "parq_tbl2", "parq_tbl3",
 "tbl1", "tbl2", "tbl3", "tbl4", "tbl5", "tbl6") {
+
+def errorMsg(path: String): String = {

Review comment:
   nit: `checkErrorMsg`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


gengliangwang commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504398990



##
File path: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
##
@@ -232,6 +232,11 @@ class HadoopRDD[K, V](
 logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist 
and no" +
 s" partitions returned from this path.", e)
 Array.empty[Partition]
+  case e: IOException if e.getMessage.contains("Not a file") =>
+val path = e.getMessage.split(":").map(_.trim).apply(2)
+throw new IOException(s"Path: ${path} is a directory, it is not" +
+  s" allowed for `serde` reader when" +

Review comment:
   How about:
   ```
   s"Path: ${path} is a directory, which is not supported by the record reader 
when `mapreduce.input.fileinputformat.input.dir.recursive` is false." 
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29831: [SPARK-32351][SQL] Show partially pushed down partition filters in explain()

2020-10-13 Thread GitBox


cloud-fan commented on a change in pull request #29831:
URL: https://github.com/apache/spark/pull/29831#discussion_r504397253



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PrunePartitionSuiteBase.scala
##
@@ -46,30 +47,59 @@ abstract class PrunePartitionSuiteBase extends QueryTest 
with SQLTestUtils with
 }
 
 assertPrunedPartitions(
-  "SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)", 2)
+  "SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)", 2,
+  Set("`p`='1'", "`p`='2'"))
 assertPrunedPartitions(
-  "SELECT * FROM t WHERE (p = '1' AND i = 2) OR (i = 1 OR p = '2')", 4)
+  "SELECT * FROM t WHERE (p = '1' AND i = 2) OR (i = 1 OR p = '2')", 4,
+  Set())
 assertPrunedPartitions(
-  "SELECT * FROM t WHERE (p = '1' AND i = 2) OR (p = '3' AND i = 3 )", 
2)
+  "SELECT * FROM t WHERE (p = '1' AND i = 2) OR (p = '3' AND i = 3 )", 
2,
+  Set("`p`='1'", "`p`='3'"))
 assertPrunedPartitions(
-  "SELECT * FROM t WHERE (p = '1' AND i = 2) OR (p = '2' OR p = '3')", 
3)
+  "SELECT * FROM t WHERE (p = '1' AND i = 2) OR (p = '2' OR p = '3')", 
3,
+  Set("`p`='1'", "`p`='2'", "`p`='3'"))
 assertPrunedPartitions(
-  "SELECT * FROM t", 4)
+  "SELECT * FROM t", 4,
+  Set())
 assertPrunedPartitions(
-  "SELECT * FROM t WHERE p = '1' AND i = 2", 1)
+  "SELECT * FROM t WHERE p = '1' AND i = 2", 1,
+  Set("`p`='1'"))
 assertPrunedPartitions(
   """
 |SELECT i, COUNT(1) FROM (
 |SELECT * FROM t WHERE  p = '1' OR (p = '2' AND i = 1)
 |) tmp GROUP BY i
-  """.stripMargin, 2)
+  """.stripMargin, 2, Set("`p`='1'", "`p`='2'"))
   }
 }
   }
 
-  protected def assertPrunedPartitions(query: String, expected: Long): Unit = {
-val plan = sql(query).queryExecution.sparkPlan
-assert(getScanExecPartitionSize(plan) == expected)
+  private def collectEqualExp(
+  expression: Expression,
+  currentSet: Set[String]): Set[String] = expression match {
+case e: EqualTo =>
+  currentSet + (e.left.sql + e.symbol + e.right.sql)
+.replaceAll("spark_catalog.default.t.", "")
+case _ =>
+  expression.children.foldLeft(currentSet)((s, exp) => 
collectEqualExp(exp, s))

Review comment:
   can we put `and`, `or` in the result? I think the test should check 
`and/or` as well, not just the leaf predicates.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29831: [SPARK-32351][SQL] Show partially pushed down partition filters in explain()

2020-10-13 Thread GitBox


cloud-fan commented on a change in pull request #29831:
URL: https://github.com/apache/spark/pull/29831#discussion_r504397419



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
##
@@ -256,20 +256,21 @@ private[hive] trait HiveStrategies {
*/
   object HiveTableScans extends Strategy {
 def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-  case ScanOperation(projectList, predicates, relation: HiveTableRelation) 
=>
+  case ScanOperation(projectList, filters, relation: HiveTableRelation) =>
 // Filter out all predicates that only deal with partition keys, these 
are given to the
 // hive table scan operator to be used for partition pruning.
 val partitionKeyIds = AttributeSet(relation.partitionCols)
-val (pruningPredicates, otherPredicates) = predicates.partition { 
predicate =>
-  !predicate.references.isEmpty &&
-  predicate.references.subsetOf(partitionKeyIds)
-}
+val normalizedFilters = DataSourceStrategy.normalizeExprs(
+  filters.filter(_.deterministic), relation.output)
+
+val partitionKeyFilters = 
DataSourceStrategy.getPushedDownFilters(relation.partitionCols,
+  normalizedFilters)
 
 pruneFilterProject(
   projectList,
-  otherPredicates,
+  filters.filter(f => f.references.isEmpty || 
!f.references.subsetOf(partitionKeyIds)),

Review comment:
   It's OK for now to be the same as before, but it's better to make it 
consistent with `FileSourceStrategy`. We can do it in followup.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


maropu commented on a change in pull request #29933:
URL: https://github.com/apache/spark/pull/29933#discussion_r504391675



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -45,11 +45,18 @@ private[hive] class SparkExecuteStatementOperation(
 parentSession: HiveSession,
 statement: String,
 confOverlay: JMap[String, String],
-runInBackground: Boolean = true)
+runInBackground: Boolean = true,
+queryTimeout: Long)
   extends ExecuteStatementOperation(parentSession, statement, confOverlay, 
runInBackground)
   with SparkOperation
   with Logging {
 
+  private val queryTimeoutValue = if (queryTimeout <= 0) {

Review comment:
   Ah, that's what I was looking for. Thanks! @AngersZh @HyukjinKwon  
Updated, could you check the latest fix?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #26194: [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #26194:
URL: https://github.com/apache/spark/pull/26194#issuecomment-708132184


   Thanks @dongjoon-hyun.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #26194: [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8

2020-10-13 Thread GitBox


dongjoon-hyun commented on pull request #26194:
URL: https://github.com/apache/spark/pull/26194#issuecomment-708131463


   I updated SPARK-29536 by adding `2.4.7` into `Affected Versions`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #30032: [SPARK-33134][SQL][3.0] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


HyukjinKwon closed pull request #30032:
URL: https://github.com/apache/spark/pull/30032


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #30032: [SPARK-33134][SQL][3.0] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #30032:
URL: https://github.com/apache/spark/pull/30032#issuecomment-708129068


   Merged to branch-3.0.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #30031: [SPARK-33134][SQL] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


HyukjinKwon closed pull request #30031:
URL: https://github.com/apache/spark/pull/30031


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29972: [SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-13 Thread GitBox


cloud-fan commented on a change in pull request #29972:
URL: https://github.com/apache/spark/pull/29972#discussion_r504374789



##
File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc.v2
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types._
+import org.apache.spark.tags.DockerTest
+
+@DockerTest
+trait V2JDBCTest extends SharedSparkSession {
+  val catalogName: String
+  // dialect specific update column type test
+  def testUpdateColumnType(tbl: String): Unit
+
+  test("SPARK-33034: ALTER TABLE ... add new columns") {
+withTable(s"$catalogName.alt_table") {
+  sql(s"CREATE TABLE $catalogName.alt_table (ID STRING) USING _")
+  var t = spark.table(s"$catalogName.alt_table")
+  var expectedSchema = new StructType().add("ID", StringType)
+  assert(t.schema === expectedSchema)
+  sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C1 STRING, C2 
STRING)")
+  t = spark.table(s"$catalogName.alt_table")
+  expectedSchema = expectedSchema.add("C1", StringType).add("C2", 
StringType)
+  assert(t.schema === expectedSchema)
+  sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C3 STRING)")
+  t = spark.table(s"$catalogName.alt_table")
+  expectedSchema = expectedSchema.add("C3", StringType)
+  assert(t.schema === expectedSchema)
+  // Add already existing column
+  val msg = intercept[AnalysisException] {
+sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C3 DOUBLE)")
+  }.getMessage
+  assert(msg.contains("Cannot add column, because C3 already exists"))
+}
+// Add a column to not existing table
+val msg = intercept[AnalysisException] {
+  sql(s"ALTER TABLE $catalogName.not_existing_table ADD COLUMNS (C4 
STRING)")
+}.getMessage
+assert(msg.contains("Table not found"))
+  }
+
+  test("SPARK-33034: ALTER TABLE ... update column type") {
+withTable(s"$catalogName.alt_table") {
+  testUpdateColumnType(s"$catalogName.alt_table")
+  // Update not existing column
+  val msg2 = intercept[AnalysisException] {
+sql(s"ALTER TABLE $catalogName.alt_table ALTER COLUMN bad_column TYPE 
DOUBLE")
+  }.getMessage
+  assert(msg2.contains("Cannot update missing field bad_column"))
+}
+// Update column type in not existing table
+val msg = intercept[AnalysisException] {
+  sql(s"ALTER TABLE $catalogName.not_existing_table ALTER COLUMN id TYPE 
DOUBLE")
+}.getMessage
+assert(msg.contains("Table not found"))
+  }
+
+  test("SPARK-33034: ALTER TABLE ... update column nullability") {
+withTable(s"$catalogName.alt_table") {
+  sql(s"CREATE TABLE $catalogName.alt_table (ID STRING NOT NULL) USING _")
+  var t = spark.table(s"$catalogName.alt_table")
+  // nullable is true in the expecteSchema because Spark always sets 
nullable to true
+  // regardless of the JDBC metadata 
https://github.com/apache/spark/pull/18445

Review comment:
   This does expose a problem in Spark: most databases allow to write 
nullable data to non-nullable column, and fail at runtime if they see null 
values. I think Spark shouldn't block it at compile time. After all, 
nullability is more like a constraint, not data type itself.  cc @rdblue 
@dongjoon-hyun @viirya @maropu @MaxGekk 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #30031: [SPARK-33134][SQL] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #30031:
URL: https://github.com/apache/spark/pull/30031#issuecomment-708128835


   Merged to master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #30031: [SPARK-33134][SQL] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #30031:
URL: https://github.com/apache/spark/pull/30031#issuecomment-708128011


   We probably need to redesign/refactoring JSON parsing logic here .. it's now 
pretty convoluted ..



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504373710



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala
##
@@ -101,6 +100,86 @@ class DataFrameWriterV2Suite extends QueryTest with 
SharedSparkSession with Befo
 assert(v2.catalog.exists(_ == catalogPlugin))
   }
 
+  case class FakeV2WriteCommand(table: NamedRelation, query: LogicalPlan) 
extends V2WriteCommand
+
+  test("SPARK-33136 output resolved on complex types for V2 write commands") {

Review comment:
   Yes, according to the implementation of 
`equalsIgnoreCompatibleNullability`.
   
   ```
   private[sql] def equalsIgnoreCompatibleNullability(from: DataType, to: 
DataType): Boolean = {
   (from, to) match {
 case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) =>
   (tn || !fn) && equalsIgnoreCompatibleNullability(fromElement, 
toElement)
   
 case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) =>
   (tn || !fn) &&
 equalsIgnoreCompatibleNullability(fromKey, toKey) &&
 equalsIgnoreCompatibleNullability(fromValue, toValue)
   
 case (StructType(fromFields), StructType(toFields)) =>
   fromFields.length == toFields.length &&
 fromFields.zip(toFields).forall { case (fromField, toField) =>
   fromField.name == toField.name &&
 (toField.nullable || !fromField.nullable) &&
 equalsIgnoreCompatibleNullability(fromField.dataType, 
toField.dataType)
 }
   
 case (fromDataType, toDataType) => fromDataType == toDataType
   }
 }
   ```
   
   For primitive types the order doesn't affect the result. `outputResolved` 
itself does the right comparison, just except the swapped parameters.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #30027:
URL: https://github.com/apache/spark/pull/30027#issuecomment-708127310


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


cloud-fan commented on pull request #30033:
URL: https://github.com/apache/spark/pull/30033#issuecomment-708126677


   > We can't get the information about the reason why the operator is 
considered as unresolved
   
   This is a good point. `CheckAnalysis` provides better error messages for 
many cases, and the `unresolved operator` error is the last resort. I think we 
should provide better error message for this case as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504372431



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##
@@ -45,7 +45,7 @@ trait V2WriteCommand extends Command {
   case (inAttr, outAttr) =>
 // names and types must match, nullability must be compatible
 inAttr.name == outAttr.name &&
-  DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
+  DataType.equalsIgnoreCompatibleNullability(inAttr.dataType, 
outAttr.dataType) &&

Review comment:
   Ah thanks for noticing. Nice finding. I found it from V2WriteCommand so 
thought it was added later. Will check the code path in branch-2.4 and test it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


cloud-fan commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504372457



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala
##
@@ -101,6 +100,86 @@ class DataFrameWriterV2Suite extends QueryTest with 
SharedSparkSession with Befo
 assert(v2.catalog.exists(_ == catalogPlugin))
   }
 
+  case class FakeV2WriteCommand(table: NamedRelation, query: LogicalPlan) 
extends V2WriteCommand
+
+  test("SPARK-33136 output resolved on complex types for V2 write commands") {

Review comment:
   does this bug only affects complex types?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


dongjoon-hyun commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504365454



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##
@@ -45,7 +45,7 @@ trait V2WriteCommand extends Command {
   case (inAttr, outAttr) =>
 // names and types must match, nullability must be compatible
 inAttr.name == outAttr.name &&
-  DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
+  DataType.equalsIgnoreCompatibleNullability(inAttr.dataType, 
outAttr.dataType) &&

Review comment:
   Hi, @HeartSaVioR . The original code looks like the same with 
`branch-2.4` but the issue is reported at 3.0.0+. Could you confirm that this 
is 3.0.0-only issue or not?
   ```scala
 override lazy val resolved: Boolean = {
   table.resolved && query.resolved && query.output.size == 
table.output.size &&
   query.output.zip(table.output).forall {
 case (inAttr, outAttr) =>
   // names and types must match, nullability must be compatible
   inAttr.name == outAttr.name &&
   DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
   (outAttr.nullable || !inAttr.nullable)
   }
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


dongjoon-hyun commented on a change in pull request #30033:
URL: https://github.com/apache/spark/pull/30033#discussion_r504365454



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##
@@ -45,7 +45,7 @@ trait V2WriteCommand extends Command {
   case (inAttr, outAttr) =>
 // names and types must match, nullability must be compatible
 inAttr.name == outAttr.name &&
-  DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
+  DataType.equalsIgnoreCompatibleNullability(inAttr.dataType, 
outAttr.dataType) &&

Review comment:
   Hi, @HeartSaVioR . This looks like the same with `branch-2.4` but the 
issue is reported at 3.0.0+. Could you confirm that this is 3.0.0-only issue or 
not?
   ```scala
 override lazy val resolved: Boolean = {
   table.resolved && query.resolved && query.output.size == 
table.output.size &&
   query.output.zip(table.output).forall {
 case (inAttr, outAttr) =>
   // names and types must match, nullability must be compatible
   inAttr.name == outAttr.name &&
   DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, 
inAttr.dataType) &&
   (outAttr.nullable || !inAttr.nullable)
   }
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29964: [SPARK-32047][SQL]Add JDBC connection provider disable possibility

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #29964:
URL: https://github.com/apache/spark/pull/29964#issuecomment-708116061


   @gaborgsomogyi, just a question. When do we disable some providers? Would 
you mind elaborating the usecase?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on pull request #30027:
URL: https://github.com/apache/spark/pull/30027#issuecomment-708114538


   @dongjoon-hyun Seems jenkins down because of this pr? em



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #29933:
URL: https://github.com/apache/spark/pull/29933#discussion_r504360835



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -200,6 +207,12 @@ private[hive] class SparkExecuteStatementOperation(
   parentSession.getUsername)
 setHasResultSet(true) // avoid no resultset for async run
 
+if (queryTimeoutValue > 0) {
+  Executors.newSingleThreadScheduledExecutor.schedule(new Runnable {
+  override def run(): Unit = timeoutCancel()
+}, queryTimeoutValue, TimeUnit.SECONDS)
+}
+

Review comment:
   Maybe we need to call shutdown for this executor ?
   
   
https://github.com/apache/hive/blob/940ee46d3a7a9300a47e5beae7c3ca6eb32fd759/service/src/java/org/apache/hive/service/cli/operation/SQLOperation.java#L170-L184





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #29933:
URL: https://github.com/apache/spark/pull/29933#discussion_r504358168



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -45,11 +45,18 @@ private[hive] class SparkExecuteStatementOperation(
 parentSession: HiveSession,
 statement: String,
 confOverlay: JMap[String, String],
-runInBackground: Boolean = true)
+runInBackground: Boolean = true,
+queryTimeout: Long)
   extends ExecuteStatementOperation(parentSession, statement, confOverlay, 
runInBackground)
   with SparkOperation
   with Logging {
 
+  private val queryTimeoutValue = if (queryTimeout <= 0) {

Review comment:
   > @wangyum @AngersZh Is This behaviour the same with the current 
Hive one?
   
   When `THRIFTSERVER_QUERY_TIMEOUT ` less then user defined, we respect 
`THRIFTSERVER_QUERY_TIMEOUT `
   
https://github.com/apache/hive/blob/940ee46d3a7a9300a47e5beae7c3ca6eb32fd759/service/src/java/org/apache/hive/service/cli/operation/SQLOperation.java#L118-L122





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29992: [SPARK-32881][CORE] Catch some race condition errors and log them more clearly

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #29992:
URL: https://github.com/apache/spark/pull/29992#issuecomment-708110379


   How did we test this, @holdenk? I can make a quick followup for that.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #29933:
URL: https://github.com/apache/spark/pull/29933#discussion_r504358168



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -45,11 +45,18 @@ private[hive] class SparkExecuteStatementOperation(
 parentSession: HiveSession,
 statement: String,
 confOverlay: JMap[String, String],
-runInBackground: Boolean = true)
+runInBackground: Boolean = true,
+queryTimeout: Long)
   extends ExecuteStatementOperation(parentSession, statement, confOverlay, 
runInBackground)
   with SparkOperation
   with Logging {
 
+  private val queryTimeoutValue = if (queryTimeout <= 0) {

Review comment:
   > @wangyum @AngersZh Is This behaviour the same with the current 
Hive one?
   
   When `THRIFTSERVER_QUERY_TIMEOUT ` less then user defined, we respect 
`THRIFTSERVER_QUERY_TIMEOUT `
   
https://github.com/apache/hive/blob/940ee46d3a7a9300a47e5beae7c3ca6eb32fd759/service/src/java/org/apache/hive/service/cli/operation/SQLOperation.java#L118-L122
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29992: [SPARK-32881][CORE] Catch some race condition errors and log them more clearly

2020-10-13 Thread GitBox


HyukjinKwon commented on a change in pull request #29992:
URL: https://github.com/apache/spark/pull/29992#discussion_r504358081



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -125,14 +125,19 @@ private class ShuffleStatus(numPartitions: Int) extends 
Logging {
* Update the map output location (e.g. during migration).
*/
   def updateMapOutput(mapId: Long, bmAddress: BlockManagerId): Unit = 
withWriteLock {
-val mapStatusOpt = mapStatuses.find(_.mapId == mapId)
-mapStatusOpt match {
-  case Some(mapStatus) =>
-logInfo(s"Updating map output for ${mapId} to ${bmAddress}")
-mapStatus.updateLocation(bmAddress)
-invalidateSerializedMapOutputStatusCache()
-  case None =>
-logError(s"Asked to update map output ${mapId} for untracked map 
status.")
+try {
+  val mapStatusOpt = mapStatuses.find(_.mapId == mapId)
+  mapStatusOpt match {
+case Some(mapStatus) =>
+  logInfo(s"Updating map output for ${mapId} to ${bmAddress}")
+  mapStatus.updateLocation(bmAddress)
+  invalidateSerializedMapOutputStatusCache()
+case None =>
+  logWarning(s"Asked to update map output ${mapId} for untracked map 
status.")
+  }
+} catch {
+  case e: java.lang.NullPointerException =>

Review comment:
   Quick question: can we avoid catching `NullPointerException`? It's a bit 
odd that we catch `NullPointerException`. We could just switch to if-else I 
guess.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on pull request #30027:
URL: https://github.com/apache/spark/pull/30027#issuecomment-708109066


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


HyukjinKwon commented on a change in pull request #29933:
URL: https://github.com/apache/spark/pull/29933#discussion_r504356369



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -45,11 +45,18 @@ private[hive] class SparkExecuteStatementOperation(
 parentSession: HiveSession,
 statement: String,
 confOverlay: JMap[String, String],
-runInBackground: Boolean = true)
+runInBackground: Boolean = true,
+queryTimeout: Long)
   extends ExecuteStatementOperation(parentSession, statement, confOverlay, 
runInBackground)
   with SparkOperation
   with Logging {
 
+  private val queryTimeoutValue = if (queryTimeout <= 0) {

Review comment:
   I am 
https://github.com/apache/hive/commit/b6218275b00b64aed7efaf470784cc0441464f67#diff-b0085fe84239ae927dfe7464a2174a1c80169bbb60a295e78aeb8df3747e4443R149.
 0 seems default meaning no timeout that's matched with the current default's 
and behaviour here.

##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -45,11 +45,18 @@ private[hive] class SparkExecuteStatementOperation(
 parentSession: HiveSession,
 statement: String,
 confOverlay: JMap[String, String],
-runInBackground: Boolean = true)
+runInBackground: Boolean = true,
+queryTimeout: Long)
   extends ExecuteStatementOperation(parentSession, statement, confOverlay, 
runInBackground)
   with SparkOperation
   with Logging {
 
+  private val queryTimeoutValue = if (queryTimeout <= 0) {

Review comment:
   I just took a cursory look for 
https://github.com/apache/hive/commit/b6218275b00b64aed7efaf470784cc0441464f67#diff-b0085fe84239ae927dfe7464a2174a1c80169bbb60a295e78aeb8df3747e4443R149.
 0 seems default meaning no timeout that's matched with the current default's 
and behaviour here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #30022: [SPARK-33090][BUILD][test-hadoop2.7] Upgrade Google Guava to 29.0-jre

2020-10-13 Thread GitBox


AngersZh commented on pull request #30022:
URL: https://github.com/apache/spark/pull/30022#issuecomment-708106738


   > After [SPARK-29250](https://issues.apache.org/jira/browse/SPARK-29250), I 
guess this PR will be a general Guava version upgrade PR without any relation 
to `Hadoop 3.2.1`.
   
   IMO, if spark-3 with hadoop3.2 can work well in Hadoop cluster (2.6/2/7/2.8. 
etc), it's ok just use hadoop 3.2 client.
   In our hadoop cluster, we use spark-2.4-hadoop-2.6 run in hadoop-3.2.1 
cluster, works well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


HyukjinKwon commented on pull request #29933:
URL: https://github.com/apache/spark/pull/29933#issuecomment-708104164


   cc @juliuszsompolski as well (from #28991)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504350520



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala
##
@@ -287,7 +287,11 @@ class HiveParquetSourceSuite extends 
ParquetPartitioningTest {
   val msg = intercept[IOException] {
 sql("SELECT * FROM tbl1").show()
   }.getMessage
-  assert(msg.contains("Not a file:"))
+  assert(msg.contains(

Review comment:
   > how about creating a method for checking the exception message?
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504350436



##
File path: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
##
@@ -232,6 +232,11 @@ class HadoopRDD[K, V](
 logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist 
and no" +
 s" partitions returned from this path.", e)
 Array.empty[Partition]
+  case e: IOException if e.getMessage.contains("Not a file") =>

Review comment:
   > Nit: if e.getMessage.startsWith("Not a file:")
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504350493



##
File path: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
##
@@ -232,6 +232,11 @@ class HadoopRDD[K, V](
 logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist 
and no" +
 s" partitions returned from this path.", e)
 Array.empty[Partition]
+  case e: IOException if e.getMessage.contains("Not a file") =>
+val path = e.getMessage.split(":").map(_.trim).apply(2)
+throw new IOException(s"Path: ${path} is a directory, it is not" +
+  s" allowed for `serde` reader when" +

Review comment:
   > The code path here is in core module. The reader may not be `serde` 
reader.
   
   How about current?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #30011: [WIP][SPARK-32281][SQL] Spark keep SORTED spec in metastore

2020-10-13 Thread GitBox


AngersZh commented on pull request #30011:
URL: https://github.com/apache/spark/pull/30011#issuecomment-708096869


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #28860: [SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct

2020-10-13 Thread GitBox


github-actions[bot] closed pull request #28860:
URL: https://github.com/apache/spark/pull/28860


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-10-13 Thread GitBox


github-actions[bot] closed pull request #28821:
URL: https://github.com/apache/spark/pull/28821


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504333803



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ErrorHandlerSuite.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.net.ConnectException;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Test suite for {@link ErrorHandler}
+ */
+public class ErrorHandlerSuite {
+
+  @Test
+  public void testPushErrorRetry() {
+ErrorHandler.BlockPushErrorHandler handler = new 
ErrorHandler.BlockPushErrorHandler();
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  ErrorHandler.BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX;
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
ConnectException(;
+assertTrue(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  
ErrorHandler.BlockPushErrorHandler.BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX;

Review comment:
   We actually distinguish between a block duplication and a block 
collision on the server side.
   
   Block duplication is when the exact same shuffle partition block gets pushed 
by different executors, due to speculation or maybe task retry.
   The server side is able to tell when block duplication happens, whether is 
one duplicate block sent after the first has been successfully merged or when 
both blocks are received at the same time.
   With duplicate block, the server will actually respond success to the 
client, so the client won't retry sending it.
   In the case of speculation, when 2 clients might be sending the same block 
at the same time, the server will respond success to 1 of the two and let the 
other write, and if that write fails the corresponding client will retry if 
it's retriable.
   
   On the other hand, a block collision is not about the exact same shuffle 
partition block, but 2 different blocks belonging to the same shuffle partition 
being sent to the same shuffle service at the same time.
   Since the shuffle service need to append 1 block completely before appending 
the content of the next block belonging to the same shuffle partition, when 
these blocks arrive at one shuffle service at the same time, we would encounter 
a block collision.
   A block collision might not  immediately lead to the collision failure sent 
back to the client, since the server will buffer the blocks for a short period 
of time and make a few attempts before giving up.
   When the shuffle service gives up, the client will receive the collision 
failure.
   If it receives the collision failure, it's an indication that this block 
hasn't been merged yet, and thus it's OK to retry.
   Of course, it's totally possible that by the time the retry happens, a 
speculative task has already pushed the block and successfully merged it.
   If that's the case, the retry would be treated as a block duplication 
instead of a block collision, and the client will receive success response.
   
   I hope this servers as an overview of what's to come in the next PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kyprifog commented on a change in pull request #29180: [SPARK-17333][PYSPARK] Enable mypy on the repository

2020-10-13 Thread GitBox


kyprifog commented on a change in pull request #29180:
URL: https://github.com/apache/spark/pull/29180#discussion_r504330439



##
File path: dev/lint-python
##
@@ -122,6 +123,31 @@ function pycodestyle_test {
 fi
 }
 
+function mypy_test {
+local MYPY_REPORT=
+local MYPY_STATUS=
+
+if ! hash "$MYPY_BUILD" 2> /dev/null; then
+echo "The mypy command was not found."
+echo "mypy checks failed."
+exit 1
+fi
+
+echo "starting $MYPY_BUILD test..."
+MYPY_REPORT=$( ($MYPY_BUILD --ignore-missing-imports --config-file 
python/mypy.ini python/) 2>&1)

Review comment:
   Second this, I think it should be in the mypy.ini under specific 
sections like is suggested here:  http://calpaterson.com/mypy-hints.html





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29843: [SPARK-29250][BUILD][test-maven][test-hadoop2.7] Upgrade to Hadoop 3.2.1 and move to shaded client

2020-10-13 Thread GitBox


sunchao commented on pull request #29843:
URL: https://github.com/apache/spark/pull/29843#issuecomment-708071828


   > Among the following, `HivePartitionFilteringSuite.2.2: create client with 
sharesHadoopClasses = false` seems to fail at Github Action, too. Is this a new 
after the last dependency change commit?
   
   Yes this is new after I rebase. I also couldn't reproduce the error locally. 
Still trying to find the root cause.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29933: [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver

2020-10-13 Thread GitBox


maropu commented on a change in pull request #29933:
URL: https://github.com/apache/spark/pull/29933#discussion_r504312420



##
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
##
@@ -45,11 +45,18 @@ private[hive] class SparkExecuteStatementOperation(
 parentSession: HiveSession,
 statement: String,
 confOverlay: JMap[String, String],
-runInBackground: Boolean = true)
+runInBackground: Boolean = true,
+queryTimeout: Long)
   extends ExecuteStatementOperation(parentSession, statement, confOverlay, 
runInBackground)
   with SparkOperation
   with Logging {
 
+  private val queryTimeoutValue = if (queryTimeout <= 0) {

Review comment:
   @wangyum @AngersZh Is This behaviour the same with the current Hive 
one?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] CodingCat commented on a change in pull request #29831: [SPARK-32351][SQL] Show partially pushed down partition filters in explain()

2020-10-13 Thread GitBox


CodingCat commented on a change in pull request #29831:
URL: https://github.com/apache/spark/pull/29831#discussion_r504307502



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
##
@@ -256,20 +256,21 @@ private[hive] trait HiveStrategies {
*/
   object HiveTableScans extends Strategy {
 def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-  case ScanOperation(projectList, predicates, relation: HiveTableRelation) 
=>
+  case ScanOperation(projectList, filters, relation: HiveTableRelation) =>
 // Filter out all predicates that only deal with partition keys, these 
are given to the
 // hive table scan operator to be used for partition pruning.
 val partitionKeyIds = AttributeSet(relation.partitionCols)
-val (pruningPredicates, otherPredicates) = predicates.partition { 
predicate =>
-  !predicate.references.isEmpty &&
-  predicate.references.subsetOf(partitionKeyIds)
-}
+val normalizedFilters = DataSourceStrategy.normalizeExprs(
+  filters.filter(_.deterministic), relation.output)
+
+val partitionKeyFilters = 
DataSourceStrategy.getPushedDownFilters(relation.partitionCols,
+  normalizedFilters)
 
 pruneFilterProject(
   projectList,
-  otherPredicates,
+  filters.filter(f => f.references.isEmpty || 
!f.references.subsetOf(partitionKeyIds)),

Review comment:
   I am actually just following the original behavior of HiveStrategies,
   
   regarding scalar subquery, is it in partition filter , I missed something in 
`FileSourceStrategy `?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR commented on pull request #30033:
URL: https://github.com/apache/spark/pull/30033#issuecomment-708046686


   Btw that was hard to debug and required me to deal with Spark test code, as 
we get nothing from the error message on the case when all columns are matched. 
(In other words, considered as unresolved due to the type incompatibility on 
write between same column.) We can't get the information about the reason why 
the operator is considered as unresolved even we turn on TRACE log.
   
   ```
   org.apache.spark.sql.AnalysisException: unresolved operator 'AppendData 
RelationV2[col_b#225, col_i#226, col_l#227L, col_f#228, col_d#229, col_da#230, 
col_ts_tz#231, col_s#232, col_fi#233, col_bi#234, col_de_1#235, col_de_2#236, 
col_de_3#237, col_st#238, col_li#239, col_ma#240] 
table_convert_read_all_types_5, Map(path -> table_convert_read_all_types_5), 
true;;
   'AppendData RelationV2[col_b#225, col_i#226, col_l#227L, col_f#228, 
col_d#229, col_da#230, col_ts_tz#231, col_s#232, col_fi#233, col_bi#234, 
col_de_1#235, col_de_2#236, col_de_3#237, col_st#238, col_li#239, col_ma#240] 
table_convert_read_all_types_5, Map(path -> table_convert_read_all_types_5), 
true
   +- Project [col_b#12, col_i#13, col_l#14L, col_f#15, col_d#16, col_da#49, 
col_ts_tz#63, col_s#17, col_fi#18, col_bi#19, col_de_1#78, col_de_2#94, 
col_de_3#111, col_st#21, col_li#22, col_ma#23]
  +- Project [col_b#12, col_i#13, col_l#14L, col_f#15, col_d#16, col_s#17, 
col_fi#18, col_bi#19, col_st#21, col_li#22, col_ma#23, col_da#49, col_ts_tz#63, 
col_de_1#78, col_de_2#94, col_de_3#111]
 +- Project [col_b#12, col_i#13, col_l#14L, col_f#15, col_d#16, 
col_s#17, col_fi#18, col_bi#19, col_de#20, col_st#21, col_li#22, col_ma#23, 
col_da#49, col_ts_tz#63, col_de_1#78, col_de_2#94, cast(col_de#20 as 
decimal(38,10)) AS col_de_3#111]
+- Project [col_b#12, col_i#13, col_l#14L, col_f#15, col_d#16, 
col_s#17, col_fi#18, col_bi#19, col_de#20, col_st#21, col_li#22, col_ma#23, 
col_da#49, col_ts_tz#63, col_de_1#78, cast(col_de#20 as decimal(11,2)) AS 
col_de_2#94]
   +- Project [col_b#12, col_i#13, col_l#14L, col_f#15, col_d#16, 
col_s#17, col_fi#18, col_bi#19, col_de#20, col_st#21, col_li#22, col_ma#23, 
col_da#49, col_ts_tz#63, cast(col_de#20 as decimal(9,0)) AS col_de_1#78]
  +- Project [col_b#12, col_i#13, col_l#14L, col_f#15, 
col_d#16, col_s#17, col_fi#18, col_bi#19, col_de#20, col_st#21, col_li#22, 
col_ma#23, col_da#49, now() AS col_ts_tz#63]
 +- Project [col_b#12, col_i#13, col_l#14L, col_f#15, 
col_d#16, col_s#17, col_fi#18, col_bi#19, col_de#20, col_st#21, col_li#22, 
col_ma#23, current_date(Some(Asia/Seoul)) AS col_da#49]
+- LocalRelation [col_b#12, col_i#13, col_l#14L, 
col_f#15, col_d#16, col_s#17, col_fi#18, col_bi#19, col_de#20, col_st#21, 
col_li#22, col_ma#23]
   
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$43(CheckAnalysis.scala:666)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$43$adapted(CheckAnalysis.scala:664)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:664)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
 at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$withCachedData$1(QueryExecution.scala:72)
 at 

[GitHub] [spark] HeartSaVioR opened a new pull request #30033: [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved

2020-10-13 Thread GitBox


HeartSaVioR opened a new pull request #30033:
URL: https://github.com/apache/spark/pull/30033


   ### What changes were proposed in this pull request?
   
   This PR proposes to fix a bug on calling 
`DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters 
in `V2WriteCommand.outputResolved`. The order of parameters for 
`DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says 
that the right order of matching variables are `inAttr` and `outAttr`.
   
   ### Why are the changes needed?
   
   Spark throws AnalysisException due to unresolved operator in v2 write, while 
the operator is unresolved due to a bug that parameters to call 
`DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been 
swapped.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, end users no longer suffer on unresolved operator in v2 write if 
they're trying to write dataframe containing non-nullable complex types against 
table matching complex types as nullable.
   
   ### How was this patch tested?
   
   New UT added.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei edited a comment on pull request #29982: [SPARK-33100][SQL] Ignore a semicolon inside a bracketed comment in spark-sql

2020-10-13 Thread GitBox


turboFei edited a comment on pull request #29982:
URL: https://github.com/apache/spark/pull/29982#issuecomment-707801924


   thanks for your comments @maropu @yaooqinn 
   I tried to thought how to remove the variable `bracketedCommentRightBound` 
but could not find a better way to recognize bracketed comment.
   
   Without this variable:
   1. I might need leave the bracketed comment when current char equals '/' and 
last char equals '*', but it is difficult to judge whether the statement begin.
   2. If I do not leave the bracketed comment when current char equals '/' and 
last char equals '*', it is difficult to leave the comment later.
   For example:
   the line likes: select */*comment\*/ from ta;
   
   So, I recommend to save this variable.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


AmplabJenkins removed a comment on pull request #29924:
URL: https://github.com/apache/spark/pull/29924#issuecomment-702286766


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


holdenk commented on pull request #29924:
URL: https://github.com/apache/spark/pull/29924#issuecomment-707990107


   Jenkins OK to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huskysun commented on pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


huskysun commented on pull request #29924:
URL: https://github.com/apache/spark/pull/29924#issuecomment-707959738


   Also is Jenkins triggered? Looks like the bot didn't make any comment to 
this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huskysun commented on pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


huskysun commented on pull request #29924:
URL: https://github.com/apache/spark/pull/29924#issuecomment-707955643


   @holdenk Made the fix about the indentation. Please take a look again. 
Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huskysun commented on a change in pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


huskysun commented on a change in pull request #29924:
URL: https://github.com/apache/spark/pull/29924#discussion_r504198638



##
File path: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
##
@@ -59,11 +65,19 @@ object ExecutorPodsSnapshot extends Logging {
 case "pending" =>
   PodPending(pod)
 case "running" =>
-  PodRunning(pod)
+  if (shouldCheckAllContainers &&
+"Never" == pod.getSpec.getRestartPolicy &&
+pod.getStatus.getContainerStatuses.stream
+  .map[ContainerStateTerminated](cs => cs.getState.getTerminated)
+  .anyMatch(t => t != null && t.getExitCode != 0)) {
+PodFailed(pod)
+  } else {
+PodRunning(pod)
+  }
 case "failed" =>
   PodFailed(pod)
 case "succeeded" =>
-  PodSucceeded(pod)
+PodSucceeded(pod)

Review comment:
   Sorry about that, just fixed it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on pull request #29855:
URL: https://github.com/apache/spark/pull/29855#issuecomment-707952721


   The most recent test failure does not seem related to this patch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] lidavidm commented on pull request #29818: [WIP][SPARK-32953][PYTHON] Add Arrow self_destruct support to toPandas

2020-10-13 Thread GitBox


lidavidm commented on pull request #29818:
URL: https://github.com/apache/spark/pull/29818#issuecomment-707938312


   Following up here - any other comments? Does this look like a desirable 
feature, and how do we want to configure it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


holdenk commented on a change in pull request #29924:
URL: https://github.com/apache/spark/pull/29924#discussion_r504163939



##
File path: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
##
@@ -59,11 +65,19 @@ object ExecutorPodsSnapshot extends Logging {
 case "pending" =>
   PodPending(pod)
 case "running" =>
-  PodRunning(pod)
+  if (shouldCheckAllContainers &&
+"Never" == pod.getSpec.getRestartPolicy &&
+pod.getStatus.getContainerStatuses.stream
+  .map[ContainerStateTerminated](cs => cs.getState.getTerminated)
+  .anyMatch(t => t != null && t.getExitCode != 0)) {
+PodFailed(pod)
+  } else {
+PodRunning(pod)
+  }
 case "failed" =>
   PodFailed(pod)
 case "succeeded" =>
-  PodSucceeded(pod)
+PodSucceeded(pod)

Review comment:
   nit: indentation should match.

##
File path: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshot.scala
##
@@ -59,11 +65,19 @@ object ExecutorPodsSnapshot extends Logging {
 case "pending" =>
   PodPending(pod)
 case "running" =>
-  PodRunning(pod)
+  if (shouldCheckAllContainers &&
+"Never" == pod.getSpec.getRestartPolicy &&

Review comment:
   Good addition :+1: 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gongx commented on pull request #29924: [SPARK-30821][K8S]Handle executor failure with multiple containers

2020-10-13 Thread GitBox


gongx commented on pull request #29924:
URL: https://github.com/apache/spark/pull/29924#issuecomment-707919371


   @holdenk Could you review this PR, please?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #29972: [SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-13 Thread GitBox


huaxingao commented on a change in pull request #29972:
URL: https://github.com/apache/spark/pull/29972#discussion_r504154239



##
File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc.v2
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types._
+import org.apache.spark.tags.DockerTest
+
+@DockerTest
+trait V2JDBCTest extends SharedSparkSession {
+  val catalogName: String
+  // dialect specific update column type test
+  def testUpdateColumnType(tbl: String): Unit
+
+  test("SPARK-33034: ALTER TABLE ... add new columns") {
+withTable(s"$catalogName.alt_table") {
+  sql(s"CREATE TABLE $catalogName.alt_table (ID STRING) USING _")
+  var t = spark.table(s"$catalogName.alt_table")
+  var expectedSchema = new StructType().add("ID", StringType)
+  assert(t.schema === expectedSchema)
+  sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C1 STRING, C2 
STRING)")
+  t = spark.table(s"$catalogName.alt_table")
+  expectedSchema = expectedSchema.add("C1", StringType).add("C2", 
StringType)
+  assert(t.schema === expectedSchema)
+  sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C3 STRING)")
+  t = spark.table(s"$catalogName.alt_table")
+  expectedSchema = expectedSchema.add("C3", StringType)
+  assert(t.schema === expectedSchema)
+  // Add already existing column
+  val msg = intercept[AnalysisException] {
+sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C3 DOUBLE)")
+  }.getMessage
+  assert(msg.contains("Cannot add column, because C3 already exists"))
+}
+// Add a column to not existing table
+val msg = intercept[AnalysisException] {
+  sql(s"ALTER TABLE $catalogName.not_existing_table ADD COLUMNS (C4 
STRING)")
+}.getMessage
+assert(msg.contains("Table not found"))
+  }
+
+  test("SPARK-33034: ALTER TABLE ... update column type") {
+withTable(s"$catalogName.alt_table") {
+  testUpdateColumnType(s"$catalogName.alt_table")
+  // Update not existing column
+  val msg2 = intercept[AnalysisException] {
+sql(s"ALTER TABLE $catalogName.alt_table ALTER COLUMN bad_column TYPE 
DOUBLE")
+  }.getMessage
+  assert(msg2.contains("Cannot update missing field bad_column"))
+}
+// Update column type in not existing table
+val msg = intercept[AnalysisException] {
+  sql(s"ALTER TABLE $catalogName.not_existing_table ALTER COLUMN id TYPE 
DOUBLE")
+}.getMessage
+assert(msg.contains("Table not found"))
+  }
+
+  test("SPARK-33034: ALTER TABLE ... update column nullability") {
+withTable(s"$catalogName.alt_table") {
+  sql(s"CREATE TABLE $catalogName.alt_table (ID STRING NOT NULL) USING _")
+  var t = spark.table(s"$catalogName.alt_table")
+  // nullable is true in the expecteSchema because Spark always sets 
nullable to true
+  // regardless of the JDBC metadata 
https://github.com/apache/spark/pull/18445

Review comment:
   I did a couple of quick tests using V2 write API:
   ```
   sql("INSERT INTO h2.test.people SELECT 'bob', null")
   ```
   and
   ```
   sql("SELECT null AS ID, 'bob' AS NAME").writeTo("h2.test.people")
   ```
   I got Exception from h2 jdbc driver:
   ```
   Caused by: org.h2.jdbc.JdbcSQLException: NULL not allowed for column "ID"; 
SQL statement:
   INSERT INTO "test"."people" ("NAME","ID") VALUES (?,?) [23502-195]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
   ```
   So we are able to pass the null value for not null column `ID` to h2 and h2 
blocks the insert.
   
   However, if I change the current code in `JDBCRDD.resolveTable` to make 
`alwaysNullable = false` to get the real nullable value, 
   ```
 def resolveTable(options: JDBCOptions): StructType = {
   
 ..
   
 JdbcUtils.getSchema(rs, dialect, alwaysNullable = false)
   ```
   For insert, I got Exception from Spark
   ```
   Cannot write incompatible data to table 'test.people':
   - 

[GitHub] [spark] huaxingao commented on pull request #29972: [SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-13 Thread GitBox


huaxingao commented on pull request #29972:
URL: https://github.com/apache/spark/pull/29972#issuecomment-707914377


   Thanks! @cloud-fan @MaxGekk 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #30031: [SPARK-33134][SQL] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


MaxGekk commented on pull request #30031:
URL: https://github.com/apache/spark/pull/30031#issuecomment-707874151


   The changes conflict with `branch-3.0`. Here is the backport to 3.0: 
https://github.com/apache/spark/pull/30032



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk opened a new pull request #30032: [SPARK-33134][SQL][3.0] Return partial results only for root JSON objects

2020-10-13 Thread GitBox


MaxGekk opened a new pull request #30032:
URL: https://github.com/apache/spark/pull/30032


   ### What changes were proposed in this pull request?
   In the PR, I propose to restrict the partial result feature only by root 
JSON objects. JSON datasource as well as `from_json()` will return `null` for 
malformed nested JSON objects.
   
   ### Why are the changes needed?
   1. To not raise exception to users in the PERMISSIVE mode
   2. To fix a regression and to have the same behavior as Spark 2.4.x has
   3. Current implementation of partial result is supposed to work only for 
root (top-level) JSON objects, and not tested for bad nested complex JSON 
fields.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Before the changes, the code below:
   ```scala
   val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
   val event = new StructType().add("playerId", LongType).add("cards", 
ArrayType(new StructType().add("id", LongType).add("rank", StringType)))
   val pokerhand_events = pokerhand_raw.select(from_json($"events", 
ArrayType(event)).as("event"))
   pokerhand_events.show
   ```
   throws the exception even in the default **PERMISSIVE** mode:
   ```java
   java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
 at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
 at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
 at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
   ```
   
   After the changes:
   ```
   +-+
   |event|
   +-+
   | null|
   +-+
   ```
   
   ### How was this patch tested?
   Added a test to `JsonFunctionsSuite`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread GitBox


xkrogen commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707866922


   +1 on this effort from me @sunchao ! Checking FS instance types is brittle 
and has caused us lots of headaches when we leverage wrapper-type FS instances. 
Allowing the `FileSystem` instance to understand its own capabilities and 
delegate appropriately is much better.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Ngone51 commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504101431



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ErrorHandlerSuite.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.net.ConnectException;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Test suite for {@link ErrorHandler}
+ */
+public class ErrorHandlerSuite {
+
+  @Test
+  public void testPushErrorRetry() {
+ErrorHandler.BlockPushErrorHandler handler = new 
ErrorHandler.BlockPushErrorHandler();
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  ErrorHandler.BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX;
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
ConnectException(;
+assertTrue(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  
ErrorHandler.BlockPushErrorHandler.BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX;

Review comment:
   > If the first attempt failed because of collision, then that block 
effectively hasn't been appended to the file yet, which makes it retriable.
   
   What makes the first attempt failed because of collision? With my 
understanding, it has two possibilities:
   
   1. the same partition has been already merged by another task attempt
   2. the same partition is merging by another task attempt
   
   
   For case 1, do we still need to retry? If we do retry in this case, doesn't 
it return `BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX` again? 
   
   For case2, I think it may make sense to retry in case of that attempt 
doesn't merge partition successfully at the end.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #30030: [SPARK-33132][WEBUI] Make `formatBytes` return `0.0 B` for negative input instead of `NaN`

2020-10-13 Thread GitBox


gengliangwang commented on a change in pull request #30030:
URL: https://github.com/apache/spark/pull/30030#discussion_r504095721



##
File path: core/src/main/resources/org/apache/spark/ui/static/utils.js
##
@@ -39,7 +39,7 @@ function formatDuration(milliseconds) {
 
 function formatBytes(bytes, type) {
 if (type !== 'display') return bytes;
-if (bytes == 0) return '0.0 B';
+if (bytes <= 0) return '0.0 B';

Review comment:
   @akiyamaneko  Thanks for the fix!
   BTW, when will the `bytes` be negative here? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30030: [SPARK-33132][WEBUI] Make `formatBytes` return `0.0 B` for negative input instead of `NaN`

2020-10-13 Thread GitBox


dongjoon-hyun commented on pull request #30030:
URL: https://github.com/apache/spark/pull/30030#issuecomment-707862268


   +1, LGTM. Thank you for your first contribution, @akiyamaneko .
   - This is merged to `master` branch for Apache Spark 3.1.0.
   - I added you to the Apache Spark contributor group.
   - SPARK-33132 is assigned to you.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #30030: [SPARK-33132][WEBUI] Make `formatBytes` return `0.0 B` for negative input instead of `NaN`

2020-10-13 Thread GitBox


dongjoon-hyun closed pull request #30030:
URL: https://github.com/apache/spark/pull/30030


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30030: [SPARK-33132][WEBUI] Fix The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread GitBox


dongjoon-hyun commented on a change in pull request #30030:
URL: https://github.com/apache/spark/pull/30030#discussion_r504091343



##
File path: core/src/main/resources/org/apache/spark/ui/static/utils.js
##
@@ -39,7 +39,7 @@ function formatDuration(milliseconds) {
 
 function formatBytes(bytes, type) {
 if (type !== 'display') return bytes;
-if (bytes == 0) return '0.0 B';
+if (bytes <= 0) return '0.0 B';

Review comment:
   Okay. We already prevent `-Infinity` at `Math.log(0)`. It looks safe if 
we prevent `NaN` at `Math.log(-1)`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread GitBox


sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707856091


   @dongjoon-hyun created 
[SPARK-33135](https://issues.apache.org/jira/browse/SPARK-33135) and changed 
this PR's title.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #30028: [SPARK-33129][BUILD][DOCS] Updating the build/sbt references to test-only with testOnly for SBT 1.3.x

2020-10-13 Thread GitBox


dongjoon-hyun closed pull request #30028:
URL: https://github.com/apache/spark/pull/30028


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread GitBox


sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707856299


   I still need to look into failing tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504085933



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ErrorHandlerSuite.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.net.ConnectException;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Test suite for {@link ErrorHandler}
+ */
+public class ErrorHandlerSuite {
+
+  @Test
+  public void testPushErrorRetry() {
+ErrorHandler.BlockPushErrorHandler handler = new 
ErrorHandler.BlockPushErrorHandler();
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  ErrorHandler.BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX;
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
ConnectException(;
+assertTrue(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  
ErrorHandler.BlockPushErrorHandler.BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX;

Review comment:
   When we append a block to the merged shuffle file, we either append it 
completely or we end up effectively writing nothing to the file.
   If the first attempt failed because of collision, then that block 
effectively hasn't been appended to the file yet, which makes it retriable.
   In the 2nd PR to be sent out soon, it will include more details for this 
part.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

2020-10-13 Thread GitBox


viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-70782


   cc @cloud-fan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30022: [SPARK-33090][BUILD][test-hadoop2.7] Upgrade Google Guava to 29.0-jre

2020-10-13 Thread GitBox


dongjoon-hyun commented on pull request #30022:
URL: https://github.com/apache/spark/pull/30022#issuecomment-707853981


   BTW, @sfcoy and @AngersZh .
   
   For the following, Apache Spark community wants to use the official Hadoop 3 
client to cut the dependency dramatically. 
   > Upgrade the Google Guava dependency for compatibility with Hadoop 3.2.1 
and Hadoop 3.3.0.
   
   Please see here.
   - https://github.com/apache/spark/pull/29843 (SPARK-29250 Upgrade to Hadoop 
3.2.1 and move to shaded client)
   
   After SPARK-29250, I guess this PR will be a general Guava version upgrade 
PR without any relation to `Hadoop 3.2.1`.
   
   cc @sunchao 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504081782



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ErrorHandlerSuite.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.net.ConnectException;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Test suite for {@link ErrorHandler}
+ */
+public class ErrorHandlerSuite {
+
+  @Test
+  public void testPushErrorRetry() {
+ErrorHandler.BlockPushErrorHandler handler = new 
ErrorHandler.BlockPushErrorHandler();
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  ErrorHandler.BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX;
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
ConnectException(;
+assertTrue(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  
ErrorHandler.BlockPushErrorHandler.BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX;

Review comment:
   It's retriable because this block hasn't been appended to the merged 
shuffle file and the merge operation hasn't been finalized yet.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

2020-10-13 Thread GitBox


viirya commented on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-707851438


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29092: [SPARK-32295][SQL] Add not null and size > 0 filters before inner explode/inline to benefit from predicate pushdown

2020-10-13 Thread GitBox


cloud-fan commented on a change in pull request #29092:
URL: https://github.com/apache/spark/pull/29092#discussion_r504080387



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -868,6 +866,41 @@ object TransposeWindow extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Infers filters from [[Generate]], such that rows that would have been 
removed
+ * by this [[Generate]] can be removed earlier - before joins and in data 
sources.
+ */
+object InferFiltersFromGenerate extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+// This rule does not infer filters from foldable expressions to avoid 
constant filters
+// like 'size([1, 2, 3]) > 0'. These do not show up in child's constraints 
and
+// then the idempotence will break.
+case generate @ Generate(e, _, _, _, _, _)
+  if !e.deterministic || e.children.forall(_.foldable) => generate
+
+case generate @ Generate(g, _, false, _, _, _) if canInferFilters(g) =>
+  // Exclude child's constraints to guarantee idempotency
+  val inferredFilters = ExpressionSet(
+Seq(
+  GreaterThan(Size(g.children.head), Literal(0)),

Review comment:
   I don't think so...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30022: [SPARK-33090][BUILD][test-hadoop2.7] Upgrade Google Guava to 29.0-jre

2020-10-13 Thread GitBox


AmplabJenkins removed a comment on pull request #30022:
URL: https://github.com/apache/spark/pull/30022#issuecomment-707484086


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29092: [SPARK-32295][SQL] Add not null and size > 0 filters before inner explode/inline to benefit from predicate pushdown

2020-10-13 Thread GitBox


viirya commented on a change in pull request #29092:
URL: https://github.com/apache/spark/pull/29092#discussion_r504079357



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -868,6 +866,41 @@ object TransposeWindow extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Infers filters from [[Generate]], such that rows that would have been 
removed
+ * by this [[Generate]] can be removed earlier - before joins and in data 
sources.
+ */
+object InferFiltersFromGenerate extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+// This rule does not infer filters from foldable expressions to avoid 
constant filters
+// like 'size([1, 2, 3]) > 0'. These do not show up in child's constraints 
and
+// then the idempotence will break.
+case generate @ Generate(e, _, _, _, _, _)
+  if !e.deterministic || e.children.forall(_.foldable) => generate
+
+case generate @ Generate(g, _, false, _, _, _) if canInferFilters(g) =>
+  // Exclude child's constraints to guarantee idempotency
+  val inferredFilters = ExpressionSet(
+Seq(
+  GreaterThan(Size(g.children.head), Literal(0)),

Review comment:
   This can be useful to filter rows before join, but can we pushdown 
`Size` expression to datasource?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30022: [SPARK-33090][BUILD][test-hadoop2.7] Upgrade Google Guava to 29.0-jre

2020-10-13 Thread GitBox


dongjoon-hyun commented on pull request #30022:
URL: https://github.com/apache/spark/pull/30022#issuecomment-707850088


   Thanks, @sfcoy . This is an interesting approach.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30022: [SPARK-33090][BUILD][test-hadoop2.7] Upgrade Google Guava to 29.0-jre

2020-10-13 Thread GitBox


dongjoon-hyun commented on pull request #30022:
URL: https://github.com/apache/spark/pull/30022#issuecomment-707849045


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29843: [SPARK-29250][BUILD][test-maven][test-hadoop2.7] Upgrade to Hadoop 3.2.1 and move to shaded client

2020-10-13 Thread GitBox


dongjoon-hyun commented on pull request #29843:
URL: https://github.com/apache/spark/pull/29843#issuecomment-707843544


   Among the following, `HivePartitionFilteringSuite.2.2: create client with 
sharesHadoopClasses = false` seems to fail at Github Action, too. Is this a new 
after the last dependency change commit?
   ![Screen Shot 2020-10-13 at 9 00 44 
AM](https://user-images.githubusercontent.com/9700541/95885860-a6a09600-0d32-11eb-8884-a6cdea031ba7.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on pull request #29855:
URL: https://github.com/apache/spark/pull/29855#issuecomment-707841160


   Thanks for the additional review comments from @jiangxb1987 @Ngone51. I 
should have resolved all pending issues.
   
   Ping @attilapiros @tgravescs @mridulm to see if there're any additional 
concerns on the PR and if we can get a +1.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504066706



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockPusher.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Map;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.spark.network.buffer.ManagedBuffer;
+import org.apache.spark.network.buffer.NioManagedBuffer;
+import org.apache.spark.network.client.RpcResponseCallback;
+import org.apache.spark.network.client.TransportClient;
+import org.apache.spark.network.shuffle.protocol.PushBlockStream;
+
+/**
+ * Similar to {@link OneForOneBlockFetcher}, but for pushing blocks to remote 
shuffle service to
+ * be merged instead of for fetching them from remote shuffle services. This 
is used by
+ * ShuffleWriter when the block push process is initiated. The supplied 
BlockFetchingListener
+ * is used to handle the success or failure in pushing each blocks.
+ */
+public class OneForOneBlockPusher {
+  private static final Logger logger = 
LoggerFactory.getLogger(OneForOneBlockPusher.class);
+  private static final ErrorHandler PUSH_ERROR_HANDLER = new 
ErrorHandler.BlockPushErrorHandler();
+
+  private final TransportClient client;
+  private final String appId;
+  private final String[] blockIds;
+  private final BlockFetchingListener listener;
+  private final Map buffers;
+
+  public OneForOneBlockPusher(
+  TransportClient client,
+  String appId,
+  String[] blockIds,
+  BlockFetchingListener listener,
+  Map buffers) {
+this.client = client;
+this.appId = appId;
+this.blockIds = blockIds;
+this.listener = listener;
+this.buffers = buffers;
+  }
+
+  private class BlockPushCallback implements RpcResponseCallback {
+
+private int index;
+private String blockId;
+
+BlockPushCallback(int index, String blockId) {
+  this.index = index;
+  this.blockId = blockId;
+}
+
+@Override
+public void onSuccess(ByteBuffer response) {
+  // On receipt of a successful block push
+  listener.onBlockFetchSuccess(blockId, new 
NioManagedBuffer(ByteBuffer.allocate(0)));
+}
+
+@Override
+public void onFailure(Throwable e) {
+  // Since block push is best effort, i.e., if we encountered a block push 
failure that's not
+  // retriable or exceeding the max retires, we should not fail all 
remaining block pushes.
+  // The best effort nature makes block push tolerable of a partial 
completion. Thus, we only
+  // fail the block that's actually failed. Not that, on the 
RetryingBlockFetcher side, once
+  // retry is initiated, it would still invalidate the previous active 
retry listener, and
+  // retry all outstanding blocks. We are preventing forwarding 
unnecessary block push failures
+  // to the parent listener of the retry listener. The only exceptions 
would be if the block
+  // push failure is due to block arriving on the server side after merge 
finalization, or the
+  // client fails to establish connection to the server side. In both 
cases, we would fail all
+  // remaining blocks.
+  if (PUSH_ERROR_HANDLER.shouldRetryError(e)) {
+String[] targetBlockId = Arrays.copyOfRange(blockIds, index, index + 
1);
+failRemainingBlocks(targetBlockId, e);
+  } else {
+String[] targetBlockId = Arrays.copyOfRange(blockIds, index, 
blockIds.length);

Review comment:
   If it's not retriable, the exception will then be surfaced to the parent 
listener, which will actually handle the failure.
   The log will only be printed for the exceptions that are not retriable, and 
thus it should only print the log once per block by 
`RetryingBlockFetchListener`.
   It will however print the log for the entire batch of blocks in the 
invocation of `failRemainingBlocks` with a non-retriable exception though.
   This is the current behavior of Spark for block fetch failure as well.
   
   In terms of parent listener, 

[GitHub] [spark] Ngone51 commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Ngone51 commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504063893



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ErrorHandlerSuite.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.net.ConnectException;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Test suite for {@link ErrorHandler}
+ */
+public class ErrorHandlerSuite {
+
+  @Test
+  public void testPushErrorRetry() {
+ErrorHandler.BlockPushErrorHandler handler = new 
ErrorHandler.BlockPushErrorHandler();
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  ErrorHandler.BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX;
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
ConnectException(;
+assertTrue(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  
ErrorHandler.BlockPushErrorHandler.BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX;

Review comment:
   Ah..sorry. I do mean the same shuffle partition rather than the same 
block.
   
   WDYT?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

2020-10-13 Thread GitBox


cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r504061074



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, 
SQLMetrics}
 import org.apache.spark.util.SerializableConfiguration
 
 
+
 /**
  * Simple metrics collected during an instance of [[FileFormatDataWriter]].
  * These were first introduced in https://github.com/apache/spark/pull/18159 
(SPARK-20703).
  */
 case class BasicWriteTaskStats(
-numPartitions: Int,
+partitions: Seq[InternalRow],

Review comment:
   end-to-end performance of an INSERT query, with a partitioned table with 
1 or 2 or 3 partitioned columns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tanelk commented on pull request #28224: [SPARK-31429][SQL][DOC] Automatically generates a SQL document for built-in functions

2020-10-13 Thread GitBox


tanelk commented on pull request #28224:
URL: https://github.com/apache/spark/pull/28224#issuecomment-707825409


   Yeah, I can take a look



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #29855: [SPARK-32915][CORE] Network-layer and shuffle RPC layer changes to support push shuffle blocks

2020-10-13 Thread GitBox


Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r504050374



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ErrorHandlerSuite.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle;
+
+import java.net.ConnectException;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Test suite for {@link ErrorHandler}
+ */
+public class ErrorHandlerSuite {
+
+  @Test
+  public void testPushErrorRetry() {
+ErrorHandler.BlockPushErrorHandler handler = new 
ErrorHandler.BlockPushErrorHandler();
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  ErrorHandler.BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX;
+assertFalse(handler.shouldRetryError(new RuntimeException(new 
ConnectException(;
+assertTrue(handler.shouldRetryError(new RuntimeException(new 
IllegalArgumentException(
+  
ErrorHandler.BlockPushErrorHandler.BLOCK_APPEND_COLLISION_DETECTED_MSG_PREFIX;

Review comment:
   It's not the same block, but another block belonging to the same shuffle 
partition.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #30027: [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory

2020-10-13 Thread GitBox


gengliangwang commented on a change in pull request #30027:
URL: https://github.com/apache/spark/pull/30027#discussion_r504046931



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala
##
@@ -287,7 +287,11 @@ class HiveParquetSourceSuite extends 
ParquetPartitioningTest {
   val msg = intercept[IOException] {
 sql("SELECT * FROM tbl1").show()
   }.getMessage
-  assert(msg.contains("Not a file:"))
+  assert(msg.contains(

Review comment:
   how about creating a method for checking the exception message?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >