[GitHub] [spark] gaborgsomogyi commented on a change in pull request #30389: [SPARK-33143][PYTHON] Add configurable timeout to python server and client

2020-11-17 Thread GitBox


gaborgsomogyi commented on a change in pull request #30389:
URL: https://github.com/apache/spark/pull/30389#discussion_r525876645



##
File path: core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala
##
@@ -76,12 +79,22 @@ class PythonRDDSuite extends SparkFunSuite with 
LocalSparkContext {
   }
 
   test("python server error handling") {
-val authHelper = new SocketAuthHelper(new SparkConf())
-val errorServer = new ExceptionPythonServer(authHelper)
-val client = new Socket(InetAddress.getLoopbackAddress(), errorServer.port)
-authHelper.authToServer(client)
-val ex = intercept[Exception] { errorServer.getResult(Duration(1, 
"second")) }
-assert(ex.getCause().getMessage().contains("exception within 
handleConnection"))
+val savedSparkEnv = SparkEnv.get
+try {
+  val conf = new SparkConf()
+  val env = mock(classOf[SparkEnv])
+  doReturn(conf).when(env).conf
+  SparkEnv.set(env)

Review comment:
   It's not testing the newly added code in any way. The test blowed-up 
with NPE because `SparkEnv.get` introduced in the server side code.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30406: [SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections

2020-11-17 Thread GitBox


SparkQA commented on pull request #30406:
URL: https://github.com/apache/spark/pull/30406#issuecomment-729502912


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35861/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaborgsomogyi commented on pull request #30336: [SPARK-33287][SS][UI]Expose state custom metrics information on SS UI

2020-11-17 Thread GitBox


gaborgsomogyi commented on pull request #30336:
URL: https://github.com/apache/spark/pull/30336#issuecomment-729502108


   > O.K, it seems to be better to document how to get the list of custom 
metrics.
   
   I'll add it to the doc of the new parameter.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaborgsomogyi commented on a change in pull request #30336: [SPARK-33287][SS][UI]Expose state custom metrics information on SS UI

2020-11-17 Thread GitBox


gaborgsomogyi commented on a change in pull request #30336:
URL: https://github.com/apache/spark/pull/30336#discussion_r525874550



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala
##
@@ -236,12 +239,61 @@ private[ui] class StreamingQueryStatisticsPage(parent: 
StreamingQueryTab)
   {graphUIDataForNumRowsDroppedByWatermark.generateTimelineHtml(jsCollector)}
   {graphUIDataForNumRowsDroppedByWatermark.generateHistogramHtml(jsCollector)}
 
-  // scalastyle:on
+// scalastyle:on
+
+  result ++= generateAggregatedCustomMetrics(query, minBatchTime, 
maxBatchTime, jsCollector)
+  result
 } else {
   new NodeBuffer()
 }
   }
 
+  def generateAggregatedCustomMetrics(
+  query: StreamingQueryUIData,
+  minBatchTime: Long,
+  maxBatchTime: Long,
+  jsCollector: JsCollector): NodeBuffer = {
+val result: NodeBuffer = new NodeBuffer
+
+// This is made sure on caller side but put it here to be defensive
+require(query.lastProgress.stateOperators.nonEmpty)
+val enabledCustomMetrics = 
parent.parent.conf.get(ENABLED_STREAMING_UI_CUSTOM_METRIC_LIST)
+logDebug(s"Enabled custom metrics: $enabledCustomMetrics")
+query.lastProgress.stateOperators.head.customMetrics.keySet().asScala
+  .filter(enabledCustomMetrics.contains(_)).map { metricName =>
+val data = query.recentProgress.map(p => 
(parseProgressTimestamp(p.timestamp),
+  p.stateOperators.map(_.customMetrics.get(metricName).toDouble).sum))
+val max = data.maxBy(_._2)._2
+
+val graphUIData =
+  new GraphUIData(
+s"aggregated-$metricName-timeline",
+s"aggregated-$metricName-histogram",
+data,
+minBatchTime,
+maxBatchTime,
+0,
+max,
+"")

Review comment:
   Since we're not calling `init` it doesn't play as you've mentioned. The 
restriction is not to put anything nasty in the constructor. I can live with it 
just wanted to highlight. Since I don't feel objection from @HeartSaVioR I'm 
starting to add it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-11-17 Thread GitBox


gengliangwang commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-729501113


   I'm sorry that I was forcing on other tasks and couldn't follow this thread.
   Thanks for the great work,  @cchighman !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30373: [SPARK-33472][SQL] Adjust RemoveRedundantSorts rule order

2020-11-17 Thread GitBox


SparkQA commented on pull request #30373:
URL: https://github.com/apache/spark/pull/30373#issuecomment-729501057


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35859/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30373: [SPARK-33472][SQL] Adjust RemoveRedundantSorts rule order

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30373:
URL: https://github.com/apache/spark/pull/30373#issuecomment-729501074







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffl

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729500060







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv

2020-11-17 Thread GitBox


SparkQA commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729500038


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35858/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30406: [SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections

2020-11-17 Thread GitBox


SparkQA commented on pull request #30406:
URL: https://github.com/apache/spark/pull/30406#issuecomment-729499173


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35860/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffl

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729494363







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv

2020-11-17 Thread GitBox


SparkQA commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729494347


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35857/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuzikun2003 commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


xuzikun2003 commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525862140



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;
+import java.util.Queue;
+import java.util.TreeMap;
+import java.io.IOException;
+import java.util.function.Supplier;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import org.apache.spark.memory.SparkOutOfMemoryError;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends 
AbstractUnsafeExternalRowSorter {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeExternalRowWindowSorter.class);
+
+  private final StructType schema;
+  private final UnsafeProjection partitionSpecProjection;
+  private final Ordering orderingOfPartitionKey;
+  private final Ordering orderingInWindow;
+  private final Ordering orderingAcrossWindows;
+  private final PrefixComparator prefixComparatorInWindow;
+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;

Review comment:
   @maropu , @opensky142857, here are the reasons for why we set the 
windowSorterMapMaxSize to be 1 and why we reduce the page size of each sorter.
   
   Each UnsafeExternalRowSorter is using a different memory consumer. Whenever 
you insert the first row into an UnsafeExternalRowSorter, the memory consumer 
of this sorter will allocate a whole page to the sorter. In our perf run of 
TPCDS100TB, the default page size is 64MB. If we insert only a few rows into a 
sorter corresponding to a window, then a lot of memory resources are wasted and 
the non-necessary memory allocation also brings significant performance 
overhead. So that is why we do two things in this PR:
   1. Keep the number of window sorters small
   2. Decrease the page size of each window sorter.
   
   To address this problem, actually we have two directions to go. 
   
   One direction is that we can let these window sorters share the same memory 
consumer. Thus we won't allocate many big pages to which very few rows are 
inserted. But this direction requires a lot of engineer effort the refactor the 
code of UnsafeExternalSorter.
   
   The second direction is that we only keep one window sorter for each 
physical partition.
   
   Here is why we choose the second direction. When we run TPCDS100TB, we are 
not seeing Spark engine is slow in sorting many windows in a physical 
partition. We are seeing Spark engine is slow in sorting a single window in a 
single physical partition (q67 is the case), and the executor is doing a lot of 
unnecessary comparisons on the window partition key. To address the slowness 
that we observe, we follow the second direction to keep only one window sorter 
for each physical partition. And this single window sorter in each physical 
partition does not need to compare the window partition key and thus it runs 
almost 2 times faster.
   
   Perhaps I can rename these parameters to avoid confusion. How do you guys 
think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [spark] xuzikun2003 commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


xuzikun2003 commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525862140



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;
+import java.util.Queue;
+import java.util.TreeMap;
+import java.io.IOException;
+import java.util.function.Supplier;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import org.apache.spark.memory.SparkOutOfMemoryError;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends 
AbstractUnsafeExternalRowSorter {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeExternalRowWindowSorter.class);
+
+  private final StructType schema;
+  private final UnsafeProjection partitionSpecProjection;
+  private final Ordering orderingOfPartitionKey;
+  private final Ordering orderingInWindow;
+  private final Ordering orderingAcrossWindows;
+  private final PrefixComparator prefixComparatorInWindow;
+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;

Review comment:
   @maropu , @opensky142857, here are the reasons for why we set the 
windowSorterMapMaxSize to be 1 and why we reduce the page size of each sorter.
   
   Each UnsafeExternalRowSorter is using a different memory consumer. Whenever 
you insert the first row into an UnsafeExternalRowSorter, the memory consumer 
of this sorter will allocate a whole page to the sorter. In our perf run of 
TPCDS100TB, the default page size is 64MB. If we insert only a few rows into a 
sorter corresponding to a window, then a lot of memory resources are wasted and 
the non-necessary memory allocation also brings significant performance 
overhead. So that is why we do two things in this PR:
   1. Keep the number of window sorters small
   2. Decrease the page size of each window sorter.
   
   To address this problem, actually we have two directions to go. 
   
   One direction is that we can let these window sorters share the same memory 
consumer. Thus we won't allocate many big pages to which very few rows are 
inserted. But this direction requires a lot of engineer effort the refactor the 
code of UnsafeExternalSorter.
   
   The second direction is that we only keep one window sorter for each 
physical partition.
   
   Here is why we choose the second direction. When we run TPCDS100TB, we are 
not seeing Spark engine is slow in sorting many windows in a physical 
partition. We are seeing Spark engine is slow in sorting a single window in a 
single physical partition (q67 is the case), and the executor is doing a lot of 
unnecessary comparisons on the window partition key. To address the slowness 
that we observe, we follow the second direction to keep only one window sorter 
for each physical partition. And this single window sorter in each physical 
partition does not need to compare the window partition key.
   
   Perhaps I can rename these parameters to avoid confusion. How do you guys 
think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [spark] xuzikun2003 commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


xuzikun2003 commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525862140



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;
+import java.util.Queue;
+import java.util.TreeMap;
+import java.io.IOException;
+import java.util.function.Supplier;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import org.apache.spark.memory.SparkOutOfMemoryError;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends 
AbstractUnsafeExternalRowSorter {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeExternalRowWindowSorter.class);
+
+  private final StructType schema;
+  private final UnsafeProjection partitionSpecProjection;
+  private final Ordering orderingOfPartitionKey;
+  private final Ordering orderingInWindow;
+  private final Ordering orderingAcrossWindows;
+  private final PrefixComparator prefixComparatorInWindow;
+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;

Review comment:
   @maropu , @opensky142857, here are the reasons for why we set the 
windowSorterMapMaxSize to be 1 and why we reduce the page size of each sorter.
   
   Each UnsafeExternalRowSorter is using a different memory consumer. Whenever 
you insert the first row into an UnsafeExternalRowSorter, the memory consumer 
of this sorter will allocate a whole page to the sorter. In our perf run of 
TPCDS100TB, the default page size is 64MB. If we insert only a few rows into a 
sorter corresponding to a window, then a lot of memory resources are wasted and 
the non-necessary memory allocation also brings significant overhead. So that 
is why we do two things in this PR:
   1. Keep the number of window sorters small
   2. Decrease the page size of each window sorter.
   
   To address this problem, actually we have two directions to go. 
   
   One direction is that we can let these window sorters share the same memory 
consumer. Thus we won't allocate many big pages to which very few rows are 
inserted. But this direction requires a lot of engineer effort the refactor the 
code of UnsafeExternalSorter.
   
   The second direction is that we only keep one window sorter for each 
physical partition.
   
   Here is why we choose the second direction. When we run TPCDS100TB, we are 
not seeing Spark engine is slow in sorting many windows in a physical 
partition. We are seeing Spark engine is slow in sorting a single window in a 
single physical partition (q67 is the case), and the executor is doing a lot of 
unnecessary comparisons on the window partition key. To address the slowness 
that we observe, we follow the second direction to keep only one window sorter 
for each physical partition. And this single window sorter in each physical 
partition does not need to compare the window partition key.
   
   Perhaps I can rename these parameters to avoid confusion. How do you guys 
think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [spark] dongjoon-hyun commented on pull request #25870: [SPARK-27936][K8S] Support python deps

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #25870:
URL: https://github.com/apache/spark/pull/25870#issuecomment-729492170


   Gentle ping, @skonto .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuzikun2003 commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


xuzikun2003 commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525862140



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;
+import java.util.Queue;
+import java.util.TreeMap;
+import java.io.IOException;
+import java.util.function.Supplier;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import org.apache.spark.memory.SparkOutOfMemoryError;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends 
AbstractUnsafeExternalRowSorter {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeExternalRowWindowSorter.class);
+
+  private final StructType schema;
+  private final UnsafeProjection partitionSpecProjection;
+  private final Ordering orderingOfPartitionKey;
+  private final Ordering orderingInWindow;
+  private final Ordering orderingAcrossWindows;
+  private final PrefixComparator prefixComparatorInWindow;
+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;

Review comment:
   @maropu , @opensky142857, here are the reasons for why we set the 
windowSorterMapMaxSize to be 1 and why we reduce the page size of each sorter.
   
   Each UnsafeExternalRowSorter is using a different memory consumer. Whenever 
you insert the first row into an UnsafeExternalRowSorter, the memory consumer 
of this sorter will allocate a whole page to the sorter. In our perf run of 
TPCDS100TB, the default page size is 64MB. If we insert only a few rows into a 
sorter corresponding to a window, then a lot of memory resources are wasted and 
it also brings significant overhead for non-necessary memory allocation. So 
that is why we do two things in this PR:
   1. Keep the number of window sorters small
   2. Decrease the page size of each window sorter.
   
   To address this problem, actually we have two directions to go. 
   
   One direction is that we can let these window sorters share the same memory 
consumer. Thus we won't allocate many big pages to which very few rows are 
inserted. But this direction requires a lot of engineer effort the refactor the 
code of UnsafeExternalSorter.
   
   The second direction is that we only keep one window sorter for each 
physical partition.
   
   Here is why we choose the second direction. When we run TPCDS100TB, we are 
not seeing Spark engine is slow in sorting many windows in a physical 
partition. We are seeing Spark engine is slow in sorting a single window in a 
single physical partition (q67 is the case), and the executor is doing a lot of 
unnecessary comparisons on the window partition key. To address the slowness 
that we observe, we follow the second direction to keep only one window sorter 
for each physical partition. And this single window sorter in each physical 
partition does not need to compare the window partition key.
   
   Perhaps I can rename these parameters to avoid confusion. How do you guys 
think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [spark] gengliangwang commented on a change in pull request #30407: [SPARK-32852][SQL][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


gengliangwang commented on a change in pull request #30407:
URL: https://github.com/apache/spark/pull/30407#discussion_r525863735



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -96,17 +96,18 @@ private[spark] object HiveUtils extends Logging {
 .createWithDefault("builtin")
 
   val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
-.doc(s"Comma separated URL of Hive jars, support both local and remote 
paths," +
-  s"Such as: " +
-  s" 1. file://path/to/jar/xxx.jar\n" +
-  s" 2. hdfs://nameservice/path/to/jar/xxx.jar\n" +
-  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)\n" +
-  s" 4. [http/https/ftp]://path/to/jar/xxx.jar\n" +
-  s"Notice: `http/https/ftp` doesn't support wildcard, but other URLs 
support" +
-  s"nested path wildcard, Such as: " +
-  s" 1. file://path/to/jar/*, file://path/to/jar/*/*\n" +
-  s" 2. hdfs://nameservice/path/to/jar/*, 
hdfs://nameservice/path/to/jar/*/*\n" +
-  s"When ${HIVE_METASTORE_JARS.key} is set to `path`, we will use Hive 
jars configured by this")
+.doc(s"""
+  | Comma-sperated paths of the jars that should be used to instantiate 
the HiveMetastoreClient.

Review comment:
   Thanks, updated





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729490935


   Thank you, @viirya !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30407: [SPARK-32852][SQL][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30407:
URL: https://github.com/apache/spark/pull/30407#discussion_r525862572



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -96,17 +96,18 @@ private[spark] object HiveUtils extends Logging {
 .createWithDefault("builtin")
 
   val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
-.doc(s"Comma separated URL of Hive jars, support both local and remote 
paths," +
-  s"Such as: " +
-  s" 1. file://path/to/jar/xxx.jar\n" +
-  s" 2. hdfs://nameservice/path/to/jar/xxx.jar\n" +
-  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)\n" +
-  s" 4. [http/https/ftp]://path/to/jar/xxx.jar\n" +
-  s"Notice: `http/https/ftp` doesn't support wildcard, but other URLs 
support" +
-  s"nested path wildcard, Such as: " +
-  s" 1. file://path/to/jar/*, file://path/to/jar/*/*\n" +
-  s" 2. hdfs://nameservice/path/to/jar/*, 
hdfs://nameservice/path/to/jar/*/*\n" +
-  s"When ${HIVE_METASTORE_JARS.key} is set to `path`, we will use Hive 
jars configured by this")
+.doc(s"""
+  | Comma-sperated paths of the jars that should be used to instantiate 
the HiveMetastoreClient.
+  | This configuration is effective only when `{$HIVE_METASTORE_JARS.key}` 
is set as `path`.
+  | The paths can be any of the following format:
+  | 1. file://path/to/jar/foo.jar
+  | 2. hdfs://nameservice/path/to/jar/foo.jar
+  | 3. /path/to/jar/ (path without URI scheme follow conf `fs.defaultFS`'s 
URI schema)
+  | 4. [http/https/ftp]://path/to/jar/foo.jar
+  | All the above formats support wildcard except for the `http/https/ftp` 
URL. For example:

Review comment:
   Maybe, the following is shorter?
   ```
   - All the above formats support wildcard except for the `http/https/ftp` 
URL. For example:
   + 1, 2, and 3 support wildcard. For example:
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30373: [SPARK-33472][SQL] Adjust RemoveRedundantSorts rule order

2020-11-17 Thread GitBox


SparkQA commented on pull request #30373:
URL: https://github.com/apache/spark/pull/30373#issuecomment-729490026


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35859/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuzikun2003 commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


xuzikun2003 commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525862140



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;
+import java.util.Queue;
+import java.util.TreeMap;
+import java.io.IOException;
+import java.util.function.Supplier;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import org.apache.spark.memory.SparkOutOfMemoryError;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends 
AbstractUnsafeExternalRowSorter {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeExternalRowWindowSorter.class);
+
+  private final StructType schema;
+  private final UnsafeProjection partitionSpecProjection;
+  private final Ordering orderingOfPartitionKey;
+  private final Ordering orderingInWindow;
+  private final Ordering orderingAcrossWindows;
+  private final PrefixComparator prefixComparatorInWindow;
+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;

Review comment:
   @maropu , @opensky142857, here are the reasons for why we set the 
windowSorterMapMaxSize to be 1 and why we reduce the page size of each sorter.
   
   Each UnsafeExternalRowSorter is using a different memory consumer. Whenever 
you insert the first row into an UnsafeExternalRowSorter, the memory consumer 
of this sorter will allocate a whole page to the sorter. In our perf run of 
TPCDS100TB, the default page size is 64MB. If we insert only a few rows into a 
sorter corresponding to a window, then a lot of memory resources are wasted and 
it also brings significant overhead for non-necessary memory allocation. So 
that is why we do two things in this PR:
   1. Keep the number of window sorters small
   2. Decrease the page size of each window sorter.
   
   To address this problem, actually we have two directions to go. 
   
   One direction is that we can let these window sorters share the same memory 
consumer. Thus we won't allocate many big pages to which very few rows are 
inserted.
   
   The second direction is that we only keep one window sorter for each 
physical partition.
   
   Here is why we choose the second direction. When we run TPCDS100TB, we are 
not seeing Spark engine is slow in sorting many windows in a physical 
partition. We are seeing Spark engine is slow in sorting a single window in a 
single physical partition (q67 is the case), and the executor is doing a lot of 
unnecessary comparisons on the window partition key. To address the slowness 
that we observe, we follow the second direction to keep only one window sorter 
for each physical partition. And this single window sorter in each physical 
partition does not need to compare the window partition key.
   
   Perhaps I can rename these parameters to avoid confusion. How do you guys 
think?

##
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala
##
@@ -42,6 +43,139 @@ case class SortExec(
 global: Boolean,
 child: SparkPlan,
 testSpillFrequency: Int = 0)
+  extends SortExecBase(
+sortOrder,
+global,
+child,
+testSpillFrequency) {
+
+  def createSorter(): UnsafeExternalRowSorter = {

Review comment:
   Sure, will add.

##
File path: 

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30407: [SPARK-32852][SQL][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30407:
URL: https://github.com/apache/spark/pull/30407#discussion_r525861673



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -96,17 +96,18 @@ private[spark] object HiveUtils extends Logging {
 .createWithDefault("builtin")
 
   val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
-.doc(s"Comma separated URL of Hive jars, support both local and remote 
paths," +
-  s"Such as: " +
-  s" 1. file://path/to/jar/xxx.jar\n" +
-  s" 2. hdfs://nameservice/path/to/jar/xxx.jar\n" +
-  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)\n" +
-  s" 4. [http/https/ftp]://path/to/jar/xxx.jar\n" +
-  s"Notice: `http/https/ftp` doesn't support wildcard, but other URLs 
support" +
-  s"nested path wildcard, Such as: " +
-  s" 1. file://path/to/jar/*, file://path/to/jar/*/*\n" +
-  s" 2. hdfs://nameservice/path/to/jar/*, 
hdfs://nameservice/path/to/jar/*/*\n" +
-  s"When ${HIVE_METASTORE_JARS.key} is set to `path`, we will use Hive 
jars configured by this")
+.doc(s"""
+  | Comma-sperated paths of the jars that should be used to instantiate 
the HiveMetastoreClient.

Review comment:
   Could you revert from `sperated` (this) to `separated` (original)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv

2020-11-17 Thread GitBox


SparkQA commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729489017


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35858/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang edited a comment on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


LuciferYang edited a comment on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729487796


   seems all case has been fixed except for `RDD.isEmpty` relevant case



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


LuciferYang commented on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729487796


   seems all case fixed except for `RDD.isEmpty` relevant case



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #30407: [SPARK-32852][SQL][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


gengliangwang commented on a change in pull request #30407:
URL: https://github.com/apache/spark/pull/30407#discussion_r525860237



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -95,18 +95,19 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault("builtin")
 
-  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
-.doc(s"Comma separated URL of Hive jars, support both local and remote 
paths," +
-  s"Such as: " +
-  s" 1. file://path/to/jar/xxx.jar\n" +
-  s" 2. hdfs://nameservice/path/to/jar/xxx.jar\n" +
-  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)\n" +
-  s" 4. [http/https/ftp]://path/to/jar/xxx.jar\n" +
-  s"Notice: `http/https/ftp` doesn't support wildcard, but other URLs 
support" +
-  s"nested path wildcard, Such as: " +
-  s" 1. file://path/to/jar/*, file://path/to/jar/*/*\n" +
-  s" 2. hdfs://nameservice/path/to/jar/*, 
hdfs://nameservice/path/to/jar/*/*\n" +
-  s"When ${HIVE_METASTORE_JARS.key} is set to `path`, we will use Hive 
jars configured by this")
+val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")

Review comment:
   Thanks, updated





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729487327


   Sorry, but what do you mean?
   > It seems that except for RDD.isEmpty has been fixed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


LuciferYang commented on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729486540


   @dongjoon-hyun It seems that except for `RDD.isEmpty` has been fixed.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30407: [SPARK-32852][SQL][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30407:
URL: https://github.com/apache/spark/pull/30407#discussion_r525859433



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -95,18 +95,19 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault("builtin")
 
-  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
-.doc(s"Comma separated URL of Hive jars, support both local and remote 
paths," +
-  s"Such as: " +
-  s" 1. file://path/to/jar/xxx.jar\n" +
-  s" 2. hdfs://nameservice/path/to/jar/xxx.jar\n" +
-  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)\n" +
-  s" 4. [http/https/ftp]://path/to/jar/xxx.jar\n" +
-  s"Notice: `http/https/ftp` doesn't support wildcard, but other URLs 
support" +
-  s"nested path wildcard, Such as: " +
-  s" 1. file://path/to/jar/*, file://path/to/jar/*/*\n" +
-  s" 2. hdfs://nameservice/path/to/jar/*, 
hdfs://nameservice/path/to/jar/*/*\n" +
-  s"When ${HIVE_METASTORE_JARS.key} is set to `path`, we will use Hive 
jars configured by this")
+val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")

Review comment:
   indentation?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30391: [SPARK-33464][INFRA] Add/remove (un)necessary cache and restructure GitHub Actions yaml

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #30391:
URL: https://github.com/apache/spark/pull/30391#issuecomment-729485211


   The position is up to the agreement between @HyukjinKwon and @mridulm .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting externa

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729483958







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffl

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729483958







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv

2020-11-17 Thread GitBox


SparkQA commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729483939


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35856/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-11-17 Thread GitBox


HeartSaVioR commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-729483537


   Thanks @cchighman for great efforts during so far, and sorry to make you 
struggle with the review process. I'll take this over based on the current 
state of the PR and address my own comments.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #30407: [SPARK-32852][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


gengliangwang commented on pull request #30407:
URL: https://github.com/apache/spark/pull/30407#issuecomment-729483340


   cc @AngersZh @cloud-fan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request #30407: [SPARK-32852][Doc] Revise the documentation of spark.sql.hive.metastore.jars

2020-11-17 Thread GitBox


gengliangwang opened a new pull request #30407:
URL: https://github.com/apache/spark/pull/30407


   
   
   ### What changes were proposed in this pull request?
   
   This is a follow-up for https://github.com/apache/spark/pull/29881.
   It revises the documentation of the configuration 
`spark.sql.hive.metastore.jars`.
   
   ### Why are the changes needed?
   
   Fix grammatical errors in the doc. 
   Also, make it more clear that the configuration is effective only when 
`spark.sql.hive.metastore.jars` is set as `path`
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Just doc changes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv

2020-11-17 Thread GitBox


SparkQA commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729480460


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35857/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729475604


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35855/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30403: [SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30403:
URL: https://github.com/apache/spark/pull/30403#issuecomment-729471388


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131246/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


SparkQA removed a comment on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729434913


   **[Test build #131252 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131252/testReport)**
 for PR 30355 at commit 
[`48b2814`](https://github.com/apache/spark/commit/48b28146033b292f6b856d2443bf5b562854066a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


SparkQA removed a comment on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729391035


   **[Test build #131244 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131244/testReport)**
 for PR 30234 at commit 
[`2082c3b`](https://github.com/apache/spark/commit/2082c3bd40a77c9083ec658045f9588a76c1ef60).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729477538







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30403: [SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier

2020-11-17 Thread GitBox


SparkQA removed a comment on pull request #30403:
URL: https://github.com/apache/spark/pull/30403#issuecomment-72944


   **[Test build #131246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131246/testReport)**
 for PR 30403 at commit 
[`a0687b3`](https://github.com/apache/spark/commit/a0687b3fc8ba5bd9ac19c3a0374cf855e8493f78).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30351: [SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30351:
URL: https://github.com/apache/spark/pull/30351#issuecomment-729469598







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30260: [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30260:
URL: https://github.com/apache/spark/pull/30260#issuecomment-729466252







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30403: [SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30403:
URL: https://github.com/apache/spark/pull/30403#issuecomment-729471375


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729473406







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729460845







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30393: [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow

2020-11-17 Thread GitBox


SparkQA commented on pull request #30393:
URL: https://github.com/apache/spark/pull/30393#issuecomment-729477702


   **[Test build #131260 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131260/testReport)**
 for PR 30393 at commit 
[`b257470`](https://github.com/apache/spark/commit/b257470bccb363ca1ae10a96126c3d49d10ad8c6).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729477538







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #30312: [WIP][SPARK-32917][SHUFFLE][CORE][test-maven][test-hadoop2.7] Adds support for executors to push shuffle blocks after successful m

2020-11-17 Thread GitBox


otterc commented on a change in pull request #30312:
URL: https://github.com/apache/spark/pull/30312#discussion_r525853164



##
File path: core/src/main/scala/org/apache/spark/shuffle/PushShuffleSupport.scala
##
@@ -0,0 +1,466 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle
+
+import java.io.File
+import java.net.ConnectException
+import java.nio.ByteBuffer
+import java.util.concurrent.ExecutorService
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue}
+
+import com.google.common.base.Throwables
+
+import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv}
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config._
+import org.apache.spark.launcher.SparkLauncher
+import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, 
ManagedBuffer, NioManagedBuffer}
+import org.apache.spark.network.netty.SparkTransportConf
+import org.apache.spark.network.shuffle.BlockFetchingListener
+import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler
+import org.apache.spark.network.util.TransportConf
+import org.apache.spark.shuffle.PushShuffleSupport._
+import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId}
+import org.apache.spark.util.{ThreadUtils, Utils}
+
+/**
+ * Used for pushing shuffle blocks to remote shuffle services when push 
shuffle is enabled.
+ * When push shuffle is enabled, it is created after the shuffle writer 
finishes writing the shuffle
+ * file and initiates the block push process.
+ *
+ * @param dataFile mapper generated shuffle data file
+ * @param partitionLengths array of shuffle block size so we can tell shuffle 
block
+ * boundaries within the shuffle file
+ * @param dep  shuffle dependency to get shuffle ID and the 
location of remote shuffle
+ * services to push local shuffle blocks
+ * @param partitionId  map index of the shuffle map task
+ * @param conf spark configuration
+ */
+private[spark] class PushShuffleSupport(

Review comment:
   I have renamed `PushShuffleSupport` to `PushShuffleComponent`. Will wait 
for more feedback to make anymore changes to this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13

2020-11-17 Thread GitBox


SparkQA commented on pull request #30234:
URL: https://github.com/apache/spark/pull/30234#issuecomment-729476281


   **[Test build #131244 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131244/testReport)**
 for PR 30234 at commit 
[`2082c3b`](https://github.com/apache/spark/commit/2082c3bd40a77c9083ec658045f9588a76c1ef60).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729475594







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


SparkQA commented on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729475571


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35855/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


SparkQA commented on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729473677


   **[Test build #131259 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131259/testReport)**
 for PR 30405 at commit 
[`8ca8624`](https://github.com/apache/spark/commit/8ca8624d43c530cda5ebb7a1a26c52174f908496).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729473406







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


SparkQA commented on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729472700


   **[Test build #131252 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131252/testReport)**
 for PR 30355 at commit 
[`48b2814`](https://github.com/apache/spark/commit/48b28146033b292f6b856d2443bf5b562854066a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #30403: [SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier

2020-11-17 Thread GitBox


imback82 commented on a change in pull request #30403:
URL: https://github.com/apache/spark/pull/30403#discussion_r525849567



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala
##
@@ -113,7 +113,7 @@ class CachedTableSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleto
 e = intercept[AnalysisException] {
   sql("UNCACHE TABLE nonexistentTable")
 }.getMessage
-assert(e.contains(s"$expectedErrorMsg default.nonexistentTable"))
+assert(e.contains(s"$expectedErrorMsg nonexistentTable"))

Review comment:
   This is now consistent with Line 108 above.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] BryanCutler commented on a change in pull request #30393: [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow

2020-11-17 Thread GitBox


BryanCutler commented on a change in pull request #30393:
URL: https://github.com/apache/spark/pull/30393#discussion_r525849381



##
File path: python/docs/source/user_guide/arrow_pandas.rst
##
@@ -341,7 +341,7 @@ Supported SQL Types
 
 .. currentmodule:: pyspark.sql.types
 
-Currently, all Spark SQL data types are supported by Arrow-based conversion 
except :class:`MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion 
except

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30406: [SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections

2020-11-17 Thread GitBox


SparkQA commented on pull request #30406:
URL: https://github.com/apache/spark/pull/30406#issuecomment-729471571


   **[Test build #131258 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131258/testReport)**
 for PR 30406 at commit 
[`ee4b070`](https://github.com/apache/spark/commit/ee4b070e64c24b174a37d4f730cc43867972e170).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30403: [SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30403:
URL: https://github.com/apache/spark/pull/30403#issuecomment-729471375







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30403: [SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier

2020-11-17 Thread GitBox


SparkQA commented on pull request #30403:
URL: https://github.com/apache/spark/pull/30403#issuecomment-729470662


   **[Test build #131246 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131246/testReport)**
 for PR 30403 at commit 
[`a0687b3`](https://github.com/apache/spark/commit/a0687b3fc8ba5bd9ac19c3a0374cf855e8493f78).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30351: [SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30351:
URL: https://github.com/apache/spark/pull/30351#issuecomment-729469598







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30351: [SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports

2020-11-17 Thread GitBox


SparkQA commented on pull request #30351:
URL: https://github.com/apache/spark/pull/30351#issuecomment-729469567


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35854/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv

2020-11-17 Thread GitBox


SparkQA commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729469271


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35856/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] BryanCutler commented on a change in pull request #30393: [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow

2020-11-17 Thread GitBox


BryanCutler commented on a change in pull request #30393:
URL: https://github.com/apache/spark/pull/30393#discussion_r525847003



##
File path: python/docs/source/user_guide/arrow_pandas.rst
##
@@ -341,7 +341,7 @@ Supported SQL Types
 
 .. currentmodule:: pyspark.sql.types
 
-Currently, all Spark SQL data types are supported by Arrow-based conversion 
except :class:`MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion 
except

Review comment:
   I should probably mention MapType only for pyarrow 2.0.0..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] BryanCutler commented on a change in pull request #30393: [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow

2020-11-17 Thread GitBox


BryanCutler commented on a change in pull request #30393:
URL: https://github.com/apache/spark/pull/30393#discussion_r525846522



##
File path: python/pyspark/sql/pandas/types.py
##
@@ -306,3 +322,23 @@ def _check_series_convert_timestamps_tz_local(s, timezone):
 `pandas.Series` where if it is a timestamp, has been converted to 
tz-naive
 """
 return _check_series_convert_timestamps_localize(s, timezone, None)
+
+
+def _convert_map_items_to_dict(s):

Review comment:
   Note: these conversion functions are because pyarrow expects map items 
as a list of (key, value) pairs, and has this format when converting to Pandas 
also. The reason is that the arrow spec could allow for duplicate key values in 
a row, and doesn't say how these should be handled exactly. So by having these 
conversions, we match the non-arrow behavior for maps, with a dictionary as 
input/output.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30260: [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30260:
URL: https://github.com/apache/spark/pull/30260#issuecomment-729466252







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30260: [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode

2020-11-17 Thread GitBox


SparkQA commented on pull request #30260:
URL: https://github.com/apache/spark/pull/30260#issuecomment-729466242


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35853/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #30406: [SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections

2020-11-17 Thread GitBox


maropu commented on pull request #30406:
URL: https://github.com/apache/spark/pull/30406#issuecomment-729464155


   Looks fine if the tests pass.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30355: [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR

2020-11-17 Thread GitBox


SparkQA commented on pull request #30355:
URL: https://github.com/apache/spark/pull/30355#issuecomment-729462035


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35855/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


maropu commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525841241



##
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala
##
@@ -42,6 +43,139 @@ case class SortExec(
 global: Boolean,
 child: SparkPlan,
 testSpillFrequency: Int = 0)
+  extends SortExecBase(
+sortOrder,
+global,
+child,
+testSpillFrequency) {
+
+  def createSorter(): UnsafeExternalRowSorter = {
+rowSorter = UnsafeExternalRowSorter.create(
+  schema, ordering, prefixComparator, prefixComputer, pageSize, 
canUseRadixSort)
+
+if (testSpillFrequency > 0) {
+  rowSorter.setTestSpillFrequency(testSpillFrequency)
+}
+rowSorter.asInstanceOf[UnsafeExternalRowSorter]
+  }
+
+  override protected def doProduce(ctx: CodegenContext): String = {
+doProduce(ctx, classOf[UnsafeExternalRowSorter].getName)
+  }
+}
+
+/**
+ * Performs (external) sorting for multiple windows.
+ *
+ * @param partitionSpec a sequence of expressions that defines a partition key
+ * @param sortOrderInWindow a sequence of sort orders for sorting rows inside 
a window
+ * @param sortOrderAcrossWindows a sequence of sort orders for sorting rows 
across
+ *   different windows on a Spark physical 
partition.
+ *   This sequence of sort orders is obtained from 
a partition
+ *   key plus a sequence of sort orders inside a 
window
+ * @param global when true performs a global sort of all partitions by 
shuffling the data first
+ *   if necessary.
+ * @param testSpillFrequency Method for configuring periodic spilling in unit 
tests. If set, will
+ *   spill every `frequency` records.
+ */
+case class WindowSortExec(
+partitionSpec: Seq[Expression],
+sortOrderInWindow: Seq[SortOrder],
+sortOrderAcrossWindows: Seq[SortOrder],
+global: Boolean,
+child: SparkPlan,
+testSpillFrequency: Int = 0)
+  extends SortExecBase(
+sortOrderAcrossWindows,
+global,
+child,
+testSpillFrequency) {
+

Review comment:
   Could you add `assert(partitionSpec.nonEmpty && 
sortOrderInWindow.nonEmpty, "XXX")` here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #30312: [WIP][SPARK-32917][SHUFFLE][CORE][test-maven][test-hadoop2.7] Adds support for executors to push shuffle blocks after successful m

2020-11-17 Thread GitBox


otterc commented on a change in pull request #30312:
URL: https://github.com/apache/spark/pull/30312#discussion_r525838001



##
File path: core/src/main/scala/org/apache/spark/shuffle/PushShuffleSupport.scala
##
@@ -0,0 +1,466 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle
+
+import java.io.File
+import java.net.ConnectException
+import java.nio.ByteBuffer
+import java.util.concurrent.ExecutorService
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue}
+
+import com.google.common.base.Throwables
+
+import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv}
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config._
+import org.apache.spark.launcher.SparkLauncher
+import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, 
ManagedBuffer, NioManagedBuffer}
+import org.apache.spark.network.netty.SparkTransportConf
+import org.apache.spark.network.shuffle.BlockFetchingListener
+import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler
+import org.apache.spark.network.util.TransportConf
+import org.apache.spark.shuffle.PushShuffleSupport._
+import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId}
+import org.apache.spark.util.{ThreadUtils, Utils}
+
+/**
+ * Used for pushing shuffle blocks to remote shuffle services when push 
shuffle is enabled.
+ * When push shuffle is enabled, it is created after the shuffle writer 
finishes writing the shuffle
+ * file and initiates the block push process.
+ *
+ * @param dataFile mapper generated shuffle data file
+ * @param partitionLengths array of shuffle block size so we can tell shuffle 
block
+ * boundaries within the shuffle file
+ * @param dep  shuffle dependency to get shuffle ID and the 
location of remote shuffle
+ * services to push local shuffle blocks
+ * @param partitionId  map index of the shuffle map task
+ * @param conf spark configuration
+ */
+private[spark] class PushShuffleSupport(

Review comment:
   > XXXSupport is usually a trait or abstract class.
   
   I can rename it to be `PushShuffleComponent`
   
   > Would it be better to make PushShuffleSupport an abstract class which 
ShuffleWriter extends?
   The getPartitionLengths method can be the abstract method in this abstract 
class instead of inside ShuffleWriter
   
   I can add another interface which let's say looks like this and 
ShuffleWriter can extend it
   ```
   interface PushEnabledShuffleWriter {
getPartitionLengths()
initiateBlockPush(...)
   }
   ```
   Not sure if the interface is really needed since ShuffleWriter is the only 
base writer to which we are adding this ability. Maybe we can just add back the 
`initiateBlockPush` method to ShuffleWriter.
   
   > All the other class fields in this class can be converted into method 
arguments for initiateBlockPush.
   This way, we would still be able to initiate block push via the 
ShuffleWriter objects instead of relying on instantiating an object of 
PushShuffleSupport.
   
   With the ` initiateBlockPush` in `ShuffleWriter`, block push is still 
triggered via the shuffle writer. It will however delegate it to 
`PushShuffleComponent`.  
   
   @Victsm @Ngone51 @mridulm Please let me know what you think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729461077


   All comments are addressed. (@HyukjinKwon , @viirya , @mridulm )



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729460845







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


SparkQA commented on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729460822


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35852/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


maropu commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525840579



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala
##
@@ -124,7 +125,18 @@ case class EnsureRequirements(conf: SQLConf) extends 
Rule[SparkPlan] {
   if (SortOrder.orderingSatisfies(child.outputOrdering, requiredOrdering)) 
{
 child
   } else {
-SortExec(requiredOrdering, global = false, child = child)
+operator match {
+  case WindowExec(_, partitionSpec, orderSpec, _)
+if (!partitionSpec.isEmpty && !orderSpec.isEmpty) =>

Review comment:
   nit: `isEmpty` -> `nonEmpty`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external

2020-11-17 Thread GitBox


dongjoon-hyun edited a comment on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729460106


   +1, LGTM. (only editorial comments.)
   Thank you all!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffl

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #30164:
URL: https://github.com/apache/spark/pull/30164#issuecomment-729460106


   +1, LGTM. (only editorial comments.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29725: [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order

2020-11-17 Thread GitBox


maropu commented on a change in pull request #29725:
URL: https://github.com/apache/spark/pull/29725#discussion_r525834434



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;
+import java.util.Queue;
+import java.util.TreeMap;
+import java.io.IOException;
+import java.util.function.Supplier;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import org.apache.spark.memory.SparkOutOfMemoryError;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends 
AbstractUnsafeExternalRowSorter {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeExternalRowWindowSorter.class);
+
+  private final StructType schema;
+  private final UnsafeProjection partitionSpecProjection;
+  private final Ordering orderingOfPartitionKey;
+  private final Ordering orderingInWindow;
+  private final Ordering orderingAcrossWindows;
+  private final PrefixComparator prefixComparatorInWindow;
+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;
+  private static final int totalNumSorters = windowSorterMapMaxSize + 1;
+  private final HashMap 
windowSorterMap;

Review comment:
   nit: `HashMap`

##
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala
##
@@ -42,6 +43,139 @@ case class SortExec(
 global: Boolean,
 child: SparkPlan,
 testSpillFrequency: Int = 0)
+  extends SortExecBase(
+sortOrder,
+global,
+child,
+testSpillFrequency) {
+
+  def createSorter(): UnsafeExternalRowSorter = {
+rowSorter = UnsafeExternalRowSorter.create(
+  schema, ordering, prefixComparator, prefixComputer, pageSize, 
canUseRadixSort)
+
+if (testSpillFrequency > 0) {
+  rowSorter.setTestSpillFrequency(testSpillFrequency)
+}
+rowSorter.asInstanceOf[UnsafeExternalRowSorter]
+  }
+
+  override protected def doProduce(ctx: CodegenContext): String = {
+doProduce(ctx, classOf[UnsafeExternalRowSorter].getName)
+  }
+}
+
+/**
+ * Performs (external) sorting for multiple windows.

Review comment:
   Could you leave some comments about what's a difference from `SortExec`?

##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java
##
@@ -0,0 +1,453 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map.Entry;

[GitHub] [spark] SparkQA commented on pull request #30351: [SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports

2020-11-17 Thread GitBox


SparkQA commented on pull request #30351:
URL: https://github.com/apache/spark/pull/30351#issuecomment-729458991


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35854/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30396: [MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30396:
URL: https://github.com/apache/spark/pull/30396#issuecomment-729457164







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30406: [SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections

2020-11-17 Thread GitBox


SparkQA commented on pull request #30406:
URL: https://github.com/apache/spark/pull/30406#issuecomment-729457313


   **[Test build #131257 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131257/testReport)**
 for PR 30406 at commit 
[`37e3dca`](https://github.com/apache/spark/commit/37e3dca4d1240e37729efd764eb4b676feac0e65).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30396: [MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc

2020-11-17 Thread GitBox


AmplabJenkins commented on pull request #30396:
URL: https://github.com/apache/spark/pull/30396#issuecomment-729457164







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #30312: [WIP][SPARK-32917][SHUFFLE][CORE][test-maven][test-hadoop2.7] Adds support for executors to push shuffle blocks after successful m

2020-11-17 Thread GitBox


otterc commented on a change in pull request #30312:
URL: https://github.com/apache/spark/pull/30312#discussion_r525838001



##
File path: core/src/main/scala/org/apache/spark/shuffle/PushShuffleSupport.scala
##
@@ -0,0 +1,466 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle
+
+import java.io.File
+import java.net.ConnectException
+import java.nio.ByteBuffer
+import java.util.concurrent.ExecutorService
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue}
+
+import com.google.common.base.Throwables
+
+import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv}
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config._
+import org.apache.spark.launcher.SparkLauncher
+import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, 
ManagedBuffer, NioManagedBuffer}
+import org.apache.spark.network.netty.SparkTransportConf
+import org.apache.spark.network.shuffle.BlockFetchingListener
+import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler
+import org.apache.spark.network.util.TransportConf
+import org.apache.spark.shuffle.PushShuffleSupport._
+import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId}
+import org.apache.spark.util.{ThreadUtils, Utils}
+
+/**
+ * Used for pushing shuffle blocks to remote shuffle services when push 
shuffle is enabled.
+ * When push shuffle is enabled, it is created after the shuffle writer 
finishes writing the shuffle
+ * file and initiates the block push process.
+ *
+ * @param dataFile mapper generated shuffle data file
+ * @param partitionLengths array of shuffle block size so we can tell shuffle 
block
+ * boundaries within the shuffle file
+ * @param dep  shuffle dependency to get shuffle ID and the 
location of remote shuffle
+ * services to push local shuffle blocks
+ * @param partitionId  map index of the shuffle map task
+ * @param conf spark configuration
+ */
+private[spark] class PushShuffleSupport(

Review comment:
   > XXXSupport is usually a trait or abstract class.
   
   I can rename it to be `PushShuffleComponent`
   
   > Would it be better to make PushShuffleSupport an abstract class which 
ShuffleWriter extends?
   The getPartitionLengths method can be the abstract method in this abstract 
class instead of inside ShuffleWriter
   
   I can add another interface which let's say looks like this and 
ShuffleWriter can extend it
   ```
   interface PushEnabledShuffleWriter {
getPartitionLengths()
initiateBlockPush(...)
   }
   ```
   Not sure if the interface is really needed since ShuffleWriter is the only 
base writer to which we are adding this ability. Maybe we can just add back the 
`initiateBlockPush` method to ShuffleWriter.
   
   > All the other class fields in this class can be converted into method 
arguments for initiateBlockPush.
   This way, we would still be able to initiate block push via the 
ShuffleWriter objects instead of relying on instantiating an object of 
PushShuffleSupport.
   
   With the above interface, ` initiateBlockPush` of `ShuffleWriter` can 
delegate the push to `PushShuffleComponent`. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30396: [MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc

2020-11-17 Thread GitBox


SparkQA removed a comment on pull request #30396:
URL: https://github.com/apache/spark/pull/30396#issuecomment-729321498


   **[Test build #131239 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131239/testReport)**
 for PR 30396 at commit 
[`1224ba9`](https://github.com/apache/spark/commit/1224ba9f6d27d3b3502e7ee64ff7ec53cb9bd7da).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30396: [MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc

2020-11-17 Thread GitBox


SparkQA commented on pull request #30396:
URL: https://github.com/apache/spark/pull/30396#issuecomment-729456105


   **[Test build #131239 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131239/testReport)**
 for PR 30396 at commit 
[`1224ba9`](https://github.com/apache/spark/commit/1224ba9f6d27d3b3502e7ee64ff7ec53cb9bd7da).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30405:
URL: https://github.com/apache/spark/pull/30405#discussion_r525837460



##
File path: core/src/main/scala/org/apache/spark/executor/Executor.scala
##
@@ -110,7 +110,9 @@ private[spark] class Executor(
   .build()
 
Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
   }
-  private val executorSource = new ExecutorSource(threadPool, executorId)
+  private val schemes = conf.get(EXECUTOR_METRICS_FILESYSTEM_SCHEMES)
+.toLowerCase(Locale.ROOT).split(",").map(_.trim)

Review comment:
   Sure, thanks, @mridulm !





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


mridulm commented on a change in pull request #30405:
URL: https://github.com/apache/spark/pull/30405#discussion_r525836701



##
File path: core/src/main/scala/org/apache/spark/executor/Executor.scala
##
@@ -110,7 +110,9 @@ private[spark] class Executor(
   .build()
 
Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
   }
-  private val executorSource = new ExecutorSource(threadPool, executorId)
+  private val schemes = conf.get(EXECUTOR_METRICS_FILESYSTEM_SCHEMES)
+.toLowerCase(Locale.ROOT).split(",").map(_.trim)

Review comment:
   nit: filter empty after `map` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30260: [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode

2020-11-17 Thread GitBox


SparkQA commented on pull request #30260:
URL: https://github.com/apache/spark/pull/30260#issuecomment-729454825


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35853/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

2020-11-17 Thread GitBox


c21 commented on a change in pull request #29000:
URL: https://github.com/apache/spark/pull/29000#discussion_r525836484



##
File path: 
core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala
##
@@ -169,4 +169,8 @@ object FileCommitProtocol extends Logging {
 ctor.newInstance(jobId, outputPath)
 }
   }
+
+  def getStagingDir(path: String, jobId: String): Path = {
+new Path(path, ".spark-staging-" + jobId)
+  }

Review comment:
   @cloud-fan - wondering do you still think FileCommitProtocol V2 
(https://issues.apache.org/jira/browse/SPARK-33298) is a pre-requisite for this 
PR? or this one is good to go separately, and we only need to worry about 
allowing adding prefix to file name per 
https://github.com/apache/spark/pull/30003 ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Victsm commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external s

2020-11-17 Thread GitBox


Victsm commented on a change in pull request #30164:
URL: https://github.com/apache/spark/pull/30164#discussion_r525835969



##
File path: core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala
##
@@ -92,4 +93,16 @@ private[spark] trait SchedulerBackend {
*/
   def maxNumConcurrentTasks(rp: ResourceProfile): Int
 
+  /**
+   * Get the list of host locations for push based shuffle
+   *
+   * Currently push based shuffle is disabled for both stage retry and stage 
reuse cases

Review comment:
   Both are true.
   `getShufflePushMergerLocations` will be invoked only once per 
`ShuffleDependency`.
   Thus retried stages will get the same merger locations.
   In #30062, the way we implemented the block push handling logic would ignore 
blocks received after shuffle finalization.
   
   
https://github.com/apache/spark/blob/dd32f45d2058d00293330c01d3d9f53ecdbc036c/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java#L132
   
   So, blocks pushed from the retry stage will be ignored, and this is what's 
stated in this comment about `push based shuffle is disabled for both stage 
retry and stage reuse cases`.
   Ignoring blocks pushed from the retry stage is reasonable, since the block 
data from these retried tasks most likely have already been merged.
   Making sure the retried stage use the same merger location is critical to 
ensure we don't run into data duplication issues.
   
   The only exception is for indeterministic stage retry, which we have created 
SPARK-32923 for it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30405: [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes

2020-11-17 Thread GitBox


dongjoon-hyun commented on pull request #30405:
URL: https://github.com/apache/spark/pull/30405#issuecomment-729453805


   Thank you, @HyukjinKwon and @viirya . I'll mention this new conf in the 
`monitoring.md` according to your advice.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting ext

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30164:
URL: https://github.com/apache/spark/pull/30164#discussion_r525835108



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
##
@@ -161,6 +173,35 @@ private[spark] abstract class YarnSchedulerBackend(
 totalRegisteredExecutors.get() >= totalExpectedExecutors * 
minRegisteredRatio
   }
 
+  override def getShufflePushMergerLocations(
+  numPartitions: Int,
+  resourceProfileId: Int): Seq[BlockManagerId] = {
+// Currently this is naive way of calculating numMergersDesired for a 
stage. In future,
+// we can use better heuristics to calculate numMergersDesired for a stage.
+val maxExecutors = if (Utils.isDynamicAllocationEnabled(sc.getConf)) {
+  maxNumExecutors
+} else {
+  numExecutors
+}
+val tasksPerExecutor = sc.resourceProfileManager
+  .resourceProfileFromId(resourceProfileId).maxTasksPerExecutor(sc.conf)
+val numMergersDesired = math.min(
+  math.max(1, math.ceil(numPartitions / tasksPerExecutor).toInt), 
maxExecutors)
+val minMergersNeeded = math.max(minMergersStaticThreshold,
+  math.floor(numMergersDesired * minMergersThresholdRatio).toInt)
+
+// Request for numMergersDesired shuffle mergers to 
BlockManagerMasterEndpoint
+// and if its less than minMergersNeeded, we disable push based shuffle.
+val mergerLocations = blockManagerMaster
+  .getShufflePushMergerLocations(numMergersDesired, 
scheduler.excludedNodes())
+if (mergerLocations.size < numMergersDesired && mergerLocations.size < 
minMergersNeeded) {
+  Seq.empty[BlockManagerId]
+} else {
+  logDebug(s"Num merger locations available ${mergerLocations.length}")

Review comment:
   Also, for the good logging, it would be great if we can print 
`numMergersDesired` together.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] BryanCutler commented on pull request #30393: [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow

2020-11-17 Thread GitBox


BryanCutler commented on pull request #30393:
URL: https://github.com/apache/spark/pull/30393#issuecomment-729452568


   > BTW I believe we should also update the docs :-)
   
   Thanks for reminding me! I'll do that now..



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting ext

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30164:
URL: https://github.com/apache/spark/pull/30164#discussion_r525834706



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
##
@@ -161,6 +173,35 @@ private[spark] abstract class YarnSchedulerBackend(
 totalRegisteredExecutors.get() >= totalExpectedExecutors * 
minRegisteredRatio
   }
 
+  override def getShufflePushMergerLocations(
+  numPartitions: Int,
+  resourceProfileId: Int): Seq[BlockManagerId] = {
+// Currently this is naive way of calculating numMergersDesired for a 
stage. In future,
+// we can use better heuristics to calculate numMergersDesired for a stage.
+val maxExecutors = if (Utils.isDynamicAllocationEnabled(sc.getConf)) {
+  maxNumExecutors
+} else {
+  numExecutors
+}
+val tasksPerExecutor = sc.resourceProfileManager
+  .resourceProfileFromId(resourceProfileId).maxTasksPerExecutor(sc.conf)
+val numMergersDesired = math.min(
+  math.max(1, math.ceil(numPartitions / tasksPerExecutor).toInt), 
maxExecutors)
+val minMergersNeeded = math.max(minMergersStaticThreshold,
+  math.floor(numMergersDesired * minMergersThresholdRatio).toInt)
+
+// Request for numMergersDesired shuffle mergers to 
BlockManagerMasterEndpoint
+// and if its less than minMergersNeeded, we disable push based shuffle.
+val mergerLocations = blockManagerMaster
+  .getShufflePushMergerLocations(numMergersDesired, 
scheduler.excludedNodes())
+if (mergerLocations.size < numMergersDesired && mergerLocations.size < 
minMergersNeeded) {
+  Seq.empty[BlockManagerId]
+} else {
+  logDebug(s"Num merger locations available ${mergerLocations.length}")

Review comment:
   If you don't mine, shall we revise like the following?
   ```scala
   - logDebug(s"Num merger locations available ${mergerLocations.length}")
   + logDebug(s"The number of available locations are 
${mergerLocations.length}")
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting ext

2020-11-17 Thread GitBox


dongjoon-hyun commented on a change in pull request #30164:
URL: https://github.com/apache/spark/pull/30164#discussion_r525834706



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
##
@@ -161,6 +173,35 @@ private[spark] abstract class YarnSchedulerBackend(
 totalRegisteredExecutors.get() >= totalExpectedExecutors * 
minRegisteredRatio
   }
 
+  override def getShufflePushMergerLocations(
+  numPartitions: Int,
+  resourceProfileId: Int): Seq[BlockManagerId] = {
+// Currently this is naive way of calculating numMergersDesired for a 
stage. In future,
+// we can use better heuristics to calculate numMergersDesired for a stage.
+val maxExecutors = if (Utils.isDynamicAllocationEnabled(sc.getConf)) {
+  maxNumExecutors
+} else {
+  numExecutors
+}
+val tasksPerExecutor = sc.resourceProfileManager
+  .resourceProfileFromId(resourceProfileId).maxTasksPerExecutor(sc.conf)
+val numMergersDesired = math.min(
+  math.max(1, math.ceil(numPartitions / tasksPerExecutor).toInt), 
maxExecutors)
+val minMergersNeeded = math.max(minMergersStaticThreshold,
+  math.floor(numMergersDesired * minMergersThresholdRatio).toInt)
+
+// Request for numMergersDesired shuffle mergers to 
BlockManagerMasterEndpoint
+// and if its less than minMergersNeeded, we disable push based shuffle.
+val mergerLocations = blockManagerMaster
+  .getShufflePushMergerLocations(numMergersDesired, 
scheduler.excludedNodes())
+if (mergerLocations.size < numMergersDesired && mergerLocations.size < 
minMergersNeeded) {
+  Seq.empty[BlockManagerId]
+} else {
+  logDebug(s"Num merger locations available ${mergerLocations.length}")

Review comment:
   If you don't mine, shall we revise like the following?
   ```scala
   - logDebug(s"Num merger locations available ${mergerLocations.length}")
   + logDebug(s"The number of available locations are  
${mergerLocations.length}")
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30404: [SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30404:
URL: https://github.com/apache/spark/pull/30404#issuecomment-729450454


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35851/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30400: [SPARK-33469][SQL] Add current_timezone function

2020-11-17 Thread GitBox


AmplabJenkins removed a comment on pull request #30400:
URL: https://github.com/apache/spark/pull/30400#issuecomment-729441891


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35850/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >