[GitHub] [spark] cloud-fan closed pull request #35308: [SPARK-38011][SQL] Remove duplicated and useless configuration in ParquetFileFormat

2022-01-26 Thread GitBox


cloud-fan closed pull request #35308:
URL: https://github.com/apache/spark/pull/35308


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #35308: [SPARK-38011][SQL] Remove duplicated and useless configuration in ParquetFileFormat

2022-01-26 Thread GitBox


cloud-fan commented on pull request #35308:
URL: https://github.com/apache/spark/pull/35308#issuecomment-1022939904


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #35340: [WINDOW]improve structured streaming window of calculated

2022-01-26 Thread GitBox


AmplabJenkins commented on pull request #35340:
URL: https://github.com/apache/spark/pull/35340#issuecomment-1022938396


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

2022-01-26 Thread GitBox


cloud-fan commented on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1022926161


   @sunchao  can we add back `HashClusteredDistribution` and use it for 
streaming join/aggregate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

2022-01-26 Thread GitBox


HeartSaVioR edited a comment on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1022894109






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-26 Thread GitBox


LuciferYang commented on pull request #35262:
URL: https://github.com/apache/spark/pull/35262#issuecomment-1022919561


   @parthchandra I think we should add some UTs similar to `String with Nulls 
Scan` because when I add 
   
   ```
   sparkSession.conf.set(SQLConf.COLUMN_VECTOR_OFFHEAP_ENABLED.key, "true")
   ``` 
   
   to `DataSourceReadBenchmark` to enable `ColumnVector` use offheap memory, 
`String with Nulls Scan` releated cases will failed as follows:
   
   ```
   14:33:29.271 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
in stage 5043.0 (TID 3936)
   org.apache.spark.sql.execution.QueryExecutionException: Encountered error 
while reading file 
file:///private/var/folders/0x/xj61_dbd0dldn793s6cyb7rrgp/T/spark-a6065795-c141-43cd-8ec6-359f3f3a0307/parquetV2/part-0-7c6de322-95b1-4283-9399-8306753c68ab-c000.snappy.parquet.
 Details: 
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:659)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:546)
 ~[classes/:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
 ~[classes/:?]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
 ~[classes/:?]
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 ~[classes/:?]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[classes/:?]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
~[classes/:?]
at org.apache.spark.scheduler.Task.run(Task.scala:136) ~[classes/:?]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
 ~[classes/:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475) 
~[classes/:?]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) 
[classes/:?]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_292]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
   Caused by: org.apache.parquet.io.ParquetDecodingException: Failed to read 
268435456 bytes
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedDeltaLengthByteArrayReader.readBinary(VectorizedDeltaLengthByteArrayReader.java:79)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedDeltaByteArrayReader.initFromPage(VectorizedDeltaByteArrayReader.java:76)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:293)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:362)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:52)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:260)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:247)
 ~[classes/:?]
at 
org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) 
~[parquet-column-1.12.2.jar:1.12.2]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:247)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:183)
 ~[classes/:?]
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311)
 ~[classes/:?]
   

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


HyukjinKwon commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r793302012



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -605,6 +606,66 @@ private[spark] object Utils extends Logging {
 }
   }
 
+  def unZip(inFile: File, unzipDir: File): Unit = {

Review comment:
   Can we refer to other references of codes? e.g., 
https://cs.android.com/android/platform/superproject/+/master:tools/tradefederation/core/common_util/com/android/tradefed/util/ZipUtil2.java;drc=master;l=44
   
   It has a bunch of corner cases handling. .e.g., we will have to skip this 
logic if the platform is non-Unix.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang edited a comment on pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang edited a comment on pull request #35278:
URL: https://github.com/apache/spark/pull/35278#issuecomment-1022899627


   @zhongjingxiong Please check the failure of GA(`Build and test`)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on pull request #35278:
URL: https://github.com/apache/spark/pull/35278#issuecomment-1022899627


   @zhongjingxiong Please pay attention to the failure of GA(`Build and test`)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhongjingxiong commented on pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


zhongjingxiong commented on pull request #35278:
URL: https://github.com/apache/spark/pull/35278#issuecomment-1022897967


   @HyukjinKwon @LuciferYang @Yikf I used the zipfile under the common package, 
ZipArchiveEntry.getUnixMode() can expose the permissions of the file, so I use 
the collection to store the permissions of the file. When the file has been 
created, give each file the original permissions. Could you review it ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

2022-01-26 Thread GitBox


HeartSaVioR commented on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1022894109


   @sunchao 
   
   Sorry for the post-review. I didn't know this PR may affect streaming query 
and indicated later. 
   
   I discussed with @cloud-fan about this change, and we are concerned about 
any possibility on skipping shuffle against grouping keys in stateful 
operators, "including stream-stream join".
   
   In Structured Streaming, state is partitioned with grouping keys based on 
Spark's internal hash function, and the number of partition is static. That 
said, if Spark does not respect the distribution of state against stateful 
operator, it leads to correctness problem.
   
   So please consider that same key is co-located for three aspects (left, 
right, state) in stream-stream join. It's going to apply the same for non-join 
case, e.g. aggregation against bucket table. other stateful operators will have 
two aspects, (key, state). In short, state must be considered.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun edited a comment on pull request #34983: [SPARK-37713][K8S] Assign namespace to executor configmap

2022-01-26 Thread GitBox


Yikun edited a comment on pull request #34983:
URL: https://github.com/apache/spark/pull/34983#issuecomment-1014248788


   FYI, this PR breaks the case of using ConfigMap with namespace specified 
(driver side), see https://github.com/apache/spark/pull/35215 .
   
   You could using below cmd to revert this commits manually before above PR 
merged:
   ```bash
   git revert c4a9772f741836cdb399e35a45b3b5df48d9eea6
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] GabeChurch commented on pull request #34983: [SPARK-37713][K8S] Assign namespace to executor configmap

2022-01-26 Thread GitBox


GabeChurch commented on pull request #34983:
URL: https://github.com/apache/spark/pull/34983#issuecomment-1022874830


   Noticing the same behavior as Yikun with this pull breaking ConfigMap usage 
in non "default" namespace(s). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a change in pull request #35252: [SPARK-37154][PYTHON] Inline hints for pyspark.rdd

2022-01-26 Thread GitBox


itholic commented on a change in pull request #35252:
URL: https://github.com/apache/spark/pull/35252#discussion_r793259415



##
File path: python/pyspark/_typing.pyi
##
@@ -17,17 +17,27 @@
 # under the License.
 
 from typing import Callable, Iterable, Sized, TypeVar, Union
-from typing_extensions import Protocol
+from typing_extensions import Literal, Protocol
+
+from numpy import int32, int64, float32, float64, ndarray
 
 F = TypeVar("F", bound=Callable)
 T_co = TypeVar("T_co", covariant=True)
 
 PrimitiveType = Union[bool, float, int, str]
 
+NonUDFType = Literal[0]
+
 class SupportsIAdd(Protocol):
 def __iadd__(self, other: SupportsIAdd) -> SupportsIAdd: ...
 
 class SupportsOrdering(Protocol):
-def __le__(self, other: SupportsOrdering) -> bool: ...
+def __lt__(self, other: SupportsOrdering) -> bool: ...
 
 class SizedIterable(Protocol, Sized, Iterable[T_co]): ...
+
+O = TypeVar("O", bound=SupportsOrdering)

Review comment:
   Not really strong opinion about this, and not a big deal, but it looks 
like easily be confused with integer zero.
   
   Is there any reason we should use "O" as a variable name here ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a change in pull request #35252: [SPARK-37154][PYTHON] Inline hints for pyspark.rdd

2022-01-26 Thread GitBox


itholic commented on a change in pull request #35252:
URL: https://github.com/apache/spark/pull/35252#discussion_r793259415



##
File path: python/pyspark/_typing.pyi
##
@@ -17,17 +17,27 @@
 # under the License.
 
 from typing import Callable, Iterable, Sized, TypeVar, Union
-from typing_extensions import Protocol
+from typing_extensions import Literal, Protocol
+
+from numpy import int32, int64, float32, float64, ndarray
 
 F = TypeVar("F", bound=Callable)
 T_co = TypeVar("T_co", covariant=True)
 
 PrimitiveType = Union[bool, float, int, str]
 
+NonUDFType = Literal[0]
+
 class SupportsIAdd(Protocol):
 def __iadd__(self, other: SupportsIAdd) -> SupportsIAdd: ...
 
 class SupportsOrdering(Protocol):
-def __le__(self, other: SupportsOrdering) -> bool: ...
+def __lt__(self, other: SupportsOrdering) -> bool: ...
 
 class SizedIterable(Protocol, Sized, Iterable[T_co]): ...
+
+O = TypeVar("O", bound=SupportsOrdering)

Review comment:
   Not really strong opinion about this, and not a big deal, but it looks 
like easily be confused with integer zero.
   
   Should we have to use "O" as a variable name here ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a change in pull request #35252: [SPARK-37154][PYTHON] Inline hints for pyspark.rdd

2022-01-26 Thread GitBox


itholic commented on a change in pull request #35252:
URL: https://github.com/apache/spark/pull/35252#discussion_r793259415



##
File path: python/pyspark/_typing.pyi
##
@@ -17,17 +17,27 @@
 # under the License.
 
 from typing import Callable, Iterable, Sized, TypeVar, Union
-from typing_extensions import Protocol
+from typing_extensions import Literal, Protocol
+
+from numpy import int32, int64, float32, float64, ndarray
 
 F = TypeVar("F", bound=Callable)
 T_co = TypeVar("T_co", covariant=True)
 
 PrimitiveType = Union[bool, float, int, str]
 
+NonUDFType = Literal[0]
+
 class SupportsIAdd(Protocol):
 def __iadd__(self, other: SupportsIAdd) -> SupportsIAdd: ...
 
 class SupportsOrdering(Protocol):
-def __le__(self, other: SupportsOrdering) -> bool: ...
+def __lt__(self, other: SupportsOrdering) -> bool: ...
 
 class SizedIterable(Protocol, Sized, Iterable[T_co]): ...
+
+O = TypeVar("O", bound=SupportsOrdering)

Review comment:
   Not really strong feeling about this, but it looks like easily be 
confused with integer zero.
   
   Should we have to use "O" as a variable name here ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #35325: [SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


otterc commented on a change in pull request #35325:
URL: https://github.com/apache/spark/pull/35325#discussion_r793258742



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java
##
@@ -1161,8 +1161,8 @@ public void testFinalizeOfDeterminateShuffle() throws 
IOException {
 
 RemoteBlockPushResolver.AppShuffleInfo appShuffleInfo =
   pushResolver.validateAndGetAppShuffleInfo(TEST_APP);
-assertTrue("Metadata of determinate shuffle should be removed after 
finalize shuffle"
-  + " merge", appShuffleInfo.getShuffles().get(0) == null);
+assertTrue("Determinate shuffle should be marked finalized"
+  + " merge", appShuffleInfo.getShuffles().get(0).isFinalized());

Review comment:
   Nit: remove ` merge`

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -1008,20 +996,28 @@ AppShufflePartitionInfo getPartitionInfo() {
* required for the shuffles of indeterminate stages.
*/
   public static class AppShuffleMergePartitionsInfo {
+// ConcurrentHashMap doesn't allow null for keys or values which is why 
this is required.
+// Marker to identify finalized indeterminate shuffle partitions in the 
case of indeterminate

Review comment:
   Nit: remove indeterminate

##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java
##
@@ -1287,6 +1287,79 @@ void closeAndDeletePartitionFiles(Map partitio
   + " up", appShuffleInfo.getMergedShuffleDataFile(0, 4, 0).exists());
   }
 
+  @Test
+  public void 
testFinalizationResultIsEmptyWhenTheServerDidNotReceiveAnyBlocks() {
+//shuffle 1 0 is finalized even though the server didn't receive any 
blocks for it.
+MergeStatuses statuses = pushResolver.finalizeShuffleMerge(
+new FinalizeShuffleMerge(TEST_APP, NO_ATTEMPT_ID, 1, 0));
+assertEquals("no partitions were merged", 0, statuses.reduceIds.length);
+RemoteBlockPushResolver.AppShuffleInfo appShuffleInfo =
+pushResolver.validateAndGetAppShuffleInfo(TEST_APP);
+assertTrue("Determinate shuffle should be marked finalized merge",

Review comment:
   Nit: fix the msg




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793256119



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -309,6 +393,116 @@ DESC StudentInfo;
 |age|  int|   NULL|
 +---+-+---+
 
+-- Drop columns of a table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|   name|   string|   NULL|
+| rollno|  int|   NULL|
+|   LastName|   string|   NULL|
+|DOB|timestamp|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo DROP columns (LastName, DOB);
+
+-- After dropping columns of the table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|   name|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+-- Rename a column of a table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|   name|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo RENAME COLUMN name TO FirstName;
+
+-- After renaming a column of the table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|  FirstName|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+-- ALTER OR CHANGE COLUMNS
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|  FirstName|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo ALTER COLUMN FirstName COMMENT "new comment";
+
+--After ALTER or CHANGE COLUMNS
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|  FirstName|   string|new comment|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+-- REPLACE COLUMNS
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|  FirstName|   string|new comment|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo REPLACE COLUMNS (name string, ID int COMMENT 'new 
comment');
+
+--After replacing COLUMNS

Review comment:
   nit: space after `-`

##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -435,6 +597,9 @@ ALTER TABLE dbx.tab1 SET TBLPROPERTIES ('comment' = 'This 
is a new comment.');
 
 -- DROP TABLE PROPERTIES
 ALTER TABLE dbx.tab1 UNSET TBLPROPERTIES ('winner');
+
+-- RECOVER PARTITIONS
+ALTER TABLE dbx.tab1 RECOVER PARTITIONS

Review comment:
   nit: `;` in the end




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this 

[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793255964



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -309,6 +393,116 @@ DESC StudentInfo;
 |age|  int|   NULL|
 +---+-+---+
 
+-- Drop columns of a table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|   name|   string|   NULL|
+| rollno|  int|   NULL|
+|   LastName|   string|   NULL|
+|DOB|timestamp|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo DROP columns (LastName, DOB);
+
+-- After dropping columns of the table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|   name|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+-- Rename a column of a table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|   name|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo RENAME COLUMN name TO FirstName;
+
+-- After renaming a column of the table
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|  FirstName|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+-- ALTER OR CHANGE COLUMNS
+DESC StudentInfo;
++---+-+---+
+|   col_name|data_type|comment|
++---+-+---+
+|  FirstName|   string|   NULL|
+| rollno|  int|   NULL|
+|age|  int|   NULL|
+|# Partition Information| |   |
+| # col_name|data_type|comment|
+|age|  int|   NULL|
++---+-+---+
+
+ALTER TABLE StudentInfo ALTER COLUMN FirstName COMMENT "new comment";
+
+--After ALTER or CHANGE COLUMNS

Review comment:
   nit: space after `-`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793255420



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -101,6 +145,27 @@ ALTER TABLE table_identifier { ALTER | CHANGE } [ COLUMN ] 
col_spec alterColumnA
 
 Change column's definition.
 
+### REPLACE COLUMNS
+
+`ALTER TABLE REPLACE COLUMNS` statement removes all existing columns and adds 
the new set of columns.
+Note that this statement is only supported with v2 tables.
+
+
+ Syntax
+
+```sql
+ALTER TABLE table_identifier REPLACE COLUMNS  
+  [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
+```

Review comment:
   This one is a little complicated. Here is the syntax:
   ```
   | ALTER TABLE table=multipartIdentifier partitionSpec?
   REPLACE COLUMNS
   '(' columns=qualifiedColTypeWithPositionList ')'
   
   qualifiedColTypeWithPositionList
   : qualifiedColTypeWithPosition (',' qualifiedColTypeWithPosition)*
   ;
   
   qualifiedColTypeWithPosition
   : name=multipartIdentifier dataType (NOT NULL)? commentSpec? colPosition?
   ;
   ```
   Could you please make the syntax more accurate?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793253463



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -75,6 +75,50 @@ ALTER TABLE table_identifier ADD COLUMNS ( col_spec [ , ... 
] )
 
 Specifies the columns to be added.
 
+### DROP COLUMNS
+
+`ALTER TABLE DROP COLUMNS` statement drops mentioned columns from an existing 
table.
+Note that this statement is only supported with v2 tables.
+
+
+ Syntax
+
+```sql
+ALTER TABLE table_identifier DROP COLUMNS ( col_name1, ... )
+```
+
+ Parameters
+
+* **table_identifier**
+
+  Specifies a table name, which may be optionally qualified with a database 
name.
+
+  **Syntax:** `[ database_name. ] table_name`
+
+### RENAME COLUMN
+
+`ALTER TABLE RENAME COLUMN` statement changes the column name of an existing 
table.
+Note that this statement is only supported with v2 tables.
+
+
+ Syntax
+
+```sql
+ALTER TABLE table_identifier RENAME COLUMN col_spec to col_spec

Review comment:
   nit: capitalize `to`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793252862



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -75,6 +75,50 @@ ALTER TABLE table_identifier ADD COLUMNS ( col_spec [ , ... 
] )
 
 Specifies the columns to be added.
 
+### DROP COLUMNS
+
+`ALTER TABLE DROP COLUMNS` statement drops mentioned columns from an existing 
table.
+Note that this statement is only supported with v2 tables.
+
+
+ Syntax
+
+```sql
+ALTER TABLE table_identifier DROP COLUMNS ( col_name1, ... )
+```

Review comment:
   I just realized that previously we used `col_spec` for add column. I 
think `col_name` is more accurate, because `col_spec` makes me think it is 
column name + data type. Could you please check if other doc use `col_spec` 
too? If yes, we probably want to use `col_spec` to be consistent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793249421



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -75,6 +75,50 @@ ALTER TABLE table_identifier ADD COLUMNS ( col_spec [ , ... 
] )
 
 Specifies the columns to be added.
 
+### DROP COLUMNS
+
+`ALTER TABLE DROP COLUMNS` statement drops mentioned columns from an existing 
table.
+Note that this statement is only supported with v2 tables.
+
+

Review comment:
   nit: I think one blank line. There are a couple of more places that you 
have an extra blank line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #35239: [SPARK-37952][DOCS] Add missing statements to ALTER TABLE document

2022-01-26 Thread GitBox


huaxingao commented on a change in pull request #35239:
URL: https://github.com/apache/spark/pull/35239#discussion_r793248260



##
File path: docs/sql-ref-syntax-ddl-alter-table.md
##
@@ -75,6 +75,50 @@ ALTER TABLE table_identifier ADD COLUMNS ( col_spec [ , ... 
] )
 
 Specifies the columns to be added.
 
+### DROP COLUMNS
+
+`ALTER TABLE DROP COLUMNS` statement drops mentioned columns from an existing 
table.
+Note that this statement is only supported with v2 tables.
+
+
+ Syntax
+
+```sql
+ALTER TABLE table_identifier DROP COLUMNS ( col_name1, ... )
+```

Review comment:
   Here is the syntax:
   ```
   | ALTER TABLE multipartIdentifier
   DROP (COLUMN | COLUMNS)
   '(' columns=multipartIdentifierList ')'
#dropTableColumns
   | ALTER TABLE multipartIdentifier
   DROP (COLUMN | COLUMNS) columns=multipartIdentifierList
#dropTableColumns
   ```
   So I think it's something like this:
   ```
   ALTER TABLE table_identifier DROP { COLUMN | COLUMNS } [ ( ] col_name [ , 
... ]  [ ) ]
   ```
   I guess add a parameter in the below section for `col_name`.
   
   You might want to change the add column syntax too to make it more accurate.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #35339: [SPARK-38040][BUILD] Enable binary compatibility check for APIs in Catalyst, KVStore and Avro modules

2022-01-26 Thread GitBox


HyukjinKwon closed pull request #35339:
URL: https://github.com/apache/spark/pull/35339


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #35339: [SPARK-38040][BUILD] Enable binary compatibility check for APIs in Catalyst, KVStore and Avro modules

2022-01-26 Thread GitBox


HyukjinKwon commented on pull request #35339:
URL: https://github.com/apache/spark/pull/35339#issuecomment-1022840884


   Thanks all! Mima passed.
   
   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] iRakson commented on pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF

2022-01-26 Thread GitBox


iRakson commented on pull request #35337:
URL: https://github.com/apache/spark/pull/35337#issuecomment-1022838418


   @HyukjinKwon @dongjoon-hyun
   Kindly take a look at this PR. 
   UT failure is not related to changes made in this PR.
   More UTs will be added covering all other scenarios.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nyingping opened a new pull request #35340: [WINDOW]improve structured streaming window of calculated

2022-01-26 Thread GitBox


nyingping opened a new pull request #35340:
URL: https://github.com/apache/spark/pull/35340


 **What changes were proposed in this pull request?**
   
   Remove the `CaseWhen`,Modified the calculation method of the obtained window
   
   **Why are the changes needed?**
   
   More simpler and more efficient
   
   **Does this PR introduce any user-facing change?**
   No.
   
   **How was this patch tested?**
   
   Existing test as this is just refactoring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] melin commented on pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF

2022-01-26 Thread GitBox


melin commented on pull request #35337:
URL: https://github.com/apache/spark/pull/35337#issuecomment-1022829887


   Update jar by CREATE OR REPLACE command, it is not very convenient, it is 
best to insist on jar file modification time update, take effect directly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a change in pull request #34212: [SPARK-36402][PYTHON] Implement Series.combine

2022-01-26 Thread GitBox


Yikun commented on a change in pull request #34212:
URL: https://github.com/apache/spark/pull/34212#discussion_r793225136



##
File path: python/pyspark/pandas/series.py
##
@@ -4483,6 +4487,181 @@ def replace(
 
 return self._with_new_scol(current)  # TODO: dtype?
 
+def combine(
+self,
+other: Union[Scalar, "Series"],
+func: Callable,
+fill_value: Optional[Any] = None,
+) -> "Series":
+"""
+Combine the Series with a Series or scalar according to `func`.
+
+Combine the Series and `other` using `func` to perform elementwise
+selection for combined Series.
+`fill_value` is assumed when value is missing at some index
+from one of the two objects being combined.
+
+.. versionadded:: 3.3.0
+
+.. note:: This API executes the function once to infer the type which 
is
+potentially expensive, for instance, when the dataset is created 
after
+aggregations or sorting.
+
+To avoid this, specify return type in ``func``, for instance, as 
below:
+
+>>> def foo(x, y) -> np.int32:

Review comment:
   or maybe just give a `max` exmaple, it would be fluent when user see 
below doctest.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a change in pull request #34212: [SPARK-36402][PYTHON] Implement Series.combine

2022-01-26 Thread GitBox


Yikun commented on a change in pull request #34212:
URL: https://github.com/apache/spark/pull/34212#discussion_r793219057



##
File path: python/pyspark/pandas/series.py
##
@@ -4483,6 +4487,181 @@ def replace(
 
 return self._with_new_scol(current)  # TODO: dtype?
 
+def combine(
+self,
+other: Union[Scalar, "Series"],
+func: Callable,
+fill_value: Optional[Any] = None,
+) -> "Series":
+"""
+Combine the Series with a Series or scalar according to `func`.
+
+Combine the Series and `other` using `func` to perform elementwise
+selection for combined Series.
+`fill_value` is assumed when value is missing at some index
+from one of the two objects being combined.
+
+.. versionadded:: 3.3.0
+
+.. note:: This API executes the function once to infer the type which 
is
+potentially expensive, for instance, when the dataset is created 
after
+aggregations or sorting.
+
+To avoid this, specify return type in ``func``, for instance, as 
below:
+
+>>> def foo(x, y) -> np.int32:
+... return x * y
+
+pandas-on-Spark uses return type hint and does not try to infer 
the type.
+
+This API does not support self combine for now.
+
+>>> psser1 = ps.Series([1, 2, 3, 4])
+>>> psser1.combine(psser1, max)  # doctest: +SKIP
+...
+ValueError: Unsupported self combine
+
+Parameters
+--
+other : Series or scalar
+The value(s) to be combined with the `Series`.
+func : function
+Function that takes two scalars as inputs and returns an element.
+Note that type hint for return type is strongly recommended.
+fill_value : scalar, optional
+The value to assume when an index is missing from
+one Series or the other. The default specifies to use the
+appropriate NaN value for the underlying dtype of the Series.
+
+Returns
+---
+Series
+The result of combining the Series with the other object.
+
+See Also
+
+Series.combine_first : Combine Series values, choosing the calling
+Series' values first.
+
+Examples
+
+Consider 2 Datasets ``s1`` and ``s2`` containing
+highest clocked speeds of different birds.
+
+>>> from pyspark.pandas.config import set_option, reset_option
+>>> set_option("compute.ops_on_diff_frames", True)
+>>> s1 = ps.Series({'falcon': 330.0, 'eagle': 160.0})
+>>> s1
+falcon330.0
+eagle 160.0
+dtype: float64
+>>> s2 = ps.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0})
+>>> s2
+falcon345.0
+eagle 200.0
+duck   30.0
+dtype: float64
+
+Now, to combine the two datasets and view the highest speeds
+of the birds across the two datasets
+
+>>> s1.combine(s2, max)

Review comment:
   but this is not a best practice for `the good note`, let's give the 
right example like:
   ```
   >>> def max_with_return_type(x, y) -> float:
   ... return max(x, y)
   >>> s1.combine(s2, max)
   ```

##
File path: python/pyspark/pandas/series.py
##
@@ -4483,6 +4487,181 @@ def replace(
 
 return self._with_new_scol(current)  # TODO: dtype?
 
+def combine(
+self,
+other: Union[Scalar, "Series"],
+func: Callable,
+fill_value: Optional[Any] = None,
+) -> "Series":
+"""
+Combine the Series with a Series or scalar according to `func`.
+
+Combine the Series and `other` using `func` to perform elementwise
+selection for combined Series.
+`fill_value` is assumed when value is missing at some index
+from one of the two objects being combined.
+
+.. versionadded:: 3.3.0
+
+.. note:: This API executes the function once to infer the type which 
is
+potentially expensive, for instance, when the dataset is created 
after
+aggregations or sorting.
+
+To avoid this, specify return type in ``func``, for instance, as 
below:
+
+>>> def foo(x, y) -> np.int32:

Review comment:
   nits:
   ```suggestion
   >>> def multiply(x, y) -> np.int32:
   ```

##
File path: python/pyspark/pandas/series.py
##
@@ -4483,6 +4487,181 @@ def replace(
 
 return self._with_new_scol(current)  # TODO: dtype?
 
+def combine(
+self,
+other: Union[Scalar, "Series"],
+func: Callable,
+fill_value: Optional[Any] = None,
+) -> "Series":
+"""
+Combine the Series with a Series or scalar according to `func`.
+
+Combine the Series and `other` using 

[GitHub] [spark] HyukjinKwon commented on pull request #35339: [SPARK-38040][BUILD] Enable binary compatibility check for APIs in Catalyst, KVStore and Avro modules

2022-01-26 Thread GitBox


HyukjinKwon commented on pull request #35339:
URL: https://github.com/apache/spark/pull/35339#issuecomment-1022808335


   cc @HeartSaVioR @cloud-fan @gengliangwang @dongjoon-hyun FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request #35339: [SPARK-38040][BUILD] Enable binary compatibility check for APIs in Catalyst, KVStore and Avro modules

2022-01-26 Thread GitBox


HyukjinKwon opened a new pull request #35339:
URL: https://github.com/apache/spark/pull/35339


   ### What changes were proposed in this pull request?
   
   We don't currently run binary compatibility check in below modules:
   
   ```
   [info] spark-parent: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-network-common: mimaPreviousArtifacts not set, not analyzing 
binary compatibility
   [info] spark-tags: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-unsafe: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-network-shuffle: mimaPreviousArtifacts not set, not analyzing 
binary compatibility
   [info] spark-kvstore: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-tools: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-token-provider-kafka-0-10: mimaPreviousArtifacts not set, not 
analyzing binary compatibility
   [info] spark-streaming-kafka-0-10-assembly: mimaPreviousArtifacts not set, 
not analyzing binary compatibility
   [info] spark-catalyst: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-repl: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-avro: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-sql-kafka-0-10: mimaPreviousArtifacts not set, not analyzing 
binary compatibility
   [info] spark-hive: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-assembly: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   [info] spark-examples: mimaPreviousArtifacts not set, not analyzing binary 
compatibility
   ```
   
   However, there are some APIs under these modules. For example, 
https://github.com/apache/spark/blob/master/python/pyspark/sql/avro/functions.py
 for Avro,  
https://github.com/apache/spark/tree/master/common/kvstore/src/main/java/org/apache/spark/util/kvstore
 for KVStore (to be API), and 
https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector
 for Catalyst
   
   ### Why are the changes needed?
   
   To detect binary compatibility.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, dev-only.
   
   ### How was this patch tested?
   
   Manually tested via running `dev/mima`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF

2022-01-26 Thread GitBox


Yikf commented on pull request #35337:
URL: https://github.com/apache/spark/pull/35337#issuecomment-1022808022


   > Support Permanent Function?
   
   The information from suite describe that permanent should be supportive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35335: [SPARK-38036][SQL][TESTS] Refactor `VersionsSuite` to `HiveClientSuite` and make it a subclass of `HiveVersionSuite`

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35335:
URL: https://github.com/apache/spark/pull/35335#discussion_r793220735



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala
##
@@ -0,0 +1,1072 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.client
+
+import java.io.{ByteArrayOutputStream, File, PrintStream, PrintWriter}
+import java.net.URI
+
+import org.apache.commons.lang3.{JavaVersion, SystemUtils}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.common.StatsSetupConst
+import org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
+import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
+import org.apache.hadoop.mapred.TextInputFormat
+import org.apache.hadoop.security.UserGroupInformation
+
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.{DatabaseAlreadyExistsException, 
NoSuchDatabaseException, NoSuchPermanentFunctionException, 
PartitionsAlreadyExistException}
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Literal}
+import org.apache.spark.sql.hive.HiveExternalCatalog
+import org.apache.spark.sql.hive.test.TestHiveVersion
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.util.{MutableURLClassLoader, Utils}
+
+class HiveClientSuite(version: String, allVersions: Seq[String])
+  extends HiveVersionSuite(version) {
+
+  private var versionSpark: TestHiveVersion = null
+
+  private val emptyDir = Utils.createTempDir().getCanonicalPath
+
+  /**
+   * Drops table `tableName` after calling `f`.
+   */
+  protected def withTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally {
+  tableNames.foreach { name =>
+versionSpark.sql(s"DROP TABLE IF EXISTS $name")
+  }
+}
+  }
+
+  test("create client") {

Review comment:
   The `$version:` prefix is removed because the subclass of 
`HiveVersionSuite` will print this prefix by default, other cases in this file 
are similar
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35335: [SPARK-38036][SQL][TESTS] Refactor `VersionsSuite` to `HiveClientSuite` and make it a subclass of `HiveVersionSuite`

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35335:
URL: https://github.com/apache/spark/pull/35335#discussion_r793219679



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
##
@@ -1,1159 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.hive.client
-
-import java.io.{ByteArrayOutputStream, File, PrintStream, PrintWriter}
-import java.net.URI
-
-import org.apache.commons.lang3.{JavaVersion, SystemUtils}
-import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.Path
-import org.apache.hadoop.hive.common.StatsSetupConst
-import org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
-import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
-import org.apache.hadoop.mapred.TextInputFormat
-import org.apache.hadoop.security.UserGroupInformation
-
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.internal.Logging
-import org.apache.spark.sql.{AnalysisException, Row}
-import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
-import org.apache.spark.sql.catalyst.analysis.{DatabaseAlreadyExistsException, 
NoSuchDatabaseException, NoSuchPermanentFunctionException, 
PartitionsAlreadyExistException}
-import org.apache.spark.sql.catalyst.catalog._
-import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Literal}
-import org.apache.spark.sql.catalyst.util.quietly
-import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
-import org.apache.spark.sql.hive.test.TestHiveVersion
-import org.apache.spark.sql.types.IntegerType
-import org.apache.spark.sql.types.StructType
-import org.apache.spark.tags.{ExtendedHiveTest, SlowHiveTest}
-import org.apache.spark.util.{MutableURLClassLoader, Utils}
-
-/**
- * A simple set of tests that call the methods of a [[HiveClient]], loading 
different version
- * of hive from maven central.  These tests are simple in that they are mostly 
just testing to make
- * sure that reflective calls are not throwing NoSuchMethod error, but the 
actually functionality
- * is not fully tested.
- */
-// TODO: Refactor this to `HiveClientSuite` and make it a subclass of 
`HiveVersionSuite`
-@SlowHiveTest
-@ExtendedHiveTest
-class VersionsSuite extends SparkFunSuite with Logging {
-
-  override protected val enableAutoThreadAudit = false
-
-  import HiveClientBuilder.buildClient
-
-  /**
-   * Drops table `tableName` after calling `f`.
-   */
-  protected def withTable(tableNames: String*)(f: => Unit): Unit = {
-try f finally {
-  tableNames.foreach { name =>
-versionSpark.sql(s"DROP TABLE IF EXISTS $name")
-  }
-}
-  }
-
-  test("success sanity check") {

Review comment:
   `success sanity check`,  `hadoop configuration preserved`, `override 
useless and side-effect hive configurations` and `failure sanity check` move to 
`HiveClientSuites`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35335: [SPARK-38036][SQL][TESTS] Refactor `VersionsSuite` to `HiveClientSuite` and make it a subclass of `HiveVersionSuite`

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35335:
URL: https://github.com/apache/spark/pull/35335#discussion_r793219289



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
##
@@ -1,1159 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.hive.client
-
-import java.io.{ByteArrayOutputStream, File, PrintStream, PrintWriter}
-import java.net.URI
-
-import org.apache.commons.lang3.{JavaVersion, SystemUtils}
-import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.Path
-import org.apache.hadoop.hive.common.StatsSetupConst
-import org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
-import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
-import org.apache.hadoop.mapred.TextInputFormat
-import org.apache.hadoop.security.UserGroupInformation
-
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.internal.Logging
-import org.apache.spark.sql.{AnalysisException, Row}
-import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
-import org.apache.spark.sql.catalyst.analysis.{DatabaseAlreadyExistsException, 
NoSuchDatabaseException, NoSuchPermanentFunctionException, 
PartitionsAlreadyExistException}
-import org.apache.spark.sql.catalyst.catalog._
-import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Literal}
-import org.apache.spark.sql.catalyst.util.quietly
-import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
-import org.apache.spark.sql.hive.test.TestHiveVersion
-import org.apache.spark.sql.types.IntegerType
-import org.apache.spark.sql.types.StructType
-import org.apache.spark.tags.{ExtendedHiveTest, SlowHiveTest}
-import org.apache.spark.util.{MutableURLClassLoader, Utils}
-
-/**
- * A simple set of tests that call the methods of a [[HiveClient]], loading 
different version
- * of hive from maven central.  These tests are simple in that they are mostly 
just testing to make
- * sure that reflective calls are not throwing NoSuchMethod error, but the 
actually functionality
- * is not fully tested.
- */
-// TODO: Refactor this to `HiveClientSuite` and make it a subclass of 
`HiveVersionSuite`
-@SlowHiveTest
-@ExtendedHiveTest
-class VersionsSuite extends SparkFunSuite with Logging {
-
-  override protected val enableAutoThreadAudit = false
-
-  import HiveClientBuilder.buildClient
-
-  /**
-   * Drops table `tableName` after calling `f`.
-   */
-  protected def withTable(tableNames: String*)(f: => Unit): Unit = {
-try f finally {
-  tableNames.foreach { name =>
-versionSpark.sql(s"DROP TABLE IF EXISTS $name")
-  }
-}
-  }
-
-  test("success sanity check") {
-val badClient = buildClient(HiveUtils.builtinHiveVersion, new 
Configuration())
-val db = new CatalogDatabase("default", "desc", new URI("loc"), Map())
-badClient.createDatabase(db, ignoreIfExists = true)
-  }
-
-  test("hadoop configuration preserved") {
-val hadoopConf = new Configuration()
-hadoopConf.set("test", "success")
-val client = buildClient(HiveUtils.builtinHiveVersion, hadoopConf)
-assert("success" === client.getConf("test", null))
-  }
-
-  test("override useless and side-effect hive configurations ") {
-val hadoopConf = new Configuration()
-// These hive flags should be reset by spark
-hadoopConf.setBoolean("hive.cbo.enable", true)
-hadoopConf.setBoolean("hive.session.history.enabled", true)
-hadoopConf.set("hive.execution.engine", "tez")
-val client = buildClient(HiveUtils.builtinHiveVersion, hadoopConf)
-assert(!client.getConf("hive.cbo.enable", "true").toBoolean)
-assert(!client.getConf("hive.session.history.enabled", "true").toBoolean)
-assert(client.getConf("hive.execution.engine", "tez") === "mr")
-  }
-
-  private def getNestedMessages(e: Throwable): String = {
-var causes = ""
-var lastException = e
-while (lastException != null) {
-  causes += lastException.toString + "\n"
-  lastException = lastException.getCause
-}
-causes
-  }
-
-  private val emptyDir = Utils.createTempDir().getCanonicalPath
-
-  // Its actually pretty easy to mess things up and have all of your tests 

[GitHub] [spark] Kimahriman commented on pull request #35085: [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service for released executors

2022-01-26 Thread GitBox


Kimahriman commented on pull request #35085:
URL: https://github.com/apache/spark/pull/35085#issuecomment-1022803122


   I went with the make subdirs 770 and make files world readable approach, 
seemed a lot less hacky than relying on ACLs. Will update the description with 
the approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35335: [SPARK-38036][SQL][TESTS] Refactor `VersionsSuite` to `HiveClientSuite` and make it a subclass of `HiveVersionSuite`

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35335:
URL: https://github.com/apache/spark/pull/35335#discussion_r793218821



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
##
@@ -1,1159 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.hive.client
-
-import java.io.{ByteArrayOutputStream, File, PrintStream, PrintWriter}
-import java.net.URI
-
-import org.apache.commons.lang3.{JavaVersion, SystemUtils}
-import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.Path
-import org.apache.hadoop.hive.common.StatsSetupConst
-import org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
-import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
-import org.apache.hadoop.mapred.TextInputFormat
-import org.apache.hadoop.security.UserGroupInformation
-
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.internal.Logging
-import org.apache.spark.sql.{AnalysisException, Row}
-import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
-import org.apache.spark.sql.catalyst.analysis.{DatabaseAlreadyExistsException, 
NoSuchDatabaseException, NoSuchPermanentFunctionException, 
PartitionsAlreadyExistException}
-import org.apache.spark.sql.catalyst.catalog._
-import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Literal}
-import org.apache.spark.sql.catalyst.util.quietly
-import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
-import org.apache.spark.sql.hive.test.TestHiveVersion
-import org.apache.spark.sql.types.IntegerType
-import org.apache.spark.sql.types.StructType
-import org.apache.spark.tags.{ExtendedHiveTest, SlowHiveTest}
-import org.apache.spark.util.{MutableURLClassLoader, Utils}
-
-/**
- * A simple set of tests that call the methods of a [[HiveClient]], loading 
different version
- * of hive from maven central.  These tests are simple in that they are mostly 
just testing to make
- * sure that reflective calls are not throwing NoSuchMethod error, but the 
actually functionality
- * is not fully tested.
- */
-// TODO: Refactor this to `HiveClientSuite` and make it a subclass of 
`HiveVersionSuite`
-@SlowHiveTest
-@ExtendedHiveTest
-class VersionsSuite extends SparkFunSuite with Logging {
-
-  override protected val enableAutoThreadAudit = false
-
-  import HiveClientBuilder.buildClient
-
-  /**
-   * Drops table `tableName` after calling `f`.
-   */
-  protected def withTable(tableNames: String*)(f: => Unit): Unit = {
-try f finally {
-  tableNames.foreach { name =>
-versionSpark.sql(s"DROP TABLE IF EXISTS $name")
-  }
-}
-  }
-
-  test("success sanity check") {
-val badClient = buildClient(HiveUtils.builtinHiveVersion, new 
Configuration())
-val db = new CatalogDatabase("default", "desc", new URI("loc"), Map())
-badClient.createDatabase(db, ignoreIfExists = true)
-  }
-
-  test("hadoop configuration preserved") {
-val hadoopConf = new Configuration()
-hadoopConf.set("test", "success")
-val client = buildClient(HiveUtils.builtinHiveVersion, hadoopConf)
-assert("success" === client.getConf("test", null))
-  }
-
-  test("override useless and side-effect hive configurations ") {
-val hadoopConf = new Configuration()
-// These hive flags should be reset by spark
-hadoopConf.setBoolean("hive.cbo.enable", true)
-hadoopConf.setBoolean("hive.session.history.enabled", true)
-hadoopConf.set("hive.execution.engine", "tez")
-val client = buildClient(HiveUtils.builtinHiveVersion, hadoopConf)
-assert(!client.getConf("hive.cbo.enable", "true").toBoolean)
-assert(!client.getConf("hive.session.history.enabled", "true").toBoolean)
-assert(client.getConf("hive.execution.engine", "tez") === "mr")
-  }
-
-  private def getNestedMessages(e: Throwable): String = {
-var causes = ""
-var lastException = e
-while (lastException != null) {
-  causes += lastException.toString + "\n"
-  lastException = lastException.getCause
-}
-causes
-  }
-
-  private val emptyDir = Utils.createTempDir().getCanonicalPath
-
-  // Its actually pretty easy to mess things up and have all of your tests 

[GitHub] [spark] Yikf commented on pull request #35308: [SPARK-38011][SQL] Remove duplicated and useless configuration in ParquetFileFormat

2022-01-26 Thread GitBox


Yikf commented on pull request #35308:
URL: https://github.com/apache/spark/pull/35308#issuecomment-1022795343


   @cloud-fan Could you please take a look when you have a time? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on pull request #35308: [SPARK-38011][SQL] Remove duplicated and useless configuration in ParquetFileFormat

2022-01-26 Thread GitBox


Yikf commented on pull request #35308:
URL: https://github.com/apache/spark/pull/35308#issuecomment-1022795191


   > A similar situation exists in `ParquetScan#createReaderFactory`
   
   Thanks a log for your reminder, updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] melin commented on pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF

2022-01-26 Thread GitBox


melin commented on pull request #35337:
URL: https://github.com/apache/spark/pull/35337#issuecomment-1022791504


   Support Permanent Function?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


mridulm commented on pull request #35325:
URL: https://github.com/apache/spark/pull/35325#issuecomment-1022786290


   @otterc Can you remove the WIP tag, given the PR should be complete now ? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #35325:
URL: https://github.com/apache/spark/pull/35325#discussion_r793202348



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -534,35 +515,33 @@ public MergeStatuses 
finalizeShuffleMerge(FinalizeShuffleMerge msg) {
 }
 AtomicReference> 
shuffleMergePartitionsRef =
   new AtomicReference<>(null);
-// Metadata of the determinate stage shuffle can be safely removed as part 
of finalizing
-// shuffle merge. Currently once the shuffle is finalized for a 
determinate stages, retry
-// stages of the same shuffle will have shuffle push disabled.
-if (msg.shuffleMergeId == DETERMINATE_SHUFFLE_MERGE_ID) {
-  AppShuffleMergePartitionsInfo appShuffleMergePartitionsInfo =
-appShuffleInfo.shuffles.remove(msg.shuffleId);
-  if (appShuffleMergePartitionsInfo != null) {
-
shuffleMergePartitionsRef.set(appShuffleMergePartitionsInfo.shuffleMergePartitions);
+appShuffleInfo.shuffles.compute(msg.shuffleId, (shuffleId, 
mergePartitionsInfo) -> {
+  if (null == mergePartitionsInfo) {
+//If the mergePartitions was never created then it means that there 
weren't any push
+//blocks that were ever received for this shuffle. This could be the 
case when the driver
+//doesn't wait for enough time to start the stage which reads this 
shuffle data.

Review comment:
   nit: space between `//` and text.

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -534,35 +515,33 @@ public MergeStatuses 
finalizeShuffleMerge(FinalizeShuffleMerge msg) {
 }
 AtomicReference> 
shuffleMergePartitionsRef =
   new AtomicReference<>(null);
-// Metadata of the determinate stage shuffle can be safely removed as part 
of finalizing
-// shuffle merge. Currently once the shuffle is finalized for a 
determinate stages, retry
-// stages of the same shuffle will have shuffle push disabled.
-if (msg.shuffleMergeId == DETERMINATE_SHUFFLE_MERGE_ID) {
-  AppShuffleMergePartitionsInfo appShuffleMergePartitionsInfo =
-appShuffleInfo.shuffles.remove(msg.shuffleId);
-  if (appShuffleMergePartitionsInfo != null) {
-
shuffleMergePartitionsRef.set(appShuffleMergePartitionsInfo.shuffleMergePartitions);
+appShuffleInfo.shuffles.compute(msg.shuffleId, (shuffleId, 
mergePartitionsInfo) -> {
+  if (null == mergePartitionsInfo) {
+//If the mergePartitions was never created then it means that there 
weren't any push
+//blocks that were ever received for this shuffle. This could be the 
case when the driver
+//doesn't wait for enough time to start the stage which reads this 
shuffle data.
+return new AppShuffleMergePartitionsInfo(msg.shuffleMergeId, true);
+  } else if (msg.shuffleMergeId < mergePartitionsInfo.shuffleMergeId
+  || mergePartitionsInfo.isFinalized()) {
+throw new RuntimeException(
+String.format("Shuffle merge finalize request for shuffle %s with" 
+
+" shuffleMergeId %s is %s", msg.shuffleId, msg.shuffleMergeId,
+
ErrorHandler.BlockPushErrorHandler.STALE_SHUFFLE_FINALIZE_SUFFIX));
+  } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) {
+// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
then return
+// empty MergeStatuses but cleanup the older shuffleMergeId files.
+mergedShuffleCleaner.execute(() ->
+
closeAndDeletePartitionFiles(mergePartitionsInfo.shuffleMergePartitions));
+return new AppShuffleMergePartitionsInfo(msg.shuffleMergeId, true);
+  } else {
+// This block covers:
+//  1. finalization of determinate stage
+//  2. finalization of indeterminate stage if the shuffleMergeId 
related to it is the one
+//  for which the message is received.
+
shuffleMergePartitionsRef.set(mergePartitionsInfo.shuffleMergePartitions);
+return new AppShuffleMergePartitionsInfo(msg.shuffleMergeId, true);

Review comment:
   nit: All paths in this lambda, which are not throwing exceptions, are 
doing the same thing - `new AppShuffleMergePartitionsInfo(msg.shuffleMergeId, 
true);`
   Make that common ?

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -576,14 +555,25 @@ public MergeStatuses 
finalizeShuffleMerge(FinalizeShuffleMerge msg) {
   for (AppShufflePartitionInfo partition: shuffleMergePartitions.values()) 
{
 synchronized (partition) {
   try {
+logger.debug("{} attempt {} shuffle {} shuffleMerge {}: finalizing 
shuffle partition "
++ "{} ", msg.appId, msg.appAttemptId, msg.shuffleId,
+msg.shuffleMergeId, partition.reduceId);
  

[GitHub] [spark] itholic edited a comment on pull request #35191: [SPARK-37491][PYTHON]Fix Series.asof for unsorted values

2022-01-26 Thread GitBox


itholic edited a comment on pull request #35191:
URL: https://github.com/apache/spark/pull/35191#issuecomment-1022785523


   @pralabhkumar Sorry for being delayed. I've been busy for couple of days. 
Will take a closer look soon  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #35191: [SPARK-37491][PYTHON]Fix Series.asof for unsorted values

2022-01-26 Thread GitBox


itholic commented on pull request #35191:
URL: https://github.com/apache/spark/pull/35191#issuecomment-1022785523


   @pralabhkumar Sorry for being delayed. I've been busy couple of days. Will 
take a closer look soon  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dchvn closed pull request #34178: [SPARK-36296][SQL] Refactor seventh set of 20 in QueryExecutionErrors to use error classes

2022-01-26 Thread GitBox


dchvn closed pull request #34178:
URL: https://github.com/apache/spark/pull/34178


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #35289: [SPARK-37397][PYTHON] Inline annotations for pyspark.ml.base

2022-01-26 Thread GitBox


HyukjinKwon commented on pull request #35289:
URL: https://github.com/apache/spark/pull/35289#issuecomment-1022757181


   cc @WeichenXu123 FYI if you find some time to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on pull request #35289: [SPARK-37397][PYTHON] Inline annotations for pyspark.ml.base

2022-01-26 Thread GitBox


zero323 commented on pull request #35289:
URL: https://github.com/apache/spark/pull/35289#issuecomment-1022755067


   cc @itholic @ueshin @xinrong-databricks FYI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #35333: [SPARK-38035][SQL] Add docker tests for build-in JDBC dialect

2022-01-26 Thread GitBox


HyukjinKwon commented on pull request #35333:
URL: https://github.com/apache/spark/pull/35333#issuecomment-1022753153


   cc @sarutak fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

2022-01-26 Thread GitBox


beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1022751663


   ping @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #35333: [SPARK-38035][SQL] Add docker tests for build-in JDBC dialect

2022-01-26 Thread GitBox


beliefer commented on pull request #35333:
URL: https://github.com/apache/spark/pull/35333#issuecomment-1022735880


   ping @huaxingao cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #34178: [SPARK-36296][SQL] Refactor seventh set of 20 in QueryExecutionErrors to use error classes

2022-01-26 Thread GitBox


github-actions[bot] commented on pull request #34178:
URL: https://github.com/apache/spark/pull/34178#issuecomment-1022720525


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on a change in pull request #35332: [SPARK-38030][SQL] Canonicalization of cast should remove nullability of target dataType

2022-01-26 Thread GitBox


xkrogen commented on a change in pull request #35332:
URL: https://github.com/apache/spark/pull/35332#discussion_r793120223



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {
+  assert(cast.resolved)
+  // canonicalization should not converted resolved cast to unresolved
+  assert(cast.canonicalized.resolved)

Review comment:
   Yah I think the current test is probably sufficient.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shardulm94 commented on a change in pull request #35332: [SPARK-38030][SQL] Canonicalization of cast should remove nullability of target dataType

2022-01-26 Thread GitBox


shardulm94 commented on a change in pull request #35332:
URL: https://github.com/apache/spark/pull/35332#discussion_r793102644



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
##
@@ -310,13 +310,16 @@ abstract class CastBase extends UnaryExpression with 
TimeZoneAwareExpression wit
 
   protected def ansiEnabled: Boolean
 
+  protected def withDataType(dataType: DataType): CastBase
+

Review comment:
   The `copy` methods are not generated for abstract classes. 

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {
+  assert(cast.resolved)
+  // canonicalization should not converted resolved cast to unresolved
+  assert(cast.canonicalized.resolved)

Review comment:
   The original issue was detected because of the following plan `UnionExec 
-> ProjectExec --> Alias --> Cast --> AttributeReference` and a call to 
`.output` on `UnionExec` which in turn calls `ProjectExec.output.nullable`. I 
can recreate this chain here, but not sure if thats useful. The root issue was 
that a resolved node was being converted to unresolved which we are testing 
here.

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {

Review comment:
   The existing approach is simpler to read IMO, but open to changing it if 
more folks think the second approach is better.

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {
+  assert(cast.resolved)
+  // canonicalization should not converted resolved cast to unresolved
+  assert(cast.canonicalized.resolved)
+  assert(cast.canonicalized.dataType == structType.asNullable)

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #35338: [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44

2022-01-26 Thread GitBox


dongjoon-hyun commented on pull request #35338:
URL: https://github.com/apache/spark/pull/35338#issuecomment-1022659030


   @JackBuggins . You can reopen this after discussing about the requirement on 
this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun removed a comment on pull request #35338: [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44

2022-01-26 Thread GitBox


dongjoon-hyun removed a comment on pull request #35338:
URL: https://github.com/apache/spark/pull/35338#issuecomment-1022657954


   BTW, @JackBuggins . Could you enable GitHub Action in your forked Spark 
repository?
   - https://github.com/apache/spark/pull/35338/checks?check_run_id=4957367779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #35338: [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44

2022-01-26 Thread GitBox


dongjoon-hyun closed pull request #35338:
URL: https://github.com/apache/spark/pull/35338


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #35338: [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44

2022-01-26 Thread GitBox


dongjoon-hyun commented on pull request #35338:
URL: https://github.com/apache/spark/pull/35338#issuecomment-1022657954


   BTW, @JackBuggins . Could you enable GitHub Action in your forked Spark 
repository?
   - https://github.com/apache/spark/pull/35338/checks?check_run_id=4957367779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #35230: [SPARK-37934] [Build] Upgrade Jetty version to 9.4.44

2022-01-26 Thread GitBox


dongjoon-hyun commented on pull request #35230:
URL: https://github.com/apache/spark/pull/35230#issuecomment-1022656387


   In general, the dependency change on the release branch is -1 by default if 
there is no significant needs, @JackBuggins .
   
   > None to my knowledge @HyukjinKwon, but wrt 
[eclipse/jetty.project#6973](https://github.com/eclipse/jetty.project/issues/6973),
 my thinking is it would offer peace of mind
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #35273: [SPARK-37983][SQL] Back out agg build time metrics from sort aggregate

2022-01-26 Thread GitBox


c21 commented on pull request #35273:
URL: https://github.com/apache/spark/pull/35273#issuecomment-1022616989


   @cloud-fan - the PR is ready for review again, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


venkata91 commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r793016468



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   In that case, should we also make `shuffleMergeAllowed` as well 
`private[spark]`. For now it is better to make everything `private[spark]` 
related to push-based shuffle right? Thoughts?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JackBuggins opened a new pull request #35338: [SPARK-37934][BUILD] Upgrade Jetty version to 9.4.44 (3.2 Stream)

2022-01-26 Thread GitBox


JackBuggins opened a new pull request #35338:
URL: https://github.com/apache/spark/pull/35338


   
   
   ### What changes were proposed in this pull request?
   
   
   Mirroring https://github.com/apache/spark/pull/35230 but applying this to 
the 3.2 stream; the only difference is that  `./dev/test-dependencies.sh 
--replace-manifest` does not appear to modify the test dep manifests in the 
same way as against master. Core was the only jar in the latest scala 
2.12/hadoop 2.7 bundle I could find containing this.
   
   
   ### Why are the changes needed?
   
   
   April is a little while away and since 3.2.1 has completed, now seems like 
as sensible a time as any to look at providing the minor dependency update as a 
precaution for the 3.2 stream to provide mitigation against the original issue 
reported in https://github.com/eclipse/jetty.project/issues/6973 
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   ### How was this patch tested?
   
   
   Executing test suites for core.
   
   
   ```
   $ build/sbt
   > project core
   > test
   ```
   
   On first execution I received the following results 
   
   ```
   [info] Tests: succeeded 2890, failed 1, canceled 4, ignored 8, pending 0
   [info] *** 1 TEST FAILED ***
   [error] Failed: Total 3152, Failed 1, Errors 0, Passed 3151, Ignored 8, 
Canceled 4
   [error] Failed tests:
   [error] org.apache.spark.executor.ExecutorSuite
   ```
   
   After checking out a clean branch from 3.2 I discovered that for some reason 
this test intermittently fails and it seems unrelated to this change, I have 
observed this test passing against this changeset and the base branch; is this 
a known issue?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


venkata91 commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r793016468



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   In that case, should we also make `shuffleMergeAllowed` as well 
`private[spark]`. For now it is better to make everything `private` related to 
push-based shuffle right? Thoughts?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


venkata91 commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r793006586



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   Yeah. thats true. We are using `mergerLocs.isEmpty` in other places as 
well. If we use `shuffleMergeEnabled (mergerLocs.nonEmpty)` and 
`mergerLocs.isEmpty|nonEmpty` in few other places, wouldn't it be confusing? Do 
you suggest to replace all the usages of `mergerLocs.isEmpty|nonEmpty` 
appropriately with `shuffleMergeEnabled | !shuffleMergeEnabled`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


otterc commented on a change in pull request #35325:
URL: https://github.com/apache/spark/pull/35325#discussion_r792983163



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -1008,19 +995,28 @@ AppShufflePartitionInfo getPartitionInfo() {
* required for the shuffles of indeterminate stages.
*/
   public static class AppShuffleMergePartitionsInfo {
+// ConcurrentHashMap doesn't allow null for keys or values which is why 
this is required.
+// Marker to identify finalized indeterminate shuffle partitions in the 
case of indeterminate
+// stage retries.
+@VisibleForTesting
+public static final Map 
SHUFFLE_FINALIZED_MARKER =
+Collections.emptyMap();
 private final int shuffleMergeId;
-private final Map shuffleMergePartitions;
+private final Map mergePartitions;

Review comment:
   reverted the rename




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


otterc commented on a change in pull request #35325:
URL: https://github.com/apache/spark/pull/35325#discussion_r792982698



##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java
##
@@ -1287,6 +1287,52 @@ void closeAndDeletePartitionFiles(Map partitio
   + " up", appShuffleInfo.getMergedShuffleDataFile(0, 4, 0).exists());
   }
 
+  // Test for SPARK-37675 and SPARK-37793
+  @Test
+  public void testAllBlocksAreRejectedWhenReceivedAfterFinalization() throws 
IOException {

Review comment:
   Added the test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on a change in pull request #35332: [SPARK-38030][SQL] Canonicalization of cast should remove nullability of target dataType

2022-01-26 Thread GitBox


xkrogen commented on a change in pull request #35332:
URL: https://github.com/apache/spark/pull/35332#discussion_r792961446



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {
+  assert(cast.resolved)
+  // canonicalization should not converted resolved cast to unresolved
+  assert(cast.canonicalized.resolved)

Review comment:
   should we be asserting on `cast.canonicalized.output` as well? this is 
how the original issue was detected, right?

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {
+  assert(cast.resolved)
+  // canonicalization should not converted resolved cast to unresolved
+  assert(cast.canonicalized.resolved)
+  assert(cast.canonicalized.dataType == structType.asNullable)

Review comment:
   Use triple-equals here

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
##
@@ -310,13 +310,16 @@ abstract class CastBase extends UnaryExpression with 
TimeZoneAwareExpression wit
 
   protected def ansiEnabled: Boolean
 
+  protected def withDataType(dataType: DataType): CastBase
+

Review comment:
   Looks like the implementation is the same between all subclasses, why 
can't we implement it here?

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
##
@@ -177,4 +177,18 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(expr.semanticEquals(attr))
 assert(attr.semanticEquals(expr))
   }
+
+  test("SPARK-38030: Canonicalize Cast should remove nullability of target 
dataType") {
+val structType = StructType(Seq(StructField("name", StringType, nullable = 
false)))
+val attr = AttributeReference("col", structType)()
+for (cast <- Seq(
+Cast(attr, structType),
+AnsiCast(attr, structType),
+TryCast(attr, structType))) {

Review comment:
   ```suggestion
   Seq(Cast.apply _, AnsiCast.apply _, TryCast.apply _)
   .map(_.apply(attr, structType, None))
   .foreach { cast =>
   ```
   Not sure if this is better, but it does deduplicate the args. A bit messy 
either way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] iRakson opened a new pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF

2022-01-26 Thread GitBox


iRakson opened a new pull request #35337:
URL: https://github.com/apache/spark/pull/35337


   
   
   ### What changes were proposed in this pull request?
   Currently, If we need to change definitions of our UDFs, we need to restart 
thriftserver.  In production environment this is not always possible.
   In this PR, i propose to update our UDFs dynamically.
   To achieve this, user needs to fire CREATE OR REPLACE command while updating 
UDFs.
   For e.g:
   1. Create a new function
   `CREATE TEMPORARY FUNCTION func AS c1 USING JAR jar1;`
   
   2. Once associated jar files are updated
   `CREATE OR REPLACE TEMPORARY FUNCTION func AS c1 USING JAR jar1;`
   
   Above query will update the classpaths and users will be able to use updated 
jars.
   A new configuration is added for this PR : 
spark.sql.function.updateUdfResources
   
   
   
   ### Why are the changes needed?
   We need to start our thriftserver whenever we want to update our UDF 
definition. This is not always possible in production environment. These 
changes will allow users to update them dynamically.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes,
   Now users can update their UDFs without restarting thriftserver using CREATE 
OR REPLACE FUNCTION command, if dynamic update is turned on.
   
   
   
   ### How was this patch tested?
   Unit Tests added.
   More tests will be added for testing other scenarios.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-26 Thread GitBox


parthchandra edited a comment on pull request #35262:
URL: https://github.com/apache/spark/pull/35262#issuecomment-1022462107


   > > Updated the JDK 8 benchmark results as well.
   > 
   > After comparing the new bench data, I find that the data corresponding to 
`Parquet Data Page V2` in the two test cases `String with Nulls Scan (50.0%) ` 
and `String with Nulls Scan (95.0%)` is relatively slower than the previous pr 
(although the CPU frequency of the testing machine is reduced):
   > 
   > before after
   > String with Nulls Scan (50.0%) 145.7 ns/per row228.7 ns/per row
   > String with Nulls Scan (95.0%) 25.2 ns/per row 77.9 ns/per row
   
   It's hard to reasonably compare the numbers across runs (even though the 
difference is substantial) because of the difference in the environment. 
   Incidentally, with nulls, the decoder doesn't even get called so such a 
precipitous drop is somewhat suspicious. _And_ it appears that the vectorized 
decoder is being called one record at a time (this may not be a problem because 
the decoding has mostly been done though not written into the output vector).
   I made a change to determine runs of null/non-null values and increase the 
number of values being written out to the output vector in each call, but saw 
no significant change (running benchmark on laptop).
   See: 
https://github.com/apache/spark/blob/6e64e9252a821651a8984babfac79a9ea433/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L237
   
   Let me do a profile run to see if any obvious bottlenecks stand out. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] parthchandra commented on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-26 Thread GitBox


parthchandra commented on pull request #35262:
URL: https://github.com/apache/spark/pull/35262#issuecomment-1022462107


   > > Updated the JDK 8 benchmark results as well.
   > 
   > After comparing the new bench data, I find that the data corresponding to 
`Parquet Data Page V2` in the two test cases `String with Nulls Scan (50.0%) ` 
and `String with Nulls Scan (95.0%)` is relatively slower than the previous pr 
(although the CPU frequency of the testing machine is reduced):
   > 
   > before after
   > String with Nulls Scan (50.0%) 145.7 ns/per row228.7 ns/per row
   > String with Nulls Scan (95.0%) 25.2 ns/per row 77.9 ns/per row
   
   It's hard to reasonably compare the numbers across runs (even though the 
difference is substantial) because of the difference in the environment. Let me 
do a profile run to see if any obvious bottlenecks stand out. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] senthh opened a new pull request #35336: [SPARK-37936][SQL] Use error classes in the parsing errors of intervals

2022-01-26 Thread GitBox


senthh opened a new pull request #35336:
URL: https://github.com/apache/spark/pull/35336


   
   n the PR, We propose to throw ParseException from below methodswith the 
error classes:
   
   moreThanOneFromToUnitInIntervalLiteralError
   invalidIntervalLiteralError
   invalidIntervalFormError
   invalidFromToUnitValueError
   fromToIntervalUnsupportedError
   mixedIntervalUnitsError
   MORE_THAN_ONE_FROM_TO_UNIT_IN_INTERVAL_LITERAL - 
moreThanOneFromToUnitInIntervalLiteralError
   INVALID_INTERVAL_LITERAL - invalidIntervalLiteralError
   INVALID_INTERVAL_FORM - invalidIntervalFormError
   INVALID_FROM_TO_UNIT_VALUE - invalidFromToUnitValueError
   UNSUPPORTED_FROM_TO_INTERVAL - fromToIntervalUnsupportedError
   MIXED_INTERVAL_UNITS - mixedIntervalUnitsError
   
   New error classes are added to error-classes.json.
   
   New test suite QueryParsingErrorsSuite for checking the errors has been 
created.
   ### What changes were proposed in this pull request?
   
   Please refer SPARK-37935 - Migrate onto error classes
   
   ### Why are the changes needed?
   
   Yes
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes
   
   ### How was this patch tested?
   
   By running new test suite:
   build/sbt "test:testOnly *QueryParsingErrorsSuite"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #35324: [SPARK-37896][SQL][FOLLOWUP] Fix NPE in ConstantColumnVector.close()

2022-01-26 Thread GitBox


c21 commented on pull request #35324:
URL: https://github.com/apache/spark/pull/35324#issuecomment-1022421010


   Thank you @cloud-fan and @Yaohua628 for review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #35302: [SPARK-38001][SQL] Replace the error classes related to unsupported features by `UNSUPPORTED_FEATURE`

2022-01-26 Thread GitBox


MaxGekk commented on a change in pull request #35302:
URL: https://github.com/apache/spark/pull/35302#discussion_r792819398



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
##
@@ -93,8 +93,8 @@ object QueryCompilationErrors {
 
   def unsupportedIfNotExistsError(tableName: String): Throwable = {

Review comment:
   I added the base test suites `QueryCompilationErrorsSuiteBase` + 
`SessionCatalogTestBase` and new test suites for compilation errors:
   
   - QueryCompilationErrorsDSv2Suite
   - QueryCompilationErrorsDSv2SessionCatalogSuite
   - QueryCompilationErrorsV1WriteFallbackSuite




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792766502



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   Let us keep this method - and make it
   `def shuffleMergeEnabled = mergerLocs.nonEmpty`
   What I meant was, let us remove the boolean `_shuffleMergeEnabled`
   
   So that rest of the code can depend on this as a state and not need to 
inspect the mergerLocs directly
   
   
   Also, this is part of public api - so cant remove it :-)
   Speaking of which, can you make `getFinalizeTask`/`getFinalizeTask` 
`private[spark]` as well ? (`ShuffleDependency` is developer api and we cant 
leak impl details)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792766502



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   Let us keep this method - and make it
   `def shuffleMergeEnabled = mergerLocs.nonEmpty`
   What I meant was, let us remove the boolean `_shuffleMergeEnabled`
   
   So that rest of the code can depend on this as a state and not need to 
inspect the mergerLocs directly
   
   
   Also, this is part of public api - so cant remove it :-)
   Speaking of which, can you make `getFinalizeTask`/`getFinalizeTask` 
`private[spark]` as well ? (this or different pr - `ShuffleDependency` is 
developer api and we cant leak impl details)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792766502



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   Let us keep this method - and make it
   `def shuffleMergeEnabled = mergerLocs.nonEmpty`
   What I meant was, let us remove the boolean `_shuffleMergeEnabled`
   
   So that rest of the code can depend on this as a state and not need to 
inspect the mergerLocs directly
   
   
   Also, this is part of public api - so cant remove it :-)
   Speaking of which, can you make `getFinalizeTask` `private[spark]` as well ? 
(this or different pr - `ShuffleDependency` is developer api and we cant leak 
impl details)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792766502



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   Let us keep this method - and make it
   `def shuffleMergeEnabled = mergerLocs.nonEmpty`
   What I meant was, let us remove the boolean `_shuffleMergeEnabled`
   
   So that rest of the code can depend on this as a state and not need to 
inspect the mergerLocs directly
   
   
   Also, this is part of public api - so cant remove it :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on pull request #34122:
URL: https://github.com/apache/spark/pull/34122#issuecomment-1022330045


   Can you take a look at the build failures @venkata91 ? Not sure if it 
requires to be updated to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792766502



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -114,15 +114,6 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed
 
-  // By default, shuffle merge is enabled for ShuffleDependency if 
shuffleMergeAllowed
-  private[this] var _shuffleMergeEnabled = shuffleMergeAllowed
-
-  private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): 
Unit = {
-_shuffleMergeEnabled = shuffleMergeEnabled
-  }
-
-  def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled

Review comment:
   Let us keep this method - and make it
   `def shuffleMergeEnabled = mergerLocs.nonEmpty`
   What I meant was, let us remove the boolean `_shuffleMergeEnabled`
   
   So that rest of the code can depend on this as a state and not need to 
inspect the mergerLocs directly

##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -3579,7 +3579,7 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 submit(reduceRdd, (0 until parts).toArray)
 completeShuffleMapStageSuccessfully(0, 0, parts)

Review comment:
   Once we reintroduce the `shuffleMergeEnabled` method, all changes to 
this file (DAGSchedulerSuite) related to mergerLocs.isEmpty == 
!shuffleMergeEnabled (and reverse) can be reverted (for the latest 
[diff](https://github.com/apache/spark/pull/34122/commits/3b52bed7ea4bbdb127a4711201d8c8268ab06650#))
   

##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -4172,7 +4172,7 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 
 // host2 executor added event to trigger registering of shuffle merger 
locations
 // as shuffle mergers are tracked separately for test
-runEvent(ExecutorAdded("host2", "host2"))
+runEvent(ExecutorAdded("host2", "exec2"))

Review comment:
   Order is reverse - `ExecutorAdded(execId: String, host: String)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792761973



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -135,6 +144,7 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
   def shuffleMergeId: Int = _shuffleMergeId
 
   def setMergerLocs(mergerLocs: Seq[BlockManagerId]): Unit = {
+assert(shuffleMergeEnabled || shuffleMergeAllowed)

Review comment:
   Yes, when we want it to be disabled (like with retry of DETERMINATE 
stage which was already finalized), we must set merge allowed to false. It is 
cleaner.
   Note - `def isShuffleMergeEnabled` or `def shuffleMergeEnabled` pointing to 
`mergerLocs.nonEmpty` is fine (and preferred to using mergerLocs.) - we 
just dont need the state boolean.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34122: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #34122:
URL: https://github.com/apache/spark/pull/34122#discussion_r792761973



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -135,6 +144,7 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
   def shuffleMergeId: Int = _shuffleMergeId
 
   def setMergerLocs(mergerLocs: Seq[BlockManagerId]): Unit = {
+assert(shuffleMergeEnabled || shuffleMergeAllowed)

Review comment:
   Yes, when we want it to be disabled (like with retry of DETERMINATE 
stage which was already finalized), we must set merge allowed to false. It is 
cleaner.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm edited a comment on pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


mridulm edited a comment on pull request #35325:
URL: https://github.com/apache/spark/pull/35325#issuecomment-1022312145


   @pan3793 Functionally, @otterc's patch looks good (except for testing 
enhancements, renames, etc).
   Can you also confirm if this works for you, given that your env has been 
great with producing this issue consistently ? Thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


mridulm commented on pull request #35325:
URL: https://github.com/apache/spark/pull/35325#issuecomment-1022312145


   @pan3793 Functionally, @otterc's patch looks good (except for testing 
enhancements, renames, etc).
   Can you also confirm if this works for you ? Thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #35325: [WIP][SPARK-37675][SPARK-37793] Prevent overwriting of push shuffle merged files once the shuffle is finalized

2022-01-26 Thread GitBox


mridulm commented on a change in pull request #35325:
URL: https://github.com/apache/spark/pull/35325#discussion_r792754171



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -1008,19 +995,28 @@ AppShufflePartitionInfo getPartitionInfo() {
* required for the shuffles of indeterminate stages.
*/
   public static class AppShuffleMergePartitionsInfo {
+// ConcurrentHashMap doesn't allow null for keys or values which is why 
this is required.
+// Marker to identify finalized indeterminate shuffle partitions in the 
case of indeterminate
+// stage retries.
+@VisibleForTesting
+public static final Map 
SHUFFLE_FINALIZED_MARKER =
+Collections.emptyMap();
 private final int shuffleMergeId;
-private final Map shuffleMergePartitions;
+private final Map mergePartitions;

Review comment:
   nit: avoid rename here ?

##
File path: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java
##
@@ -1287,6 +1287,52 @@ void closeAndDeletePartitionFiles(Map partitio
   + " up", appShuffleInfo.getMergedShuffleDataFile(0, 4, 0).exists());
   }
 
+  // Test for SPARK-37675 and SPARK-37793
+  @Test
+  public void testAllBlocksAreRejectedWhenReceivedAfterFinalization() throws 
IOException {

Review comment:
   Also add a test for the cardinality change ?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on pull request #35328: [SPARK-37937][SQL] Use error classes in the parsing errors of lateral join

2022-01-26 Thread GitBox


imback82 commented on pull request #35328:
URL: https://github.com/apache/spark/pull/35328#issuecomment-1022307238


   > @imback82 Could you fix the JIRA id in PR's title. Seems SPARK-37858 is 
not correct.
   
   Updated, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JackBuggins commented on pull request #35230: [SPARK-37934] [Build] Upgrade Jetty version to 9.4.44

2022-01-26 Thread GitBox


JackBuggins commented on pull request #35230:
URL: https://github.com/apache/spark/pull/35230#issuecomment-1022282155


   None to my knowledge @HyukjinKwon, but wrt 
https://github.com/eclipse/jetty.project/issues/6973, my thinking is it would 
offer peace of mind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun edited a comment on pull request #35299: [SPARK-37916][SPARK-37713][K8S] Revert "Assign namespace to executor configmap"

2022-01-26 Thread GitBox


Yikun edited a comment on pull request #35299:
URL: https://github.com/apache/spark/pull/35299#issuecomment-1019849743


   FYI @dcoliversun
   
   I contacted with @dcoliversun, in the offline discussion, we have confirmed 
there are two problems are introduced in PR 
https://github.com/apache/spark/pull/34983 , we'd better to revert this PR 
first.
   
   And I will mark it as `ready for review` after you confirmed what happened 
in original issue `SPARK-37713`.
   
   Personally, I think we don't need to set metadata.namespace explictly, so we 
don't need this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792679337



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -595,7 +593,7 @@ private[spark] object Utils extends Logging {
 if (lowerSrc.endsWith(".jar")) {
   RunJar.unJar(source, dest, RunJar.MATCH_ANY)
 } else if (lowerSrc.endsWith(".zip")) {
-  FileUtil.unZip(source, dest)
+  unZip(source, dest)

Review comment:
   Is it possible to chmod again after unzip? I'm not sure
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #35246: [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API

2022-01-26 Thread GitBox


HyukjinKwon commented on a change in pull request #35246:
URL: https://github.com/apache/spark/pull/35246#discussion_r792675110



##
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsNamespaces.java
##
@@ -136,15 +137,20 @@ void alterNamespace(
   NamespaceChange... changes) throws NoSuchNamespaceException;
 
   /**
-   * Drop a namespace from the catalog, recursively dropping all objects 
within the namespace.
+   * Drop a namespace from the catalog with cascade mode, recursively dropping 
all objects
+   * within the namespace if cascade is true.
* 
* If the catalog implementation does not support this operation, it may 
throw
* {@link UnsupportedOperationException}.
*
* @param namespace a multi-part namespace
+   * @param cascade When true, deletes all objects under the namespace
* @return true if the namespace was dropped
* @throws NoSuchNamespaceException If the namespace does not exist 
(optional)
+   * @throws NonEmptyNamespaceException If the namespace is non-empty and 
cascade is false
* @throws UnsupportedOperationException If drop is not a supported operation
*/
-  boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException;
+  boolean dropNamespace(
+  String[] namespace,
+  boolean cascade) throws NoSuchNamespaceException, 
NonEmptyNamespaceException;

Review comment:
   Can we at least keep the old signature, deprecate it and throw an 
exception? Also this PR changed the behaviour.
   
   I know that DSv2 has been less strict on this but probably we should start 
taking care about this given that the API has been almost 2 years since Spark 
3.0.0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #35246: [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API

2022-01-26 Thread GitBox


HyukjinKwon commented on a change in pull request #35246:
URL: https://github.com/apache/spark/pull/35246#discussion_r792675110



##
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsNamespaces.java
##
@@ -136,15 +137,20 @@ void alterNamespace(
   NamespaceChange... changes) throws NoSuchNamespaceException;
 
   /**
-   * Drop a namespace from the catalog, recursively dropping all objects 
within the namespace.
+   * Drop a namespace from the catalog with cascade mode, recursively dropping 
all objects
+   * within the namespace if cascade is true.
* 
* If the catalog implementation does not support this operation, it may 
throw
* {@link UnsupportedOperationException}.
*
* @param namespace a multi-part namespace
+   * @param cascade When true, deletes all objects under the namespace
* @return true if the namespace was dropped
* @throws NoSuchNamespaceException If the namespace does not exist 
(optional)
+   * @throws NonEmptyNamespaceException If the namespace is non-empty and 
cascade is false
* @throws UnsupportedOperationException If drop is not a supported operation
*/
-  boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException;
+  boolean dropNamespace(
+  String[] namespace,
+  boolean cascade) throws NoSuchNamespaceException, 
NonEmptyNamespaceException;

Review comment:
   Can we at least keep the old signature, deprecate it and throw an 
exception? Also this PR changed the behaviour.
   
   I know that DSv2 has been less strict on this but probably we should start 
taking care about this given that the API has been almost 2 years since Spark 
3.0.0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792673852



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -605,6 +603,22 @@ private[spark] object Utils extends Logging {
 }
   }
 
+  def unZip(inFile: File, untarDir: File): Unit = {
+val untarCommand = new StringBuffer

Review comment:
   There will be no concurrency issue here
   
   

##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -605,6 +603,22 @@ private[spark] object Utils extends Logging {
 }
   }
 
+  def unZip(inFile: File, untarDir: File): Unit = {
+val untarCommand = new StringBuffer

Review comment:
   There will be no concurrency issue
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792670450



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -605,6 +603,29 @@ private[spark] object Utils extends Logging {
 }
   }
 
+  def unZip(inFile: File, unzipDir: File): Unit = {
+if (!unzipDir.mkdirs && !unzipDir.isDirectory) {
+  throw new IOException("Mkdirs failed to create " + unzipDir)
+} else {
+  if (Shell.WINDOWS) {
+FileUtil.unZip(inFile, unzipDir)

Review comment:
   +1 agree with @HyukjinKwon 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792668929



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -3196,8 +3217,8 @@ private[spark] object Utils extends Logging {
 entry = in.getNextEntry()
   }
   in.close() // so that any error in closing does not get ignored
-  logInfo(s"Unzipped from $dfsZipFile\n\t${files.mkString("\n\t")}")
 } finally {
+  logInfo(s"Unzipped from $dfsZipFile\n\t${files.mkString("\n\t")}")

Review comment:
   This log should be printed after successful decompression. It doesn't 
seem appropriate to move it in finally
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792668929



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -3196,8 +3217,8 @@ private[spark] object Utils extends Logging {
 entry = in.getNextEntry()
   }
   in.close() // so that any error in closing does not get ignored
-  logInfo(s"Unzipped from $dfsZipFile\n\t${files.mkString("\n\t")}")
 } finally {
+  logInfo(s"Unzipped from $dfsZipFile\n\t${files.mkString("\n\t")}")

Review comment:
   This log should be printed after successful decompression. It doesn't 
seem appropriate to put it in finally
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792667649



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -55,14 +53,14 @@ import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
 import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
 import org.apache.hadoop.security.UserGroupInformation
-import org.apache.hadoop.util.{RunJar, StringUtils}
+import org.apache.hadoop.util.Shell.ShellCommandExecutor
+import org.apache.hadoop.util.{RunJar, Shell, StringUtils}
 import org.apache.hadoop.yarn.conf.YarnConfiguration
 import org.eclipse.jetty.util.MultiException
 import org.slf4j.Logger
-
 import org.apache.spark._
 import org.apache.spark.deploy.SparkHadoopUtil
-import org.apache.spark.internal.{config, Logging}

Review comment:
   Please restore to `import org.apache.spark.internal.{config, Logging}`
   
   

##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -55,14 +53,14 @@ import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
 import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
 import org.apache.hadoop.security.UserGroupInformation
-import org.apache.hadoop.util.{RunJar, StringUtils}
+import org.apache.hadoop.util.Shell.ShellCommandExecutor
+import org.apache.hadoop.util.{RunJar, Shell, StringUtils}
 import org.apache.hadoop.yarn.conf.YarnConfiguration
 import org.eclipse.jetty.util.MultiException
 import org.slf4j.Logger
-

Review comment:
   ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #35278: [SPARK-37677][CORE] Use the shell command to decompress the ZIP file

2022-01-26 Thread GitBox


LuciferYang commented on a change in pull request #35278:
URL: https://github.com/apache/spark/pull/35278#discussion_r792666914



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -32,7 +32,6 @@ import java.util.{Locale, Properties, Random, UUID}
 import java.util.concurrent._
 import java.util.concurrent.TimeUnit.NANOSECONDS
 import java.util.zip.{GZIPInputStream, ZipInputStream}
-

Review comment:
   This blank line should not be deleted

##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -42,7 +41,6 @@ import scala.reflect.ClassTag
 import scala.util.{Failure, Success, Try}
 import scala.util.control.{ControlThrowable, NonFatal}
 import scala.util.matching.Regex
-

Review comment:
   ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #35246: [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API

2022-01-26 Thread GitBox


cloud-fan commented on a change in pull request #35246:
URL: https://github.com/apache/spark/pull/35246#discussion_r792639178



##
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsNamespaces.java
##
@@ -136,15 +137,20 @@ void alterNamespace(
   NamespaceChange... changes) throws NoSuchNamespaceException;
 
   /**
-   * Drop a namespace from the catalog, recursively dropping all objects 
within the namespace.
+   * Drop a namespace from the catalog with cascade mode, recursively dropping 
all objects
+   * within the namespace if cascade is true.
* 
* If the catalog implementation does not support this operation, it may 
throw
* {@link UnsupportedOperationException}.
*
* @param namespace a multi-part namespace
+   * @param cascade When true, deletes all objects under the namespace
* @return true if the namespace was dropped
* @throws NoSuchNamespaceException If the namespace does not exist 
(optional)
+   * @throws NonEmptyNamespaceException If the namespace is non-empty and 
cascade is false
* @throws UnsupportedOperationException If drop is not a supported operation
*/
-  boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException;
+  boolean dropNamespace(
+  String[] namespace,
+  boolean cascade) throws NoSuchNamespaceException, 
NonEmptyNamespaceException;

Review comment:
   > Can we implement default method of dropNamespace(String[] namespace)
   
   This doesn't help,  because the new `dropNamespace(String[] namespace, 
boolean cascade)` is not implemented and will break all the existing 
implementations.
   
   A common way to keep backward compatibility is to add default implementation 
for the new API. e.g.
   ```
   boolean dropNamespace(String[] namespace, boolean cascade) {
 if (cascade) dropNamespace(namespace) else throw ...
   }
   ```
   However, `DROP DATABASE` is much more commonly used than `DROP DATABASE ... 
CASCADE`. So this doesn't help either.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #35307: [SPARK-38008][CORE] Fix the method description of refill

2022-01-26 Thread GitBox


srowen commented on pull request #35307:
URL: https://github.com/apache/spark/pull/35307#issuecomment-1022198541


   Any other typos? you can run a spell check.
   No need for a JIRA for something like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #35246: [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API

2022-01-26 Thread GitBox


HyukjinKwon commented on a change in pull request #35246:
URL: https://github.com/apache/spark/pull/35246#discussion_r792623987



##
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsNamespaces.java
##
@@ -136,15 +137,20 @@ void alterNamespace(
   NamespaceChange... changes) throws NoSuchNamespaceException;
 
   /**
-   * Drop a namespace from the catalog, recursively dropping all objects 
within the namespace.
+   * Drop a namespace from the catalog with cascade mode, recursively dropping 
all objects
+   * within the namespace if cascade is true.
* 
* If the catalog implementation does not support this operation, it may 
throw
* {@link UnsupportedOperationException}.
*
* @param namespace a multi-part namespace
+   * @param cascade When true, deletes all objects under the namespace
* @return true if the namespace was dropped
* @throws NoSuchNamespaceException If the namespace does not exist 
(optional)
+   * @throws NonEmptyNamespaceException If the namespace is non-empty and 
cascade is false
* @throws UnsupportedOperationException If drop is not a supported operation
*/
-  boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException;
+  boolean dropNamespace(
+  String[] namespace,
+  boolean cascade) throws NoSuchNamespaceException, 
NonEmptyNamespaceException;

Review comment:
   If we think that we shouldn't make this implemented, we should throw an 
exception with keeping the signature so we can show nicer error message instead 
of method not found.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >