Re: [PR] [SPARK-47202][PYTHON][TESTS][FOLLOW-UP] Test timestamp with tzinfo in toPandas and createDataFrame with Arrow optimized [spark]

2024-02-27 Thread via GitHub
HyukjinKwon commented on PR #45308: URL: https://github.com/apache/spark/pull/45308#issuecomment-1968421360 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47202][PYTHON][TESTS][FOLLOW-UP] Test timestamp with tzinfo in toPandas and createDataFrame with Arrow optimized [spark]

2024-02-27 Thread via GitHub
HyukjinKwon closed pull request #45308: [SPARK-47202][PYTHON][TESTS][FOLLOW-UP] Test timestamp with tzinfo in toPandas and createDataFrame with Arrow optimized URL: https://github.com/apache/spark/pull/45308 -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

2024-02-27 Thread via GitHub
eubnara commented on PR #45309: URL: https://github.com/apache/spark/pull/45309#issuecomment-1968419075 Thanks for explanation. I think I need to review spark, iceberg codes more... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-42929][CONNECT][PYTHON][TEST] test barrier mode for mapInPandas/mapInArrow [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #45310: URL: https://github.com/apache/spark/pull/45310#issuecomment-1968413037 Hi @WeichenXu123, Could you help review this PR? Thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

2024-02-27 Thread via GitHub
pan3793 commented on PR #45309: URL: https://github.com/apache/spark/pull/45309#issuecomment-1968395911 IMO it's an Iceberg side issue, and in addition to the case you listed above, accessing multiple Kerberized HMS cases should be considered, e.g. the Spark built-in HMS and Iceberg HMS

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
uros-db commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505471123 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) { return

Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

2024-02-27 Thread via GitHub
mihailom-db commented on code in PR #45285: URL: https://github.com/apache/spark/pull/45285#discussion_r1505476279 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType class CollationSuite

[PR] [SPARK-42929][CONNECT][PYTHON][TEST] test barrier mode for mapInPandas/mapInArrow [spark]

2024-02-27 Thread via GitHub
wbo4958 opened a new pull request, #45310: URL: https://github.com/apache/spark/pull/45310 ### What changes were proposed in this pull request? Add barrier mode tests for mapInPandas and mapInArrow. ### Why are the changes needed? This is the follow-up of

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
uros-db commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505471123 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) { return

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

2024-02-27 Thread via GitHub
eubnara commented on PR #45309: URL: https://github.com/apache/spark/pull/45309#issuecomment-1968379376 Thanks for reply. With `spark-sql` or `spark-shell`, it is impossible to use iceberg with HiveCatalog? only iceberg with HadoopCatalog is supported? -- This is an automated message

Re: [PR] [SPARK-47194][BUILD] Upgrade log4j to 2.23.0 [spark]

2024-02-27 Thread via GitHub
LuciferYang commented on PR #45292: URL: https://github.com/apache/spark/pull/45292#issuecomment-1968372075 OK, Let me close this pr first. Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47194][BUILD] Upgrade log4j to 2.23.0 [spark]

2024-02-27 Thread via GitHub
LuciferYang closed pull request #45292: [SPARK-47194][BUILD] Upgrade log4j to 2.23.0 URL: https://github.com/apache/spark/pull/45292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505462011 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -105,6 +105,9 @@ public Collation( private static final Collation[]

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505461280 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
uros-db commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505460392 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -105,6 +105,9 @@ public Collation( private static final Collation[]

Re: [PR] [SPARK-47194][BUILD] Upgrade log4j to 2.23.0 [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun commented on PR #45292: URL: https://github.com/apache/spark/pull/45292#issuecomment-1968366002 Thank you for the investigation. +1 for skipping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47199][PYTHON][TESTS] Add prefix into TemporaryDirectory to avoid flakiness [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun commented on PR #45298: URL: https://github.com/apache/spark/pull/45298#issuecomment-1968364450 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47199][PYTHON][TESTS] Add prefix into TemporaryDirectory to avoid flakiness [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun closed pull request #45298: [SPARK-47199][PYTHON][TESTS] Add prefix into TemporaryDirectory to avoid flakiness URL: https://github.com/apache/spark/pull/45298 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968358962 "simply subtracting 1 ulp" may cause the precision error to be accumulated when doing the double calculation each time, and may finally result in unexpected behavior. Anyway, "simply

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
uros-db commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505450164 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) { return

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
uros-db commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505450164 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) { return

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
uros-db commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505450164 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) { return

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

2024-02-27 Thread via GitHub
pan3793 commented on PR #45309: URL: https://github.com/apache/spark/pull/45309#issuecomment-1968353484 `HiveDelegationTokenProvider` takes care of the Spark built-in HMS client token refresh, Iceberg uses its own implemented HMS client, and should take care of itself. As an

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505449019 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505448239 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -105,6 +105,9 @@ public Collation( private static final Collation[]

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505447421 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -327,11 +329,10 @@ public UTF8String substringSQL(int pos, int length) { /**

Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

2024-02-27 Thread via GitHub
mkaravel commented on code in PR #45216: URL: https://github.com/apache/spark/pull/45216#discussion_r1505438346 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ## @@ -559,6 +559,25 @@ case class BinaryPredicate(override val

Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

2024-02-27 Thread via GitHub
mihailom-db commented on code in PR #45285: URL: https://github.com/apache/spark/pull/45285#discussion_r1505436616 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType class CollationSuite

Re: [PR] [SPARK-47203][DOCKER][TEST] Use gvenzl/oracle-free:23.3-slim to reduce disk usage for docker it [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on PR #45304: URL: https://github.com/apache/spark/pull/45304#issuecomment-1968329845 Thank you @dongjoon-hyun. Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-47203][DOCKER][TEST] Use gvenzl/oracle-free:23.3-slim to reduce disk usage for docker it [spark]

2024-02-27 Thread via GitHub
yaooqinn closed pull request #45304: [SPARK-47203][DOCKER][TEST] Use gvenzl/oracle-free:23.3-slim to reduce disk usage for docker it URL: https://github.com/apache/spark/pull/45304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

2024-02-27 Thread via GitHub
eubnara opened a new pull request, #45309: URL: https://github.com/apache/spark/pull/45309 ### What changes were proposed in this pull request? Make `spark-sql`, `spark-shell` be able to access iceberg with HiveCatalog. If a user want to access iceberg table

Re: [PR] [SPARK-47192] Convert some _LEGACY_ERROR_TEMP_0035 errors [spark]

2024-02-27 Thread via GitHub
MaxGekk closed pull request #45291: [SPARK-47192] Convert some _LEGACY_ERROR_TEMP_0035 errors URL: https://github.com/apache/spark/pull/45291 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47192] Convert some _LEGACY_ERROR_TEMP_0035 errors [spark]

2024-02-27 Thread via GitHub
MaxGekk commented on PR #45291: URL: https://github.com/apache/spark/pull/45291#issuecomment-1968321379 +1, LGTM. Merging to master. Thank you, @srielau. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
HyukjinKwon commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968321173 PTAL: https://github.com/apache/spark/pull/45308 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[PR] [SPARK-47202][PYTHON][TESTS][FOLLOW-UP]Test timestamp with tzinfo in toPandas and createDataFrame [spark]

2024-02-27 Thread via GitHub
HyukjinKwon opened a new pull request, #45308: URL: https://github.com/apache/spark/pull/45308 ### What changes were proposed in this pull request? This PR is a follow up of https://github.com/apache/spark/pull/45301 that actually test the change. ### Why are the changes

[PR] [SPARK-47206] Add official image Dockerfile for Apache Spark 3.5.1 [spark-docker]

2024-02-27 Thread via GitHub
Yikun opened a new pull request, #59: URL: https://github.com/apache/spark-docker/pull/59 ### What changes were proposed in this pull request? Add Apache Spark 3.5.0 Dockerfiles. - Add 3.5.1 GPG key - Add .github/workflows/build_3.5.1.yaml - `./add-dockerfiles.sh 3.5.1` to

Re: [PR] [SPARK-47194][BUILD] Upgrade log4j to 2.23.0 [spark]

2024-02-27 Thread via GitHub
LuciferYang commented on PR #45292: URL: https://github.com/apache/spark/pull/45292#issuecomment-1968311173 It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps we should skip this upgrade. I have tested the following scenarios: 1. run

[PR] [SPARK-47205][DOCKER][TESTS] Upgrade docker-java to 3.3.5 [spark]

2024-02-27 Thread via GitHub
yaooqinn opened a new pull request, #45307: URL: https://github.com/apache/spark/pull/45307 ### What changes were proposed in this pull request? Upgrades docker-java to 3.3.5 ### Why are the changes needed? A new API for set start_interval might help in

[PR] [SPARK-47155] Fix Error Class Issue [spark]

2024-02-27 Thread via GitHub
sunan135 opened a new pull request, #45306: URL: https://github.com/apache/spark/pull/45306 ### What changes were proposed in this pull request? Make create_data_source.py use correct error class. ### Why are the changes needed? This is part of the effort of SPARK-44076

Re: [PR] [WIP][SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on PR #43751: URL: https://github.com/apache/spark/pull/43751#issuecomment-1968294284 This is a hard decision. Technically the behavior of LIKE in many commands (`SHOW TABLES LIKE ...`) relies on the underlying catalog, which can be HMS of different versions, or a

Re: [PR] [SPARK-47191][SQL] Avoid unnecessary relation lookup when uncaching table/view [spark]

2024-02-27 Thread via GitHub
cloud-fan closed pull request #45289: [SPARK-47191][SQL] Avoid unnecessary relation lookup when uncaching table/view URL: https://github.com/apache/spark/pull/45289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47191][SQL] Avoid unnecessary relation lookup when uncaching table/view [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on PR #45289: URL: https://github.com/apache/spark/pull/45289#issuecomment-1968279079 thanks for the review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-47153][CORE] Guard serialize/deserialize in JavaSerializer with try-with-resource block [spark]

2024-02-27 Thread via GitHub
jwang0306 commented on code in PR #45238: URL: https://github.com/apache/spark/pull/45238#discussion_r1505389956 ## core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala: ## @@ -118,22 +118,24 @@ private[spark] class JavaSerializerInstance( override def

Re: [PR] [SPARK-47187][SQL][3.4] Fix hive compress output config does not work [spark]

2024-02-27 Thread via GitHub
ulysses-you closed pull request #45286: [SPARK-47187][SQL][3.4] Fix hive compress output config does not work URL: https://github.com/apache/spark/pull/45286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47187][SQL][3.4] Fix hive compress output config does not work [spark]

2024-02-27 Thread via GitHub
ulysses-you commented on PR #45286: URL: https://github.com/apache/spark/pull/45286#issuecomment-1968269430 thanks, merging to branch-3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1968268514 > df.count() and df.collect().size should always agree. how about this idea: when calling `df.collect()`, if the plan is cached but the physical plan is not a cache scan, then we

Re: [PR] [SPARK-46525][BUILD][TESTS][FOLLOWUP] Cleanup http client deps for spotify docker client [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on PR #45303: URL: https://github.com/apache/spark/pull/45303#issuecomment-1968263903 Thank you @dongjoon-hyun @HyukjinKwon, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-27 Thread via GitHub
mridulm commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1505383159 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask =

Re: [PR] [SPARK-46525][BUILD][TESTS][FOLLOWUP] Cleanup http client deps for spotify docker client [spark]

2024-02-27 Thread via GitHub
yaooqinn closed pull request #45303: [SPARK-46525][BUILD][TESTS][FOLLOWUP] Cleanup http client deps for spotify docker client URL: https://github.com/apache/spark/pull/45303 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-47192] Convert some _LEGACY_ERROR_TEMP_0035 errors [spark]

2024-02-27 Thread via GitHub
srielau commented on PR #45291: URL: https://github.com/apache/spark/pull/45291#issuecomment-1968260763 @MaxGekk Can you merge? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45285: URL: https://github.com/apache/spark/pull/45285#discussion_r1505370756 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala: ## @@ -90,6 +100,9 @@ case class Collate(child: Expression,

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
HyukjinKwon commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968241780 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
HyukjinKwon closed pull request #45301: [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo URL: https://github.com/apache/spark/pull/45301 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
HyukjinKwon commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968241304 Let me just merge this in and follow up with a test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47144][CONNECT][SQL][PYTHON] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-27 Thread via GitHub
cloud-fan closed pull request #45233: [SPARK-47144][CONNECT][SQL][PYTHON] Fix Spark Connect collation error by adding collateId protobuf field URL: https://github.com/apache/spark/pull/45233 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] [SPARK-47144][CONNECT][SQL][PYTHON] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on PR #45233: URL: https://github.com/apache/spark/pull/45233#issuecomment-1968238333 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
srowen commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968231292 Well, long * double is always carried out in double precision. Casting it to long doesn't make the math somehow exact. You will always have some truncation when rounding to an integer. I

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-27 Thread via GitHub
sadikovi commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1505362656 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask =

Re: [PR] [SPARK-47135][SS] Implement error classes for Kafka data loss exceptions [spark]

2024-02-27 Thread via GitHub
HeartSaVioR commented on code in PR #45221: URL: https://github.com/apache/spark/pull/45221#discussion_r1505349482 ## connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala: ## @@ -92,13 +93,18 @@ class KafkaContinuousStream(

[PR] [WIP] implement Python streaming data sink [spark]

2024-02-27 Thread via GitHub
chaoqin-li1123 opened a new pull request, #45305: URL: https://github.com/apache/spark/pull/45305 ### What changes were proposed in this pull request? Implement Python streaming data sink ### Why are the changes needed? Implement Python streaming data sink

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968217911 I know if we directly convert float to integer, there will be something like round down, but the fact is we have multiplied a Big number, so I think there is no round down anymore,

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
arzavj commented on code in PR #45301: URL: https://github.com/apache/spark/pull/45301#discussion_r1505353843 ## python/pyspark/sql/pandas/types.py: ## @@ -993,7 +993,7 @@ def convert_struct(value: Any) -> Any: def convert_timestamp(value: Any) -> Any:

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
srowen commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968192652 float -> integer conversion in the JVM always truncates so yes (for positive numbers) you are rounding down by doing this. I think my point is, the fix actually has nothing to do

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968190296 > But I think what you're doing is converting taskAmount * ONE_ENTIRE_RESOURCE to a long, thus rounding down and making the number a little smaller by throwing away a tiny bit.

Re: [PR] [SPARK-43157][SQL] Clone InMemoryRelation cached plan to prevent cloned plan from referencing same objects [spark]

2024-02-27 Thread via GitHub
liuzqt commented on PR #40812: URL: https://github.com/apache/spark/pull/40812#issuecomment-1968189774 Had a discussion with @maryannxue , a few thoughts: - might not be a good idea to implicitly override innerChildren with clone behavior? it's a generate function from base class, even

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968184018 Hi @srowen, Sorry for my bad example in https://github.com/apache/spark/pull/44690#discussion_r1477679566, just like you said, the double has been converted to Long by multiplying

Re: [PR] [SPARK-46834][SQL][Collations] Support for aggregates [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45290: URL: https://github.com/apache/spark/pull/45290#discussion_r1505328033 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala: ## @@ -353,9 +353,20 @@ object MergeScalarSubqueries extends

Re: [PR] [SPARK-46834][SQL][Collations] Support for aggregates [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45290: URL: https://github.com/apache/spark/pull/45290#discussion_r1505326604 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala: ## @@ -405,11 +405,21 @@ abstract class HashExpression[E] extends Expression {

Re: [PR] [SPARK-47063][SQL] CAST long to timestamp has different behavior for codegen vs interpreted [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on PR #45294: URL: https://github.com/apache/spark/pull/45294#issuecomment-1968162489 Thanks all, merged to master, 3.5.2 and 3.4.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-47063][SQL] CAST long to timestamp has different behavior for codegen vs interpreted [spark]

2024-02-27 Thread via GitHub
yaooqinn closed pull request #45294: [SPARK-47063][SQL] CAST long to timestamp has different behavior for codegen vs interpreted URL: https://github.com/apache/spark/pull/45294 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
srowen commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968149149 So, this _doesn't_ work: ``` scala> val ONE_ENTIRE_RESOURCE: Long = 1L | val taskAmount = 1.0/11.0 | var total: Double = ONE_ENTIRE_RESOURCE

Re: [PR] [SPARK-47120][SQL] Null comparison push down data filter from subquery produces in NPE in Parquet filter [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on code in PR #45202: URL: https://github.com/apache/spark/pull/45202#discussion_r1505309057 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala: ## @@ -609,8 +609,8 @@ class ParquetFilters( // Parquet's

[PR] [SPARK-47203][DOCKER][TEST] Use gvenzl/oracle-free:23.3-slim to reduce disk usage for docker it [spark]

2024-02-27 Thread via GitHub
yaooqinn opened a new pull request, #45304: URL: https://github.com/apache/spark/pull/45304 ### What changes were proposed in this pull request? Use `gvenzl/oracle-free:23.3-slim` to reduce disk usage for docker it ```docker docker image ls REPOSITORY

Re: [PR] [SPARK-47120][SQL] Null comparison push down data filter from subquery produces in NPE in Parquet filter [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on code in PR #45202: URL: https://github.com/apache/spark/pull/45202#discussion_r1505307420 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala: ## @@ -700,17 +700,19 @@ class ParquetFilters(

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-27 Thread via GitHub
mridulm commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1505305241 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask =

[PR] [SPARK-46525][FOLLOWUP] Cleanup http client deps for spotify docker client [spark]

2024-02-27 Thread via GitHub
yaooqinn opened a new pull request, #45303: URL: https://github.com/apache/spark/pull/45303 ### What changes were proposed in this pull request? Cleanup http client deps used by the spotify docker client, as it's unnecessary for dokcer-java ### Why are the

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968122629 Hi @srowen, For https://github.com/apache/spark/pull/44690#discussion_r1479110884 I deployed a spark standalone cluster, launched a spark-shell and ran the test code by

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun commented on code in PR #45301: URL: https://github.com/apache/spark/pull/45301#discussion_r1505276588 ## python/pyspark/sql/pandas/types.py: ## @@ -993,7 +993,7 @@ def convert_struct(value: Any) -> Any: def convert_timestamp(value: Any) -> Any:

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
HyukjinKwon commented on code in PR #45301: URL: https://github.com/apache/spark/pull/45301#discussion_r1505269608 ## python/pyspark/sql/pandas/types.py: ## @@ -993,7 +993,7 @@ def convert_struct(value: Any) -> Any: def convert_timestamp(value: Any) -> Any:

Re: [PR] [WIP][SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… [spark]

2024-02-27 Thread via GitHub
panbingkun commented on code in PR #43751: URL: https://github.com/apache/spark/pull/43751#discussion_r1505263771 ## sql/core/src/test/resources/sql-tests/analyzer-results/show-views.sql.out: ## @@ -77,31 +77,31 @@ ShowViewsCommand global_temp, [namespace#x, viewName#x,

Re: [PR] [WIP][SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… [spark]

2024-02-27 Thread via GitHub
panbingkun commented on code in PR #43751: URL: https://github.com/apache/spark/pull/43751#discussion_r1505263446 ## sql/core/src/test/resources/sql-tests/analyzer-results/show-tables.sql.out: ## @@ -60,37 +60,37 @@ ShowTables [namespace#x, tableName#x, isTemporary#x] --

[PR] [SPARK-43255][SQL]Replace the error class _LEGACY_ERROR_TEMP_2020 by an internal error [spark]

2024-02-27 Thread via GitHub
JinHelin404 opened a new pull request, #45302: URL: https://github.com/apache/spark/pull/45302 ### What changes were proposed in this pull request? Change the error class_LEGACY_ERROR_TEMP_2022 to internal error as it cannot be accessed by public API. ###

Re: [PR] [SPARK-47202][PySpark] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968087569 Try `git commit -am "ci" --allow-empty` and push once more -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47202][PySpark] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
arzavj commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968086734 I did enable it after it failed but I don't know how to re-run the check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-47202][PySpark] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968086497 Hi, @arzavj Can you enable the GHA? https://github.com/apache/spark/pull/45301/checks?check_run_id=22059858407 -- This is an automated message from the Apache Git Service. To respond

Re: [PR] [SPARK-47202][PySpark] Fix typo breaking datetimes with tzinfo [spark]

2024-02-27 Thread via GitHub
arzavj commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1968085914 @zhengruifeng @ueshin could you please review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
srowen commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968083215 I'm confused too, didn't we have a long conversation about this? The essence of your fix is this: https://github.com/apache/spark/pull/44690#discussion_r1477679566 but it

Re: [PR] [SPARK-47201][PYTHON][CONNECT] `sameSemantics` checks input types [spark]

2024-02-27 Thread via GitHub
zhengruifeng commented on PR #45300: URL: https://github.com/apache/spark/pull/45300#issuecomment-1968079694 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47201][PYTHON][CONNECT] `sameSemantics` checks input types [spark]

2024-02-27 Thread via GitHub
zhengruifeng closed pull request #45300: [SPARK-47201][PYTHON][CONNECT] `sameSemantics` checks input types URL: https://github.com/apache/spark/pull/45300 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [WIP][SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… [spark]

2024-02-27 Thread via GitHub
panbingkun commented on code in PR #43751: URL: https://github.com/apache/spark/pull/43751#discussion_r1505253654 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala: ## @@ -107,27 +107,82 @@ object StringUtils extends Logging { def

Re: [PR] [SPARK-45599][CORE][3.5] Use object equality in OpenHashSet [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun closed pull request #45296: [SPARK-45599][CORE][3.5] Use object equality in OpenHashSet URL: https://github.com/apache/spark/pull/45296 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-45599][CORE][3.5] Use object equality in OpenHashSet [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun commented on PR #45296: URL: https://github.com/apache/spark/pull/45296#issuecomment-1968077387 Merged to branch-3.5. Thank you, @nchammas . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-27 Thread via GitHub
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1968075711 Hi @srowen, I appreciate your review, but I have some confusion regarding your statements. You mentioned that `"The core issue of floating-point inaccuracy doesn't go away"` and

Re: [PR] [SPARK-47201][PYTHON][CONNECT] `sameSemantics` checks input types [spark]

2024-02-27 Thread via GitHub
zhengruifeng commented on code in PR #45300: URL: https://github.com/apache/spark/pull/45300#discussion_r1505251493 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -1843,15 +1843,14 @@ def check_to_local_iterator_not_fully_consumed(self):

[PR] [SPARK-47201][PYTHON][CONNECT] `sameSemantics` checks input types [spark]

2024-02-27 Thread via GitHub
zhengruifeng opened a new pull request, #45300: URL: https://github.com/apache/spark/pull/45300 ### What changes were proposed in this pull request? `sameSemantics` checks input types in the same way as vanilla pyspark ### Why are the changes needed? for parity ###

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1505249964 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](

Re: [PR] [SPARK-47186][DOCKER][TESTS] Add some timeouts options and logs to improve the debuggability for docker integration test [spark]

2024-02-27 Thread via GitHub
yaooqinn commented on PR #45284: URL: https://github.com/apache/spark/pull/45284#issuecomment-1968073134 Thank you @dongjoon-hyun, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-47186][DOCKER][TESTS] Add some timeouts options and logs to improve the debuggability for docker integration test [spark]

2024-02-27 Thread via GitHub
yaooqinn closed pull request #45284: [SPARK-47186][DOCKER][TESTS] Add some timeouts options and logs to improve the debuggability for docker integration test URL: https://github.com/apache/spark/pull/45284 -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-02-27 Thread via GitHub
anishshri-db commented on code in PR #45051: URL: https://github.com/apache/spark/pull/45051#discussion_r1505248645 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala: ## @@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl(

Re: [PR] [SPARK-47186][DOCKER][TESTS] Add some timeouts options and logs to improve the debuggability for docker integration test [spark]

2024-02-27 Thread via GitHub
dongjoon-hyun commented on PR #45284: URL: https://github.com/apache/spark/pull/45284#issuecomment-1968071225 Thank you. Feel free to merge~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

  1   2   3   >