Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-19 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2119251289

   > How about we go with this:
   > 
   > 
   > 
   > - try to get the modified `Mode.update` approach working, and see how it 
does on the benchmark
   > 
   > - limit the support to collated StringType only, no complex types for now 
- that can be a separate change
   > 
   > 
   > 
   > hint: to properly limit the support for collated types, you can use 
`override def checkInputDataTypes` to throw something like 
`UNSUPPORTED_DATATYPE` if child.dataType is a complex type with **collated** 
strings
   
   @uros-db how about instead, I add support for complex types in this PR and 
in the next one, try to switch implementation to the mode.update approach. 
   
   For the pr, i will add a benchmark that calls update as well as eval because 
the current benchmark won't be relevant to the mode.update approach as it is a 
benchmark of calls to eval 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-19 Thread via GitHub


uros-db commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2119216632

   How about we go with this:
   
   - try to get the modified `Mode.update` approach working, and see how it 
does on the benchmark
   - limit the support to collated StringType only, no complex types for now - 
that can be a separate change
   
   hint: to properly limit the support for collated types, you can use 
`override def checkInputDataTypes` to throw something like 
`UNSUPPORTED_DATATYPE` if child.dataType is a complex type with **collated** 
strings


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-19 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2119215757

   > I wouldn't say there's a preference on whether to include both support for 
string type and complex types within the same PR - if you think that the 
changes might end up being too large, then it's fine to split it into separate 
PRs.
   > 
   > 
   > 
   > However I would say that we need to make sure there's no unexpected 
behaviour - for example, MODE shouldn't have correct support for collated 
StringType, but incorrect behaviour for ArrayType(StringType), 
StructType(...StringType...), etc.
   > 
   > 
   > 
   > With that in mind, it seems that we should adopt one of two approaches:
   > 
   > 
   > 
   > - implement the support for collated StringType in this PR, but fail 
(throw exception) for complex types that have collated strings
   > 
   > - implement full support at once
   
   @uros-db if you are fine with me splitting it into two PRs, that's what I 
will do! I will modify this PR to fail for complex types that have collated 
strings. And I will get the PR to implement full (recursive) support for said 
complex types ready to be reviewed right after this one is merged. I appreciate 
your flexibility!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-19 Thread via GitHub


uros-db commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2119214355

   As for the modified open hash map, that does sound promising - I suppose 
changing `update` in order to inject collationKey should be enough? it seems 
that `merge` should then work by default
   
   but then of course there's the problem of preserving one of the actual 
values - you correctly noticed that we can't just return collationKey, as that 
might not be present in the original array
   
   I suppose a separate map might do the trick here (mapping collationKey to 
original string value), and since we don't have preference towards which value 
gets returned, simply returning the first one that appeared is considered 
correct behaviour


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-19 Thread via GitHub


uros-db commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2119209981

   also note that covering StringTypes which are fields of StructType is not by 
itself enough - suppose there's a field of StructType that is another 
StructType that has a field of collated StringType, etc.
   
   same goes for arrays, handling ArrayType(StringType) is not enough by itself 
- we also need ArrayType(ArrayType(StringType))
   
   in short, I would say that we need a recursive approach to properly handle 
all possible collated string instances


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-19 Thread via GitHub


uros-db commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2119208823

   I wouldn't say there's a preference on whether to include both support for 
string type and complex types within the same PR - if you think that the 
changes might end up being too large, then it's fine to split it into separate 
PRs.
   
   However I would say that we need to make sure there's no unexpected 
behaviour - for example, MODE shouldn't have correct support for collated 
StringType, but incorrect behaviour for ArrayType(StringType), 
StructType(...StringType...), etc.
   
   With that in mind, it seems that we should adopt one of two approaches:
   
   - implement the support for collated StringType in this PR, but fail (throw 
exception) for complex types that have collated strings
   - implement full support at once, which shouldn't be much more complicated 
than the above approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-18 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2118994013

What I would really like to try is to move from this implementation to an 
approach that will have the collation-support logic moved to the 
PartialAggregation stage, by moving logic to `Mode.merge` and `Mode.update`. I 
would use a modified open hash map for that with hashing based on the collation 
key and with a separate map to map from collation key to one of the actual 
values observed that maps to that collation key (which experimentation has 
shown could work).
   
   But as it has already been a couple weeks of development on this, I believe 
we should, for this PR, confine all the collation logic in the stage that can't 
be serialized and deserialized -- the `eval` stage. And I should try what I 
have described above in a PR raised after we have merged the approach that has 
already been tested (i.e. this PR).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-18 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2118991669

   > > since Mode expression works with any child expression, and you 
special-cased handling Strings, how do we handle Array(String) and 
Struct(String), etc.?
   > 
   > In my local tests, I found that Mode performs a byte-by-byte comparison 
for structs, which does not consider collation. So that is still outstanding. 
Good catch!
   > 
   > @uros-db There are several strategies we might adopt to handle structs 
with collation fields. I am looking into implementations. It is potentially 
straightforward though have some gotchas.
   > 
   > Do you feel I should solve for that in a separate PR or in this one? I 
assume you prefer that this get solve in this PR and not a follow-up PR, right?
   
   @uros-db 
   
   Added implementation for mode to support structs with fields with the 
various collations. Performance is not great, so far.
 
   ```
   [info] collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 
-
   [info] UTF8_BINARY_LCASE - mode - 30105 elements 31  
   32   1  9.8 102.3   1.0X
   [info] UNICODE - mode - 30105 elements1  
1   0240.4   4.2  24.6X
   [info] UTF8_BINARY - mode - 30105 elements1  
1   0239.1   4.2  24.5X
   [info] UNICODE_CI - mode - 30105 elements57  
   59   2  5.3 189.9   0.5X
   ```
   
   I will add the benchmark results from GHA once I get your feedback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1605523709


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +80,28 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer =
+  if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)
+  case (key, _) => key
+}(x => x)((x, y) => (x._1, x._2 + y._2)).values
+modeMap
+  } else {
+buffer
+  }

Review Comment:
   @uros-db done



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -41,6 +42,11 @@ case class Mode(
 this(child, 0, 0, Some(reverse))
   }
 
+  final lazy val collationId: Int = child.dataType match {
+case c: StringType => c.collationId
+case _ => CollationFactory.UTF8_BINARY_COLLATION_ID
+  }

Review Comment:
   @uros-db done
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2118330617

   > since Mode expression works with any child expression, and you 
special-cased handling Strings, how do we handle Array(String) and 
Struct(String), etc.?
   
   In my local tests, I found that Mode performs a byte-by-byte comparison for 
structs, which does not consider collation. So that is still outstanding. Good 
catch! 
   
   @uros-db There are several strategies we might adopt to handle structs with 
collation fields. I am looking into implementations. It is potentially 
straightforward though have some gotchas.
   
   Do you feel I should solve for that in a separate PR or in this one? I 
assume you prefer that this get solve in this PR and not a follow-up PR, right? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1605139359


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +80,28 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer =
+  if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)
+  case (key, _) => key
+}(x => x)((x, y) => (x._1, x._2 + y._2)).values
+modeMap
+  } else {
+buffer
+  }

Review Comment:
   also, you may be able to use just: 
`CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], st.collationId)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1605136854


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +80,28 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer =
+  if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)
+  case (key, _) => key
+}(x => x)((x, y) => (x._1, x._2 + y._2)).values
+modeMap
+  } else {
+buffer
+  }

Review Comment:
   ```suggestion
   val collationAwareBuffer = child.dataType match {
 case st: StringType =>
 val modeMap = buffer.toSeq.groupMapReduce {
 case (key: String, _) =>
   CollationFactory.getCollationKey(UTF8String.fromString(key), 
st.collationId)
 case (key: UTF8String, _) =>
   CollationFactory.getCollationKey(key, st.collationId)
 case (key, _) => key
 }(x => x)((x, y) => (x._1, x._2 + y._2)).values
 modeMap
 case _ => buffer
   }
   ```
   
   (something like this)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1605136854


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +80,28 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer =
+  if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)
+  case (key, _) => key
+}(x => x)((x, y) => (x._1, x._2 + y._2)).values
+modeMap
+  } else {
+buffer
+  }

Review Comment:
   ```suggestion
   val collationAwareBuffer = child.dataType match {
 case st: StringType =>
 val modeMap = buffer.toSeq.groupMapReduce {
 case (key: String, _) =>
   CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
 case (key: UTF8String, _) =>
   CollationFactory.getCollationKey(key, collationId)
 case (key, _) => key
 }(x => x)((x, y) => (x._1, x._2 + y._2)).values
 modeMap
 case _ => buffer
   }
   ```
   
   (something like this)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1605132882


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -41,6 +42,11 @@ case class Mode(
 this(child, 0, 0, Some(reverse))
   }
 
+  final lazy val collationId: Int = child.dataType match {
+case c: StringType => c.collationId
+case _ => CollationFactory.UTF8_BINARY_COLLATION_ID
+  }

Review Comment:
   I see, child is just Expression in Mode
   
   in this case, I would say remove final lazy val collationId (since it's only 
used in 1 place anyways), and just do pattern matching below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2117660366

   @uros-db This is all cleaned up. Let's get some of the other reviewers to 
look at it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604955280


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##
@@ -1404,6 +1408,81 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("Support mode for string expression with collation - Basic Test") {
+Seq("utf8_binary", "utf8_binary_lcase", "unicode_ci", "unicode").foreach { 
collationId =>
+  val query = s"SELECT mode(collate('abc', '${collationId}'))"
+  checkAnswer(sql(query), Row("abc"))
+  
assert(sql(query).schema.fields.head.dataType.sameType(StringType(collationId)))
+}
+  }
+
+  test("Support mode for string expression with collation - Advanced Test") {
+case class ModeTestCase[R](collationId: String, bufferValues: Map[String, 
Long], result: R)
+val testCases = Seq(
+  ModeTestCase("utf8_binary", Map("a" -> 3L, "b" -> 2L, "B" -> 2L), "a"),
+  ModeTestCase("utf8_binary_lcase", Map("a" -> 3L, "b" -> 2L, "B" -> 2L), 
"b"),
+  ModeTestCase("unicode_ci", Map("a" -> 3L, "b" -> 2L, "B" -> 2L), "b"),
+  ModeTestCase("unicode", Map("a" -> 3L, "b" -> 2L, "B" -> 2L), "a")
+)
+testCases.foreach(t => {
+  val valuesToAdd = t.bufferValues.map { case (elt, numRepeats) =>
+(0L to numRepeats).map(_ => s"('$elt')").mkString(",")
+  }.mkString(",")
+
+  withTable("t") {
+sql("CREATE TABLE t(i STRING) USING parquet")
+sql("INSERT INTO t VALUES " + valuesToAdd)
+val query = s"SELECT mode(collate(i, '${t.collationId}')) FROM t"
+checkAnswer(sql(query), Row(t.result))
+
assert(sql(query).schema.fields.head.dataType.sameType(StringType(t.collationId)))
+
+  }
+})
+  }
+
+  test("Support Mode.eval(buffer)") {
+case class ModeTestCase[R](
+collationId: String,
+bufferValues: Map[String, Long],
+result: R)
+case class UTF8StringModeTestCase[R](
+collationId: String,
+bufferValues: Map[UTF8String, Long],
+result: R)
+
+val bufferValues = Map("a" -> 5L, "b" -> 4L, "B" -> 3L, "d" -> 2L, "e" -> 
1L)
+val testCasesStrings = Seq(ModeTestCase("utf8_binary", bufferValues, "a"),
+  ModeTestCase("utf8_binary_lcase", bufferValues, "b"),
+  ModeTestCase("unicode_ci", bufferValues, "b"),
+  ModeTestCase("unicode", bufferValues, "a"))
+
+val bufferValuesUTF8String = Map(
+  UTF8String.fromString("a") -> 5L,
+  UTF8String.fromString("b") -> 4L,
+  UTF8String.fromString("B") -> 3L,
+  UTF8String.fromString("d") -> 2L,
+  UTF8String.fromString("e") -> 1L)
+
+val testCasesUTF8String = Seq(
+  UTF8StringModeTestCase("utf8_binary", bufferValuesUTF8String, "a"),
+  UTF8StringModeTestCase("utf8_binary_lcase", bufferValuesUTF8String, "b"),
+  UTF8StringModeTestCase("unicode_ci", bufferValuesUTF8String, "b"),
+  UTF8StringModeTestCase("unicode", bufferValuesUTF8String, "a"))
+
+testCasesStrings.foreach(t => {
+  val buffer = new OpenHashMap[AnyRef, Long](5)
+  val myMode = Mode(child = Literal.create("some_column_name", 
StringType(t.collationId)))
+  t.bufferValues.foreach { case (k, v) => buffer.update(k, v) }
+  assert(myMode.eval(buffer).toString.toLowerCase() == 
t.result.toLowerCase())
+})
+
+testCasesUTF8String.foreach(t => {
+  val buffer = new OpenHashMap[AnyRef, Long](5)
+  val myMode = Mode(child = Literal.create("some_column_name", 
StringType(t.collationId)))
+  t.bufferValues.foreach { case (k, v) => buffer.update(k, v) }
+  assert(myMode.eval(buffer).toString.toLowerCase() == 
t.result.toLowerCase())
+})
+  }

Review Comment:
   I would love to add some E2E Benchmark tests. But it is okay if not. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604955632


##
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##
@@ -185,6 +189,32 @@ abstract class CollationBenchmarkBase extends 
BenchmarkBase {
 }
 benchmark.run()
   }
+
+  def benchmarkMode(

Review Comment:
   I would love to add some E2E Benchmark tests. But it is okay if not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604951157


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE2948   2958   
   13  0.0   29483.6   1.0X
-UNICODE  2040   2042   
3  0.0   20396.6   1.4X
-UTF8_BINARY  2043   2043   
0  0.0   20426.3   1.4X
-UNICODE_CI  16318  16338   
   28  0.0  163178.4   0.2X
+UTF8_BINARY_LCASE2896   2898   
3  0.0   28958.7   1.0X
+UNICODE  2038   2040   
3  0.0   20377.5   1.4X
+UTF8_BINARY  2053   2054   
1  0.0   20534.9   1.4X
+UNICODE_CI  16779  16802   
   34  0.0  167785.2   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 3227   3228  
 1  0.0   32272.1   1.0X
-UNICODE  16637  16643  
 9  0.0  166367.7   0.2X
-UTF8_BINARY   3132   3137  
 7  0.0   31319.2   1.0X
-UNICODE_CI   17816  17829  
18  0.0  178162.4   0.2X
+UTF8_BINARY_LCASE 4705   4705  
 0  0.0   47048.0   1.0X
+UNICODE  18863  18867  
 6  0.0  188625.3   0.2X
+UTF8_BINARY   4894   4901  
11  0.0   48936.8   1.0X
+UNICODE_CI   19595  19598  
 4  0.0  195953.0   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  4824   4824 
  0  0.0   48243.7   1.0X
-UNICODE   69416  69475 
 84  0.0  694158.3   0.1X
-UTF8_BINARY3806   3808 
  2  0.0   38062.8   1.3X
-UNICODE_CI60943  60975 
 45  0.0  609426.2   0.1X
+UTF8_BINARY_LCASE  5011   5013 
  2  0.0   50113.1   1.0X
+UNICODE   68309  68319 
 13  0.0  683094.7   0.1X
+UTF8_BINARY3887   3887 
  1  0.0   38869.8   1.3X
+UNICODE_CI56675  56686 
 15  0.0  566750.3   0.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 11979  11980 
  1  0.0  119790.4   1.0X
-UNICODE

Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604935916


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE2948   2958   
   13  0.0   29483.6   1.0X
-UNICODE  2040   2042   
3  0.0   20396.6   1.4X
-UTF8_BINARY  2043   2043   
0  0.0   20426.3   1.4X
-UNICODE_CI  16318  16338   
   28  0.0  163178.4   0.2X
+UTF8_BINARY_LCASE2896   2898   
3  0.0   28958.7   1.0X
+UNICODE  2038   2040   
3  0.0   20377.5   1.4X
+UTF8_BINARY  2053   2054   
1  0.0   20534.9   1.4X
+UNICODE_CI  16779  16802   
   34  0.0  167785.2   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 3227   3228  
 1  0.0   32272.1   1.0X
-UNICODE  16637  16643  
 9  0.0  166367.7   0.2X
-UTF8_BINARY   3132   3137  
 7  0.0   31319.2   1.0X
-UNICODE_CI   17816  17829  
18  0.0  178162.4   0.2X
+UTF8_BINARY_LCASE 4705   4705  
 0  0.0   47048.0   1.0X
+UNICODE  18863  18867  
 6  0.0  188625.3   0.2X
+UTF8_BINARY   4894   4901  
11  0.0   48936.8   1.0X
+UNICODE_CI   19595  19598  
 4  0.0  195953.0   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  4824   4824 
  0  0.0   48243.7   1.0X
-UNICODE   69416  69475 
 84  0.0  694158.3   0.1X
-UTF8_BINARY3806   3808 
  2  0.0   38062.8   1.3X
-UNICODE_CI60943  60975 
 45  0.0  609426.2   0.1X
+UTF8_BINARY_LCASE  5011   5013 
  2  0.0   50113.1   1.0X
+UNICODE   68309  68319 
 13  0.0  683094.7   0.1X
+UTF8_BINARY3887   3887 
  1  0.0   38869.8   1.3X
+UNICODE_CI56675  56686 
 15  0.0  566750.3   0.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 11979  11980 
  1  0.0  119790.4   1.0X
-UNICODE

Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604945722


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE2948   2958   
   13  0.0   29483.6   1.0X
-UNICODE  2040   2042   
3  0.0   20396.6   1.4X
-UTF8_BINARY  2043   2043   
0  0.0   20426.3   1.4X
-UNICODE_CI  16318  16338   
   28  0.0  163178.4   0.2X
+UTF8_BINARY_LCASE2896   2898   
3  0.0   28958.7   1.0X
+UNICODE  2038   2040   
3  0.0   20377.5   1.4X
+UTF8_BINARY  2053   2054   
1  0.0   20534.9   1.4X
+UNICODE_CI  16779  16802   
   34  0.0  167785.2   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 3227   3228  
 1  0.0   32272.1   1.0X
-UNICODE  16637  16643  
 9  0.0  166367.7   0.2X
-UTF8_BINARY   3132   3137  
 7  0.0   31319.2   1.0X
-UNICODE_CI   17816  17829  
18  0.0  178162.4   0.2X
+UTF8_BINARY_LCASE 4705   4705  
 0  0.0   47048.0   1.0X
+UNICODE  18863  18867  
 6  0.0  188625.3   0.2X
+UTF8_BINARY   4894   4901  
11  0.0   48936.8   1.0X
+UNICODE_CI   19595  19598  
 4  0.0  195953.0   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  4824   4824 
  0  0.0   48243.7   1.0X
-UNICODE   69416  69475 
 84  0.0  694158.3   0.1X
-UTF8_BINARY3806   3808 
  2  0.0   38062.8   1.3X
-UNICODE_CI60943  60975 
 45  0.0  609426.2   0.1X
+UTF8_BINARY_LCASE  5011   5013 
  2  0.0   50113.1   1.0X
+UNICODE   68309  68319 
 13  0.0  683094.7   0.1X
+UTF8_BINARY3887   3887 
  1  0.0   38869.8   1.3X
+UNICODE_CI56675  56686 
 15  0.0  566750.3   0.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 11979  11980 
  1  0.0  119790.4   1.0X
-UNICODE

Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604944495


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE2948   2958   
   13  0.0   29483.6   1.0X
-UNICODE  2040   2042   
3  0.0   20396.6   1.4X
-UTF8_BINARY  2043   2043   
0  0.0   20426.3   1.4X
-UNICODE_CI  16318  16338   
   28  0.0  163178.4   0.2X
+UTF8_BINARY_LCASE2896   2898   
3  0.0   28958.7   1.0X
+UNICODE  2038   2040   
3  0.0   20377.5   1.4X
+UTF8_BINARY  2053   2054   
1  0.0   20534.9   1.4X
+UNICODE_CI  16779  16802   
   34  0.0  167785.2   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 3227   3228  
 1  0.0   32272.1   1.0X
-UNICODE  16637  16643  
 9  0.0  166367.7   0.2X
-UTF8_BINARY   3132   3137  
 7  0.0   31319.2   1.0X
-UNICODE_CI   17816  17829  
18  0.0  178162.4   0.2X
+UTF8_BINARY_LCASE 4705   4705  
 0  0.0   47048.0   1.0X
+UNICODE  18863  18867  
 6  0.0  188625.3   0.2X
+UTF8_BINARY   4894   4901  
11  0.0   48936.8   1.0X
+UNICODE_CI   19595  19598  
 4  0.0  195953.0   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  4824   4824 
  0  0.0   48243.7   1.0X
-UNICODE   69416  69475 
 84  0.0  694158.3   0.1X
-UTF8_BINARY3806   3808 
  2  0.0   38062.8   1.3X
-UNICODE_CI60943  60975 
 45  0.0  609426.2   0.1X
+UTF8_BINARY_LCASE  5011   5013 
  2  0.0   50113.1   1.0X
+UNICODE   68309  68319 
 13  0.0  683094.7   0.1X
+UTF8_BINARY3887   3887 
  1  0.0   38869.8   1.3X
+UNICODE_CI56675  56686 
 15  0.0  566750.3   0.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 11979  11980 
  1  0.0  119790.4   1.0X
-UNICODE

Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604942334


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -41,6 +42,11 @@ case class Mode(
 this(child, 0, 0, Some(reverse))
   }
 
+  final lazy val collationId: Int = child.dataType match {

Review Comment:
   if we use `asInstanceOf[StringType]` we will get an exception whenever the 
child is a numerical type or any non-string type.
   
   It would be ideal in the case of a non-string type if collationId could be 
-1 or something, but then CollationFactory.fetchCollation(collationId) would 
throw an exception. But it really is not precise to be relying on 
`!CollationFactory.fetchCollation(CollationFactory.UTF8_BINARY_COLLATION_ID).supportsBinaryEquality`
 when it is a non-string-type to continue with the non-collation logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604934461


##
sql/core/benchmarks/CollationBenchmark-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE3571   3576   
7  0.0   35708.8   1.0X
-UNICODE  2235   2240   
7  0.0   22349.2   1.6X
-UTF8_BINARY  2237   2242   
6  0.0   22371.7   1.6X
-UNICODE_CI  18733  18817   
  118  0.0  187333.8   0.2X
+UTF8_BINARY_LCASE3241   3252   
   16  0.0   32413.8   1.0X
+UNICODE  2080   2082   
3  0.0   20800.9   1.6X
+UTF8_BINARY  2081   2083   
2  0.0   20814.2   1.6X
+UNICODE_CI  17364  17384   
   27  0.0  173644.2   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 4260   4290  
41  0.0   42602.6   1.0X
-UNICODE  19536  19624  
   124  0.0  195360.2   0.2X
-UTF8_BINARY   3582   3612  
43  0.0   35818.5   1.2X
-UNICODE_CI   20381  20454  
   103  0.0  203814.1   0.2X
+UTF8_BINARY_LCASE 3614   3615  
 1  0.0   36142.6   1.0X
+UNICODE  18575  18585  
15  0.0  185747.7   0.2X
+UTF8_BINARY   3311   3326  
21  0.0   33111.6   1.1X
+UNICODE_CI   19241  19249  
11  0.0  192409.4   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  7347   7349 
  3  0.0   73467.1   1.0X
-UNICODE   73462  73608 
206  0.0  734623.2   0.1X
-UTF8_BINARY5775   5815 
 57  0.0   57746.0   1.3X
-UNICODE_CI57543  57619 
108  0.0  575425.2   0.1X
+UTF8_BINARY_LCASE  6928   6929 
  1  0.0   69276.9   1.0X
+UNICODE   65674  65693 
 27  0.0  656737.6   0.1X
+UTF8_BINARY5440   5457 
 23  0.0   54403.2   1.3X
+UNICODE_CI60549  60605 
 79  0.0  605488.5   0.1X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 15415  15424 
 13  0.0  154147.1   1.0X
-UNICODE  

Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604935916


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE2948   2958   
   13  0.0   29483.6   1.0X
-UNICODE  2040   2042   
3  0.0   20396.6   1.4X
-UTF8_BINARY  2043   2043   
0  0.0   20426.3   1.4X
-UNICODE_CI  16318  16338   
   28  0.0  163178.4   0.2X
+UTF8_BINARY_LCASE2896   2898   
3  0.0   28958.7   1.0X
+UNICODE  2038   2040   
3  0.0   20377.5   1.4X
+UTF8_BINARY  2053   2054   
1  0.0   20534.9   1.4X
+UNICODE_CI  16779  16802   
   34  0.0  167785.2   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 3227   3228  
 1  0.0   32272.1   1.0X
-UNICODE  16637  16643  
 9  0.0  166367.7   0.2X
-UTF8_BINARY   3132   3137  
 7  0.0   31319.2   1.0X
-UNICODE_CI   17816  17829  
18  0.0  178162.4   0.2X
+UTF8_BINARY_LCASE 4705   4705  
 0  0.0   47048.0   1.0X
+UNICODE  18863  18867  
 6  0.0  188625.3   0.2X
+UTF8_BINARY   4894   4901  
11  0.0   48936.8   1.0X
+UNICODE_CI   19595  19598  
 4  0.0  195953.0   0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  4824   4824 
  0  0.0   48243.7   1.0X
-UNICODE   69416  69475 
 84  0.0  694158.3   0.1X
-UTF8_BINARY3806   3808 
  2  0.0   38062.8   1.3X
-UNICODE_CI60943  60975 
 45  0.0  609426.2   0.1X
+UTF8_BINARY_LCASE  5011   5013 
  2  0.0   50113.1   1.0X
+UNICODE   68309  68319 
 13  0.0  683094.7   0.1X
+UTF8_BINARY3887   3887 
  1  0.0   38869.8   1.3X
+UNICODE_CI56675  56686 
 15  0.0  566750.3   0.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 11979  11980 
  1  0.0  119790.4   1.0X
-UNICODE

Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-17 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1604934461


##
sql/core/benchmarks/CollationBenchmark-results.txt:
##
@@ -1,54 +1,63 @@
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE3571   3576   
7  0.0   35708.8   1.0X
-UNICODE  2235   2240   
7  0.0   22349.2   1.6X
-UTF8_BINARY  2237   2242   
6  0.0   22371.7   1.6X
-UNICODE_CI  18733  18817   
  118  0.0  187333.8   0.2X
+UTF8_BINARY_LCASE3241   3252   
   16  0.0   32413.8   1.0X
+UNICODE  2080   2082   
3  0.0   20800.9   1.6X
+UTF8_BINARY  2081   2083   
2  0.0   20814.2   1.6X
+UNICODE_CI  17364  17384   
   27  0.0  173644.2   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 4260   4290  
41  0.0   42602.6   1.0X
-UNICODE  19536  19624  
   124  0.0  195360.2   0.2X
-UTF8_BINARY   3582   3612  
43  0.0   35818.5   1.2X
-UNICODE_CI   20381  20454  
   103  0.0  203814.1   0.2X
+UTF8_BINARY_LCASE 3614   3615  
 1  0.0   36142.6   1.0X
+UNICODE  18575  18585  
15  0.0  185747.7   0.2X
+UTF8_BINARY   3311   3326  
21  0.0   33111.6   1.1X
+UNICODE_CI   19241  19249  
11  0.0  192409.4   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  7347   7349 
  3  0.0   73467.1   1.0X
-UNICODE   73462  73608 
206  0.0  734623.2   0.1X
-UTF8_BINARY5775   5815 
 57  0.0   57746.0   1.3X
-UNICODE_CI57543  57619 
108  0.0  575425.2   0.1X
+UTF8_BINARY_LCASE  6928   6929 
  1  0.0   69276.9   1.0X
+UNICODE   65674  65693 
 27  0.0  656737.6   0.1X
+UTF8_BINARY5440   5457 
 23  0.0   54403.2   1.3X
+UNICODE_CI60549  60605 
 79  0.0  605488.5   0.1X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 15415  15424 
 13  0.0  154147.1   1.0X
-UNICODE