[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23217


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23217#discussion_r238700354
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, 
valueType: DataType) extends Seria
   private lazy val keyGetter = InternalRow.getAccessor(keyType)
   private lazy val valueGetter = InternalRow.getAccessor(valueType)
 
-  def put(key: Any, value: Any): Unit = {
+  def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = {
 if (key == null) {
   throw new RuntimeException("Cannot use null as map key.")
 }
 
 val index = keyToIndex.getOrDefault(key, -1)
 if (index == -1) {
+  if (withSizeCheck && size >= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
--- End diff --

ok, let me remove it then, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23217#discussion_r238698103
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, 
valueType: DataType) extends Seria
   private lazy val keyGetter = InternalRow.getAccessor(keyType)
   private lazy val valueGetter = InternalRow.getAccessor(valueType)
 
-  def put(key: Any, value: Any): Unit = {
+  def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = {
 if (key == null) {
   throw new RuntimeException("Cannot use null as map key.")
 }
 
 val index = keyToIndex.getOrDefault(key, -1)
 if (index == -1) {
+  if (withSizeCheck && size >= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
--- End diff --

hmmm, I'd like to avoid premature optimization. Actually how much perf this 
can save? This code block is already doing some heavy work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23217#discussion_r238690465
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, 
valueType: DataType) extends Seria
   private lazy val keyGetter = InternalRow.getAccessor(keyType)
   private lazy val valueGetter = InternalRow.getAccessor(valueType)
 
-  def put(key: Any, value: Any): Unit = {
+  def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = {
 if (key == null) {
   throw new RuntimeException("Cannot use null as map key.")
 }
 
 val index = keyToIndex.getOrDefault(key, -1)
 if (index == -1) {
+  if (withSizeCheck && size >= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
--- End diff --

this flag is just for perf reasons, we can skip the check in some 
conditions and I didn't want to introduce perf overhead if not needed. If we 
remove the flag we would do the comparison for each item, also when it is not 
needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23217#discussion_r238651534
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, 
valueType: DataType) extends Seria
   private lazy val keyGetter = InternalRow.getAccessor(keyType)
   private lazy val valueGetter = InternalRow.getAccessor(valueType)
 
-  def put(key: Any, value: Any): Unit = {
+  def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = {
 if (key == null) {
   throw new RuntimeException("Cannot use null as map key.")
 }
 
 val index = keyToIndex.getOrDefault(key, -1)
 if (index == -1) {
+  if (withSizeCheck && size >= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+throw new RuntimeException(s"Unsuccessful attempt to concat maps 
with $size elements " +
--- End diff --

nit: `concat` -> `build`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23217#discussion_r238651421
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, 
valueType: DataType) extends Seria
   private lazy val keyGetter = InternalRow.getAccessor(keyType)
   private lazy val valueGetter = InternalRow.getAccessor(valueType)
 
-  def put(key: Any, value: Any): Unit = {
+  def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = {
 if (key == null) {
   throw new RuntimeException("Cannot use null as map key.")
 }
 
 val index = keyToIndex.getOrDefault(key, -1)
 if (index == -1) {
+  if (withSizeCheck && size >= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
--- End diff --

I think we should aways check the size. Such a big map is very likely to 
cause problems.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...

2018-12-04 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/23217

[SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat in order to check properly 
the limit size

## What changes were proposed in this pull request?

The PR starts from the 
[comment](https://github.com/apache/spark/pull/23124#discussion_r236112390) in 
the main one and it aims at:
 - simplifying the code for `MapConcat`;
 - be more precise in checking the limit size.

## How was this patch tested?

existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-25829_followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23217.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23217


commit 54f0f31aaa14de7c44c336580c7ed18e8ffb4b54
Author: Marco Gaido 
Date:   2018-12-04T12:35:09Z

[SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat in order to check properly 
the limit size




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org