[jira] [Assigned] (SPARK-39979) IndexOutOfBoundsException on groupby + apply pandas grouped map udf function

2023-05-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39979:


Assignee: Adam Binford

> IndexOutOfBoundsException on groupby + apply pandas grouped map udf function
> 
>
> Key: SPARK-39979
> URL: https://issues.apache.org/jira/browse/SPARK-39979
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: yaniv oren
>Assignee: Adam Binford
>Priority: Major
>
> I'm grouping on relatively small subset of groups with big size groups.
> Working with pyarrow version 2.0.0, machines memory is {color:#44}64 
> GiB.{color}
> I'm getting the following error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 387 
> in stage 162.0 failed 4 times, most recent failure: Lost task 387.3 in stage 
> 162.0 (TID 29957) (ip-172-21-129-187.eu-west-1.compute.internal executor 71): 
> java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: 
> range(0, 2147483648))
>   at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
>   at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890)
>   at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087)
>   at 
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:130)
>   at 
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:95)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:92)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:103)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:435)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2031)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)
>  {code}
> Why do I hit this 2 GB limit? according SPARK-34588 this is supported, 
> perhaps related to SPARK-34020.
> Please assist.
> Note:
> Is it related to the usage of BaseVariableWidthVector and not 
> BaseLargeVariableWidthVector?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39979) IndexOutOfBoundsException on groupby + apply pandas grouped map udf function

2023-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39979:


Assignee: Apache Spark

> IndexOutOfBoundsException on groupby + apply pandas grouped map udf function
> 
>
> Key: SPARK-39979
> URL: https://issues.apache.org/jira/browse/SPARK-39979
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: yaniv oren
>Assignee: Apache Spark
>Priority: Major
>
> I'm grouping on relatively small subset of groups with big size groups.
> Working with pyarrow version 2.0.0, machines memory is {color:#44}64 
> GiB.{color}
> I'm getting the following error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 387 
> in stage 162.0 failed 4 times, most recent failure: Lost task 387.3 in stage 
> 162.0 (TID 29957) (ip-172-21-129-187.eu-west-1.compute.internal executor 71): 
> java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: 
> range(0, 2147483648))
>   at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
>   at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890)
>   at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087)
>   at 
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:130)
>   at 
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:95)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:92)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:103)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:435)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2031)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)
>  {code}
> Why do I hit this 2 GB limit? according SPARK-34588 this is supported, 
> perhaps related to SPARK-34020.
> Please assist.
> Note:
> Is it related to the usage of BaseVariableWidthVector and not 
> BaseLargeVariableWidthVector?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39979) IndexOutOfBoundsException on groupby + apply pandas grouped map udf function

2023-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39979:


Assignee: (was: Apache Spark)

> IndexOutOfBoundsException on groupby + apply pandas grouped map udf function
> 
>
> Key: SPARK-39979
> URL: https://issues.apache.org/jira/browse/SPARK-39979
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: yaniv oren
>Priority: Major
>
> I'm grouping on relatively small subset of groups with big size groups.
> Working with pyarrow version 2.0.0, machines memory is {color:#44}64 
> GiB.{color}
> I'm getting the following error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 387 
> in stage 162.0 failed 4 times, most recent failure: Lost task 387.3 in stage 
> 162.0 (TID 29957) (ip-172-21-129-187.eu-west-1.compute.internal executor 71): 
> java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: 
> range(0, 2147483648))
>   at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
>   at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890)
>   at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087)
>   at 
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:130)
>   at 
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:95)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:92)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:103)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:435)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2031)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)
>  {code}
> Why do I hit this 2 GB limit? according SPARK-34588 this is supported, 
> perhaps related to SPARK-34020.
> Please assist.
> Note:
> Is it related to the usage of BaseVariableWidthVector and not 
> BaseLargeVariableWidthVector?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org