[ https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xinrong Meng updated SPARK-39550: --------------------------------- Description: When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns an Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instead. was: When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns a Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instad. > Fix `MultiIndex.value_counts()` when Arrow Execution is enabled > --------------------------------------------------------------- > > Key: SPARK-39550 > URL: https://issues.apache.org/jira/browse/SPARK-39550 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, PySpark > Affects Versions: 3.4.0 > Reporter: Xinrong Meng > Priority: Major > > > When Arrow Execution is enabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'true' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 > {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 > dtype: int64 > {code} > When Arrow Execution is disabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'false' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > (1, a) 1 > (2, b) 1 > dtype: int64 {code} > Notice how indexes of their results are different. > Especially, `value_counts` returns an Index (rather than a MultiIndex), under > the hood, a Spark column of StructType (rather than multiple Spark columns), > so when Arrow Execution is enabled, Arrow converts the StructType column to a > dictionary, where we expect a tuple instead. > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org