[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477702#comment-17477702 ] Apache Spark commented on SPARK-37930: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/35240 > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477700#comment-17477700 ] Apache Spark commented on SPARK-37930: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/35240 > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477581#comment-17477581 ] Yikun Jiang commented on SPARK-37930: - [~dchvn] Thanks for investigation Also update the pandas issue: https://github.com/pandas-dev/pandas/issues/45439 > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477561#comment-17477561 ] Yikun Jiang commented on SPARK-37930: - https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/frame.py#L12268 FYI: {code:java} self.loc[:, ['s1', 's2']] Out[8]: s1 s2 0 330.0 345.0 1 160.00.0 2NaN 30.0 self.loc[:, ['s1', 's1']] # raise the issue you mentioned {code} So, maybe we need to support loc index for ['s1', 's1']. > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477543#comment-17477543 ] Hyukjin Kwon commented on SPARK-37930: -- Thanks [~dchvn]! > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477529#comment-17477529 ] dch nguyen commented on SPARK-37930: I'm working on this. Thanks > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477528#comment-17477528 ] dch nguyen commented on SPARK-37930: {code:java} >>> import pandas as pd >>> pdf = pd.DataFrame([1,2,3,4], columns=['a']) >>> pdf a 0 1 1 2 2 3 3 4 >>> pdf = pdf[['a', 'a']] >>> pdf a a 0 1 1 1 2 2 2 3 3 3 4 4 >>> pdf[['a', 'a']] a a a a 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 {code} Seem it come from pandas. [https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/internal.py#L1146] > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477520#comment-17477520 ] Hyukjin Kwon commented on SPARK-37930: -- are you working on this? [~dchvn] [~yikunkero]? > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477519#comment-17477519 ] Hyukjin Kwon commented on SPARK-37930: -- Interesting. we should fix this. cc [~XinrongM][~ueshin][~itholic] FYI > Fix DataFrame select subset with duplicated columns > --- > > Key: SPARK-37930 > URL: https://issues.apache.org/jira/browse/SPARK-37930 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > pandas > {code:java} > >>> pdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> pdf[['a', 'a']] > a a > 0 1 1 > 1 2 2 > 2 3 3 > 3 4 4 {code} > pandas on spark > {code:java} > >>> psdf > a > 0 1 > 1 2 > 2 3 > 3 4 > >>> psdf[['a', 'a']] > Traceback (most recent call last): > File "", line 1, in > File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__ > pdf = self._get_or_create_repr_pandas_cache(max_display_count) > File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} > File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in > to_pandas_frame > return InternalFrame.restore_index(pdf, > **self.arguments_for_restore_index) > File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in > restore_index > pdf.columns = pd.Index( > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 5500, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", > line 766, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", > line 216, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", > line 57, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 > elements {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org