subject:"\[jira\] \[Commented\] \(SPARK\-37930\) Fix DataFrame select subset with duplicated columns"

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477702#comment-17477702
 ] 

Apache Spark commented on SPARK-37930:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35240

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477700#comment-17477700
 ] 

Apache Spark commented on SPARK-37930:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35240

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477581#comment-17477581
 ] 

Yikun Jiang commented on SPARK-37930:
-

[~dchvn] Thanks for investigation
Also update the pandas issue: https://github.com/pandas-dev/pandas/issues/45439

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477561#comment-17477561
 ] 

Yikun Jiang commented on SPARK-37930:
-

https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/frame.py#L12268

FYI:

{code:java}
self.loc[:, ['s1', 's2']]
Out[8]: 
  s1 s2
0  330.0  345.0
1  160.00.0
2NaN   30.0
self.loc[:, ['s1', 's1']]
# raise the issue you mentioned
{code}

So, maybe we need to support loc index for ['s1', 's1'].

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477543#comment-17477543
 ] 

Hyukjin Kwon commented on SPARK-37930:
--

Thanks [~dchvn]!

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread dch nguyen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477529#comment-17477529
 ] 

dch nguyen commented on SPARK-37930:


I'm working on this. Thanks

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread dch nguyen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477528#comment-17477528
 ] 

dch nguyen commented on SPARK-37930:


{code:java}
>>> import pandas as pd
>>> pdf = pd.DataFrame([1,2,3,4], columns=['a'])
>>> pdf
   a
0  1
1  2
2  3
3  4
>>> pdf = pdf[['a', 'a']]
>>> pdf
   a  a
0  1  1
1  2  2
2  3  3
3  4  4
>>> pdf[['a', 'a']]
   a  a  a  a
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3
3  4  4  4  4
 {code}
Seem it come from pandas.

[https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/internal.py#L1146]

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477520#comment-17477520
 ] 

Hyukjin Kwon commented on SPARK-37930:
--

are you working on this? [~dchvn] [~yikunkero]?

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

2022-01-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477519#comment-17477519
 ] 

Hyukjin Kwon commented on SPARK-37930:
--

Interesting. we should fix this. cc [~XinrongM][~ueshin][~itholic] FYI

> Fix DataFrame select subset with duplicated columns
> ---
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in 
> _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in 
> _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in 
> wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in 
> to_pandas_frame
>     return InternalFrame.restore_index(pdf, 
> **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in 
> restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", 
> line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py",
>  line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File 
> "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", 
> line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 
> elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

[jira] [Commented] (SPARK-37930) Fix DataFrame select subset with duplicated columns

9 matches

Site Navigation

Mail list logo

Footer information