[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350868#comment-16350868
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative 
fix for duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362688457
 
 
   Thanks for merging!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350694#comment-16350694
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm closed pull request #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index 4a30fb3b4..240cccdaf 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -179,10 +179,8 @@ def get_column_metadata(column, name, arrow_type, 
field_name):
 }
 
 
-index_level_name = '__index_level_{:d}__'.format
-
-
-def construct_metadata(df, column_names, index_levels, preserve_index, types):
+def construct_metadata(df, column_names, index_levels, index_column_names,
+   preserve_index, types):
 """Returns a dictionary containing enough metadata to reconstruct a pandas
 DataFrame as an Arrow Table, including index columns.
 
@@ -197,9 +195,8 @@ def construct_metadata(df, column_names, index_levels, 
preserve_index, types):
 ---
 dict
 """
-ncolumns = len(column_names)
-df_types = types[:ncolumns - len(index_levels)]
-index_types = types[ncolumns - len(index_levels):]
+df_types = types[:-len(index_levels)]
+index_types = types[-len(index_levels):]
 
 column_metadata = [
 get_column_metadata(
@@ -213,9 +210,6 @@ def construct_metadata(df, column_names, index_levels, 
preserve_index, types):
 ]
 
 if preserve_index:
-index_column_names = list(map(
-index_level_name, range(len(index_levels))
-))
 index_column_metadata = [
 get_column_metadata(
 level,
@@ -294,9 +288,29 @@ def _column_name_to_strings(name):
 return str(name)
 
 
+def _index_level_name(index, i, column_names):
+"""Return the name of an index level or a default name if `index.name` is
+None or is already a column name.
+
+Parameters
+--
+index : pandas.Index
+i : int
+
+Returns
+---
+name : str
+"""
+if index.name is not None and index.name not in column_names:
+return index.name
+else:
+return '__index_level_{:d}__'.format(i)
+
+
 def dataframe_to_arrays(df, schema, preserve_index, nthreads=1):
-names = []
+column_names = []
 index_columns = []
+index_column_names = []
 type = None
 
 if preserve_index:
@@ -324,12 +338,13 @@ def dataframe_to_arrays(df, schema, preserve_index, 
nthreads=1):
 
 columns_to_convert.append(col)
 convert_types.append(type)
-names.append(name)
+column_names.append(name)
 
 for i, column in enumerate(index_columns):
 columns_to_convert.append(column)
 convert_types.append(None)
-names.append(index_level_name(i))
+name = _index_level_name(column, i, column_names)
+index_column_names.append(name)
 
 # NOTE(wesm): If nthreads=None, then we use a heuristic to decide whether
 # using a thread pool is worth it. Currently the heuristic is whether the
@@ -358,8 +373,10 @@ def convert_column(col, ty):
 types = [x.type for x in arrays]
 
 metadata = construct_metadata(
-df, names, index_columns, preserve_index, types
+df, column_names, index_columns, index_column_names, preserve_index,
+types
 )
+names = column_names + index_column_names
 return names, arrays, metadata
 
 
diff --git a/python/pyarrow/tests/test_convert_pandas.py 
b/python/pyarrow/tests/test_convert_pandas.py
index ca2f1e361..f1f40a695 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -191,8 +191,9 @@ def test_index_metadata_field_name(self):
 assert idx0['field_name'] == idx0_name
 assert idx0['name'] is None
 
-assert foo_name == '__index_level_1__'
-assert foo['name'] == 'foo'
+assert foo_name == 'foo'
+assert foo['field_name'] == foo_name
+assert foo['name'] == foo_name
 
 def test_categorical_column_index(self):
 df = pd.DataFrame(


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 

[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350692#comment-16350692
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362649378
 
 
   No problem, I'm merging this, thanks @jorisvandenbossche!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349945#comment-16349945
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative 
fix for duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362516006
 
 
   Hmm, still timing out on the first one (but the other failures seems 
resolved)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349778#comment-16349778
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362485027
 
 
   It looked a lot like the failure that was happening before ARROW-2062, I 
triggered a new build to see if it's transient


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349388#comment-16349388
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative 
fix for duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362422663
 
 
   I see the ARROW-2062 commit in the history of this branch: 
https://github.com/jorisvandenbossche/arrow/commits/index-names (I fetched 
upstream master just before I merged / pushed)
   
   But, it is failing on travis (amongst others, a timeout for the first (gcc) 
build), is that the reason you were thinking this is not up to date?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349378#comment-16349378
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362421304
 
 
   Seems like this could be a stale merge -- doesn't look like it got the 
ARROW-2062 patch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349284#comment-16349284
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362404429
 
 
   @jorisvandenbossche Yep that's a good idea, I can merge on green.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349282#comment-16349282
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative 
fix for duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-362403901
 
 
   Regarding the PR backlog, given the comments above I think there was 
agreement to merge this. 
   There are no merge conflicts yet, but should I update with master to ensure 
tests are still passing?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338325#comment-16338325
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-360292671
 
 
   LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338322#comment-16338322
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

cpcloud commented on a change in pull request #1408: ARROW-1754: [Python] 
alternative fix for duplicate index/column name that preserves index name if 
available
URL: https://github.com/apache/arrow/pull/1408#discussion_r163696046
 
 

 ##
 File path: python/pyarrow/pandas_compat.py
 ##
 @@ -294,9 +288,29 @@ def _column_name_to_strings(name):
 return str(name)
 
 
+def _index_level_name(index, i, column_names):
+"""Return the name of an index level or a default name if `index.name` is
+None or is already a column name.
+
+Parameters
+--
+index : pandas.Index
+i : int
+
+Returns
+---
+name : str
+"""
+if index.name is not None and index.name not in column_names:
 
 Review comment:
   Fine by me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329749#comment-16329749
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-358495585
 
 
   Rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329725#comment-16329725
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

cpcloud commented on a change in pull request #1408: ARROW-1754: [Python] 
alternative fix for duplicate index/column name that preserves index name if 
available
URL: https://github.com/apache/arrow/pull/1408#discussion_r162216610
 
 

 ##
 File path: python/pyarrow/pandas_compat.py
 ##
 @@ -294,9 +288,29 @@ def _column_name_to_strings(name):
 return str(name)
 
 
+def _index_level_name(index, i, column_names):
+"""Return the name of an index level or a default name if `index.name` is
+None or is already a column name.
+
+Parameters
+--
+index : pandas.Index
+i : int
+
+Returns
+---
+name : str
+"""
+if index.name is not None and index.name not in column_names:
 
 Review comment:
   Should we be concerned about the linear search for `index.name not in 
column_names`? If so, let's create a set outside the loop below that we can 
check so that we don't need to do a full scan of the column names for every 
index column.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329718#comment-16329718
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-358490397
 
 
   Looking now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329328#comment-16329328
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-358423209
 
 
   OK, if @cpcloud could take a look at this and advise (since he worked on 
this code most recently) I'm fine with merging


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329066#comment-16329066
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative 
fix for duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-358380263
 
 
   My opinion is to merge this, but I had the feeling nobody else was feeling 
strongly in favor of it. See the top-level post for my reasoning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329059#comment-16329059
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408#issuecomment-358378386
 
 
   Can this be closed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2017-12-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285395#comment-16285395
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

jorisvandenbossche opened a new pull request #1408: ARROW-1754: [Python] 
alternative fix for duplicate index/column name that preserves index name if 
available
URL: https://github.com/apache/arrow/pull/1408
 
 
   Related to the discussion about the pandas metadata specification in 
https://github.com/pandas-dev/pandas/pull/18201, and an alternative to 
https://github.com/apache/arrow/pull/1271.
   
   I don't open this PR because it should necessarily be merged, I just want to 
show that it is not that difficult to both fix 
[ARROW-1754](https://issues.apache.org/jira/browse/ARROW-1754) and preserve 
index names as field names when possible (as this was mentioned in 
https://github.com/pandas-dev/pandas/pull/18201 as the reason to make this 
change to not preserve index names). 
   The diff is partly a revert of https://github.com/apache/arrow/pull/1271, 
but then adapted to the current codebase.
   
   Main reasons I prefer to preserve index names: 1) usability in pyarrow 
itself (if you would want to work with pyarrow Tables created from pandas) and 
2) when interchanging parquet files with other people / other non-pandas 
systems, then it would be much nicer to not have `__index_level_n__` column 
names if possible.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2017-11-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248173#comment-16248173
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

pjdufour commented on issue #1271: ARROW-1754: [Python] Fix buggy Parquet 
roundtrip when an index name is the same as a column name
URL: https://github.com/apache/arrow/pull/1271#issuecomment-343614629
 
 
   Yes!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2017-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227097#comment-16227097
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm closed pull request #1271: ARROW-1754: [Python] Fix buggy Parquet 
roundtrip when an index name is the same as a column name
URL: https://github.com/apache/arrow/pull/1271
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/ipc/metadata-internal.cc 
b/cpp/src/arrow/ipc/metadata-internal.cc
index f04e9b05a..f0f0f6758 100644
--- a/cpp/src/arrow/ipc/metadata-internal.cc
+++ b/cpp/src/arrow/ipc/metadata-internal.cc
@@ -72,7 +72,7 @@ MetadataVersion GetMetadataVersion(flatbuf::MetadataVersion 
version) {
 case flatbuf::MetadataVersion_V4:
   // Arrow >= 0.8
   return MetadataVersion::V4;
-  // Add cases as other versions become available
+// Add cases as other versions become available
 default:
   return MetadataVersion::V4;
   }
diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index d6c844c84..1984598ff 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -18,7 +18,6 @@
 import ast
 import collections
 import json
-import re
 
 import numpy as np
 import pandas as pd
@@ -29,13 +28,6 @@
 from pyarrow.compat import PY2, zip_longest  # noqa
 
 
-INDEX_LEVEL_NAME_REGEX = re.compile(r'^__index_level_\d+__$')
-
-
-def is_unnamed_index_level(name):
-return INDEX_LEVEL_NAME_REGEX.match(name) is not None
-
-
 def infer_dtype(column):
 try:
 return pd.api.types.infer_dtype(column)
@@ -143,7 +135,7 @@ def get_column_metadata(column, name, arrow_type):
 
 Parameters
 --
-column : pandas.Series
+column : pandas.Series or pandas.Index
 name : str
 arrow_type : pyarrow.DataType
 
@@ -161,7 +153,7 @@ def get_column_metadata(column, name, arrow_type):
 }
 string_dtype = 'object'
 
-if not isinstance(name, six.string_types):
+if name is not None and not isinstance(name, six.string_types):
 raise TypeError(
 'Column name must be a string. Got column {} of type {}'.format(
 name, type(name).__name__
@@ -176,23 +168,7 @@ def get_column_metadata(column, name, arrow_type):
 }
 
 
-def index_level_name(index, i):
-"""Return the name of an index level or a default name if `index.name` is
-None.
-
-Parameters
---
-index : pandas.Index
-i : int
-
-Returns
----
-name : str
-"""
-if index.name is not None:
-return index.name
-else:
-return '__index_level_{:d}__'.format(i)
+index_level_name = '__index_level_{:d}__'.format
 
 
 def construct_metadata(df, column_names, index_levels, preserve_index, types):
@@ -222,11 +198,11 @@ def construct_metadata(df, column_names, index_levels, 
preserve_index, types):
 ]
 
 if preserve_index:
-index_column_names = [index_level_name(level, i)
-  for i, level in enumerate(index_levels)]
+index_column_names = list(map(
+index_level_name, range(len(index_levels))
+))
 index_column_metadata = [
-get_column_metadata(level, name=index_level_name(level, i),
-arrow_type=arrow_type)
+get_column_metadata(level, name=level.name, arrow_type=arrow_type)
 for i, (level, arrow_type) in enumerate(
 zip(index_levels, index_types)
 )
@@ -317,7 +293,7 @@ def dataframe_to_arrays(df, schema, preserve_index, 
nthreads=1):
 for i, column in enumerate(index_columns):
 columns_to_convert.append(column)
 convert_types.append(None)
-names.append(index_level_name(column, i))
+names.append(index_level_name(i))
 
 # NOTE(wesm): If nthreads=None, then we use a heuristic to decide whether
 # using a thread pool is worth it. Currently the heuristic is whether the
@@ -378,6 +354,7 @@ def table_to_blockmanager(options, table, memory_pool, 
nthreads=1):
 import pyarrow.lib as lib
 
 index_columns = []
+columns = []
 column_indexes = []
 index_arrays = []
 index_names = []
@@ -390,6 +367,7 @@ def table_to_blockmanager(options, table, memory_pool, 
nthreads=1):
 if has_pandas_metadata:
 pandas_metadata = json.loads(metadata[b'pandas'].decode('utf8'))
 index_columns = pandas_metadata['index_columns']
+columns = pandas_metadata['columns']
 column_indexes = pandas_metadata.get('column_indexes', [])
 table = _add_any_metadata(table, pandas_metadata)
 
@@ -397,11 +375,11 @@ def 

[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2017-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227098#comment-16227098
 ] 

ASF GitHub Bot commented on ARROW-1754:
---

wesm commented on issue #1271: ARROW-1754: [Python] Fix buggy Parquet roundtrip 
when an index name is the same as a column name
URL: https://github.com/apache/arrow/pull/1271#issuecomment-340827241
 
 
   thanks @cpcloud!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2017-10-30 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225799#comment-16225799
 ] 

Phillip Cloud commented on ARROW-1754:
--

I think we should solve this by always making index column name follow the 
pattern for unnamed columns, namely, {{__index_level___}}. Along with 
changing {{index_columns}} to be a list of dictionaries mapping the raw arrow 
column name to either {{None}} or the actual column name.

I'll update the pandas metadata spec accordingly.

> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

2017-10-30 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224988#comment-16224988
 ] 

Wes McKinney commented on ARROW-1754:
-

[~cpcloud] could you take a look at this?

> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> 
>
> Key: ARROW-1754
> URL: https://issues.apache.org/jira/browse/ARROW-1754
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)