[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350868#comment-16350868 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362688457 Thanks for merging! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350694#comment-16350694 ] ASF GitHub Bot commented on ARROW-1754: --- wesm closed pull request #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py index 4a30fb3b4..240cccdaf 100644 --- a/python/pyarrow/pandas_compat.py +++ b/python/pyarrow/pandas_compat.py @@ -179,10 +179,8 @@ def get_column_metadata(column, name, arrow_type, field_name): } -index_level_name = '__index_level_{:d}__'.format - - -def construct_metadata(df, column_names, index_levels, preserve_index, types): +def construct_metadata(df, column_names, index_levels, index_column_names, + preserve_index, types): """Returns a dictionary containing enough metadata to reconstruct a pandas DataFrame as an Arrow Table, including index columns. @@ -197,9 +195,8 @@ def construct_metadata(df, column_names, index_levels, preserve_index, types): --- dict """ -ncolumns = len(column_names) -df_types = types[:ncolumns - len(index_levels)] -index_types = types[ncolumns - len(index_levels):] +df_types = types[:-len(index_levels)] +index_types = types[-len(index_levels):] column_metadata = [ get_column_metadata( @@ -213,9 +210,6 @@ def construct_metadata(df, column_names, index_levels, preserve_index, types): ] if preserve_index: -index_column_names = list(map( -index_level_name, range(len(index_levels)) -)) index_column_metadata = [ get_column_metadata( level, @@ -294,9 +288,29 @@ def _column_name_to_strings(name): return str(name) +def _index_level_name(index, i, column_names): +"""Return the name of an index level or a default name if `index.name` is +None or is already a column name. + +Parameters +-- +index : pandas.Index +i : int + +Returns +--- +name : str +""" +if index.name is not None and index.name not in column_names: +return index.name +else: +return '__index_level_{:d}__'.format(i) + + def dataframe_to_arrays(df, schema, preserve_index, nthreads=1): -names = [] +column_names = [] index_columns = [] +index_column_names = [] type = None if preserve_index: @@ -324,12 +338,13 @@ def dataframe_to_arrays(df, schema, preserve_index, nthreads=1): columns_to_convert.append(col) convert_types.append(type) -names.append(name) +column_names.append(name) for i, column in enumerate(index_columns): columns_to_convert.append(column) convert_types.append(None) -names.append(index_level_name(i)) +name = _index_level_name(column, i, column_names) +index_column_names.append(name) # NOTE(wesm): If nthreads=None, then we use a heuristic to decide whether # using a thread pool is worth it. Currently the heuristic is whether the @@ -358,8 +373,10 @@ def convert_column(col, ty): types = [x.type for x in arrays] metadata = construct_metadata( -df, names, index_columns, preserve_index, types +df, column_names, index_columns, index_column_names, preserve_index, +types ) +names = column_names + index_column_names return names, arrays, metadata diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index ca2f1e361..f1f40a695 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -191,8 +191,9 @@ def test_index_metadata_field_name(self): assert idx0['field_name'] == idx0_name assert idx0['name'] is None -assert foo_name == '__index_level_1__' -assert foo['name'] == 'foo' +assert foo_name == 'foo' +assert foo['field_name'] == foo_name +assert foo['name'] == foo_name def test_categorical_column_index(self): df = pd.DataFrame( This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350692#comment-16350692 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362649378 No problem, I'm merging this, thanks @jorisvandenbossche! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349945#comment-16349945 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362516006 Hmm, still timing out on the first one (but the other failures seems resolved) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349778#comment-16349778 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362485027 It looked a lot like the failure that was happening before ARROW-2062, I triggered a new build to see if it's transient This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349388#comment-16349388 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362422663 I see the ARROW-2062 commit in the history of this branch: https://github.com/jorisvandenbossche/arrow/commits/index-names (I fetched upstream master just before I merged / pushed) But, it is failing on travis (amongst others, a timeout for the first (gcc) build), is that the reason you were thinking this is not up to date? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349378#comment-16349378 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362421304 Seems like this could be a stale merge -- doesn't look like it got the ARROW-2062 patch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349284#comment-16349284 ] ASF GitHub Bot commented on ARROW-1754: --- cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362404429 @jorisvandenbossche Yep that's a good idea, I can merge on green. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349282#comment-16349282 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-362403901 Regarding the PR backlog, given the comments above I think there was agreement to merge this. There are no merge conflicts yet, but should I update with master to ensure tests are still passing? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338325#comment-16338325 ] ASF GitHub Bot commented on ARROW-1754: --- cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-360292671 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338322#comment-16338322 ] ASF GitHub Bot commented on ARROW-1754: --- cpcloud commented on a change in pull request #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#discussion_r163696046 ## File path: python/pyarrow/pandas_compat.py ## @@ -294,9 +288,29 @@ def _column_name_to_strings(name): return str(name) +def _index_level_name(index, i, column_names): +"""Return the name of an index level or a default name if `index.name` is +None or is already a column name. + +Parameters +-- +index : pandas.Index +i : int + +Returns +--- +name : str +""" +if index.name is not None and index.name not in column_names: Review comment: Fine by me. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338295#comment-16338295 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche commented on a change in pull request #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#discussion_r163692020 ## File path: python/pyarrow/pandas_compat.py ## @@ -294,9 +288,29 @@ def _column_name_to_strings(name): return str(name) +def _index_level_name(index, i, column_names): +"""Return the name of an index level or a default name if `index.name` is +None or is already a column name. + +Parameters +-- +index : pandas.Index +i : int + +Returns +--- +name : str +""" +if index.name is not None and index.name not in column_names: Review comment: I did some timings, and conversion to a set typically takes twice the time of a single search in the list. So you already need to have 3 index levels to benefit from this, and I don't think this is the typical use case? So I would personally leave it as is, but can certainly also easily add the suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329749#comment-16329749 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-358495585 Rebased This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329727#comment-16329727 ] ASF GitHub Bot commented on ARROW-1754: --- cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-358491875 LGTM other than the comment. Should be rebased to run tests against current master. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329725#comment-16329725 ] ASF GitHub Bot commented on ARROW-1754: --- cpcloud commented on a change in pull request #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#discussion_r162216610 ## File path: python/pyarrow/pandas_compat.py ## @@ -294,9 +288,29 @@ def _column_name_to_strings(name): return str(name) +def _index_level_name(index, i, column_names): +"""Return the name of an index level or a default name if `index.name` is +None or is already a column name. + +Parameters +-- +index : pandas.Index +i : int + +Returns +--- +name : str +""" +if index.name is not None and index.name not in column_names: Review comment: Should we be concerned about the linear search for `index.name not in column_names`? If so, let's create a set outside the loop below that we can check so that we don't need to do a full scan of the column names for every index column. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329718#comment-16329718 ] ASF GitHub Bot commented on ARROW-1754: --- cpcloud commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-358490397 Looking now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329328#comment-16329328 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-358423209 OK, if @cpcloud could take a look at this and advise (since he worked on this code most recently) I'm fine with merging This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329066#comment-16329066 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-358380263 My opinion is to merge this, but I had the feeling nobody else was feeling strongly in favor of it. See the top-level post for my reasoning. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329059#comment-16329059 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408#issuecomment-358378386 Can this be closed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285395#comment-16285395 ] ASF GitHub Bot commented on ARROW-1754: --- jorisvandenbossche opened a new pull request #1408: ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available URL: https://github.com/apache/arrow/pull/1408 Related to the discussion about the pandas metadata specification in https://github.com/pandas-dev/pandas/pull/18201, and an alternative to https://github.com/apache/arrow/pull/1271. I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix [ARROW-1754](https://issues.apache.org/jira/browse/ARROW-1754) and preserve index names as field names when possible (as this was mentioned in https://github.com/pandas-dev/pandas/pull/18201 as the reason to make this change to not preserve index names). The diff is partly a revert of https://github.com/apache/arrow/pull/1271, but then adapted to the current codebase. Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have `__index_level_n__` column names if possible. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248173#comment-16248173 ] ASF GitHub Bot commented on ARROW-1754: --- pjdufour commented on issue #1271: ARROW-1754: [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name URL: https://github.com/apache/arrow/pull/1271#issuecomment-343614629 Yes! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227097#comment-16227097 ] ASF GitHub Bot commented on ARROW-1754: --- wesm closed pull request #1271: ARROW-1754: [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name URL: https://github.com/apache/arrow/pull/1271 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index f04e9b05a..f0f0f6758 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -72,7 +72,7 @@ MetadataVersion GetMetadataVersion(flatbuf::MetadataVersion version) { case flatbuf::MetadataVersion_V4: // Arrow >= 0.8 return MetadataVersion::V4; - // Add cases as other versions become available +// Add cases as other versions become available default: return MetadataVersion::V4; } diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py index d6c844c84..1984598ff 100644 --- a/python/pyarrow/pandas_compat.py +++ b/python/pyarrow/pandas_compat.py @@ -18,7 +18,6 @@ import ast import collections import json -import re import numpy as np import pandas as pd @@ -29,13 +28,6 @@ from pyarrow.compat import PY2, zip_longest # noqa -INDEX_LEVEL_NAME_REGEX = re.compile(r'^__index_level_\d+__$') - - -def is_unnamed_index_level(name): -return INDEX_LEVEL_NAME_REGEX.match(name) is not None - - def infer_dtype(column): try: return pd.api.types.infer_dtype(column) @@ -143,7 +135,7 @@ def get_column_metadata(column, name, arrow_type): Parameters -- -column : pandas.Series +column : pandas.Series or pandas.Index name : str arrow_type : pyarrow.DataType @@ -161,7 +153,7 @@ def get_column_metadata(column, name, arrow_type): } string_dtype = 'object' -if not isinstance(name, six.string_types): +if name is not None and not isinstance(name, six.string_types): raise TypeError( 'Column name must be a string. Got column {} of type {}'.format( name, type(name).__name__ @@ -176,23 +168,7 @@ def get_column_metadata(column, name, arrow_type): } -def index_level_name(index, i): -"""Return the name of an index level or a default name if `index.name` is -None. - -Parameters --- -index : pandas.Index -i : int - -Returns ---- -name : str -""" -if index.name is not None: -return index.name -else: -return '__index_level_{:d}__'.format(i) +index_level_name = '__index_level_{:d}__'.format def construct_metadata(df, column_names, index_levels, preserve_index, types): @@ -222,11 +198,11 @@ def construct_metadata(df, column_names, index_levels, preserve_index, types): ] if preserve_index: -index_column_names = [index_level_name(level, i) - for i, level in enumerate(index_levels)] +index_column_names = list(map( +index_level_name, range(len(index_levels)) +)) index_column_metadata = [ -get_column_metadata(level, name=index_level_name(level, i), -arrow_type=arrow_type) +get_column_metadata(level, name=level.name, arrow_type=arrow_type) for i, (level, arrow_type) in enumerate( zip(index_levels, index_types) ) @@ -317,7 +293,7 @@ def dataframe_to_arrays(df, schema, preserve_index, nthreads=1): for i, column in enumerate(index_columns): columns_to_convert.append(column) convert_types.append(None) -names.append(index_level_name(column, i)) +names.append(index_level_name(i)) # NOTE(wesm): If nthreads=None, then we use a heuristic to decide whether # using a thread pool is worth it. Currently the heuristic is whether the @@ -378,6 +354,7 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1): import pyarrow.lib as lib index_columns = [] +columns = [] column_indexes = [] index_arrays = [] index_names = [] @@ -390,6 +367,7 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1): if has_pandas_metadata: pandas_metadata = json.loads(metadata[b'pandas'].decode('utf8')) index_columns = pandas_metadata['index_columns'] +columns = pandas_metadata['columns'] column_indexes = pandas_metadata.get('column_indexes', []) table = _add_any_metadata(table, pandas_metadata) @@ -397,11 +375,11 @@ def table_to
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227098#comment-16227098 ] ASF GitHub Bot commented on ARROW-1754: --- wesm commented on issue #1271: ARROW-1754: [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name URL: https://github.com/apache/arrow/pull/1271#issuecomment-340827241 thanks @cpcloud! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud > Labels: pull-request-available > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225799#comment-16225799 ] Phillip Cloud commented on ARROW-1754: -- I think we should solve this by always making index column name follow the pattern for unnamed columns, namely, {{__index_level___}}. Along with changing {{index_columns}} to be a list of dictionaries mapping the raw arrow column name to either {{None}} or the actual column name. I'll update the pandas metadata spec accordingly. > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
[ https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224988#comment-16224988 ] Wes McKinney commented on ARROW-1754: - [~cpcloud] could you take a look at this? > [Python] Fix buggy Parquet roundtrip when an index name is the same as a > column name > > > Key: ARROW-1754 > URL: https://issues.apache.org/jira/browse/ARROW-1754 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney > Fix For: 0.8.0 > > > See upstream report > https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column -- This message was sent by Atlassian JIRA (v6.4.14#64029)