[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432773#comment-16432773 ] Joshua Storck commented on ARROW-1938: -- Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453 > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432768#comment-16432768 ] ASF GitHub Bot commented on ARROW-1938: --- joshuastorck opened a new pull request #453: Bug fix for ARROW-1938 URL: https://github.com/apache/parquet-cpp/pull/453 The error was reported here: https://issues.apache.org/jira/browse/ARROW-1938. Because dictionary types are not supported in writing yet, the code converts the dictionary column to the actual values first before writing. However, the existing code was accidentally using zero as the offset and the length of the column as the size. This resulted in writing all of the column values for each chunk of the column that was supposed to be written. The fix is to pass the offset and size when recursively calling through to WriteColumnChunk with the "flattened" data. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394789#comment-16394789 ] Wes McKinney commented on ARROW-1938: - Moving to 0.10.0 as we have not had time to diagnose the issue yet > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338081#comment-16338081 ] Robert Dailey commented on ARROW-1938: -- Added data to test with as well as the exact commands I was using. I hit the error when testing this just now. > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.9.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338057#comment-16338057 ] Robert Dailey commented on ARROW-1938: -- Let me gather the data I was using for this. Here are the steps I I took: * Read dataset pieces * Concat resulting DataFrames together * Convert all object columns to category * Write concatenated DataFrame to parquet set I tried converting the columns back to strings, but I still hit the error. To get around the issue, I could take the following steps: * Write concatenated DataFrame to csv * Load the csv file into pandas * Write DataFrame to parquet set > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.9.0 > > Attachments: pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)