[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

2018-03-31 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2369:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> --
>
> Key: ARROW-2369
> URL: https://issues.apache.org/jira/browse/ARROW-2369
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Reproduced on Ubuntu + Mac OSX
>Reporter: Justin Tan
>Priority: Major
>  Labels: Parquet, bug, pandas, parquetWriter, pyarrow
> Fix For: 0.10.0
>
> Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 10 GB or so) from Pandas to Parquet 
> via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as 
> well. When the parquet files are small, say < 5 GB or so (drawn randomly from 
> the same dataset), everything proceeds as normal. I've also tried this with 
> Pandas df.to_parquet(), with the same results.
> Update: Looks like any DataFrame with size above ~5GB (on disk) returns the 
> same error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

2018-03-30 Thread Justin Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Tan updated ARROW-2369:
--
Description: 
When writing large Parquet files (above 10 GB or so) from Pandas to Parquet via 
the command

{{pq.write_table(my_df, 'table.parquet')}}

The write succeeds, but when the parquet file is loaded, the error message

{{ArrowIOError: Invalid parquet file. Corrupt footer.}}

appears. This same error occurs when the parquet file is written chunkwise as 
well. When the parquet files are small, say < 5 GB or so (drawn randomly from 
the same dataset), everything proceeds as normal. I've also tried this with 
Pandas df.to_parquet(), with the same results.

Update: Looks like any DataFrame with size above ~5GB (on disk) returns the 
same error.

  was:
When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via 
the command

{{pq.write_table(my_df, 'table.parquet')}}

The write succeeds, but when the parquet file is loaded, the error message

{{ArrowIOError: Invalid parquet file. Corrupt footer.}}

{{appears. This same error occurs when the parquet file is written chunkwise as 
well. When the parquet files are small, say < 10 GB or so (drawn randomly from 
the same dataset), everything proceeds as normal. I've also tried this with 
Pandas df.to_parquet(), with the same results.}}

Details:

Arrow v0.9.0

Reproduced on Ubuntu, Mac osx


> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> --
>
> Key: ARROW-2369
> URL: https://issues.apache.org/jira/browse/ARROW-2369
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Reproduced on Ubuntu + Mac OSX
>Reporter: Justin Tan
>Priority: Major
>  Labels: Parquet, bug, pandas, parquetWriter, pyarrow
> Fix For: 0.9.0
>
> Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 10 GB or so) from Pandas to Parquet 
> via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as 
> well. When the parquet files are small, say < 5 GB or so (drawn randomly from 
> the same dataset), everything proceeds as normal. I've also tried this with 
> Pandas df.to_parquet(), with the same results.
> Update: Looks like any DataFrame with size above ~5GB (on disk) returns the 
> same error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

2018-03-30 Thread Justin Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Tan updated ARROW-2369:
--
Description: 
When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via 
the command

{{pq.write_table(my_df, 'table.parquet')}}

The write succeeds, but when the parquet file is loaded, the error message

{{ArrowIOError: Invalid parquet file. Corrupt footer.}}

appears. This same error occurs when the parquet file is written chunkwise as 
well. When the parquet files are small, say < 10 GB or so (drawn randomly from 
the same dataset), everything proceeds as normal. Could this be related to 
memory constraints 

Details:

Arrow v0.9.0

Reproduced on Ubuntu, Mac osx

  was:
When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via 
the command

{{pq.write_table(my_df, 'table.parquet')}}

The write succeeds, but when the parquet file is loaded, the error message

{{ArrowIOError: Invalid parquet file. Corrupt footer.}}

appears. This same error occurs when the parquet file is written chunkwise as 
well. When the parquet files are small, say < 10 GB or so (drawn randomly from 
the same dataset), everything proceeds as normal.

Details:

Arrow v0.9.0

Reproduced on Ubuntu, Mac osx


> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> --
>
> Key: ARROW-2369
> URL: https://issues.apache.org/jira/browse/ARROW-2369
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Reproduced on Ubuntu + Mac OSX
>Reporter: Justin Tan
>Priority: Major
>  Labels: Parquet, bug, pandas, parquetWriter, pyarrow
> Fix For: 0.9.0
>
> Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 20 GB or so) from Pandas to Parquet 
> via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as 
> well. When the parquet files are small, say < 10 GB or so (drawn randomly 
> from the same dataset), everything proceeds as normal. Could this be related 
> to memory constraints 
> Details:
> Arrow v0.9.0
> Reproduced on Ubuntu, Mac osx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)