[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2369: Fix Version/s: (was: 0.9.0) 0.10.0 > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, pyarrow > Fix For: 0.10.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 10 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 5 GB or so (drawn randomly from > the same dataset), everything proceeds as normal. I've also tried this with > Pandas df.to_parquet(), with the same results. > Update: Looks like any DataFrame with size above ~5GB (on disk) returns the > same error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Tan updated ARROW-2369: -- Description: When writing large Parquet files (above 10 GB or so) from Pandas to Parquet via the command {{pq.write_table(my_df, 'table.parquet')}} The write succeeds, but when the parquet file is loaded, the error message {{ArrowIOError: Invalid parquet file. Corrupt footer.}} appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 5 GB or so (drawn randomly from the same dataset), everything proceeds as normal. I've also tried this with Pandas df.to_parquet(), with the same results. Update: Looks like any DataFrame with size above ~5GB (on disk) returns the same error. was: When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via the command {{pq.write_table(my_df, 'table.parquet')}} The write succeeds, but when the parquet file is loaded, the error message {{ArrowIOError: Invalid parquet file. Corrupt footer.}} {{appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 10 GB or so (drawn randomly from the same dataset), everything proceeds as normal. I've also tried this with Pandas df.to_parquet(), with the same results.}} Details: Arrow v0.9.0 Reproduced on Ubuntu, Mac osx > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, pyarrow > Fix For: 0.9.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 10 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 5 GB or so (drawn randomly from > the same dataset), everything proceeds as normal. I've also tried this with > Pandas df.to_parquet(), with the same results. > Update: Looks like any DataFrame with size above ~5GB (on disk) returns the > same error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Tan updated ARROW-2369: -- Description: When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via the command {{pq.write_table(my_df, 'table.parquet')}} The write succeeds, but when the parquet file is loaded, the error message {{ArrowIOError: Invalid parquet file. Corrupt footer.}} appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 10 GB or so (drawn randomly from the same dataset), everything proceeds as normal. Could this be related to memory constraints Details: Arrow v0.9.0 Reproduced on Ubuntu, Mac osx was: When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via the command {{pq.write_table(my_df, 'table.parquet')}} The write succeeds, but when the parquet file is loaded, the error message {{ArrowIOError: Invalid parquet file. Corrupt footer.}} appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 10 GB or so (drawn randomly from the same dataset), everything proceeds as normal. Details: Arrow v0.9.0 Reproduced on Ubuntu, Mac osx > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, pyarrow > Fix For: 0.9.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 20 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 10 GB or so (drawn randomly > from the same dataset), everything proceeds as normal. Could this be related > to memory constraints > Details: > Arrow v0.9.0 > Reproduced on Ubuntu, Mac osx -- This message was sent by Atlassian JIRA (v7.6.3#76005)