[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3861: -- Labels: dataset dataset-parquet-read parquet pull-request-available python (was: dataset dataset-parquet-read parquet python) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Assignee: Joris Van den Bossche >Priority: Major > Labels: dataset, dataset-parquet-read, parquet, > pull-request-available, python > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-3861: - Fix Version/s: 1.0.0 > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Assignee: Joris Van den Bossche >Priority: Major > Labels: dataset, dataset-parquet-read, parquet, python > Fix For: 1.0.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-3861: - Labels: dataset dataset-parquet-read parquet python (was: dataset parquet python) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: dataset, dataset-parquet-read, parquet, python > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3861: Fix Version/s: (was: 0.16.0) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: dataset, parquet, python > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3861: Labels: dataset parquet python (was: parquet python) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: dataset, parquet, python > Fix For: 0.14.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3861: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: dataset, parquet, python > Fix For: 0.15.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-3861: - Labels: parquet python (was: parquet pyarrow python) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: parquet, python > Fix For: 0.14.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-3861: - Labels: parquet pyarrow python (was: pyarrow python) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: parquet, pyarrow, python > Fix For: 0.14.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-3861: -- Component/s: Python > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: pyarrow, python > Fix For: 0.14.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3861: Fix Version/s: 0.14.0 > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug >Reporter: Christian Thiel >Priority: Major > Labels: pyarrow, python > Fix For: 0.14.0 > > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column
[ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3861: Summary: [Python] ParquetDataset().read columns argument always returns partition column (was: ParquetDataset().read columns argument always returns partition column) > [Python] ParquetDataset().read columns argument always returns partition > column > --- > > Key: ARROW-3861 > URL: https://issues.apache.org/jira/browse/ARROW-3861 > Project: Apache Arrow > Issue Type: Bug >Reporter: Christian Thiel >Priority: Major > Labels: pyarrow, python > > I just noticed that no matter which columns are specified on load of a > dataset, the partition column is always returned. This might lead to strange > behaviour, as the resulting dataframe has more than the expected columns: > {code} > import dask as da > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > table = pa.Table.from_pandas(df, schema=my_schema) > pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, > partition_cols=['partition_column']) > df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', > 'strings']).to_pandas() > # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], > engine='pyarrow') > df_pq > {code} > df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)