[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2020-04-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3861:
--
Labels: dataset dataset-parquet-read parquet pull-request-available python  
(was: dataset dataset-parquet-read parquet python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet, 
> pull-request-available, python
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2020-04-28 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3861:
-
Fix Version/s: 1.0.0

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet, python
> Fix For: 1.0.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2020-03-12 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3861:
-
Labels: dataset dataset-parquet-read parquet python  (was: dataset parquet 
python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet, python
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3861:

Fix Version/s: (was: 0.16.0)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: dataset, parquet, python
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-06-11 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3861:

Labels: dataset parquet python  (was: parquet python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: dataset, parquet, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-06-11 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3861:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: dataset, parquet, python
> Fix For: 0.15.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-26 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3861:
-
Labels: parquet python  (was: parquet pyarrow python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-26 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3861:
-
Labels: parquet pyarrow python  (was: pyarrow python)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet, pyarrow, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-04-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3861:
--
Component/s: Python

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: pyarrow, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2019-02-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3861:

Fix Version/s: 0.14.0

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Christian Thiel
>Priority: Major
>  Labels: pyarrow, python
> Fix For: 0.14.0
>
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

2018-11-23 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3861:

Summary: [Python] ParquetDataset().read columns argument always returns 
partition column  (was: ParquetDataset().read columns argument always returns 
partition column)

> [Python] ParquetDataset().read columns argument always returns partition 
> column
> ---
>
> Key: ARROW-3861
> URL: https://issues.apache.org/jira/browse/ARROW-3861
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Christian Thiel
>Priority: Major
>  Labels: pyarrow, python
>
> I just noticed that no matter which columns are specified on load of a 
> dataset, the partition column is always returned. This might lead to strange 
> behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
> partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
> 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
> engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)