[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Ben Kietzman  (was: Joris Van den Bossche)

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Joris Van den Bossche

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Joris Van den Bossche  (was: Ben Kietzman)

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Ben Kietzman  (was: Joris Van den Bossche)

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)