[ https://issues.apache.org/jira/browse/ARROW-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-2515: ----------------------------------- Assignee: Brent Kerby > Errors with DictionaryArray inside of ListArray or other DictionaryArray > ------------------------------------------------------------------------ > > Key: ARROW-2515 > URL: https://issues.apache.org/jira/browse/ARROW-2515 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.9.0 > Reporter: Brent Kerby > Assignee: Brent Kerby > Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 1h > Time Spent: 2h > Remaining Estimate: 0h > > An exception ("KeyError: 26") is raised when .as_py() is called on elements > of a ListArray over a DictionaryArray, or of a DictionaryArray with values in > a DictionaryArray. Here are a couple tests that currently fail: > > {code:java} > import pyarrow as pa > def test_dictionary_array_1(): > dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b']) > list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr) > assert list_arr.to_pylist() == [['a', 'b'], ['a']] > def test_dictionary_array_2(): > dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b']) > dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr) > assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a'] > {code} > It appears that the problem is caused by the fact that the function > box_scalar in scalar.pxi does not handle the case of dictionary array, as we > currently have no DictionaryValue type. > DictionaryArray.__getitem__ currently works around the lack of > DictionaryValue type by dereferencing the index and constructing a scalar > based on the value in the underlying dictionary. In other words, if we have a > dictionary with int8 indices and string values, then the result of > __getitem__ will be a StringValue (rather than a DictionaryValue). This works > in simple cases but not in the more complex scenarios illustrated above. > I have a patch ready, which would add a DictionaryValue type similar to other > scalar types, resolving these bugs and removing the need for a special-cased > implementation of DictionaryArray.__getitem__. This DictionaryValue would > contain a couple accessor properties, "indices_value" and "dictionary_value" > to allow access to both the index in the dictionary as well as the looked-up > value. Then DictionaryValue.as_py() would simply call .as_py() on the > underlying dictionary_value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)