Re: [Question][Python] Columns with Limited Value Set

Sam Davis Wed, 05 Jan 2022 07:21:36 -0800

Hi Rok, David,

I think the problem is that the DictionaryType loses the semantic information 
about the categories.

Right now I define the schema for the tables and have logic to parse 
files/receive data and convert it into RecordBatchs ready for writing. This is 
quite simple: for each row we generate a dictionary of {key: value, ...} as the 
data comes in, pass a set of these to `pd.DataFrame(...)`, and then convert 
using `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware newer versions 
have a `pa.record_batch` that can now be used).

In this instance the schema species to the code and to the user what columns 
should be present and what the type, and values, of these should be.

The use of DictionaryArray breaks this as there is no way of specifying the 
permitted set of values (`dictionary` in your example) in the schema itself? 
Pandas has CategoricalDtype whereby you can specify `categories` but this 
information needs to be stored somewhere other than the schema itself and 
special cased for categorical columns.

This suggests that it may be a good idea to add the categorical type 
information?

Right now it looks like I'll have to define my own schema/field classes that 
return PyArrow and Pandas types when requested ?

Sam

________________________________
From: David Li <[email protected]>
Sent: 05 January 2022 14:53
To: [email protected] <[email protected]>
Subject: Re: [Question][Python] Columns with Limited Value Set

Hi Sam,

For categoricals, you likely want an Arrow dictionary array. (See docs at [1].) 
For example:

>>> import pyarrow as pa
>>> ty = pa.dictionary(pa.int8(), pa.string())
>>> arr = pa.array(["a", "a", None, "d"], type=ty)
>>> arr
<pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>

-- dictionary:
  [
    "a",
    "d"
  ]
-- indices:
  [
    0,
    0,
    null,
    1
  ]
>>> table = pa.table([arr], names=["col1"])
>>> table.to_pandas()
  col1
0    a
1    a
2  NaN
3    d
>>> table.to_pandas()["col1"]
0      a
1      a
2    NaN
3      d
Name: col1, dtype: category
Categories (2, object): ['a', 'd']

Is this sufficient?

[1]: 
https://arrow.apache.org/docs/python/data.html#dictionary-arrays<https://arrow.apache.org/docs/python/data.html#dictionary-arrays>

-David

On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
Hi,

I'm looking at defining a schema for a table where one of the values is 
inherently categorical/enumerable and we're ultimately ending up loading it as 
a Pandas DataFrame. I cannot seem to find a decent way of achieving this.

For example, the column may always be known to contain the values ["a", "b", 
"c", "d"]. Stating this as a stringly-typed column in the schema is a bad idea 
as it permits all strings and requires more storage than necessary for longer 
strings, stating it as an integer column is a bad idea as you lose context and 
force the user to cast after loading, and the dictionary type does not allow 
you to specify the values in the schema so similarly loses all meaning.

I have been playing with the API all morning and from what I can tell there is 
no easy way of achieving this. Am I missing something obvious?

---

One possible route I thought of is to define an extension type and then 
implement the `to_pandas_dtype` method. Yes this method permits all known 
values whilst in Arrow-land, but it at least documents the known type and, so I 
thought, any values not within the `to_pandas_dtype` return will be set to null 
on conversion anyway.

However, this seems to require unnecessarily special-casing a whole bunch of 
code to handle extension types. e.g. just creating a scalar of this type 
requires using a different API. It seems like `pa.scalar` should be able to 
work this out? This example defines a wrapper for int32, and then tries to 
create a scalar of this type showing that the user has to call a special method 
rather than just the normal API:

```
import pyarrow as pa

class IntegerWrapper(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")

    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return IntegerWrapper()

iw_type = IntegerWrapper()

pa.register_extension_type(iw_type)

# throws `ArrowNotImplementedError`
# pa.scalar(0, iw_type)

# user must do this, but code should be able to do this?
pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, iw_type.storage_type))
```

and I can't seem to get the `to_pandas_dtype` to actually work for a wrapped 
dictionary. e.g.

```
import pyarrow as pa

class DictWrapper(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), pa.string()), 
"dict_wrapper")

    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return DictWrapper()

    def to_pandas_dtype(self):
        from pandas.api.types import CategoricalDtype
        return CategoricalDtype(categories=["a", "b"])

dw_type = DictWrapper()

pa.register_extension_type(dw_type)

arr = pa.ExtensionArray.from_storage(
    dw_type,
    pa.array(["a", "b", "c"], dw_type.storage_type)
)

arr

arr.to_pandas()

arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
```

Best,

Sam
IMPORTANT NOTICE: The information transmitted is intended only for the person 
or entity to which it is addressed and may contain confidential and/or 
privileged material. Any review, re-transmission, dissemination or other use 
of, or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer. 
Although we routinely screen for viruses, addressees should check this e-mail 
and any attachment for viruses. We make no warranty as to absence of viruses in 
this e-mail or any attachments.

IMPORTANT NOTICE: The information transmitted is intended only for the person 
or entity to which it is addressed and may contain confidential and/or 
privileged material. Any review, re-transmission, dissemination or other use 
of, or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer. 
Although we routinely screen for viruses, addressees should check this e-mail 
and any attachment for viruses. We make no warranty as to absence of viruses in 
this e-mail or any attachments.

Re: [Question][Python] Columns with Limited Value Set

Reply via email to