Re: [Question][Python] Columns with Limited Value Set

Paul Balança Fri, 07 Jan 2022 02:12:41 -0800

This approach should work: I have been implementing at my work (FiveAI) a
layer on top of Arrow extension types to easily wrap common Python classes
(dataclasses, enums, ...). For Enums, we encode the mapping in the metadata
directly so it can be reconstructed when serialized/deserialized.
I hope to be able to open-source part of this work in the near future :)


On Thu, Jan 6, 2022 at 4:23 PM David Li <[email protected]> wrote:

> Oh, also, maybe this could be shipped or specified as a "canonical"
> extension type (though we'd need a good way to encode the keys in the
> schema) - what do others think?
>
> -David
>
> On Thu, Jan 6, 2022, at 11:21, David Li wrote:
>
> The extension APIs could be improved, yes. I don't think there's a real
> reason other than perhaps there hasn't been too much usage yet. If there's
> any other issues you have, feel free to chime in here or file a JIRA [1] -
> I'll file JIRAs for the issues already raised in this thread when I get a
> chance.
>
> [1]: https://issues.apache.org/jira/secure/Dashboard.jspa
>
> -David
>
> On Thu, Jan 6, 2022, at 04:11, Sam Davis wrote:
>
> > We could use an extension type here: wrap the dictionary type on an
> extension type whose metadata contains the expected keys. This way the keys
> are stored in the schema.
>
> Yes, in theory this should work but I have found extension types very
> clumsy to work with. See original post for examples, but unless I'm using
> the wrong API it seems like you must special case most things you want to
> do with them (`pa.ExtensionScalar.from_storage` vs `pa.scalar`, etc) making
> them a less useful abstraction for this sort of task? Is there a reason for
> this?
>
> ------------------------------
>
> *From:* Jorge Cardoso Leitão <[email protected]>
> *Sent:* 06 January 2022 06:30
> *To:* [email protected] <[email protected]>
> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>
> We could use an extension type here: wrap the dictionary type on an
> extension type whose metadata contains the expected keys. This way the keys
> are stored in the schema.
>
>
> On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson <
> [email protected]> wrote:
>
> For what it's worth, I encountered a similar issue in working on the R
> bindings: if you're querying a dataset or filtering a dictionary array and
> you end up with a ChunkedArray with 0 chunks, you can't populate the factor
> levels when converting to R because the type doesn't have the dictionary
> values, only the corresponding arrays, of which there are none in this
> case. In practice it hasn't been a huge problem (AFAIK) but it is a
> difference in expectations.
>
> That said, there are good, practical reasons not to include the dictionary
> values in the type/schema (updating/deltas, as David mentioned, being one
> of them). It seems like an intentional design trade-off.
>
> Neal
>
> On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote:
>
>
> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make
> the dictionary part of the schema itself (and the format even allows for
> dictionaries to be updated over time). I wonder if the dictionary type
> could be extended to handle this; alternatively, passing around explicit
> dictionaries alongside the schema might get you most of the way there. (It
> looks like we might need some way to pass a dictionary to from_pandas, or
> otherwise provide some way to dictionary-encode an Arrow array according to
> an existing dictionary.)
>
> -David
>
> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote:
>
> Hi Rok, David,
>
> I think the problem is that the DictionaryType loses the semantic
> information about the categories.
>
> Right now I define the schema for the tables and have logic to parse
> files/receive data and convert it into RecordBatchs ready for writing. This
> is quite simple: for each row we generate a dictionary of {key: value, ...}
> as the data comes in, pass a set of these to `pd.DataFrame(...)`, and then
> convert using `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware
> newer versions have a `pa.record_batch` that can now be used).
>
> In this instance the schema species to the code and to the user what
> columns should be present and what the type, and values, of these should be.
>
> The use of DictionaryArray breaks this as there is no way of specifying
> the permitted set of values (`dictionary` in your example) in the schema
> itself? Pandas has CategoricalDtype whereby you can specify `categories`
> but this information needs to be stored somewhere other than the schema
> itself and special cased for categorical columns.
>
> This suggests that it may be a good idea to add the categorical type
> information?
>
> Right now it looks like I'll have to define my own schema/field classes
> that return PyArrow and Pandas types when requested ��
>
> Sam
>
>
>
>
> ------------------------------
>
> *From:* David Li <[email protected]>
> *Sent:* 05 January 2022 14:53
> *To:* [email protected] <[email protected]>
> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>
> Hi Sam,
>
> For categoricals, you likely want an Arrow dictionary array. (See docs at
> [1].) For example:
>
> >>> import pyarrow as pa
> >>> ty = pa.dictionary(pa.int8(), pa.string())
> >>> arr = pa.array(["a", "a", None, "d"], type=ty)
> >>> arr
> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>
>
> -- dictionary:
>   [
>     "a",
>     "d"
>   ]
> -- indices:
>   [
>     0,
>     0,
>     null,
>     1
>   ]
> >>> table = pa.table([arr], names=["col1"])
> >>> table.to_pandas()
>   col1
> 0    a
> 1    a
> 2  NaN
> 3    d
> >>> table.to_pandas()["col1"]
> 0      a
> 1      a
> 2    NaN
> 3      d
> Name: col1, dtype: category
> Categories (2, object): ['a', 'd']
>
> Is this sufficient?
>
> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays
>
> -David
>
>
> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
>
> Hi,
>
> I'm looking at defining a schema for a table where one of the values is
> inherently categorical/enumerable and we're ultimately ending up loading it
> as a Pandas DataFrame. I cannot seem to find a decent way of achieving this.
>
> For example, the column may always be known to contain the values ["a",
> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is a
> bad idea as it permits all strings and requires more storage than necessary
> for longer strings, stating it as an integer column is a bad idea as you
> lose context and force the user to cast after loading, and the dictionary
> type does not allow you to specify the values in the schema so similarly
> loses all meaning.
>
> I have been playing with the API all morning and from what I can tell
> there is no easy way of achieving this. Am I missing something obvious?
>
> ---
>
> One possible route I thought of is to define an extension type and then
> implement the `to_pandas_dtype` method. Yes this method permits all known
> values whilst in Arrow-land, but it at least documents the known type and,
> so I thought, any values not within the `to_pandas_dtype` return will be
> set to null on conversion anyway.
>
> However, this seems to require unnecessarily special-casing a whole bunch
> of code to handle extension types. e.g. just creating a scalar of this type
> requires using a different API. It seems like `pa.scalar` should be able to
> work this out? This example defines a wrapper for int32, and then tries to
> create a scalar of this type showing that the user has to call a special
> method rather than just the normal API:
>
> ```
> import pyarrow as pa
>
>
> class IntegerWrapper(pa.ExtensionType):
>
>     def __init__(self):
>         pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
>
>     def __arrow_ext_serialize__(self):
>         # since we don't have a parameterized type, we don't need extra
>         # metadata to be deserialized
>         return b''
>
>     @classmethod
>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>         # return an instance of this subclass given the serialized
>         # metadata.
>         return IntegerWrapper()
>
>
> iw_type = IntegerWrapper()
>
> pa.register_extension_type(iw_type)
>
> # throws `ArrowNotImplementedError`
> # pa.scalar(0, iw_type)
>
> # user must do this, but code should be able to do this?
> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0,
> iw_type.storage_type))
> ```
>
> and I can't seem to get the `to_pandas_dtype` to actually work for a
> wrapped dictionary. e.g.
>
> ```
> import pyarrow as pa
>
>
> class DictWrapper(pa.ExtensionType):
>
>     def __init__(self):
>         pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(),
> pa.string()), "dict_wrapper")
>
>     def __arrow_ext_serialize__(self):
>         # since we don't have a parameterized type, we don't need extra
>         # metadata to be deserialized
>         return b''
>
>     @classmethod
>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>         # return an instance of this subclass given the serialized
>         # metadata.
>         return DictWrapper()
>
>     def to_pandas_dtype(self):
>         from pandas.api.types import CategoricalDtype
>         return CategoricalDtype(categories=["a", "b"])
>
> dw_type = DictWrapper()
>
> pa.register_extension_type(dw_type)
>
> arr = pa.ExtensionArray.from_storage(
>     dw_type,
>     pa.array(["a", "b", "c"], dw_type.storage_type)
> )
>
> arr
>
> arr.to_pandas()
>
> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
> ```
>
> Best,
>
> Sam
> IMPORTANT NOTICE: The information transmitted is intended only for the
> person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>
>
> IMPORTANT NOTICE: The information transmitted is intended only for the
> person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>
>
> IMPORTANT NOTICE: The information transmitted is intended only for the
> person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>
>
>
>

Re: [Question][Python] Columns with Limited Value Set

Reply via email to