Oh, also, maybe this could be shipped or specified as a "canonical" extension type (though we'd need a good way to encode the keys in the schema) - what do others think?
-David On Thu, Jan 6, 2022, at 11:21, David Li wrote: > The extension APIs could be improved, yes. I don't think there's a real > reason other than perhaps there hasn't been too much usage yet. If there's > any other issues you have, feel free to chime in here or file a JIRA [1] - > I'll file JIRAs for the issues already raised in this thread when I get a > chance. > > [1]: https://issues.apache.org/jira/secure/Dashboard.jspa > > -David > > On Thu, Jan 6, 2022, at 04:11, Sam Davis wrote: >> > We could use an extension type here: wrap the dictionary type on an >> > extension type whose metadata contains the expected keys. This way the >> > keys are stored in the schema. >> >> Yes, in theory this should work but I have found extension types very clumsy >> to work with. See original post for examples, but unless I'm using the wrong >> API it seems like you must special case most things you want to do with them >> (`pa.ExtensionScalar.from_storage` vs `pa.scalar`, etc) making them a less >> useful abstraction for this sort of task? Is there a reason for this? >> >> >> *From:* Jorge Cardoso Leitão <[email protected]> >> *Sent:* 06 January 2022 06:30 >> *To:* [email protected] <[email protected]> >> *Subject:* Re: [Question][Python] Columns with Limited Value Set >> >> We could use an extension type here: wrap the dictionary type on an >> extension type whose metadata contains the expected keys. This way the keys >> are stored in the schema. >> >> >> On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson >> <[email protected]> wrote: >>> For what it's worth, I encountered a similar issue in working on the R >>> bindings: if you're querying a dataset or filtering a dictionary array and >>> you end up with a ChunkedArray with 0 chunks, you can't populate the factor >>> levels when converting to R because the type doesn't have the dictionary >>> values, only the corresponding arrays, of which there are none in this >>> case. In practice it hasn't been a huge problem (AFAIK) but it is a >>> difference in expectations. >>> >>> That said, there are good, practical reasons not to include the dictionary >>> values in the type/schema (updating/deltas, as David mentioned, being one >>> of them). It seems like an intentional design trade-off. >>> >>> Neal >>> >>> On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote: >>>> __ >>>> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make >>>> the dictionary part of the schema itself (and the format even allows for >>>> dictionaries to be updated over time). I wonder if the dictionary type >>>> could be extended to handle this; alternatively, passing around explicit >>>> dictionaries alongside the schema might get you most of the way there. (It >>>> looks like we might need some way to pass a dictionary to from_pandas, or >>>> otherwise provide some way to dictionary-encode an Arrow array according >>>> to an existing dictionary.) >>>> >>>> -David >>>> >>>> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote: >>>>> Hi Rok, David, >>>>> >>>>> I think the problem is that the DictionaryType loses the semantic >>>>> information about the categories. >>>>> >>>>> Right now I define the schema for the tables and have logic to parse >>>>> files/receive data and convert it into RecordBatchs ready for writing. >>>>> This is quite simple: for each row we generate a dictionary of {key: >>>>> value, ...} as the data comes in, pass a set of these to >>>>> `pd.DataFrame(...)`, and then convert using >>>>> `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware newer versions >>>>> have a `pa.record_batch` that can now be used). >>>>> >>>>> In this instance the schema species to the code and to the user what >>>>> columns should be present and what the type, and values, of these should >>>>> be. >>>>> >>>>> The use of DictionaryArray breaks this as there is no way of specifying >>>>> the permitted set of values (`dictionary` in your example) in the schema >>>>> itself? Pandas has CategoricalDtype whereby you can specify `categories` >>>>> but this information needs to be stored somewhere other than the schema >>>>> itself and special cased for categorical columns. >>>>> >>>>> This suggests that it may be a good idea to add the categorical type >>>>> information? >>>>> >>>>> Right now it looks like I'll have to define my own schema/field classes >>>>> that return PyArrow and Pandas types when requested �� >>>>> >>>>> Sam >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* David Li <[email protected]> >>>>> *Sent:* 05 January 2022 14:53 >>>>> *To:* [email protected] <[email protected]> >>>>> *Subject:* Re: [Question][Python] Columns with Limited Value Set >>>>> >>>>> Hi Sam, >>>>> >>>>> For categoricals, you likely want an Arrow dictionary array. (See docs at >>>>> [1].) For example: >>>>> >>>>> >>> import pyarrow as pa >>>>> >>> ty = pa.dictionary(pa.int8(), pa.string()) >>>>> >>> arr = pa.array(["a", "a", None, "d"], type=ty) >>>>> >>> arr >>>>> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890> >>>>> >>>>> -- dictionary: >>>>> [ >>>>> "a", >>>>> "d" >>>>> ] >>>>> -- indices: >>>>> [ >>>>> 0, >>>>> 0, >>>>> null, >>>>> 1 >>>>> ] >>>>> >>> table = pa.table([arr], names=["col1"]) >>>>> >>> table.to_pandas() >>>>> col1 >>>>> 0 a >>>>> 1 a >>>>> 2 NaN >>>>> 3 d >>>>> >>> table.to_pandas()["col1"] >>>>> 0 a >>>>> 1 a >>>>> 2 NaN >>>>> 3 d >>>>> Name: col1, dtype: category >>>>> Categories (2, object): ['a', 'd'] >>>>> >>>>> Is this sufficient? >>>>> >>>>> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays >>>>> >>>>> -David >>>>> >>>>> >>>>> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote: >>>>>> Hi, >>>>>> >>>>>> I'm looking at defining a schema for a table where one of the values is >>>>>> inherently categorical/enumerable and we're ultimately ending up loading >>>>>> it as a Pandas DataFrame. I cannot seem to find a decent way of >>>>>> achieving this. >>>>>> >>>>>> For example, the column may always be known to contain the values ["a", >>>>>> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is >>>>>> a bad idea as it permits all strings and requires more storage than >>>>>> necessary for longer strings, stating it as an integer column is a bad >>>>>> idea as you lose context and force the user to cast after loading, and >>>>>> the dictionary type does not allow you to specify the values in the >>>>>> schema so similarly loses all meaning. >>>>>> >>>>>> I have been playing with the API all morning and from what I can tell >>>>>> there is no easy way of achieving this. Am I missing something obvious? >>>>>> >>>>>> --- >>>>>> >>>>>> One possible route I thought of is to define an extension type and then >>>>>> implement the `to_pandas_dtype` method. Yes this method permits all >>>>>> known values whilst in Arrow-land, but it at least documents the known >>>>>> type and, so I thought, any values not within the `to_pandas_dtype` >>>>>> return will be set to null on conversion anyway. >>>>>> >>>>>> However, this seems to require unnecessarily special-casing a whole >>>>>> bunch of code to handle extension types. e.g. just creating a scalar of >>>>>> this type requires using a different API. It seems like `pa.scalar` >>>>>> should be able to work this out? This example defines a wrapper for >>>>>> int32, and then tries to create a scalar of this type showing that the >>>>>> user has to call a special method rather than just the normal API: >>>>>> >>>>>> ``` >>>>>> import pyarrow as pa >>>>>> >>>>>> >>>>>> class IntegerWrapper(pa.ExtensionType): >>>>>> >>>>>> def __init__(self): >>>>>> pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper") >>>>>> >>>>>> def __arrow_ext_serialize__(self): >>>>>> # since we don't have a parameterized type, we don't need extra >>>>>> # metadata to be deserialized >>>>>> return b'' >>>>>> >>>>>> @classmethod >>>>>> def __arrow_ext_deserialize__(self, storage_type, serialized): >>>>>> # return an instance of this subclass given the serialized >>>>>> # metadata. >>>>>> return IntegerWrapper() >>>>>> >>>>>> >>>>>> iw_type = IntegerWrapper() >>>>>> >>>>>> pa.register_extension_type(iw_type) >>>>>> >>>>>> # throws `ArrowNotImplementedError` >>>>>> # pa.scalar(0, iw_type) >>>>>> >>>>>> # user must do this, but code should be able to do this? >>>>>> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, >>>>>> iw_type.storage_type)) >>>>>> ``` >>>>>> >>>>>> and I can't seem to get the `to_pandas_dtype` to actually work for a >>>>>> wrapped dictionary. e.g. >>>>>> >>>>>> ``` >>>>>> import pyarrow as pa >>>>>> >>>>>> >>>>>> class DictWrapper(pa.ExtensionType): >>>>>> >>>>>> def __init__(self): >>>>>> pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), >>>>>> pa.string()), "dict_wrapper") >>>>>> >>>>>> def __arrow_ext_serialize__(self): >>>>>> # since we don't have a parameterized type, we don't need extra >>>>>> # metadata to be deserialized >>>>>> return b'' >>>>>> >>>>>> @classmethod >>>>>> def __arrow_ext_deserialize__(self, storage_type, serialized): >>>>>> # return an instance of this subclass given the serialized >>>>>> # metadata. >>>>>> return DictWrapper() >>>>>> >>>>>> def to_pandas_dtype(self): >>>>>> from pandas.api.types import CategoricalDtype >>>>>> return CategoricalDtype(categories=["a", "b"]) >>>>>> >>>>>> dw_type = DictWrapper() >>>>>> >>>>>> pa.register_extension_type(dw_type) >>>>>> >>>>>> arr = pa.ExtensionArray.from_storage( >>>>>> dw_type, >>>>>> pa.array(["a", "b", "c"], dw_type.storage_type) >>>>>> ) >>>>>> >>>>>> arr >>>>>> >>>>>> arr.to_pandas() >>>>>> >>>>>> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values) >>>>>> ``` >>>>>> >>>>>> Best, >>>>>> >>>>>> Sam >>>>>> IMPORTANT NOTICE: The information transmitted is intended only for the >>>>>> person or entity to which it is addressed and may contain confidential >>>>>> and/or privileged material. Any review, re-transmission, dissemination >>>>>> or other use of, or taking of any action in reliance upon, this >>>>>> information by persons or entities other than the intended recipient is >>>>>> prohibited. If you received this in error, please contact the sender and >>>>>> delete the material from any computer. Although we routinely screen for >>>>>> viruses, addressees should check this e-mail and any attachment for >>>>>> viruses. We make no warranty as to absence of viruses in this e-mail or >>>>>> any attachments. >>>>> >>>>> IMPORTANT NOTICE: The information transmitted is intended only for the >>>>> person or entity to which it is addressed and may contain confidential >>>>> and/or privileged material. Any review, re-transmission, dissemination or >>>>> other use of, or taking of any action in reliance upon, this information >>>>> by persons or entities other than the intended recipient is prohibited. >>>>> If you received this in error, please contact the sender and delete the >>>>> material from any computer. Although we routinely screen for viruses, >>>>> addressees should check this e-mail and any attachment for viruses. We >>>>> make no warranty as to absence of viruses in this e-mail or any >>>>> attachments. >>>> >> IMPORTANT NOTICE: The information transmitted is intended only for the >> person or entity to which it is addressed and may contain confidential >> and/or privileged material. Any review, re-transmission, dissemination or >> other use of, or taking of any action in reliance upon, this information by >> persons or entities other than the intended recipient is prohibited. If you >> received this in error, please contact the sender and delete the material >> from any computer. Although we routinely screen for viruses, addressees >> should check this e-mail and any attachment for viruses. We make no warranty >> as to absence of viruses in this e-mail or any attachments. >
