Re: [Question][Python] Columns with Limited Value Set

David Li Thu, 06 Jan 2022 08:23:14 -0800

Oh, also, maybe this could be shipped or specified as a "canonical" extension 
type (though we'd need a good way to encode the keys in the schema) - what do 
others think?


-David

On Thu, Jan 6, 2022, at 11:21, David Li wrote:
> The extension APIs could be improved, yes. I don't think there's a real 
> reason other than perhaps there hasn't been too much usage yet. If there's 
> any other issues you have, feel free to chime in here or file a JIRA [1] - 
> I'll file JIRAs for the issues already raised in this thread when I get a 
> chance.
> 
> [1]: https://issues.apache.org/jira/secure/Dashboard.jspa
> 
> -David
> 
> On Thu, Jan 6, 2022, at 04:11, Sam Davis wrote:
>> > We could use an extension type here: wrap the dictionary type on an 
>> > extension type whose metadata contains the expected keys. This way the 
>> > keys are stored in the schema.
>> 
>> Yes, in theory this should work but I have found extension types very clumsy 
>> to work with. See original post for examples, but unless I'm using the wrong 
>> API it seems like you must special case most things you want to do with them 
>> (`pa.ExtensionScalar.from_storage` vs `pa.scalar`, etc) making them a less 
>> useful abstraction for this sort of task? Is there a reason for this?
>> 
>> 
>> *From:* Jorge Cardoso Leitão <[email protected]>
>> *Sent:* 06 January 2022 06:30
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>>  
>> We could use an extension type here: wrap the dictionary type on an 
>> extension type whose metadata contains the expected keys. This way the keys 
>> are stored in the schema.
>> 
>> 
>> On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson 
>> <[email protected]> wrote:
>>> For what it's worth, I encountered a similar issue in working on the R 
>>> bindings: if you're querying a dataset or filtering a dictionary array and 
>>> you end up with a ChunkedArray with 0 chunks, you can't populate the factor 
>>> levels when converting to R because the type doesn't have the dictionary 
>>> values, only the corresponding arrays, of which there are none in this 
>>> case. In practice it hasn't been a huge problem (AFAIK) but it is a 
>>> difference in expectations.
>>> 
>>> That said, there are good, practical reasons not to include the dictionary 
>>> values in the type/schema (updating/deltas, as David mentioned, being one 
>>> of them). It seems like an intentional design trade-off. 
>>> 
>>> Neal
>>> 
>>> On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote:
>>>> __
>>>> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make 
>>>> the dictionary part of the schema itself (and the format even allows for 
>>>> dictionaries to be updated over time). I wonder if the dictionary type 
>>>> could be extended to handle this; alternatively, passing around explicit 
>>>> dictionaries alongside the schema might get you most of the way there. (It 
>>>> looks like we might need some way to pass a dictionary to from_pandas, or 
>>>> otherwise provide some way to dictionary-encode an Arrow array according 
>>>> to an existing dictionary.)
>>>> 
>>>> -David
>>>> 
>>>> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote:
>>>>> Hi Rok, David,
>>>>> 
>>>>> I think the problem is that the DictionaryType loses the semantic 
>>>>> information about the categories. 
>>>>> 
>>>>> Right now I define the schema for the tables and have logic to parse 
>>>>> files/receive data and convert it into RecordBatchs ready for writing. 
>>>>> This is quite simple: for each row we generate a dictionary of {key: 
>>>>> value, ...} as the data comes in, pass a set of these to 
>>>>> `pd.DataFrame(...)`, and then convert using 
>>>>> `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware newer versions 
>>>>> have a `pa.record_batch` that can now be used). 
>>>>> 
>>>>> In this instance the schema species to the code and to the user what 
>>>>> columns should be present and what the type, and values, of these should 
>>>>> be.
>>>>> 
>>>>> The use of DictionaryArray breaks this as there is no way of specifying 
>>>>> the permitted set of values (`dictionary` in your example) in the schema 
>>>>> itself? Pandas has CategoricalDtype whereby you can specify `categories` 
>>>>> but this information needs to be stored somewhere other than the schema 
>>>>> itself and special cased for categorical columns.
>>>>> 
>>>>> This suggests that it may be a good idea to add the categorical type 
>>>>> information?
>>>>> 
>>>>> Right now it looks like I'll have to define my own schema/field classes 
>>>>> that return PyArrow and Pandas types when requested ��
>>>>> 
>>>>> Sam
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *From:* David Li <[email protected]>
>>>>> *Sent:* 05 January 2022 14:53
>>>>> *To:* [email protected] <[email protected]>
>>>>> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>>>>>  
>>>>> Hi Sam,
>>>>> 
>>>>> For categoricals, you likely want an Arrow dictionary array. (See docs at 
>>>>> [1].) For example:
>>>>> 
>>>>> >>> import pyarrow as pa
>>>>> >>> ty = pa.dictionary(pa.int8(), pa.string())
>>>>> >>> arr = pa.array(["a", "a", None, "d"], type=ty)
>>>>> >>> arr
>>>>> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>
>>>>> 
>>>>> -- dictionary:
>>>>>   [
>>>>>     "a",
>>>>>     "d"
>>>>>   ]
>>>>> -- indices:
>>>>>   [
>>>>>     0,
>>>>>     0,
>>>>>     null,
>>>>>     1
>>>>>   ]
>>>>> >>> table = pa.table([arr], names=["col1"])
>>>>> >>> table.to_pandas()
>>>>>   col1
>>>>> 0    a
>>>>> 1    a
>>>>> 2  NaN
>>>>> 3    d
>>>>> >>> table.to_pandas()["col1"]
>>>>> 0      a
>>>>> 1      a
>>>>> 2    NaN
>>>>> 3      d
>>>>> Name: col1, dtype: category
>>>>> Categories (2, object): ['a', 'd']
>>>>> 
>>>>> Is this sufficient?
>>>>> 
>>>>> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays
>>>>> 
>>>>> -David
>>>>> 
>>>>> 
>>>>> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I'm looking at defining a schema for a table where one of the values is 
>>>>>> inherently categorical/enumerable and we're ultimately ending up loading 
>>>>>> it as a Pandas DataFrame. I cannot seem to find a decent way of 
>>>>>> achieving this.
>>>>>> 
>>>>>> For example, the column may always be known to contain the values ["a", 
>>>>>> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is 
>>>>>> a bad idea as it permits all strings and requires more storage than 
>>>>>> necessary for longer strings, stating it as an integer column is a bad 
>>>>>> idea as you lose context and force the user to cast after loading, and 
>>>>>> the dictionary type does not allow you to specify the values in the 
>>>>>> schema so similarly loses all meaning.
>>>>>> 
>>>>>> I have been playing with the API all morning and from what I can tell 
>>>>>> there is no easy way of achieving this. Am I missing something obvious? 
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> One possible route I thought of is to define an extension type and then 
>>>>>> implement the `to_pandas_dtype` method. Yes this method permits all 
>>>>>> known values whilst in Arrow-land, but it at least documents the known 
>>>>>> type and, so I thought, any values not within the `to_pandas_dtype` 
>>>>>> return will be set to null on conversion anyway.
>>>>>> 
>>>>>> However, this seems to require unnecessarily special-casing a whole 
>>>>>> bunch of code to handle extension types. e.g. just creating a scalar of 
>>>>>> this type requires using a different API. It seems like `pa.scalar` 
>>>>>> should be able to work this out? This example defines a wrapper for 
>>>>>> int32, and then tries to create a scalar of this type showing that the 
>>>>>> user has to call a special method rather than just the normal API:
>>>>>> 
>>>>>> ```
>>>>>> import pyarrow as pa 
>>>>>> 
>>>>>> 
>>>>>> class IntegerWrapper(pa.ExtensionType):
>>>>>> 
>>>>>>     def __init__(self):
>>>>>>         pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
>>>>>> 
>>>>>>     def __arrow_ext_serialize__(self):
>>>>>>         # since we don't have a parameterized type, we don't need extra
>>>>>>         # metadata to be deserialized
>>>>>>         return b''
>>>>>> 
>>>>>>     @classmethod
>>>>>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>>>>>         # return an instance of this subclass given the serialized
>>>>>>         # metadata.
>>>>>>         return IntegerWrapper()
>>>>>>    
>>>>>> 
>>>>>> iw_type = IntegerWrapper()
>>>>>> 
>>>>>> pa.register_extension_type(iw_type)
>>>>>> 
>>>>>> # throws `ArrowNotImplementedError`
>>>>>> # pa.scalar(0, iw_type)
>>>>>> 
>>>>>> # user must do this, but code should be able to do this?
>>>>>> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, 
>>>>>> iw_type.storage_type))
>>>>>> ```
>>>>>> 
>>>>>> and I can't seem to get the `to_pandas_dtype` to actually work for a 
>>>>>> wrapped dictionary. e.g. 
>>>>>> 
>>>>>> ```
>>>>>> import pyarrow as pa 
>>>>>> 
>>>>>> 
>>>>>> class DictWrapper(pa.ExtensionType):
>>>>>> 
>>>>>>     def __init__(self):
>>>>>>         pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), 
>>>>>> pa.string()), "dict_wrapper")
>>>>>> 
>>>>>>     def __arrow_ext_serialize__(self):
>>>>>>         # since we don't have a parameterized type, we don't need extra
>>>>>>         # metadata to be deserialized
>>>>>>         return b''
>>>>>> 
>>>>>>     @classmethod
>>>>>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>>>>>         # return an instance of this subclass given the serialized
>>>>>>         # metadata.
>>>>>>         return DictWrapper()
>>>>>>    
>>>>>>     def to_pandas_dtype(self):
>>>>>>         from pandas.api.types import CategoricalDtype
>>>>>>         return CategoricalDtype(categories=["a", "b"])
>>>>>> 
>>>>>> dw_type = DictWrapper()
>>>>>> 
>>>>>> pa.register_extension_type(dw_type)
>>>>>> 
>>>>>> arr = pa.ExtensionArray.from_storage( 
>>>>>>     dw_type,
>>>>>>     pa.array(["a", "b", "c"], dw_type.storage_type)
>>>>>> )
>>>>>> 
>>>>>> arr
>>>>>> 
>>>>>> arr.to_pandas()
>>>>>> 
>>>>>> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
>>>>>> ```
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Sam
>>>>>> IMPORTANT NOTICE: The information transmitted is intended only for the 
>>>>>> person or entity to which it is addressed and may contain confidential 
>>>>>> and/or privileged material. Any review, re-transmission, dissemination 
>>>>>> or other use of, or taking of any action in reliance upon, this 
>>>>>> information by persons or entities other than the intended recipient is 
>>>>>> prohibited. If you received this in error, please contact the sender and 
>>>>>> delete the material from any computer. Although we routinely screen for 
>>>>>> viruses, addressees should check this e-mail and any attachment for 
>>>>>> viruses. We make no warranty as to absence of viruses in this e-mail or 
>>>>>> any attachments.
>>>>> 
>>>>> IMPORTANT NOTICE: The information transmitted is intended only for the 
>>>>> person or entity to which it is addressed and may contain confidential 
>>>>> and/or privileged material. Any review, re-transmission, dissemination or 
>>>>> other use of, or taking of any action in reliance upon, this information 
>>>>> by persons or entities other than the intended recipient is prohibited. 
>>>>> If you received this in error, please contact the sender and delete the 
>>>>> material from any computer. Although we routinely screen for viruses, 
>>>>> addressees should check this e-mail and any attachment for viruses. We 
>>>>> make no warranty as to absence of viruses in this e-mail or any 
>>>>> attachments.
>>>> 
>> IMPORTANT NOTICE: The information transmitted is intended only for the 
>> person or entity to which it is addressed and may contain confidential 
>> and/or privileged material. Any review, re-transmission, dissemination or 
>> other use of, or taking of any action in reliance upon, this information by 
>> persons or entities other than the intended recipient is prohibited. If you 
>> received this in error, please contact the sender and delete the material 
>> from any computer. Although we routinely screen for viruses, addressees 
>> should check this e-mail and any attachment for viruses. We make no warranty 
>> as to absence of viruses in this e-mail or any attachments.
>

Re: [Question][Python] Columns with Limited Value Set

Reply via email to