Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make the 
dictionary part of the schema itself (and the format even allows for 
dictionaries to be updated over time). I wonder if the dictionary type could be 
extended to handle this; alternatively, passing around explicit dictionaries 
alongside the schema might get you most of the way there. (It looks like we 
might need some way to pass a dictionary to from_pandas, or otherwise provide 
some way to dictionary-encode an Arrow array according to an existing 
dictionary.)

-David

On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote:
> Hi Rok, David,
> 
> I think the problem is that the DictionaryType loses the semantic information 
> about the categories. 
> 
> Right now I define the schema for the tables and have logic to parse 
> files/receive data and convert it into RecordBatchs ready for writing. This 
> is quite simple: for each row we generate a dictionary of {key: value, ...} 
> as the data comes in, pass a set of these to `pd.DataFrame(...)`, and then 
> convert using `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware 
> newer versions have a `pa.record_batch` that can now be used). 
> 
> In this instance the schema species to the code and to the user what columns 
> should be present and what the type, and values, of these should be.
> 
> The use of DictionaryArray breaks this as there is no way of specifying the 
> permitted set of values (`dictionary` in your example) in the schema itself? 
> Pandas has CategoricalDtype whereby you can specify `categories` but this 
> information needs to be stored somewhere other than the schema itself and 
> special cased for categorical columns.
> 
> This suggests that it may be a good idea to add the categorical type 
> information?
> 
> Right now it looks like I'll have to define my own schema/field classes that 
> return PyArrow and Pandas types when requested ��
> 
> Sam
> 
> 
> 
> 
> 
> *From:* David Li <[email protected]>
> *Sent:* 05 January 2022 14:53
> *To:* [email protected] <[email protected]>
> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>  
> Hi Sam,
> 
> For categoricals, you likely want an Arrow dictionary array. (See docs at 
> [1].) For example:
> 
> >>> import pyarrow as pa
> >>> ty = pa.dictionary(pa.int8(), pa.string())
> >>> arr = pa.array(["a", "a", None, "d"], type=ty)
> >>> arr
> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>
> 
> -- dictionary:
>   [
>     "a",
>     "d"
>   ]
> -- indices:
>   [
>     0,
>     0,
>     null,
>     1
>   ]
> >>> table = pa.table([arr], names=["col1"])
> >>> table.to_pandas()
>   col1
> 0    a
> 1    a
> 2  NaN
> 3    d
> >>> table.to_pandas()["col1"]
> 0      a
> 1      a
> 2    NaN
> 3      d
> Name: col1, dtype: category
> Categories (2, object): ['a', 'd']
> 
> Is this sufficient?
> 
> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays
> 
> -David
> 
> 
> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
>> Hi,
>> 
>> I'm looking at defining a schema for a table where one of the values is 
>> inherently categorical/enumerable and we're ultimately ending up loading it 
>> as a Pandas DataFrame. I cannot seem to find a decent way of achieving this.
>> 
>> For example, the column may always be known to contain the values ["a", "b", 
>> "c", "d"]. Stating this as a stringly-typed column in the schema is a bad 
>> idea as it permits all strings and requires more storage than necessary for 
>> longer strings, stating it as an integer column is a bad idea as you lose 
>> context and force the user to cast after loading, and the dictionary type 
>> does not allow you to specify the values in the schema so similarly loses 
>> all meaning.
>> 
>> I have been playing with the API all morning and from what I can tell there 
>> is no easy way of achieving this. Am I missing something obvious? 
>> 
>> ---
>> 
>> One possible route I thought of is to define an extension type and then 
>> implement the `to_pandas_dtype` method. Yes this method permits all known 
>> values whilst in Arrow-land, but it at least documents the known type and, 
>> so I thought, any values not within the `to_pandas_dtype` return will be set 
>> to null on conversion anyway.
>> 
>> However, this seems to require unnecessarily special-casing a whole bunch of 
>> code to handle extension types. e.g. just creating a scalar of this type 
>> requires using a different API. It seems like `pa.scalar` should be able to 
>> work this out? This example defines a wrapper for int32, and then tries to 
>> create a scalar of this type showing that the user has to call a special 
>> method rather than just the normal API:
>> 
>> ```
>> import pyarrow as pa 
>> 
>> 
>> class IntegerWrapper(pa.ExtensionType):
>> 
>>     def __init__(self):
>>         pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
>> 
>>     def __arrow_ext_serialize__(self):
>>         # since we don't have a parameterized type, we don't need extra
>>         # metadata to be deserialized
>>         return b''
>> 
>>     @classmethod
>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>         # return an instance of this subclass given the serialized
>>         # metadata.
>>         return IntegerWrapper()
>>    
>> 
>> iw_type = IntegerWrapper()
>> 
>> pa.register_extension_type(iw_type)
>> 
>> # throws `ArrowNotImplementedError`
>> # pa.scalar(0, iw_type)
>> 
>> # user must do this, but code should be able to do this?
>> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, iw_type.storage_type))
>> ```
>> 
>> and I can't seem to get the `to_pandas_dtype` to actually work for a wrapped 
>> dictionary. e.g. 
>> 
>> ```
>> import pyarrow as pa 
>> 
>> 
>> class DictWrapper(pa.ExtensionType):
>> 
>>     def __init__(self):
>>         pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), 
>> pa.string()), "dict_wrapper")
>> 
>>     def __arrow_ext_serialize__(self):
>>         # since we don't have a parameterized type, we don't need extra
>>         # metadata to be deserialized
>>         return b''
>> 
>>     @classmethod
>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>         # return an instance of this subclass given the serialized
>>         # metadata.
>>         return DictWrapper()
>>    
>>     def to_pandas_dtype(self):
>>         from pandas.api.types import CategoricalDtype
>>         return CategoricalDtype(categories=["a", "b"])
>> 
>> dw_type = DictWrapper()
>> 
>> pa.register_extension_type(dw_type)
>> 
>> arr = pa.ExtensionArray.from_storage( 
>>     dw_type,
>>     pa.array(["a", "b", "c"], dw_type.storage_type)
>> )
>> 
>> arr
>> 
>> arr.to_pandas()
>> 
>> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
>> ```
>> 
>> Best,
>> 
>> Sam
>> IMPORTANT NOTICE: The information transmitted is intended only for the 
>> person or entity to which it is addressed and may contain confidential 
>> and/or privileged material. Any review, re-transmission, dissemination or 
>> other use of, or taking of any action in reliance upon, this information by 
>> persons or entities other than the intended recipient is prohibited. If you 
>> received this in error, please contact the sender and delete the material 
>> from any computer. Although we routinely screen for viruses, addressees 
>> should check this e-mail and any attachment for viruses. We make no warranty 
>> as to absence of viruses in this e-mail or any attachments.
> 
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.

Reply via email to