I'm not sure if it's possible at the moment, but it SHOULD be made
possible. See ARROW-5340

On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
<[email protected]> wrote:
>
> Hello,
>
> I have a pandas DataFrame with many string columns (>30,000), and they share 
> a low-cardinality set of values (e.g. size 100). I'd like to convert this to 
> an Arrow table of dictionary encoded columns (let's say int16 for the index 
> cols), but with just one shared dictionary of strings.
> This is to avoid ending up with >30,000 tiny dictionaries on the wire, which 
> doesn't even load in e.g. Java (due to a stackoverflow error).
>
> Despite my efforts, I haven't really been able to achieve this with the 
> public API's I could find. Does anyone have an idea? I'm using pyarrow 3.0.0.
>
> For a mickey mouse example, I'm looking at e.g.
>
> df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})
>
> and would like a Table with dictionary-encoded columns a and b, both 
> nullable, that both refer to the same dictionary with id=0 (or whatever id) 
> containing ['foo', 'bar', 'quux'].
>
> Thanks,
> -Joris.
>
>
>
>
>
>
>

Reply via email to