I'm not sure if it's possible at the moment, but it SHOULD be made possible. See ARROW-5340
On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters <[email protected]> wrote: > > Hello, > > I have a pandas DataFrame with many string columns (>30,000), and they share > a low-cardinality set of values (e.g. size 100). I'd like to convert this to > an Arrow table of dictionary encoded columns (let's say int16 for the index > cols), but with just one shared dictionary of strings. > This is to avoid ending up with >30,000 tiny dictionaries on the wire, which > doesn't even load in e.g. Java (due to a stackoverflow error). > > Despite my efforts, I haven't really been able to achieve this with the > public API's I could find. Does anyone have an idea? I'm using pyarrow 3.0.0. > > For a mickey mouse example, I'm looking at e.g. > > df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']}) > > and would like a Table with dictionary-encoded columns a and b, both > nullable, that both refer to the same dictionary with id=0 (or whatever id) > containing ['foo', 'bar', 'quux']. > > Thanks, > -Joris. > > > > > > >
