Re: pyarrow: write table where columns share the same dictionary

Joris Peeters Fri, 26 Feb 2021 02:11:38 -0800

FWIW, in the Java client it's
https://github.com/apache/arrow/blob/apache-arrow-3.0.0/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamReader.java#L131
that's causing the aforementioned stackoverflow when reading lots of
dictionaries from a stream.
i.e. the recursive construct


    public boolean loadNextBatch() throws IOException {
    ..
      if (..) return true;
      else {
        ..
        return loadNextBatch();
      }
    }

Not sure if that qualifies as a bug, as I think the depth is typically
multiple thousands, but perhaps of interest.


On Thu, Feb 25, 2021 at 8:11 PM Wes McKinney <[email protected]> wrote:

> I'm not sure if it's possible at the moment, but it SHOULD be made
> possible. See ARROW-5340
>
> On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
> <[email protected]> wrote:
> >
> > Hello,
> >
> > I have a pandas DataFrame with many string columns (>30,000), and they
> share a low-cardinality set of values (e.g. size 100). I'd like to convert
> this to an Arrow table of dictionary encoded columns (let's say int16 for
> the index cols), but with just one shared dictionary of strings.
> > This is to avoid ending up with >30,000 tiny dictionaries on the wire,
> which doesn't even load in e.g. Java (due to a stackoverflow error).
> >
> > Despite my efforts, I haven't really been able to achieve this with the
> public API's I could find. Does anyone have an idea? I'm using pyarrow
> 3.0.0.
> >
> > For a mickey mouse example, I'm looking at e.g.
> >
> > df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux',
> 'foo']})
> >
> > and would like a Table with dictionary-encoded columns a and b, both
> nullable, that both refer to the same dictionary with id=0 (or whatever id)
> containing ['foo', 'bar', 'quux'].
> >
> > Thanks,
> > -Joris.
> >
> >
> >
> >
> >
> >
> >
>

Re: pyarrow: write table where columns share the same dictionary

Reply via email to