Hi,
My use case involves processing large datasets in batches (of rows), each
batch resulting in a DataFrame that I'm serializing to a single file on
disk via RecordBatchStreamWriter (to end up with a file that can in turn be
read in batches). My problem is that some columns are pandas categorical
types, for which I can't know ahead of time all the possible categories.
And since the RecordBatchStreamWriter accepts only a single schema, I can't
seem to find a way to update the Arrow dictionary, or write a new schema
for each RecordBatch. This results in an invalid stream/file with
dictionary indices that don't match the schema. Is there currently a way to
do this using the high-level APIs? Or would I have to manually construct
the stream using each batch's schema etc.?

It seems that this may be related to the open issues in ARROW-3144
<https://issues.apache.org/jira/browse/ARROW-3144> (ARROW-5279
<https://issues.apache.org/jira/browse/ARROW-5279>, ARROW-5336
<https://issues.apache.org/jira/browse/ARROW-5336>) and the discussion in
PR-3165 <https://github.com/apache/arrow/pull/3165>, from which I
understand that this may be supported already when writing to parquet, but
not in IPC? Is there any other workaround I could use right now?

Many thanks,
T

Reply via email to