Hi, My use case involves processing large datasets in batches (of rows), each batch resulting in a DataFrame that I'm serializing to a single file on disk via RecordBatchStreamWriter (to end up with a file that can in turn be read in batches). My problem is that some columns are pandas categorical types, for which I can't know ahead of time all the possible categories. And since the RecordBatchStreamWriter accepts only a single schema, I can't seem to find a way to update the Arrow dictionary, or write a new schema for each RecordBatch. This results in an invalid stream/file with dictionary indices that don't match the schema. Is there currently a way to do this using the high-level APIs? Or would I have to manually construct the stream using each batch's schema etc.?
It seems that this may be related to the open issues in ARROW-3144 <https://issues.apache.org/jira/browse/ARROW-3144> (ARROW-5279 <https://issues.apache.org/jira/browse/ARROW-5279>, ARROW-5336 <https://issues.apache.org/jira/browse/ARROW-5336>) and the discussion in PR-3165 <https://github.com/apache/arrow/pull/3165>, from which I understand that this may be supported already when writing to parquet, but not in IPC? Is there any other workaround I could use right now? Many thanks, T
