Ok, thanks for letting me know! I assume the same holds for the file writer class and will keep an eye on the thread...
On Mon, 14 Oct 2019 at 22:56, Wes McKinney <[email protected]> wrote: > hi Thomas, > > The stream writer class currently only supports a constant dictionary. > The work in ARROW-3144 moved the dictionary out of the schema and into > the DictionaryArray data structure, so this is necessary to allow > changing dictionaries in a stream. > > To support your use case, we either need dictionary deltas or > dictionary replacements to be implemented. These are provided for in > the format, but have not been implemented yet in C++. > > Note there's a mailing list thread on dev@ going on right now about > finalizing low level details of dictionary encoding in the columnar > format specification > > > https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E > > I just opened https://issues.apache.org/jira/browse/ARROW-6883 since I > didn't see another issue covering this > > - Wes > > On Mon, Oct 14, 2019 at 8:41 AM Thomas Buhrmann > <[email protected]> wrote: > > > > Hi, > > My use case involves processing large datasets in batches (of rows), > each batch resulting in a DataFrame that I'm serializing to a single file > on disk via RecordBatchStreamWriter (to end up with a file that can in turn > be read in batches). My problem is that some columns are pandas categorical > types, for which I can't know ahead of time all the possible categories. > And since the RecordBatchStreamWriter accepts only a single schema, I can't > seem to find a way to update the Arrow dictionary, or write a new schema > for each RecordBatch. This results in an invalid stream/file with > dictionary indices that don't match the schema. Is there currently a way to > do this using the high-level APIs? Or would I have to manually construct > the stream using each batch's schema etc.? > > > > It seems that this may be related to the open issues in ARROW-3144 > (ARROW-5279, ARROW-5336) and the discussion in PR-3165, from which I > understand that this may be supported already when writing to parquet, but > not in IPC? Is there any other workaround I could use right now? > > > > Many thanks, > > T >
