Re: RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-06-03 Thread Wes McKinney
There's a relevant Jira issue here (maybe some others), if someone wants to pick it up and write a kernel for it https://issues.apache.org/jira/browse/ARROW-4097 I think having an improved experience around this dictionary conformance/normalization problem would be valuable. On Tue, May 31,

Re: RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-05-31 Thread Weston Pace
I don't think you are missing anything. The parquet encoding is baked into the data on the disk so re-encoding at some stage is inevitable. Re-encoding in python like you are doing is going to be inefficient. I think you will want to do the re-encoding in C++. Unfortunately, I don't think we

RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-05-31 Thread Niklas Bivald
Hi, Background: I have a need to optimize read speed for few-column lookups in large datasets. Currently I have the data in Plasma to have fast reading of it, but Plasma is cumbersome to manage when the data frequently changes (and “locks” the ram). Instead I’m trying to figure out a fast-enough