hi KB, Can you be more precise in your question? I understand that you want to get all the different struct types in an Arrow dataframe(s) for analytical processing, but I need an idea on how you want to deal with the different types before I could attempt to give an answer that makes sense.
One dataframe that includes all struct types? One dataframe per type? Do you want to keep the struct as a struct data type, or do you want to flatten it? Some other kind of harmonization? Or more generalized: what are you trying to do exactly, but failing to achieve? Marnix On Tue, Feb 22, 2022 at 4:01 AM kekronbekron <[email protected]> wrote: > Hello, > > Any comments and help please? > > - KB > > ------- Original Message ------- > On Saturday, February 19th, 2022 at 7:10 PM, kekronbekron < > [email protected]> wrote: > > Hello, > > Say I have a record-based binary file. There are 100-200 diferent types of > records. > The layout is of the format: > > > [len-of-record-including-this][record]...[len-of-record-including-this][record] > > When reading the file, I split out each record and then parallelly process > them, i.e., parse them into the few hundred different types of structs. > At the moment, from each thread that gives a struct view of this binary > record, using simd-json-derive, I can get a JSON. > I'm looking to also output to the arrow format. RecordBatches, I think? > > In essence, I want to convert the record-based binary file (where there > are say 100 different record types) into the arrow format. > I can get halfway, which is to get a struct view of the data. > I could use whatever guidance and advice you have, please, to go from a > bunch of structs (all processed/coming out on different threads) to the > arrow format, to allow for analytics to 'go mad' with this data. > > PS: I'm a forever noob at programming, please be gentle :) > > - KB > > > >
