hi KB,

Can you be more precise in your question? I understand that you want to get
all the different struct types in an Arrow dataframe(s) for analytical
processing, but I need an idea on how you want to deal with the
different types before I could attempt to give an answer that makes sense.

One dataframe that includes all struct types? One dataframe per type? Do
you want to keep the struct as a struct data type, or do you want to
flatten it? Some other kind of harmonization?

Or more generalized: what are you trying to do exactly, but failing to
achieve?

Marnix





On Tue, Feb 22, 2022 at 4:01 AM kekronbekron <[email protected]>
wrote:

> Hello,
>
> Any comments and help please?
>
> - KB
>
> ------- Original Message -------
> On Saturday, February 19th, 2022 at 7:10 PM, kekronbekron <
> [email protected]> wrote:
>
> Hello,
>
> Say I have a record-based binary file. There are 100-200 diferent types of
> records.
> The layout is of the format:
>
>
> [len-of-record-including-this][record]...[len-of-record-including-this][record]
>
> When reading the file, I split out each record and then parallelly process
> them, i.e., parse them into the few hundred different types of structs.
> At the moment, from each thread that gives a struct view of this binary
> record, using simd-json-derive, I can get a JSON.
> I'm looking to also output to the arrow format. RecordBatches, I think?
>
> In essence, I want to convert the record-based binary file (where there
> are say 100 different record types) into the arrow format.
> I can get halfway, which is to get a struct view of the data.
> I could use whatever guidance and advice you have, please, to go from a
> bunch of structs (all processed/coming out on different threads) to the
> arrow format, to allow for analytics to 'go mad' with this data.
>
> PS: I'm a forever noob at programming, please be gentle :)
>
> - KB
>
>
>
>

Reply via email to