Re: Looking for suggestions on approach

Marnix van den Broek Wed, 23 Feb 2022 00:51:59 -0800

hi KB,

Thanks. I only have a superficial knowledge of how to do things in Rust,
but I'll attempt to contribute from what I know from the Pyarrow side of
things.


With regards to laying out the data (flat or nested): personally I like to
be able to include complex data types in tabular data layout, but only if
they are relatively flat. In my experience, every additional layer of
complexity in a data type like a nested struct spills over to complexity of
analytical queries, while flat(-ish) lists and structs actually simplify
analytical queries. However, before you get a grasp of the data, you
probably want to keep things together for a while, so I'd go for unpacking
only the root level of the struct and leave it at that.

Having said that, I think your main approach could be:

   - Convert or express your struct schemas to appropriate arrow schemas.
   Like I just explained, I would start by expressing all direct attributes of
   the struct as individual fields and take it from there (without flattening
   any further), instead of working with a single field containing the entire
   struct.
   - Then, for each struct type you:
      - unpack the value of each field of the structs into Arrow arrays of
      values corresponding to the schema you just defined.
      - define an Arrow table (or record batches) using the arrays and
      schema you just created
      - write the record batches to disk. On the Pyarrow side of things a
      usual choice is to store data as Parquet or Feather files, but I
can't find
      either in the rust crate (but maybe I'm not looking in the
correct place).
      There is the arrow::ipc::writer::FileWriter that you could use
to store the
      data in the IPC format. I can't digest it quickly from the docs, but I
      expect it to write out uncompressed IPC files that work like
memory mapped
      files. Great for analytics, not so great for persistent storage.

Before working on the above, I'd look into Polars [1] if I were you: it's
an analytical dataframe library on top of Arrow written in Rust with Python
bindings. It features an API to read from and write to Parquet, which could
also be a good option for you.

- Marnix

[1] https://www.pola.rs/



On Tue, Feb 22, 2022 at 1:03 PM kekronbekron <[email protected]>
wrote:

> Hey,
>
> Thanks for writing back.
> Please see if this comment helps clarify what I'm after -
> https://github.com/sharksforarms/deku/issues/219#issuecomment-846548073
>
> Trying this in Rust, I'm using declarative parsing (a Rust crate called
> deku) to parse binary data 'into' different structs and enums.
> When reading the binary file, I'm looking to use Rayon's split feature to
> iter over the identified binary 'lines'.
> Then, in that par_iter, I use deku to read from the bytes 'into' the
> qualifying enum/struct.
> So in each thread, I'll have an output struct that's a specific type.
>
> I either need to collect() it all, and then convert to arrow/feather... or
> do it one by one while in the thread, right after I get the struct/enum
> variant.
> Splitting out the different struct types if possible would be great.
> Flattening to a single level is fine, but I'm open to suggestions at this
> point on what would work out best.
>
> In super simple terms, it's to parse out a blob of binary data into
> different arrow-specced files, i.e., one for each struct type.
> There is the crate called serde_arrow, but that's being worked on at the
> moment (deku too, anyway).
> At the end of this, I can let loose analytics on those files.
>
> Thanks again for your time answering.
>
> - KB
>
> ------- Original Message -------
> On Tuesday, February 22nd, 2022 at 4:16 PM, Marnix van den Broek <
> [email protected]> wrote:
>
> hi KB,
>
> Can you be more precise in your question? I understand that you want to
> get all the different struct types in an Arrow dataframe(s) for analytical
> processing, but I need an idea on how you want to deal with the different
> types before I could attempt to give an answer that makes sense.
>
> One dataframe that includes all struct types? One dataframe per type? Do
> you want to keep the struct as a struct data type, or do you want to
> flatten it? Some other kind of harmonization?
>
> Or more generalized: what are you trying to do exactly, but failing to
> achieve?
>
> Marnix
>
>
>
>
>
> On Tue, Feb 22, 2022 at 4:01 AM kekronbekron <[email protected]>
> wrote:
>
>> Hello,
>>
>> Any comments and help please?
>>
>> - KB
>>
>> ------- Original Message -------
>> On Saturday, February 19th, 2022 at 7:10 PM, kekronbekron <
>> [email protected]> wrote:
>>
>> Hello,
>>
>> Say I have a record-based binary file. There are 100-200 diferent types
>> of records.
>> The layout is of the format:
>>
>>
>> [len-of-record-including-this][record]...[len-of-record-including-this][record]
>>
>> When reading the file, I split out each record and then parallelly
>> process them, i.e., parse them into the few hundred different types of
>> structs.
>> At the moment, from each thread that gives a struct view of this binary
>> record, using simd-json-derive, I can get a JSON.
>> I'm looking to also output to the arrow format. RecordBatches, I think?
>>
>> In essence, I want to convert the record-based binary file (where there
>> are say 100 different record types) into the arrow format.
>> I can get halfway, which is to get a struct view of the data.
>> I could use whatever guidance and advice you have, please, to go from a
>> bunch of structs (all processed/coming out on different threads) to the
>> arrow format, to allow for analytics to 'go mad' with this data.
>>
>> PS: I'm a forever noob at programming, please be gentle :)
>>
>> - KB
>>
>>
>>
>>
>

Re: Looking for suggestions on approach

Reply via email to