Makes sense.

Converting arbitrary C headers to Arrow types is not solvable in the
general case but you could write a Python (or your language of choice)
script that parses the specific .h files you care about (RegEx hacks
might be enough) and generate the C++ file that uses the DuckDB
appender [1] to convert these specific types. Your output would
include the C header so you wouldn't have to care about ABI details
regarding the packing of bits in C structs.

[1] https://duckdb.org/docs/data/appender.html

On Thu, Mar 7, 2024 at 12:20 PM kekronbekron
<[email protected]> wrote:
>
> Hi Felipe,
>
> The intent is to convert from a binary format to one that is common outside 
> my mainframe pond... Arrow spec.
> There are tens to hundreds of millions of records and hundreds of types & 
> subtypes of these records.
> So definitely more than one at a time.
> I want to make a mini DB (ex: a .duckdb file) of each record type+subtype, so 
> that exploring within a type is fast, and joining stuff is equally fast & 
> easy.
>
> Once converted, it's just a matter of accessing them via S3 or whatever.
>
>
> On Thursday, March 7th, 2024 at 20:04, Felipe Oliveira Carvalho 
> <[email protected]> wrote:
>
> > What are you trying to achieve in converting these structs to arrays
> > partitioned by columns?
> > Are you transferring batches of them from/to somewhere?
> > The Arrow format is not good if you intend to process one at a time.
> >
> > On Wed, Mar 6, 2024 at 12:33 PM kekronbekron
> > [email protected] wrote:
> >
> > > Also considering derive crates for Arrow, but it seems to be very early 
> > > days for it.
> > > If I can go from Rust structures to Arrow through derive macros, that 
> > > would be the least amount of work one has to do as a user.
> > > Code for such derive macros is certainly a lot of work...
> > > There's arrow2_convert, serde_arrow, and narrow. narrow seems to be more 
> > > promising.
> > >
> > > Although I conceptually like the example you've shown (python cffi + 
> > > header file to generate schema, then running the C program),
> > > I wonder if I'm better off with python/rust (than C/C++), despite needing 
> > > to type out the structures manually for python/rust.
> > >
> > > On Wednesday, March 6th, 2024 at 19:07, Dewey Dunnington via user 
> > > [email protected] wrote:
> > >
> > > > Hi KB,
> > > >
> > > > I imagine you will need a mix of generated and manually typed code to
> > > > generate the ArrowSchema from the definition and recipe to build the
> > > > ArrowArray from an instance, perhaps starting with well-tested
> > > > manually typed code that you replace with generated code as patterns
> > > > appear.
> > > >
> > > > I think nanoarrow is appropriate for what you are trying to do...it
> > > > provides a "straightforward" (in terms of packaging complexity) path
> > > > to wrapping your generator functions in Rust and Python. We haven't
> > > > done a great job of documenting how to do that with examples but feel
> > > > free to ask here or open an issue in apache/arrow-nanoarrow asking for
> > > > help until we do.
> > > >
> > > > Cheers!
> > > >
> > > > -dewey
> > > >
> > > > On Tue, Mar 5, 2024 at 11:14 PM kekronbekron
> > > > [email protected] wrote:
> > > >
> > > > > Hi Dewey,
> > > > >
> > > > > Thank you for taking the time.
> > > > > My goal is to convert from a variety of big C data structures like 
> > > > > this to equivalent Arrow spec/schema.
> > > > > Then, I would like to store them (RecordBatches) to parquet or any 
> > > > > other relevant type.
> > > > > The CSV or JSON output from the example C program (smf84fmt) doesn't 
> > > > > matter; just wanted to point to the sample data format as in the 
> > > > > header file.
> > > > >
> > > > > I had tried bindgen to create Rust definitions from the header files, 
> > > > > but it gets complicated real fast... more than I can comprehend at 
> > > > > least.
> > > > >
> > > > > The types get crazier too, with singly linked lists (not there in the 
> > > > > linked example, but in other types), etc.
> > > > >
> > > > > Would really like to solve this in a systemtic way, without needing 
> > > > > to hand code the Arrow schema...
> > > > > Because the C header files are maintained (by a provider), it would 
> > > > > work out best if it's possible to create a conversion script, and 
> > > > > then use the Arrow schema in Python/Rust/etc.
> > > > >
> > > > > -KB
> > > > >
> > > > > On Wednesday, March 6th, 2024 at 07:59, Dewey Dunnington via user 
> > > > > [email protected] wrote:
> > > > >
> > > > > > Hi KB,
> > > > > >
> > > > > > There might be some other approaches I'm not aware of; however, I 
> > > > > > had
> > > > > > some fun with Python's cffi package to generate some (untested)
> > > > > > nanoarrow code based on the struct definitions [1]. If all you need
> > > > > > are the types in Python or some other higher-level language (e.g., 
> > > > > > to
> > > > > > read one of the CSV or JSON files generated by the tool you linked),
> > > > > > you could generate Python code instead.
> > > > > >
> > > > > > I hope that's helpful!
> > > > > >
> > > > > > -dewey
> > > > > >
> > > > > > [1] 
> > > > > > https://gist.github.com/paleolimbot/e1667a57f837e4db7e973b9677e33ddb
> > > > > >
> > > > > > On Sun, Mar 3, 2024 at 10:08 PM kekronbekron
> > > > > > [email protected] wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Say I have a whole bunch of fully typed (with unions and all) 
> > > > > > > data structures like the one here - 
> > > > > > > https://github.com/IBM/IBM-Z-zOS/blob/main/SMF-Tools/SMF84Formatter/smf84fmt.h.
> > > > > > > Say I'm parsing bytes with such a header...is it possible to then 
> > > > > > > use Arrow's C data interface (or maybe nanoarrow) to painlessly 
> > > > > > > convert such a struct to Arrow type(s)?
> > > > > > >
> > > > > > > - KB

Reply via email to