Just wondering... 
If there is a pretty basic C program that runs as a service and gets the binary 
data off of some memory location, and if it's capable of using type definitions 
from the headers directly, is it an option (is it less work too) to just buff 
up this C program a little bit to leave the parsing entirely to the headers, 
and get instances of structs for the different types instead.
If so, it would be significantly more helpful (and easier for adoption) if 
nanoarrow is able to convert an arbitrary (maybe) nested struct into 
ArrowSchema.

Example - 
https://github.com/IBM/IBM-Z-zOS/blob/main/SMF-Tools/SMFReal/smfreal.c and 
https://github.com/IBM/IBM-Z-zOS/blob/main/SMF-Tools/SMFReal/smfreal.h


On Thursday, March 7th, 2024 at 22:10, Felipe Oliveira Carvalho 
<[email protected]> wrote:

> Makes sense.
> 
> Converting arbitrary C headers to Arrow types is not solvable in the
> general case but you could write a Python (or your language of choice)
> script that parses the specific .h files you care about (RegEx hacks
> might be enough) and generate the C++ file that uses the DuckDB
> appender [1] to convert these specific types. Your output would
> include the C header so you wouldn't have to care about ABI details
> regarding the packing of bits in C structs.
> 
> [1] https://duckdb.org/docs/data/appender.html
> 
> On Thu, Mar 7, 2024 at 12:20 PM kekronbekron
> [email protected] wrote:
> 
> > Hi Felipe,
> > 
> > The intent is to convert from a binary format to one that is common outside 
> > my mainframe pond... Arrow spec.
> > There are tens to hundreds of millions of records and hundreds of types & 
> > subtypes of these records.
> > So definitely more than one at a time.
> > I want to make a mini DB (ex: a .duckdb file) of each record type+subtype, 
> > so that exploring within a type is fast, and joining stuff is equally fast 
> > & easy.
> > 
> > Once converted, it's just a matter of accessing them via S3 or whatever.
> > 
> > On Thursday, March 7th, 2024 at 20:04, Felipe Oliveira Carvalho 
> > [email protected] wrote:
> > 
> > > What are you trying to achieve in converting these structs to arrays
> > > partitioned by columns?
> > > Are you transferring batches of them from/to somewhere?
> > > The Arrow format is not good if you intend to process one at a time.
> > > 
> > > On Wed, Mar 6, 2024 at 12:33 PM kekronbekron
> > > [email protected] wrote:
> > > 
> > > > Also considering derive crates for Arrow, but it seems to be very early 
> > > > days for it.
> > > > If I can go from Rust structures to Arrow through derive macros, that 
> > > > would be the least amount of work one has to do as a user.
> > > > Code for such derive macros is certainly a lot of work...
> > > > There's arrow2_convert, serde_arrow, and narrow. narrow seems to be 
> > > > more promising.
> > > > 
> > > > Although I conceptually like the example you've shown (python cffi + 
> > > > header file to generate schema, then running the C program),
> > > > I wonder if I'm better off with python/rust (than C/C++), despite 
> > > > needing to type out the structures manually for python/rust.
> > > > 
> > > > On Wednesday, March 6th, 2024 at 19:07, Dewey Dunnington via user 
> > > > [email protected] wrote:
> > > > 
> > > > > Hi KB,
> > > > > 
> > > > > I imagine you will need a mix of generated and manually typed code to
> > > > > generate the ArrowSchema from the definition and recipe to build the
> > > > > ArrowArray from an instance, perhaps starting with well-tested
> > > > > manually typed code that you replace with generated code as patterns
> > > > > appear.
> > > > > 
> > > > > I think nanoarrow is appropriate for what you are trying to do...it
> > > > > provides a "straightforward" (in terms of packaging complexity) path
> > > > > to wrapping your generator functions in Rust and Python. We haven't
> > > > > done a great job of documenting how to do that with examples but feel
> > > > > free to ask here or open an issue in apache/arrow-nanoarrow asking for
> > > > > help until we do.
> > > > > 
> > > > > Cheers!
> > > > > 
> > > > > -dewey
> > > > > 
> > > > > On Tue, Mar 5, 2024 at 11:14 PM kekronbekron
> > > > > [email protected] wrote:
> > > > > 
> > > > > > Hi Dewey,
> > > > > > 
> > > > > > Thank you for taking the time.
> > > > > > My goal is to convert from a variety of big C data structures like 
> > > > > > this to equivalent Arrow spec/schema.
> > > > > > Then, I would like to store them (RecordBatches) to parquet or any 
> > > > > > other relevant type.
> > > > > > The CSV or JSON output from the example C program (smf84fmt) 
> > > > > > doesn't matter; just wanted to point to the sample data format as 
> > > > > > in the header file.
> > > > > > 
> > > > > > I had tried bindgen to create Rust definitions from the header 
> > > > > > files, but it gets complicated real fast... more than I can 
> > > > > > comprehend at least.
> > > > > > 
> > > > > > The types get crazier too, with singly linked lists (not there in 
> > > > > > the linked example, but in other types), etc.
> > > > > > 
> > > > > > Would really like to solve this in a systemtic way, without needing 
> > > > > > to hand code the Arrow schema...
> > > > > > Because the C header files are maintained (by a provider), it would 
> > > > > > work out best if it's possible to create a conversion script, and 
> > > > > > then use the Arrow schema in Python/Rust/etc.
> > > > > > 
> > > > > > -KB
> > > > > > 
> > > > > > On Wednesday, March 6th, 2024 at 07:59, Dewey Dunnington via user 
> > > > > > [email protected] wrote:
> > > > > > 
> > > > > > > Hi KB,
> > > > > > > 
> > > > > > > There might be some other approaches I'm not aware of; however, I 
> > > > > > > had
> > > > > > > some fun with Python's cffi package to generate some (untested)
> > > > > > > nanoarrow code based on the struct definitions [1]. If all you 
> > > > > > > need
> > > > > > > are the types in Python or some other higher-level language 
> > > > > > > (e.g., to
> > > > > > > read one of the CSV or JSON files generated by the tool you 
> > > > > > > linked),
> > > > > > > you could generate Python code instead.
> > > > > > > 
> > > > > > > I hope that's helpful!
> > > > > > > 
> > > > > > > -dewey
> > > > > > > 
> > > > > > > [1] 
> > > > > > > https://gist.github.com/paleolimbot/e1667a57f837e4db7e973b9677e33ddb
> > > > > > > 
> > > > > > > On Sun, Mar 3, 2024 at 10:08 PM kekronbekron
> > > > > > > [email protected] wrote:
> > > > > > > 
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > Say I have a whole bunch of fully typed (with unions and all) 
> > > > > > > > data structures like the one here - 
> > > > > > > > https://github.com/IBM/IBM-Z-zOS/blob/main/SMF-Tools/SMF84Formatter/smf84fmt.h.
> > > > > > > > Say I'm parsing bytes with such a header...is it possible to 
> > > > > > > > then use Arrow's C data interface (or maybe nanoarrow) to 
> > > > > > > > painlessly convert such a struct to Arrow type(s)?
> > > > > > > > 
> > > > > > > > - KB

Reply via email to