Hi Simon,
> would there be a way to "reinterpret" this in-memory layout as an Arrow > buffer/RecordBatch/whatever and therby avoid copy operations? I think you have two options. Probably the most applicable/fastest: if you always have a fixed number of values per sample, you might want to try FixedSizeBinary. I haven't used it myself yet but I think its value buffer should look exactly like your one-dimensional C-array. You can then unwrap individual samples later on in the downstream of your pipeline. Another possibility is that you could consider your one-dimensional C-array as the "Arrow values buffer" of an "Arrow list<int8> array", where every list in the Arrow array has length 16. In fact, the format spec shows an example for list<char> here which is almost the same: https://arrow.apache.org/docs/format/Layout.html The only drawback would be that you'd also need to create an offsets buffer if you'd want to continue to read that data later on, through Arrow's API. That will also be slower that FixedSizeBinary, as you have an added level of indirection (memory latency) when you want to access a value. Hope this helps, and good luck, Johan ________________________________ From: Simon Dumke <[email protected]> Sent: Wednesday, July 17, 2019 11:27:58 AM To: [email protected] Subject: Beginner Question: HW Input into Arrow RecordBatch Dear all, I'm just starting into Apache Arrow (or more like thinking about it). I'm also thinking about using Arrow not only inside our porcessing pipeline, but auf data acwuisition pipeline too. Regarding this, I have the following Question: There are primarily two kinds of DAQ APIs in use here: * One [e.g. like int getData(unsigned char *data, size_t bufferSize)] takes a pointer to a preallocated buffer and fills it with data from DAQ hardware * The other [e.g. like int getData(unsigned char **data)] "returns" a pointer to a buffer created inside the hardware driver, filled with data from DAQ hardware If I want to use Arrow to transport and handle the data coming out of those APIs, I would usually need to allocate an Arrow Buffer and (with a sweep of copy operations) parse the acquired data into it. If the hardware's output is an interlaced stream of samples (e.g. 16 8bit values from a 16-channel ADC, followed by the 16 values of the next sample...), that would obviously be row-oriented and i would therefore need to parse it manually into the Arrow buffer. The question is now: If the data is only a one-dimensional array of samples (like from a single channel ADC) or the hardware offers the option to fill the buffer in a non-interlace / planar manner (meaning all samples from channle 0, followed by all samples of channel 1 and so on - essentially "columnar") - would there be a way to "reinterpret" this in-memory layout as an Arrow buffer/RecordBatch/whatever and therby avoid copy operations? e.g. by adding a specific "header", or, when using an API of the first type, by providing a pointer into a buffer allocated by Arrow and already prepared for the specific content layout? I hope, my question and intention coms through clear enough. Any ideas would be greatly appreciated! BTW - can anybody offer some links with Getting Started Guides, examples etc. how to start using Arrow (both C++ and Java)? I find myself still having dificulties finding the right starting point. Many Thanks and kind regards, Simon -- ---- Simon Dumke Developer - CoDaC Department Operation Max Planck Institut for Plasmaphysics
