Re: Beginner Question: HW Input into Arrow RecordBatch

Johan Peltenburg - EWI Wed, 17 Jul 2019 05:01:34 -0700

Hi Simon,


> would there be a way to "reinterpret" this in-memory layout as an Arrow 
> buffer/RecordBatch/whatever and therby avoid copy operations?


I think you have two options.


Probably the most applicable/fastest: if you always have a fixed number of 
values per sample, you might want to try FixedSizeBinary.

I haven't used it myself yet but I think its value buffer should look exactly 
like your one-dimensional C-array.

You can then unwrap individual samples later on in the downstream of your 
pipeline.


Another possibility is that you could consider your one-dimensional C-array as 
the "Arrow values buffer" of an "Arrow list<int8> array", where every list in 
the Arrow array has length 16.

In fact, the format spec shows an example for list<char> here which is almost 
the same: https://arrow.apache.org/docs/format/Layout.html

The only drawback would be that you'd also need to create an offsets buffer if 
you'd want to continue to read that data later on, through Arrow's API. That 
will also be slower that FixedSizeBinary, as you have an added level of 
indirection (memory latency) when you want to access a value.


Hope this helps, and good luck,


Johan

________________________________
From: Simon Dumke <[email protected]>
Sent: Wednesday, July 17, 2019 11:27:58 AM
To: [email protected]
Subject: Beginner Question: HW Input into Arrow RecordBatch


Dear all,

I'm just starting into Apache Arrow (or more like thinking about it). I'm also 
thinking about using Arrow not only inside our porcessing pipeline, but auf 
data acwuisition pipeline too. Regarding this, I have the following Question:

There are primarily two kinds of DAQ APIs in use here:

  *   One [e.g. like int getData(unsigned char *data, size_t bufferSize)] takes 
a pointer to a preallocated buffer and fills it with data from DAQ hardware
  *   The other [e.g. like int getData(unsigned char **data)] "returns" a 
pointer to a buffer created inside the hardware driver, filled with data from 
DAQ hardware

If I want to use Arrow to transport and handle the data coming out of those 
APIs, I would usually need to allocate an Arrow Buffer and (with a sweep of 
copy operations) parse the acquired data into it. If the hardware's output is 
an interlaced stream of samples (e.g. 16 8bit values from a 16-channel ADC, 
followed by the 16 values of the next sample...), that would obviously be 
row-oriented and i would therefore need to parse it manually into the Arrow 
buffer.

The question is now: If the data is only a one-dimensional array of samples 
(like from a single channel ADC) or the hardware offers the option to fill the 
buffer in a non-interlace / planar manner (meaning all samples from channle 0, 
followed by all samples of channel 1 and so on - essentially "columnar") - 
would there be a way to "reinterpret" this in-memory layout as an Arrow 
buffer/RecordBatch/whatever and therby avoid copy operations? e.g. by adding a 
specific "header", or, when using an API of the first type, by providing a 
pointer into a buffer allocated by Arrow and already prepared for the specific 
content layout?

I hope, my question and intention coms through clear enough. Any ideas would be 
greatly appreciated!

BTW - can anybody offer some links with Getting Started Guides, examples etc. 
how to start using Arrow (both C++ and Java)? I find myself still having 
dificulties finding the right starting point.

Many Thanks and kind regards,

Simon

--
----
Simon Dumke

Developer - CoDaC
Department Operation

Max Planck Institut for Plasmaphysics

Re: Beginner Question: HW Input into Arrow RecordBatch

Reply via email to