Re: [C++] [Parquet] Questions about batch reading byte arrays

Andrew Bell Fri, 03 Nov 2023 14:32:24 -0700

On Fri, Nov 3, 2023 at 2:30 PM McDonald, Ben <[email protected]> wrote:


> Hello,
>
>
>
> I have been using the C++ Parquet low-level interface to read Parquet
> files into regular C arrays. This has not been a problem when reading types
> supported by C, say, `int64` columns, but with string columns, I am running
> into difficulty having to read into the Arrow `ByteArray` type.
>
>
>
> Rather than reading the results into a `ByteArray`, I would like to read
> the results directly into an already created `uint8` character array. As it
> stands, I am first reading into a `ByteArray` and then copying into the
> `uint8` array, which is causing some unfortunate overhead. Is there a way
> to read directly into a byte array using the low level Parquet API? For
> reference, here is the portion of code for how I am currently reading Arrow
> strings into my `uint8` array:
> https://github.com/Bears-R-Us/arkouda/blob/a3419dd6774923d6ff6f75bdf62fb6e225d1a584/src/ArrowFunctions.cpp#L797-L814
> .
>

It's kind of odd to want to store strings as contiguous null-terminated
entities.

I can't think of any way that you can append the data directly, since it
contains a length, though the parquet-cpp is open source so you can modify
it to your liking, but you're still going to read the length field
*somewhere* (technically you can skip it, with the right "read" calls, but
meh).

I'm guessing that the compiler will optimize your hand-rolled byte
copy, but you can try memcpy.

 Additionally, when attempting to optimize my string reading approach, I
> was looking into using the `ReadBatch` function into a vector of
> `ByteArray`s to read in multiple values, instead of one at a time, like I
> am currently doing. When attempting this, I have been hitting a segfault
> with any batch size greater than 16, but am still achieving a significant
> speedup that way as opposed to reading in single values. Is there any
> reason why a larger batch size than 16 would be causing a segfault with the
> `ReadBatch` function reading into a vector of `ByteArray`s on a
> `parquet::ByteArrayReader`?
>

I expect you have a bug. Are you allowing space in your buffer for the NULL
byte you're adding?  (I have never heard of chapel before -- so many
languages). I also wonder if a double read (read once to get the total data
size and then read again for the data) is really worth it. If you can
*guess* the column size and then just reallocate if you exceed that size,
you might be better off.

Good luck,

-- 
Andrew Bell
[email protected]

Re: [C++] [Parquet] Questions about batch reading byte arrays

Reply via email to