Re: [C++] [Parquet] Questions about batch reading byte arrays

Micah Kornfield Fri, 03 Nov 2023 12:06:18 -0700

Hi Ben,
The link posted does not work for me (but actually a lot of github links
are not currently working for me, so this might be my issue).


As it stands, I am first reading into a `ByteArray` and then copying into
> the `uint8` array, which is causing some unfortunate overhead. Is there a
> way to read directly into a byte array using the low level Parquet API?

To my knowledge there is nothing exposed here.  Using the Arrow
abstractions are probably the closest thing.

I have been hitting a segfault with any batch size greater than 16, but am
> still achieving a significant speedup that way as opposed to reading in
> single values. Is there any reason why a larger batch size than 16 would be
> causing a segfault with the `ReadBatch` function reading into a vector of
> `ByteArray`s on a `parquet::ByteArrayReader`?

None that I can think of, this sounds like heap corruption, I'd make sure
you are allocating enough space in the Array.


One additional question is that, since I need to create my array prior to
> storing the values, I am having to calculate the required number of bytes
> that my array will need to be in order to store the column in advance. From
> the metadata, I am able to get the number of strings in the column, but I
> am unable to get the number of characters in the column, so have been
> reading in the entire file once and summing the `len` of each `ByteArray`
> to get the total number of characters that will be needed to store all of
> the values. Is there a simpler way to do that, possibly through the
> metadata?


In general no.  Parquet can encode byte arrays in a variety of different
formats (Plain, Dictionary, Delta byte array) and the API provides an
abstraction on top of these.  If all your data is plain encoded you can use
metadata on uncompressed page size to do the calculation (generally this
isn't the case for Byte array data).

https://github.com/apache/parquet-format/pull/197 aims to add metadata that
might meet your use-case (the link is currently no working for me but maybe
it will work for you).

Thanks,
Micah







On Fri, Nov 3, 2023 at 11:30 AM McDonald, Ben <[email protected]> wrote:

> Hello,
>
>
>
> I have been using the C++ Parquet low-level interface to read Parquet
> files into regular C arrays. This has not been a problem when reading types
> supported by C, say, `int64` columns, but with string columns, I am running
> into difficulty having to read into the Arrow `ByteArray` type.
>
>
>
> Rather than reading the results into a `ByteArray`, I would like to read
> the results directly into an already created `uint8` character array. As it
> stands, I am first reading into a `ByteArray` and then copying into the
> `uint8` array, which is causing some unfortunate overhead. Is there a way
> to read directly into a byte array using the low level Parquet API? For
> reference, here is the portion of code for how I am currently reading Arrow
> strings into my `uint8` array:
> https://github.com/Bears-R-Us/arkouda/blob/a3419dd6774923d6ff6f75bdf62fb6e225d1a584/src/ArrowFunctions.cpp#L797-L814
> .
>
>
>
> Additionally, when attempting to optimize my string reading approach, I
> was looking into using the `ReadBatch` function into a vector of
> `ByteArray`s to read in multiple values, instead of one at a time, like I
> am currently doing. When attempting this, I have been hitting a segfault
> with any batch size greater than 16, but am still achieving a significant
> speedup that way as opposed to reading in single values. Is there any
> reason why a larger batch size than 16 would be causing a segfault with the
> `ReadBatch` function reading into a vector of `ByteArray`s on a
> `parquet::ByteArrayReader`?
>
>
>
> One additional question is that, since I need to create my array prior to
> storing the values, I am having to calculate the required number of bytes
> that my array will need to be in order to store the column in advance. From
> the metadata, I am able to get the number of strings in the column, but I
> am unable to get the number of characters in the column, so have been
> reading in the entire file once and summing the `len` of each `ByteArray`
> to get the total number of characters that will be needed to store all of
> the values. Is there a simpler way to do that, possibly through the
> metadata?
>
>
>
> Thank you!
>
>
>
> Best,
>
> Ben McDonald
>
>
>

Re: [C++] [Parquet] Questions about batch reading byte arrays

Reply via email to