On Fri, Nov 3, 2023 at 2:30 PM McDonald, Ben <[email protected]> wrote:
> Hello, > > > > I have been using the C++ Parquet low-level interface to read Parquet > files into regular C arrays. This has not been a problem when reading types > supported by C, say, `int64` columns, but with string columns, I am running > into difficulty having to read into the Arrow `ByteArray` type. > > > > Rather than reading the results into a `ByteArray`, I would like to read > the results directly into an already created `uint8` character array. As it > stands, I am first reading into a `ByteArray` and then copying into the > `uint8` array, which is causing some unfortunate overhead. Is there a way > to read directly into a byte array using the low level Parquet API? For > reference, here is the portion of code for how I am currently reading Arrow > strings into my `uint8` array: > https://github.com/Bears-R-Us/arkouda/blob/a3419dd6774923d6ff6f75bdf62fb6e225d1a584/src/ArrowFunctions.cpp#L797-L814 > . > It's kind of odd to want to store strings as contiguous null-terminated entities. I can't think of any way that you can append the data directly, since it contains a length, though the parquet-cpp is open source so you can modify it to your liking, but you're still going to read the length field *somewhere* (technically you can skip it, with the right "read" calls, but meh). I'm guessing that the compiler will optimize your hand-rolled byte copy, but you can try memcpy. Additionally, when attempting to optimize my string reading approach, I > was looking into using the `ReadBatch` function into a vector of > `ByteArray`s to read in multiple values, instead of one at a time, like I > am currently doing. When attempting this, I have been hitting a segfault > with any batch size greater than 16, but am still achieving a significant > speedup that way as opposed to reading in single values. Is there any > reason why a larger batch size than 16 would be causing a segfault with the > `ReadBatch` function reading into a vector of `ByteArray`s on a > `parquet::ByteArrayReader`? > I expect you have a bug. Are you allowing space in your buffer for the NULL byte you're adding? (I have never heard of chapel before -- so many languages). I also wonder if a double read (read once to get the total data size and then read again for the data) is really worth it. If you can *guess* the column size and then just reallocate if you exceed that size, you might be better off. Good luck, -- Andrew Bell [email protected]
