On Mon, 18 Apr 2022 13:09:52 -0700 Micah Kornfield <[email protected]> wrote: > Note that uncompressed size is encoded size so can be substantially smaller > then a simple concatenated string buffer
Indeed, the only realiable way to get the desired information is to actually read and decode the Parquet data. Regards Antoine. > > On Monday, April 18, 2022, Weston Pace <[email protected]> wrote: > > > From a pure metadata-only perspective you should be able to get the > > size of the column and possibly a null count (for parquet files where > > statistics are stored). However, you will not be able to get the > > indices of the nulls. > > > > The null count and column size are going to come from the parquet > > metadata and you will need to use the parquet APIs to get this > > information. In pyarrow this would be: > > > > ``` > > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group( > > 0).column(0).statistics.null_count > > 1 > > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group( > > 0).column(0).total_compressed_size > > 122 > > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group( > > 0).column(0).total_uncompressed_size > > 119 > > ``` > > > > In the C++ API you will want to look at `parquet::ParquetFileReader:: > > metadata` > > > > On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <[email protected]> > > wrote: > > > > > > It seems that these options require reading into `ArrayData`. I have > > been using `ReadBatch` to read directly into a malloced C buffer to avoid > > having to create the additional copy, which is why I was hoping there would > > be a way to get this from the file metadata or some operation on the file > > rather than from the data that has already been read into an Arrow data > > structure. > > > > > > > > > > > > So, the only way that I could do this today would be to read into an > > `ArrayData` and then call an `arrow::compute` function? There is no way to > > get the info from the file? > > > > > > > > > > > > Best, > > > > > > Ben McDonald > > > > > > > > > > > > From: Niranda Perera <[email protected]> > > > Date: Friday, April 15, 2022 at 5:43 PM > > > To: [email protected] <[email protected]> > > > Subject: Re: [C++] Null indices and byte lengths of string columns > > > > > > Hi Ben, > > > > > > > > > > > > I believe you could use arrow::compute for this. > > > > > > > > > > > > On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <[email protected]> > > wrote: > > > > > > Hello, > > > > > > > > > > > > I have been writing some code to read Parquet files and it would be > > useful if there was an easy way to get the number of bytes in a string > > column as well as the null indices of that column. I would have expected > > this to be available in metadata somewhere, but I have not seen any way to > > query that from the API and don’t see anything like this using > > `parquet-tools` to inspect the files. > > > > > > > > > > > > Is there any way to get the null indices of a Parquet string column > > besides reading the whole file and manually checking for nulls? > > > > > > There is an internal method for this [1]. But unfortunately I don't this > > is exposed to the outside. One possible solution is, calling > > compute::is_null and pass the result to compute::indices_nonzero. > > > > > > > > > > > > > > > > > > Is there any way to get the byte lengths of string columns without > > reading each string and summing the number of bytes of each string? > > > > > > Do you want the non-null byte length? > > > > > > If not, you can simply take the offsets int64 buffer from ArrayData and > > take the last value. That would be the full bytesize of the string array. > > > > > > If yes, I believe you can achieve this by using VisitArrayDataInline/ > > VisitNullBitmapInline methods [2]. > > > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > Best, > > > > > > Ben McDonald > > > > > > > > > > > > [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32 > > eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226 > > > > > > [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32 > > eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224 > > > > > > > > > -- > > > > > > Niranda Perera > > > https://niranda.dev/ > > > > > > @n1r44 > > > > > > > > >
