On Mon, 18 Apr 2022 13:09:52 -0700
Micah Kornfield <[email protected]> wrote:
> Note that uncompressed size is encoded size so can be substantially smaller
> then a simple concatenated string buffer

Indeed, the only realiable way to get the desired information is to
actually read and decode the Parquet data.

Regards

Antoine.



> 
> On Monday, April 18, 2022, Weston Pace <[email protected]> wrote:
> 
> > From a pure metadata-only perspective you should be able to get the
> > size of the column and possibly a null count (for parquet files where
> > statistics are stored).  However, you will not be able to get the
> > indices of the nulls.
> >
> > The null count and column size are going to come from the parquet
> > metadata and you will need to use the parquet APIs to get this
> > information.  In pyarrow this would be:
> >
> > ```  
> > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(  
> > 0).column(0).statistics.null_count
> > 1  
> > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(  
> > 0).column(0).total_compressed_size
> > 122  
> > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(  
> > 0).column(0).total_uncompressed_size
> > 119
> > ```
> >
> > In the C++ API you will want to look at `parquet::ParquetFileReader::
> > metadata`
> >
> > On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <[email protected]>
> > wrote:  
> > >
> > > It seems that these options require reading into `ArrayData`. I have  
> > been using `ReadBatch` to read directly into a malloced C buffer to avoid
> > having to create the additional copy, which is why I was hoping there would
> > be a way to get this from the file metadata or some operation on the file
> > rather than from the data that has already been read into an Arrow data
> > structure.  
> > >
> > >
> > >
> > > So, the only way that I could do this today would be to read into an  
> > `ArrayData` and then call an `arrow::compute` function? There is no way to
> > get the info from the file?  
> > >
> > >
> > >
> > > Best,
> > >
> > > Ben McDonald
> > >
> > >
> > >
> > > From: Niranda Perera <[email protected]>
> > > Date: Friday, April 15, 2022 at 5:43 PM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: [C++] Null indices and byte lengths of string columns
> > >
> > > Hi Ben,
> > >
> > >
> > >
> > > I believe you could use arrow::compute for this.
> > >
> > >
> > >
> > > On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <[email protected]>  
> > wrote:  
> > >
> > > Hello,
> > >
> > >
> > >
> > > I have been writing some code to read Parquet files and it would be  
> > useful if there was an easy way to get the number of bytes in a string
> > column as well as the null indices of that column. I would have expected
> > this to be available in metadata somewhere, but I have not seen any way to
> > query that from the API and don’t see anything like this using
> > `parquet-tools` to inspect the files.  
> > >
> > >
> > >
> > > Is there any way to get the null indices of a Parquet string column  
> > besides reading the whole file and manually checking for nulls?  
> > >
> > > There is an internal method for this [1]. But unfortunately I don't this  
> > is exposed to the outside. One possible solution is, calling
> > compute::is_null and pass the result to compute::indices_nonzero.  
> > >
> > >
> > >
> > >
> > >
> > > Is there any way to get the byte lengths of string columns without  
> > reading each string and summing the number of bytes of each string?  
> > >
> > > Do you want the non-null byte length?
> > >
> > > If not, you can simply take the offsets int64 buffer from ArrayData and  
> > take the last value. That would be the full bytesize of the string array.  
> > >
> > > If yes, I believe you can achieve this by using VisitArrayDataInline/  
> > VisitNullBitmapInline methods [2].  
> > >
> > >
> > >
> > >
> > >
> > > Thank you.
> > >
> > >
> > >
> > > Best,
> > >
> > > Ben McDonald
> > >
> > >
> > >
> > > [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32  
> > eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226  
> > >
> > > [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32  
> > eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224  
> > >
> > >
> > > --
> > >
> > > Niranda Perera
> > > https://niranda.dev/
> > >
> > > @n1r44
> > >
> > >  
> >  
> 



Reply via email to