Re: Arrow 1404: Adding index for Page-level Skipping

2020-06-22 Thread Micah Kornfield
It doesn't appear so looking over the code base quickly.  It seems like
potentially PageReader [1] or Page [2] to track this information would be
necessary.

[1]
https://github.com/apache/arrow/blob/fe142922f6a4b801ece8fd16a1bff9836a8aaf77/cpp/src/parquet/column_reader.h#L99
[2]
https://github.com/apache/arrow/blob/809d40ab9518bd254705f35af01162a9da588516/cpp/src/parquet/column_page.h

On Fri, Jun 19, 2020 at 3:42 PM Lekshmi Narayanan, Arun Balajiee <
arl...@pitt.edu> wrote:

> Hi
>
> I think I should reframe my question. I am working on PARQUET-1404. I am
> not looking to extend or change this API. I want to understand ReadBatch
> and my question for that was When I make calls for reading values from the
> Parquet file internally row by row, is there a way to know which page I am
> at?
>
> Regards
> Arun Balajiee
>
>
> Regards,
>
> Arun Balajiee
>
> 
> From: Micah Kornfield 
> Sent: Thursday, June 18, 2020 11:21:42 PM
> To: dev@parquet.apache.org 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> Is this internally in the class or adding a parameter in the API?  What is
> the use case?
>
> On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee <
> arl...@pitt.edu> wrote:
>
> > Hi Dev
> >
> > Thanks Wes for these comments.
> >
> > As Informed in other threads, I have completed most of it. Will try to
> > structure it according to the comments.
> >
> > I had one question reading a (un)related matter. whenever we make calls
> to
> >
> > ReadBatch(int64_t batch_size, int16_t* def_levels,
> > int16_t* rep_levels, T*
> > values,
> > int64_t* values_read)
> >
> > Is there are possibility to keep track of which page we are at to
> retrieve
> > values?
> >
> > Regards
> > Arun Balajiee
> > ____
> > From: Wes McKinney 
> > Sent: 02 April 2020 13:16
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein <
> > sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > I just left comments on the PR. The new APIs (their semantics and what
> > should be passed as arguments) are still not adequately documented (in
> > other words, I wouldn't know how to use them just from reading the
> > header file), so I think we should focus on that for the moment. In
> > fairness documentation for other functions in these headers in poor,
> > but they also have the semantics of "read all data in the file from
> > start to finish". These new APIs appear to do something different, so
> > we need to write that down in detail in Doxygen-style comments
> >
> > On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > Would my pull request be useful for the discussion from here?
> > > https://nam05.safelinks.protection.outlook.com/?url=
> > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01%
> > 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%
> > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&
> > sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 18, 2020 3:34 AM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli
> > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > That's helpful, but I think it would be a good idea to have enough
> > > information in the header files to determine what the new APIs do
> > > without reading example code.
> > >
> > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
> > >  wrote:
> > > >
> > > > I also made changes in the low-level-api folder, couldn’t capture in
> > that link I think
> > > > https://nam05.safelinks.protection.outlook.com/?url=
> > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-
> > pages-to-the-format-to-support-efficient-page-
> > skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-
> > level-api%2Freader-writer-with-index.cc&data=02%
> > 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d72

Re: Arrow 1404: Adding index for Page-level Skipping

2020-06-19 Thread Lekshmi Narayanan, Arun Balajiee
Hi

I think I should reframe my question. I am working on PARQUET-1404. I am not 
looking to extend or change this API. I want to understand ReadBatch and my 
question for that was When I make calls for reading values from the Parquet 
file internally row by row, is there a way to know which page I am at?

Regards
Arun Balajiee


Regards,

Arun Balajiee


From: Micah Kornfield 
Sent: Thursday, June 18, 2020 11:21:42 PM
To: dev@parquet.apache.org 
Subject: Re: Arrow 1404: Adding index for Page-level Skipping

Is this internally in the class or adding a parameter in the API?  What is
the use case?

On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee <
arl...@pitt.edu> wrote:

> Hi Dev
>
> Thanks Wes for these comments.
>
> As Informed in other threads, I have completed most of it. Will try to
> structure it according to the comments.
>
> I had one question reading a (un)related matter. whenever we make calls to
>
> ReadBatch(int64_t batch_size, int16_t* def_levels,
> int16_t* rep_levels, T*
> values,
> int64_t* values_read)
>
> Is there are possibility to keep track of which page we are at to retrieve
> values?
>
> Regards
> Arun Balajiee
> 
> From: Wes McKinney 
> Sent: 02 April 2020 13:16
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein <
> sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> I just left comments on the PR. The new APIs (their semantics and what
> should be passed as arguments) are still not adequately documented (in
> other words, I wouldn't know how to use them just from reading the
> header file), so I think we should focus on that for the moment. In
> fairness documentation for other functions in these headers in poor,
> but they also have the semantics of "read all data in the file from
> start to finish". These new APIs appear to do something different, so
> we need to write that down in detail in Doxygen-style comments
>
> On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi
> > Would my pull request be useful for the discussion from here?
> > https://nam05.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01%
> 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%
> 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&
> sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 18, 2020 3:34 AM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli
> Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > That's helpful, but I think it would be a good idea to have enough
> > information in the header files to determine what the new APIs do
> > without reading example code.
> >
> > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > I also made changes in the low-level-api folder, couldn’t capture in
> that link I think
> > > https://nam05.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-
> pages-to-the-format-to-support-efficient-page-
> skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-
> level-api%2Freader-writer-with-index.cc&data=02%
> 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%
> 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&
> sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > 
> > > From: Wes McKinney 
> > > Sent: Monday, February 17, 2020 8:11:09 AM
> > > To: Parquet Dev 
> > > Cc: Deepak Majeti ; Anatoli Shein <
> sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > By "public APIs" I was referring to changes in the public header
> > > files. I see there are some changes to parquet/file_reader.h and
> > > metadata.h
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%
> 2Fmaster...a2un%3APA

Re: Arrow 1404: Adding index for Page-level Skipping

2020-06-18 Thread Micah Kornfield
Is this internally in the class or adding a parameter in the API?  What is
the use case?

On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee <
arl...@pitt.edu> wrote:

> Hi Dev
>
> Thanks Wes for these comments.
>
> As Informed in other threads, I have completed most of it. Will try to
> structure it according to the comments.
>
> I had one question reading a (un)related matter. whenever we make calls to
>
> ReadBatch(int64_t batch_size, int16_t* def_levels,
> int16_t* rep_levels, T*
> values,
> int64_t* values_read)
>
> Is there are possibility to keep track of which page we are at to retrieve
> values?
>
> Regards
> Arun Balajiee
> 
> From: Wes McKinney 
> Sent: 02 April 2020 13:16
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein <
> sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> I just left comments on the PR. The new APIs (their semantics and what
> should be passed as arguments) are still not adequately documented (in
> other words, I wouldn't know how to use them just from reading the
> header file), so I think we should focus on that for the moment. In
> fairness documentation for other functions in these headers in poor,
> but they also have the semantics of "read all data in the file from
> start to finish". These new APIs appear to do something different, so
> we need to write that down in detail in Doxygen-style comments
>
> On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi
> > Would my pull request be useful for the discussion from here?
> > https://nam05.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01%
> 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%
> 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&
> sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 18, 2020 3:34 AM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli
> Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > That's helpful, but I think it would be a good idea to have enough
> > information in the header files to determine what the new APIs do
> > without reading example code.
> >
> > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > I also made changes in the low-level-api folder, couldn’t capture in
> that link I think
> > > https://nam05.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-
> pages-to-the-format-to-support-efficient-page-
> skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-
> level-api%2Freader-writer-with-index.cc&data=02%
> 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%
> 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&
> sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > 
> > > From: Wes McKinney 
> > > Sent: Monday, February 17, 2020 8:11:09 AM
> > > To: Parquet Dev 
> > > Cc: Deepak Majeti ; Anatoli Shein <
> sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > By "public APIs" I was referring to changes in the public header
> > > files. I see there are some changes to parquet/file_reader.h and
> > > metadata.h
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%
> 2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-
> format-to-support-efficient-page-skipping-to-parquet-cpp&
> amp;data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%
> 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841&sdata=
> rBl3pY6bRFuSzWg2QT2Ca6aui2HZJjSoh1mbzDq%2F93M%3D&reserved=0
> > >
> > > Can you add some Doxygen comments to the new APIs that explain how
> > > these APIs are to be used (and what the parameters mean)? The hope
> > > would be that a us

Re: Arrow 1404: Adding index for Page-level Skipping

2020-06-13 Thread Lekshmi Narayanan, Arun Balajiee
Hi Dev

Thanks Wes for these comments.

As Informed in other threads, I have completed most of it. Will try to 
structure it according to the comments.

I had one question reading a (un)related matter. whenever we make calls to

ReadBatch(int64_t batch_size, int16_t* def_levels,
int16_t* rep_levels, T* values,
int64_t* values_read)

Is there are possibility to keep track of which page we are at to retrieve 
values?

Regards
Arun Balajiee

From: Wes McKinney 
Sent: 02 April 2020 13:16
To: Parquet Dev 
Cc: Deepak Majeti ; Anatoli Shein 

Subject: Re: Arrow 1404: Adding index for Page-level Skipping

I just left comments on the PR. The new APIs (their semantics and what
should be passed as arguments) are still not adequately documented (in
other words, I wouldn't know how to use them just from reading the
header file), so I think we should focus on that for the moment. In
fairness documentation for other functions in these headers in poor,
but they also have the semantics of "read all data in the file from
start to finish". These new APIs appear to do something different, so
we need to write that down in detail in Doxygen-style comments

On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi
> Would my pull request be useful for the discussion from here?
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 18, 2020 3:34 AM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> That's helpful, but I think it would be a good idea to have enough
> information in the header files to determine what the new APIs do
> without reading example code.
>
> On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > I also made changes in the low-level-api folder, couldn’t capture in that 
> > link I think
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > ________________
> > From: Wes McKinney 
> > Sent: Monday, February 17, 2020 8:11:09 AM
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein 
> > 
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > By "public APIs" I was referring to changes in the public header
> > files. I see there are some changes to parquet/file_reader.h and
> > metadata.h
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841&sdata=rBl3pY6bRFuSzWg2QT2Ca6aui2HZJjSoh1mbzDq%2F93M%3D&reserved=0
> >
> > Can you add some Doxygen comments to the new APIs that explain how
> > these APIs are to be used (and what the parameters mean)? The hope
> > would be that a user could make use of the column index functionality
> > by reading the .h files only.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > I have made my changes for api here, does it look good and is this what 
> > > you were seeking from me? The writer- api is still in the works and I 
> > > need to make the reader more generic to support all class data types.
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parq

Re: Arrow 1404: Adding index for Page-level Skipping

2020-04-02 Thread Wes McKinney
I just left comments on the PR. The new APIs (their semantics and what
should be passed as arguments) are still not adequately documented (in
other words, I wouldn't know how to use them just from reading the
header file), so I think we should focus on that for the moment. In
fairness documentation for other functions in these headers in poor,
but they also have the semantics of "read all data in the file from
start to finish". These new APIs appear to do something different, so
we need to write that down in detail in Doxygen-style comments

On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi
> Would my pull request be useful for the discussion from here?
> https://github.com/apache/arrow/pull/6807
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 18, 2020 3:34 AM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> That's helpful, but I think it would be a good idea to have enough
> information in the header files to determine what the new APIs do
> without reading example code.
>
> On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > I also made changes in the low-level-api folder, couldn’t capture in that 
> > link I think
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > ________
> > From: Wes McKinney 
> > Sent: Monday, February 17, 2020 8:11:09 AM
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein 
> > 
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > By "public APIs" I was referring to changes in the public header
> > files. I see there are some changes to parquet/file_reader.h and
> > metadata.h
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0
> >
> > Can you add some Doxygen comments to the new APIs that explain how
> > these APIs are to be used (and what the parameters mean)? The hope
> > would be that a user could make use of the column index functionality
> > by reading the .h files only.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > I have made my changes for api here, does it look good and is this what 
> > > you were seeking from me? The writer- api is still in the works and I 
> > > need to make the reader more generic to support all class data types.
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> > >
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 11:24 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > We can keep the discussion going on here and on GitHub when you have a
> > > pull request to discuss. There are a number of diff

RE: Arrow 1404: Adding index for Page-level Skipping

2020-04-02 Thread Lekshmi Narayanan, Arun Balajiee
Hi
Would my pull request be useful for the discussion from here?
https://github.com/apache/arrow/pull/6807

Regards,
Arun Balajiee

From: Wes McKinney<mailto:wesmck...@gmail.com>
Sent: Tuesday, February 18, 2020 3:34 AM
To: Parquet Dev<mailto:dev@parquet.apache.org>
Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
Shein<mailto:sh...@microfocus.com>
Subject: Re: Arrow 1404: Adding index for Page-level Skipping

That's helpful, but I think it would be a good idea to have enough
information in the header files to determine what the new APIs do
without reading example code.

On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> I also made changes in the low-level-api folder, couldn’t capture in that 
> link I think
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
>
> Regards,
> Arun Balajiee
>
> 
> From: Wes McKinney 
> Sent: Monday, February 17, 2020 8:11:09 AM
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein 
> 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> By "public APIs" I was referring to changes in the public header
> files. I see there are some changes to parquet/file_reader.h and
> metadata.h
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0
>
> Can you add some Doxygen comments to the new APIs that explain how
> these APIs are to be used (and what the parameters mean)? The hope
> would be that a user could make use of the column index functionality
> by reading the .h files only.
>
> Thanks
> Wes
>
> On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi
> > I have made my changes for api here, does it look good and is this what you 
> > were seeking from me? The writer- api is still in the works and I need to 
> > make the reader more generic to support all class data types.
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> >
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 11:24 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > We can keep the discussion going on here and on GitHub when you have a
> > pull request to discuss. There are a number of different people who
> > can give advice.
> >
> > Thanks
> >
> > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Actually I made some changes after the date on the pull request ( even in 
> > > this year), which are not getting reflected on this compare link
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 6:43 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > Here's a compare link in case others want to have a look
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=h

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-26 Thread Wes McKinney
I don't think so.

On Mon, Feb 24, 2020 at 5:14 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Will using a DataPage V2 or DataPage V1 cause any difference for this ticket?
>
> Regards,
> Arun Balajiee
>
> 
> From: Wes McKinney 
> Sent: Friday, February 21, 2020 3:06:58 AM
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein 
> 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> The data page statistics aren't currently being used during the "scan
> to Arrow" procedure. That's likely to change at some point since the
> Arrow Datasets project will provide a higher level API to indicate
> filter predicates
>
> On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Thanks Wes. I got it now. I am working on that. But I have a general 
> > question though, were page indices  which store min/max values implemented 
> > in arrow parquet ( not referring to column indices or offset indices, just 
> > page indices)
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 18, 2020 3:34 AM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > That's helpful, but I think it would be a good idea to have enough
> > information in the header files to determine what the new APIs do
> > without reading example code.
> >
> > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > I also made changes in the low-level-api folder, couldn’t capture in that 
> > > link I think
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661797398&sdata=slrrTS3YTiloexbzqsZ6GTy72Ok%2FimFBb%2F8%2Fl2fNDlM%3D&reserved=0
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > 
> > > From: Wes McKinney 
> > > Sent: Monday, February 17, 2020 8:11:09 AM
> > > To: Parquet Dev 
> > > Cc: Deepak Majeti ; Anatoli Shein 
> > > 
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > By "public APIs" I was referring to changes in the public header
> > > files. I see there are some changes to parquet/file_reader.h and
> > > metadata.h
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=DmAizgy3EKwENlRFBfxgvNAXE2Pq%2FctKlZaymn5dUxY%3D&reserved=0
> > >
> > > Can you add some Doxygen comments to the new APIs that explain how
> > > these APIs are to be used (and what the parameters mean)? The hope
> > > would be that a user could make use of the column index functionality
> > > by reading the .h files only.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> > >  wrote:
> > > >
> > > > Hi
> > > > I have made my changes for api here, does it look good and is this what 
> > > > you were seeking from me? The writer- api is still in the works and I 
> > > > need to make the reader more generic to support all class data types.
> > > >
> > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=hqOXB0h%2FI%2FhLgD6FDFFjw2RH4xAKxqWTPjM7rMJ8llw%3D&reserved=0
> > > >
> > > >
> > >

RE: Arrow 1404: Adding index for Page-level Skipping

2020-02-24 Thread Lekshmi Narayanan, Arun Balajiee
Will using a DataPage V2 or DataPage V1 cause any difference for this ticket?

Regards,
Arun Balajiee


From: Wes McKinney 
Sent: Friday, February 21, 2020 3:06:58 AM
To: Parquet Dev 
Cc: Deepak Majeti ; Anatoli Shein 

Subject: Re: Arrow 1404: Adding index for Page-level Skipping

The data page statistics aren't currently being used during the "scan
to Arrow" procedure. That's likely to change at some point since the
Arrow Datasets project will provide a higher level API to indicate
filter predicates

On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Thanks Wes. I got it now. I am working on that. But I have a general question 
> though, were page indices  which store min/max values implemented in arrow 
> parquet ( not referring to column indices or offset indices, just page 
> indices)
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 18, 2020 3:34 AM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> That's helpful, but I think it would be a good idea to have enough
> information in the header files to determine what the new APIs do
> without reading example code.
>
> On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > I also made changes in the low-level-api folder, couldn’t capture in that 
> > link I think
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661797398&sdata=slrrTS3YTiloexbzqsZ6GTy72Ok%2FimFBb%2F8%2Fl2fNDlM%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > ____________
> > From: Wes McKinney 
> > Sent: Monday, February 17, 2020 8:11:09 AM
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein 
> > 
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > By "public APIs" I was referring to changes in the public header
> > files. I see there are some changes to parquet/file_reader.h and
> > metadata.h
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=DmAizgy3EKwENlRFBfxgvNAXE2Pq%2FctKlZaymn5dUxY%3D&reserved=0
> >
> > Can you add some Doxygen comments to the new APIs that explain how
> > these APIs are to be used (and what the parameters mean)? The hope
> > would be that a user could make use of the column index functionality
> > by reading the .h files only.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > I have made my changes for api here, does it look good and is this what 
> > > you were seeking from me? The writer- api is still in the works and I 
> > > need to make the reader more generic to support all class data types.
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=hqOXB0h%2FI%2FhLgD6FDFFjw2RH4xAKxqWTPjM7rMJ8llw%3D&reserved=0
> > >
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 11:24 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > We can

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-21 Thread Wes McKinney
The data page statistics aren't currently being used during the "scan
to Arrow" procedure. That's likely to change at some point since the
Arrow Datasets project will provide a higher level API to indicate
filter predicates

On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Thanks Wes. I got it now. I am working on that. But I have a general question 
> though, were page indices  which store min/max values implemented in arrow 
> parquet ( not referring to column indices or offset indices, just page 
> indices)
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 18, 2020 3:34 AM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> That's helpful, but I think it would be a good idea to have enough
> information in the header files to determine what the new APIs do
> without reading example code.
>
> On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > I also made changes in the low-level-api folder, couldn’t capture in that 
> > link I think
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > ____________
> > From: Wes McKinney 
> > Sent: Monday, February 17, 2020 8:11:09 AM
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein 
> > 
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > By "public APIs" I was referring to changes in the public header
> > files. I see there are some changes to parquet/file_reader.h and
> > metadata.h
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0
> >
> > Can you add some Doxygen comments to the new APIs that explain how
> > these APIs are to be used (and what the parameters mean)? The hope
> > would be that a user could make use of the column index functionality
> > by reading the .h files only.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > I have made my changes for api here, does it look good and is this what 
> > > you were seeking from me? The writer- api is still in the works and I 
> > > need to make the reader more generic to support all class data types.
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> > >
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 11:24 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > We can keep the discussion going on here and on GitHub when you have a
> > > pull request to discuss. There are a number of different people who
> > > can give advice.
> > >
> > > Thanks
> > >
> > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
> >

RE: Arrow 1404: Adding index for Page-level Skipping

2020-02-20 Thread Lekshmi Narayanan, Arun Balajiee
Thanks Wes. I got it now. I am working on that. But I have a general question 
though, were page indices  which store min/max values implemented in arrow 
parquet ( not referring to column indices or offset indices, just page indices)

Regards,
Arun Balajiee

From: Wes McKinney<mailto:wesmck...@gmail.com>
Sent: Tuesday, February 18, 2020 3:34 AM
To: Parquet Dev<mailto:dev@parquet.apache.org>
Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
Shein<mailto:sh...@microfocus.com>
Subject: Re: Arrow 1404: Adding index for Page-level Skipping

That's helpful, but I think it would be a good idea to have enough
information in the header files to determine what the new APIs do
without reading example code.

On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> I also made changes in the low-level-api folder, couldn’t capture in that 
> link I think
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
>
> Regards,
> Arun Balajiee
>
> 
> From: Wes McKinney 
> Sent: Monday, February 17, 2020 8:11:09 AM
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein 
> 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> By "public APIs" I was referring to changes in the public header
> files. I see there are some changes to parquet/file_reader.h and
> metadata.h
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0
>
> Can you add some Doxygen comments to the new APIs that explain how
> these APIs are to be used (and what the parameters mean)? The hope
> would be that a user could make use of the column index functionality
> by reading the .h files only.
>
> Thanks
> Wes
>
> On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi
> > I have made my changes for api here, does it look good and is this what you 
> > were seeking from me? The writer- api is still in the works and I need to 
> > make the reader more generic to support all class data types.
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> >
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 11:24 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > We can keep the discussion going on here and on GitHub when you have a
> > pull request to discuss. There are a number of different people who
> > can give advice.
> >
> > Thanks
> >
> > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Actually I made some changes after the date on the pull request ( even in 
> > > this year), which are not getting reflected on this compare link
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 6:43 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > Here'

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-18 Thread Wes McKinney
That's helpful, but I think it would be a good idea to have enough
information in the header files to determine what the new APIs do
without reading example code.

On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> I also made changes in the low-level-api folder, couldn’t capture in that 
> link I think
> https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc
>
> Regards,
> Arun Balajiee
>
> 
> From: Wes McKinney 
> Sent: Monday, February 17, 2020 8:11:09 AM
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein 
> 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> By "public APIs" I was referring to changes in the public header
> files. I see there are some changes to parquet/file_reader.h and
> metadata.h
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0
>
> Can you add some Doxygen comments to the new APIs that explain how
> these APIs are to be used (and what the parameters mean)? The hope
> would be that a user could make use of the column index functionality
> by reading the .h files only.
>
> Thanks
> Wes
>
> On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi
> > I have made my changes for api here, does it look good and is this what you 
> > were seeking from me? The writer- api is still in the works and I need to 
> > make the reader more generic to support all class data types.
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=ui7ptlMyyUdlKKVdORLvjKCXidQ4yOIQqTqLFIyOVGY%3D&reserved=0
> >
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 11:24 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > We can keep the discussion going on here and on GitHub when you have a
> > pull request to discuss. There are a number of different people who
> > can give advice.
> >
> > Thanks
> >
> > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Actually I made some changes after the date on the pull request ( even in 
> > > this year), which are not getting reflected on this compare link
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 6:43 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > Here's a compare link in case others want to have a look
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0
> > >
> > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> > > >
> > > > hi Arun,
> > > >
> > > > I took a brief look at your branch. One thing that is missing is the
> > > > proposed public APIs that use the index pages -- that would be very
> > > > helpful for this di

RE: Arrow 1404: Adding index for Page-level Skipping

2020-02-17 Thread Lekshmi Narayanan, Arun Balajiee
I also made changes in the low-level-api folder, couldn’t capture in that link 
I think
https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc

Regards,
Arun Balajiee


From: Wes McKinney 
Sent: Monday, February 17, 2020 8:11:09 AM
To: Parquet Dev 
Cc: Deepak Majeti ; Anatoli Shein 

Subject: Re: Arrow 1404: Adding index for Page-level Skipping

hi Arun,

By "public APIs" I was referring to changes in the public header
files. I see there are some changes to parquet/file_reader.h and
metadata.h

https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0

Can you add some Doxygen comments to the new APIs that explain how
these APIs are to be used (and what the parameters mean)? The hope
would be that a user could make use of the column index functionality
by reading the .h files only.

Thanks
Wes

On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi
> I have made my changes for api here, does it look good and is this what you 
> were seeking from me? The writer- api is still in the works and I need to 
> make the reader more generic to support all class data types.
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=ui7ptlMyyUdlKKVdORLvjKCXidQ4yOIQqTqLFIyOVGY%3D&reserved=0
>
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 4, 2020 11:24 PM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> We can keep the discussion going on here and on GitHub when you have a
> pull request to discuss. There are a number of different people who
> can give advice.
>
> Thanks
>
> On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Actually I made some changes after the date on the pull request ( even in 
> > this year), which are not getting reflected on this compare link
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 6:43 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > Here's a compare link in case others want to have a look
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0
> >
> > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> > >
> > > hi Arun,
> > >
> > > I took a brief look at your branch. One thing that is missing is the
> > > proposed public APIs that use the index pages -- that would be very
> > > helpful for this discussion.
> > >
> > > I don't think we have any code for doing random access of a particular
> > > data page in a column chunk, so having as an initial matter would also
> > > be helpful.
> > >
> > > - Wes
> > >
> > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
> > >  wrote:
> > > >
> > > > Hi Parquet dev
> > > >
> > > > Deepak Majeti was my dev lead during my summer internship, from when I 
> > > > am trying to add a few changes in the Arrow Parquet Project for the 
> > > > ticket below

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-17 Thread Wes McKinney
hi Arun,

By "public APIs" I was referring to changes in the public header
files. I see there are some changes to parquet/file_reader.h and
metadata.h

https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp

Can you add some Doxygen comments to the new APIs that explain how
these APIs are to be used (and what the parameters mean)? The hope
would be that a user could make use of the column index functionality
by reading the .h files only.

Thanks
Wes

On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi
> I have made my changes for api here, does it look good and is this what you 
> were seeking from me? The writer- api is still in the works and I need to 
> make the reader more generic to support all class data types.
>
> https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc
>
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 4, 2020 11:24 PM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> We can keep the discussion going on here and on GitHub when you have a
> pull request to discuss. There are a number of different people who
> can give advice.
>
> Thanks
>
> On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Actually I made some changes after the date on the pull request ( even in 
> > this year), which are not getting reflected on this compare link
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 6:43 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > Here's a compare link in case others want to have a look
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=uN6KpqxuoRrTuhoysKHkN8N9XVF8dMQTa2BfBupVCpE%3D&reserved=0
> >
> > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> > >
> > > hi Arun,
> > >
> > > I took a brief look at your branch. One thing that is missing is the
> > > proposed public APIs that use the index pages -- that would be very
> > > helpful for this discussion.
> > >
> > > I don't think we have any code for doing random access of a particular
> > > data page in a column chunk, so having as an initial matter would also
> > > be helpful.
> > >
> > > - Wes
> > >
> > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
> > >  wrote:
> > > >
> > > > Hi Parquet dev
> > > >
> > > > Deepak Majeti was my dev lead during my summer internship, from when I 
> > > > am trying to add a few changes in the Arrow Parquet Project for the 
> > > > ticket below
> > > >
> > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=6ae98Gu1roe4pGw5moc8D4nwdKNNJ4HC058Ktdo8%2F8I%3D&reserved=0
> > > >  (Assigned to Deepak)
> > > >
> > > > With this regard, I am making a few changes to 
> > > > src/parquet/file_reader.cc ( in a fork on my repository)
> > > >
> > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890378845&sdata=gefWxwn8DMq7LnCLQZLpWmml%2FeNcy2XvDR2iL%2BfteKw%3D&reserved=0
> > > >
> > > > I am stuck at trying to read a particular row using the index that I 
> > > > get in the page_location array struct of offset index. Could you help 
> > > > me with this ? and if there have been discussions on the forums for 
> > > > this as well, could you direct me to that link?
> > > >
> > > > Regards,
> > > > Arun Balajiee
> > > >
> >
>


RE: Arrow 1404: Adding index for Page-level Skipping

2020-02-14 Thread Lekshmi Narayanan, Arun Balajiee
Hi
I have made my changes for api here, does it look good and is this what you 
were seeking from me? The writer- api is still in the works and I need to make 
the reader more generic to support all class data types.

https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc


Regards,
Arun Balajiee

From: Wes McKinney<mailto:wesmck...@gmail.com>
Sent: Tuesday, February 4, 2020 11:24 PM
To: Parquet Dev<mailto:dev@parquet.apache.org>
Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
Shein<mailto:sh...@microfocus.com>
Subject: Re: Arrow 1404: Adding index for Page-level Skipping

hi Arun,

We can keep the discussion going on here and on GitHub when you have a
pull request to discuss. There are a number of different people who
can give advice.

Thanks

On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Actually I made some changes after the date on the pull request ( even in 
> this year), which are not getting reflected on this compare link
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 4, 2020 6:43 PM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> Here's a compare link in case others want to have a look
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=uN6KpqxuoRrTuhoysKHkN8N9XVF8dMQTa2BfBupVCpE%3D&reserved=0
>
> On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> >
> > hi Arun,
> >
> > I took a brief look at your branch. One thing that is missing is the
> > proposed public APIs that use the index pages -- that would be very
> > helpful for this discussion.
> >
> > I don't think we have any code for doing random access of a particular
> > data page in a column chunk, so having as an initial matter would also
> > be helpful.
> >
> > - Wes
> >
> > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi Parquet dev
> > >
> > > Deepak Majeti was my dev lead during my summer internship, from when I am 
> > > trying to add a few changes in the Arrow Parquet Project for the ticket 
> > > below
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=6ae98Gu1roe4pGw5moc8D4nwdKNNJ4HC058Ktdo8%2F8I%3D&reserved=0
> > >  (Assigned to Deepak)
> > >
> > > With this regard, I am making a few changes to src/parquet/file_reader.cc 
> > > ( in a fork on my repository)
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890378845&sdata=gefWxwn8DMq7LnCLQZLpWmml%2FeNcy2XvDR2iL%2BfteKw%3D&reserved=0
> > >
> > > I am stuck at trying to read a particular row using the index that I get 
> > > in the page_location array struct of offset index. Could you help me with 
> > > this ? and if there have been discussions on the forums for this as well, 
> > > could you direct me to that link?
> > >
> > > Regards,
> > > Arun Balajiee
> > >
>



Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
hi Arun,

We can keep the discussion going on here and on GitHub when you have a
pull request to discuss. There are a number of different people who
can give advice.

Thanks

On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Actually I made some changes after the date on the pull request ( even in 
> this year), which are not getting reflected on this compare link
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 4, 2020 6:43 PM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> Here's a compare link in case others want to have a look
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=uGV8GSSL1e9CmaxKfkkStdcgQHf0RxLizO72NRKRrrg%3D&reserved=0
>
> On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> >
> > hi Arun,
> >
> > I took a brief look at your branch. One thing that is missing is the
> > proposed public APIs that use the index pages -- that would be very
> > helpful for this discussion.
> >
> > I don't think we have any code for doing random access of a particular
> > data page in a column chunk, so having as an initial matter would also
> > be helpful.
> >
> > - Wes
> >
> > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi Parquet dev
> > >
> > > Deepak Majeti was my dev lead during my summer internship, from when I am 
> > > trying to add a few changes in the Arrow Parquet Project for the ticket 
> > > below
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=aGvdRxYzQdWAo%2FC8ADw6Br5WDMxiVaeBXO7QuSYK8TU%3D&reserved=0
> > >  (Assigned to Deepak)
> > >
> > > With this regard, I am making a few changes to src/parquet/file_reader.cc 
> > > ( in a fork on my repository)
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=cNkK9cL7v6bqI6%2FM50SyLDs%2BPQ0IVmYvvc9MnYD9WgA%3D&reserved=0
> > >
> > > I am stuck at trying to read a particular row using the index that I get 
> > > in the page_location array struct of offset index. Could you help me with 
> > > this ? and if there have been discussions on the forums for this as well, 
> > > could you direct me to that link?
> > >
> > > Regards,
> > > Arun Balajiee
> > >
>


RE: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Lekshmi Narayanan, Arun Balajiee
Actually I made some changes after the date on the pull request ( even in this 
year), which are not getting reflected on this compare link

Regards,
Arun Balajiee

From: Wes McKinney<mailto:wesmck...@gmail.com>
Sent: Tuesday, February 4, 2020 6:43 PM
To: Parquet Dev<mailto:dev@parquet.apache.org>
Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
Shein<mailto:sh...@microfocus.com>
Subject: Re: Arrow 1404: Adding index for Page-level Skipping

Here's a compare link in case others want to have a look

https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=uGV8GSSL1e9CmaxKfkkStdcgQHf0RxLizO72NRKRrrg%3D&reserved=0

On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
>
> hi Arun,
>
> I took a brief look at your branch. One thing that is missing is the
> proposed public APIs that use the index pages -- that would be very
> helpful for this discussion.
>
> I don't think we have any code for doing random access of a particular
> data page in a column chunk, so having as an initial matter would also
> be helpful.
>
> - Wes
>
> On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi Parquet dev
> >
> > Deepak Majeti was my dev lead during my summer internship, from when I am 
> > trying to add a few changes in the Arrow Parquet Project for the ticket 
> > below
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=aGvdRxYzQdWAo%2FC8ADw6Br5WDMxiVaeBXO7QuSYK8TU%3D&reserved=0
> >  (Assigned to Deepak)
> >
> > With this regard, I am making a few changes to src/parquet/file_reader.cc ( 
> > in a fork on my repository)
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=cNkK9cL7v6bqI6%2FM50SyLDs%2BPQ0IVmYvvc9MnYD9WgA%3D&reserved=0
> >
> > I am stuck at trying to read a particular row using the index that I get in 
> > the page_location array struct of offset index. Could you help me with this 
> > ? and if there have been discussions on the forums for this as well, could 
> > you direct me to that link?
> >
> > Regards,
> > Arun Balajiee
> >



Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
hi Arun,

I took a brief look at your branch. One thing that is missing is the
proposed public APIs that use the index pages -- that would be very
helpful for this discussion.

I don't think we have any code for doing random access of a particular
data page in a column chunk, so having as an initial matter would also
be helpful.

- Wes

On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi Parquet dev
>
> Deepak Majeti was my dev lead during my summer internship, from when I am 
> trying to add a few changes in the Arrow Parquet Project for the ticket below
>
> https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak)
>
> With this regard, I am making a few changes to src/parquet/file_reader.cc ( 
> in a fork on my repository)
>
> https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp
>
> I am stuck at trying to read a particular row using the index that I get in 
> the page_location array struct of offset index. Could you help me with this ? 
> and if there have been discussions on the forums for this as well, could you 
> direct me to that link?
>
> Regards,
> Arun Balajiee
>


Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
Here's a compare link in case others want to have a look

https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp

On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
>
> hi Arun,
>
> I took a brief look at your branch. One thing that is missing is the
> proposed public APIs that use the index pages -- that would be very
> helpful for this discussion.
>
> I don't think we have any code for doing random access of a particular
> data page in a column chunk, so having as an initial matter would also
> be helpful.
>
> - Wes
>
> On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi Parquet dev
> >
> > Deepak Majeti was my dev lead during my summer internship, from when I am 
> > trying to add a few changes in the Arrow Parquet Project for the ticket 
> > below
> >
> > https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak)
> >
> > With this regard, I am making a few changes to src/parquet/file_reader.cc ( 
> > in a fork on my repository)
> >
> > https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp
> >
> > I am stuck at trying to read a particular row using the index that I get in 
> > the page_location array struct of offset index. Could you help me with this 
> > ? and if there have been discussions on the forums for this as well, could 
> > you direct me to that link?
> >
> > Regards,
> > Arun Balajiee
> >


Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Lekshmi Narayanan, Arun Balajiee
Hi Parquet dev

Deepak Majeti was my dev lead during my summer internship, from when I am 
trying to add a few changes in the Arrow Parquet Project for the ticket below

https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak)

With this regard, I am making a few changes to src/parquet/file_reader.cc ( in 
a fork on my repository)

https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp

I am stuck at trying to read a particular row using the index that I get in the 
page_location array struct of offset index. Could you help me with this ? and 
if there have been discussions on the forums for this as well, could you direct 
me to that link?

Regards,
Arun Balajiee