Re: Arrow 1404: Adding index for Page-level Skipping
It doesn't appear so looking over the code base quickly. It seems like potentially PageReader [1] or Page [2] to track this information would be necessary. [1] https://github.com/apache/arrow/blob/fe142922f6a4b801ece8fd16a1bff9836a8aaf77/cpp/src/parquet/column_reader.h#L99 [2] https://github.com/apache/arrow/blob/809d40ab9518bd254705f35af01162a9da588516/cpp/src/parquet/column_page.h On Fri, Jun 19, 2020 at 3:42 PM Lekshmi Narayanan, Arun Balajiee < arl...@pitt.edu> wrote: > Hi > > I think I should reframe my question. I am working on PARQUET-1404. I am > not looking to extend or change this API. I want to understand ReadBatch > and my question for that was When I make calls for reading values from the > Parquet file internally row by row, is there a way to know which page I am > at? > > Regards > Arun Balajiee > > > Regards, > > Arun Balajiee > > > From: Micah Kornfield > Sent: Thursday, June 18, 2020 11:21:42 PM > To: dev@parquet.apache.org > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > Is this internally in the class or adding a parameter in the API? What is > the use case? > > On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee < > arl...@pitt.edu> wrote: > > > Hi Dev > > > > Thanks Wes for these comments. > > > > As Informed in other threads, I have completed most of it. Will try to > > structure it according to the comments. > > > > I had one question reading a (un)related matter. whenever we make calls > to > > > > ReadBatch(int64_t batch_size, int16_t* def_levels, > > int16_t* rep_levels, T* > > values, > > int64_t* values_read) > > > > Is there are possibility to keep track of which page we are at to > retrieve > > values? > > > > Regards > > Arun Balajiee > > ____ > > From: Wes McKinney > > Sent: 02 April 2020 13:16 > > To: Parquet Dev > > Cc: Deepak Majeti ; Anatoli Shein < > > sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > I just left comments on the PR. The new APIs (their semantics and what > > should be passed as arguments) are still not adequately documented (in > > other words, I wouldn't know how to use them just from reading the > > header file), so I think we should focus on that for the moment. In > > fairness documentation for other functions in these headers in poor, > > but they also have the semantics of "read all data in the file from > > start to finish". These new APIs appear to do something different, so > > we need to write that down in detail in Doxygen-style comments > > > > On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi > > > Would my pull request be useful for the discussion from here? > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01% > > 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > > sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0 > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 18, 2020 3:34 AM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > That's helpful, but I think it would be a good idea to have enough > > > information in the header files to determine what the new APIs do > > > without reading example code. > > > > > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > > > wrote: > > > > > > > > I also made changes in the low-level-api folder, couldn’t capture in > > that link I think > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index- > > pages-to-the-format-to-support-efficient-page- > > skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow- > > level-api%2Freader-writer-with-index.cc&data=02% > > 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d72
Re: Arrow 1404: Adding index for Page-level Skipping
Hi I think I should reframe my question. I am working on PARQUET-1404. I am not looking to extend or change this API. I want to understand ReadBatch and my question for that was When I make calls for reading values from the Parquet file internally row by row, is there a way to know which page I am at? Regards Arun Balajiee Regards, Arun Balajiee From: Micah Kornfield Sent: Thursday, June 18, 2020 11:21:42 PM To: dev@parquet.apache.org Subject: Re: Arrow 1404: Adding index for Page-level Skipping Is this internally in the class or adding a parameter in the API? What is the use case? On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee < arl...@pitt.edu> wrote: > Hi Dev > > Thanks Wes for these comments. > > As Informed in other threads, I have completed most of it. Will try to > structure it according to the comments. > > I had one question reading a (un)related matter. whenever we make calls to > > ReadBatch(int64_t batch_size, int16_t* def_levels, > int16_t* rep_levels, T* > values, > int64_t* values_read) > > Is there are possibility to keep track of which page we are at to retrieve > values? > > Regards > Arun Balajiee > > From: Wes McKinney > Sent: 02 April 2020 13:16 > To: Parquet Dev > Cc: Deepak Majeti ; Anatoli Shein < > sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > I just left comments on the PR. The new APIs (their semantics and what > should be passed as arguments) are still not adequately documented (in > other words, I wouldn't know how to use them just from reading the > header file), so I think we should focus on that for the moment. In > fairness documentation for other functions in these headers in poor, > but they also have the semantics of "read all data in the file from > start to finish". These new APIs appear to do something different, so > we need to write that down in detail in Doxygen-style comments > > On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi > > Would my pull request be useful for the discussion from here? > > https://nam05.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01% > 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0 > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 18, 2020 3:34 AM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > That's helpful, but I think it would be a good idea to have enough > > information in the header files to determine what the new APIs do > > without reading example code. > > > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > I also made changes in the low-level-api folder, couldn’t capture in > that link I think > > > https://nam05.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index- > pages-to-the-format-to-support-efficient-page- > skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow- > level-api%2Freader-writer-with-index.cc&data=02% > 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0 > > > > > > Regards, > > > Arun Balajiee > > > > > > > > > From: Wes McKinney > > > Sent: Monday, February 17, 2020 8:11:09 AM > > > To: Parquet Dev > > > Cc: Deepak Majeti ; Anatoli Shein < > sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > hi Arun, > > > > > > By "public APIs" I was referring to changes in the public header > > > files. I see there are some changes to parquet/file_reader.h and > > > metadata.h > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare% > 2Fmaster...a2un%3APA
Re: Arrow 1404: Adding index for Page-level Skipping
Is this internally in the class or adding a parameter in the API? What is the use case? On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee < arl...@pitt.edu> wrote: > Hi Dev > > Thanks Wes for these comments. > > As Informed in other threads, I have completed most of it. Will try to > structure it according to the comments. > > I had one question reading a (un)related matter. whenever we make calls to > > ReadBatch(int64_t batch_size, int16_t* def_levels, > int16_t* rep_levels, T* > values, > int64_t* values_read) > > Is there are possibility to keep track of which page we are at to retrieve > values? > > Regards > Arun Balajiee > > From: Wes McKinney > Sent: 02 April 2020 13:16 > To: Parquet Dev > Cc: Deepak Majeti ; Anatoli Shein < > sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > I just left comments on the PR. The new APIs (their semantics and what > should be passed as arguments) are still not adequately documented (in > other words, I wouldn't know how to use them just from reading the > header file), so I think we should focus on that for the moment. In > fairness documentation for other functions in these headers in poor, > but they also have the semantics of "read all data in the file from > start to finish". These new APIs appear to do something different, so > we need to write that down in detail in Doxygen-style comments > > On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi > > Would my pull request be useful for the discussion from here? > > https://nam05.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01% > 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0 > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 18, 2020 3:34 AM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > That's helpful, but I think it would be a good idea to have enough > > information in the header files to determine what the new APIs do > > without reading example code. > > > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > I also made changes in the low-level-api folder, couldn’t capture in > that link I think > > > https://nam05.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index- > pages-to-the-format-to-support-efficient-page- > skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow- > level-api%2Freader-writer-with-index.cc&data=02% > 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0 > > > > > > Regards, > > > Arun Balajiee > > > > > > > > > From: Wes McKinney > > > Sent: Monday, February 17, 2020 8:11:09 AM > > > To: Parquet Dev > > > Cc: Deepak Majeti ; Anatoli Shein < > sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > hi Arun, > > > > > > By "public APIs" I was referring to changes in the public header > > > files. I see there are some changes to parquet/file_reader.h and > > > metadata.h > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare% > 2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the- > format-to-support-efficient-page-skipping-to-parquet-cpp& > amp;data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841&sdata= > rBl3pY6bRFuSzWg2QT2Ca6aui2HZJjSoh1mbzDq%2F93M%3D&reserved=0 > > > > > > Can you add some Doxygen comments to the new APIs that explain how > > > these APIs are to be used (and what the parameters mean)? The hope > > > would be that a us
Re: Arrow 1404: Adding index for Page-level Skipping
Hi Dev Thanks Wes for these comments. As Informed in other threads, I have completed most of it. Will try to structure it according to the comments. I had one question reading a (un)related matter. whenever we make calls to ReadBatch(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels, T* values, int64_t* values_read) Is there are possibility to keep track of which page we are at to retrieve values? Regards Arun Balajiee From: Wes McKinney Sent: 02 April 2020 13:16 To: Parquet Dev Cc: Deepak Majeti ; Anatoli Shein Subject: Re: Arrow 1404: Adding index for Page-level Skipping I just left comments on the PR. The new APIs (their semantics and what should be passed as arguments) are still not adequately documented (in other words, I wouldn't know how to use them just from reading the header file), so I think we should focus on that for the moment. In fairness documentation for other functions in these headers in poor, but they also have the semantics of "read all data in the file from start to finish". These new APIs appear to do something different, so we need to write that down in detail in Doxygen-style comments On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee wrote: > > Hi > Would my pull request be useful for the discussion from here? > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0 > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 18, 2020 3:34 AM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > That's helpful, but I think it would be a good idea to have enough > information in the header files to determine what the new APIs do > without reading example code. > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > I also made changes in the low-level-api folder, couldn’t capture in that > > link I think > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845&sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0 > > > > Regards, > > Arun Balajiee > > > > ________________ > > From: Wes McKinney > > Sent: Monday, February 17, 2020 8:11:09 AM > > To: Parquet Dev > > Cc: Deepak Majeti ; Anatoli Shein > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > By "public APIs" I was referring to changes in the public header > > files. I see there are some changes to parquet/file_reader.h and > > metadata.h > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841&sdata=rBl3pY6bRFuSzWg2QT2Ca6aui2HZJjSoh1mbzDq%2F93M%3D&reserved=0 > > > > Can you add some Doxygen comments to the new APIs that explain how > > these APIs are to be used (and what the parameters mean)? The hope > > would be that a user could make use of the column index functionality > > by reading the .h files only. > > > > Thanks > > Wes > > > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi > > > I have made my changes for api here, does it look good and is this what > > > you were seeking from me? The writer- api is still in the works and I > > > need to make the reader more generic to support all class data types. > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parq
Re: Arrow 1404: Adding index for Page-level Skipping
I just left comments on the PR. The new APIs (their semantics and what should be passed as arguments) are still not adequately documented (in other words, I wouldn't know how to use them just from reading the header file), so I think we should focus on that for the moment. In fairness documentation for other functions in these headers in poor, but they also have the semantics of "read all data in the file from start to finish". These new APIs appear to do something different, so we need to write that down in detail in Doxygen-style comments On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee wrote: > > Hi > Would my pull request be useful for the discussion from here? > https://github.com/apache/arrow/pull/6807 > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 18, 2020 3:34 AM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > That's helpful, but I think it would be a good idea to have enough > information in the header files to determine what the new APIs do > without reading example code. > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > I also made changes in the low-level-api folder, couldn’t capture in that > > link I think > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > > > Regards, > > Arun Balajiee > > > > ________ > > From: Wes McKinney > > Sent: Monday, February 17, 2020 8:11:09 AM > > To: Parquet Dev > > Cc: Deepak Majeti ; Anatoli Shein > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > By "public APIs" I was referring to changes in the public header > > files. I see there are some changes to parquet/file_reader.h and > > metadata.h > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0 > > > > Can you add some Doxygen comments to the new APIs that explain how > > these APIs are to be used (and what the parameters mean)? The hope > > would be that a user could make use of the column index functionality > > by reading the .h files only. > > > > Thanks > > Wes > > > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi > > > I have made my changes for api here, does it look good and is this what > > > you were seeking from me? The writer- api is still in the works and I > > > need to make the reader more generic to support all class data types. > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > > > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 4, 2020 11:24 PM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > hi Arun, > > > > > > We can keep the discussion going on here and on GitHub when you have a > > > pull request to discuss. There are a number of diff
RE: Arrow 1404: Adding index for Page-level Skipping
Hi Would my pull request be useful for the discussion from here? https://github.com/apache/arrow/pull/6807 Regards, Arun Balajiee From: Wes McKinney<mailto:wesmck...@gmail.com> Sent: Tuesday, February 18, 2020 3:34 AM To: Parquet Dev<mailto:dev@parquet.apache.org> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli Shein<mailto:sh...@microfocus.com> Subject: Re: Arrow 1404: Adding index for Page-level Skipping That's helpful, but I think it would be a good idea to have enough information in the header files to determine what the new APIs do without reading example code. On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee wrote: > > I also made changes in the low-level-api folder, couldn’t capture in that > link I think > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > Regards, > Arun Balajiee > > > From: Wes McKinney > Sent: Monday, February 17, 2020 8:11:09 AM > To: Parquet Dev > Cc: Deepak Majeti ; Anatoli Shein > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > hi Arun, > > By "public APIs" I was referring to changes in the public header > files. I see there are some changes to parquet/file_reader.h and > metadata.h > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0 > > Can you add some Doxygen comments to the new APIs that explain how > these APIs are to be used (and what the parameters mean)? The hope > would be that a user could make use of the column index functionality > by reading the .h files only. > > Thanks > Wes > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi > > I have made my changes for api here, does it look good and is this what you > > were seeking from me? The writer- api is still in the works and I need to > > make the reader more generic to support all class data types. > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 4, 2020 11:24 PM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > We can keep the discussion going on here and on GitHub when you have a > > pull request to discuss. There are a number of different people who > > can give advice. > > > > Thanks > > > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Actually I made some changes after the date on the pull request ( even in > > > this year), which are not getting reflected on this compare link > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 4, 2020 6:43 PM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > Here's a compare link in case others want to have a look > > > > > > https://nam05.safelinks.protection.outlook.com/?url=h
Re: Arrow 1404: Adding index for Page-level Skipping
I don't think so. On Mon, Feb 24, 2020 at 5:14 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Will using a DataPage V2 or DataPage V1 cause any difference for this ticket? > > Regards, > Arun Balajiee > > > From: Wes McKinney > Sent: Friday, February 21, 2020 3:06:58 AM > To: Parquet Dev > Cc: Deepak Majeti ; Anatoli Shein > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > The data page statistics aren't currently being used during the "scan > to Arrow" procedure. That's likely to change at some point since the > Arrow Datasets project will provide a higher level API to indicate > filter predicates > > On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Thanks Wes. I got it now. I am working on that. But I have a general > > question though, were page indices which store min/max values implemented > > in arrow parquet ( not referring to column indices or offset indices, just > > page indices) > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 18, 2020 3:34 AM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > That's helpful, but I think it would be a good idea to have enough > > information in the header files to determine what the new APIs do > > without reading example code. > > > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > I also made changes in the low-level-api folder, couldn’t capture in that > > > link I think > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661797398&sdata=slrrTS3YTiloexbzqsZ6GTy72Ok%2FimFBb%2F8%2Fl2fNDlM%3D&reserved=0 > > > > > > Regards, > > > Arun Balajiee > > > > > > > > > From: Wes McKinney > > > Sent: Monday, February 17, 2020 8:11:09 AM > > > To: Parquet Dev > > > Cc: Deepak Majeti ; Anatoli Shein > > > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > hi Arun, > > > > > > By "public APIs" I was referring to changes in the public header > > > files. I see there are some changes to parquet/file_reader.h and > > > metadata.h > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=DmAizgy3EKwENlRFBfxgvNAXE2Pq%2FctKlZaymn5dUxY%3D&reserved=0 > > > > > > Can you add some Doxygen comments to the new APIs that explain how > > > these APIs are to be used (and what the parameters mean)? The hope > > > would be that a user could make use of the column index functionality > > > by reading the .h files only. > > > > > > Thanks > > > Wes > > > > > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > > > wrote: > > > > > > > > Hi > > > > I have made my changes for api here, does it look good and is this what > > > > you were seeking from me? The writer- api is still in the works and I > > > > need to make the reader more generic to support all class data types. > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=hqOXB0h%2FI%2FhLgD6FDFFjw2RH4xAKxqWTPjM7rMJ8llw%3D&reserved=0 > > > > > > > > > > >
RE: Arrow 1404: Adding index for Page-level Skipping
Will using a DataPage V2 or DataPage V1 cause any difference for this ticket? Regards, Arun Balajiee From: Wes McKinney Sent: Friday, February 21, 2020 3:06:58 AM To: Parquet Dev Cc: Deepak Majeti ; Anatoli Shein Subject: Re: Arrow 1404: Adding index for Page-level Skipping The data page statistics aren't currently being used during the "scan to Arrow" procedure. That's likely to change at some point since the Arrow Datasets project will provide a higher level API to indicate filter predicates On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Thanks Wes. I got it now. I am working on that. But I have a general question > though, were page indices which store min/max values implemented in arrow > parquet ( not referring to column indices or offset indices, just page > indices) > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 18, 2020 3:34 AM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > That's helpful, but I think it would be a good idea to have enough > information in the header files to determine what the new APIs do > without reading example code. > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > I also made changes in the low-level-api folder, couldn’t capture in that > > link I think > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661797398&sdata=slrrTS3YTiloexbzqsZ6GTy72Ok%2FimFBb%2F8%2Fl2fNDlM%3D&reserved=0 > > > > Regards, > > Arun Balajiee > > > > ____________ > > From: Wes McKinney > > Sent: Monday, February 17, 2020 8:11:09 AM > > To: Parquet Dev > > Cc: Deepak Majeti ; Anatoli Shein > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > By "public APIs" I was referring to changes in the public header > > files. I see there are some changes to parquet/file_reader.h and > > metadata.h > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=DmAizgy3EKwENlRFBfxgvNAXE2Pq%2FctKlZaymn5dUxY%3D&reserved=0 > > > > Can you add some Doxygen comments to the new APIs that explain how > > these APIs are to be used (and what the parameters mean)? The hope > > would be that a user could make use of the column index functionality > > by reading the .h files only. > > > > Thanks > > Wes > > > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi > > > I have made my changes for api here, does it look good and is this what > > > you were seeking from me? The writer- api is still in the works and I > > > need to make the reader more generic to support all class data types. > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=hqOXB0h%2FI%2FhLgD6FDFFjw2RH4xAKxqWTPjM7rMJ8llw%3D&reserved=0 > > > > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 4, 2020 11:24 PM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > hi Arun, > > > > > > We can
Re: Arrow 1404: Adding index for Page-level Skipping
The data page statistics aren't currently being used during the "scan to Arrow" procedure. That's likely to change at some point since the Arrow Datasets project will provide a higher level API to indicate filter predicates On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Thanks Wes. I got it now. I am working on that. But I have a general question > though, were page indices which store min/max values implemented in arrow > parquet ( not referring to column indices or offset indices, just page > indices) > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 18, 2020 3:34 AM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > That's helpful, but I think it would be a good idea to have enough > information in the header files to determine what the new APIs do > without reading example code. > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > I also made changes in the low-level-api folder, couldn’t capture in that > > link I think > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > > > Regards, > > Arun Balajiee > > > > ____________ > > From: Wes McKinney > > Sent: Monday, February 17, 2020 8:11:09 AM > > To: Parquet Dev > > Cc: Deepak Majeti ; Anatoli Shein > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > By "public APIs" I was referring to changes in the public header > > files. I see there are some changes to parquet/file_reader.h and > > metadata.h > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0 > > > > Can you add some Doxygen comments to the new APIs that explain how > > these APIs are to be used (and what the parameters mean)? The hope > > would be that a user could make use of the column index functionality > > by reading the .h files only. > > > > Thanks > > Wes > > > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi > > > I have made my changes for api here, does it look good and is this what > > > you were seeking from me? The writer- api is still in the works and I > > > need to make the reader more generic to support all class data types. > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > > > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 4, 2020 11:24 PM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > hi Arun, > > > > > > We can keep the discussion going on here and on GitHub when you have a > > > pull request to discuss. There are a number of different people who > > > can give advice. > > > > > > Thanks > > > > > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > >
RE: Arrow 1404: Adding index for Page-level Skipping
Thanks Wes. I got it now. I am working on that. But I have a general question though, were page indices which store min/max values implemented in arrow parquet ( not referring to column indices or offset indices, just page indices) Regards, Arun Balajiee From: Wes McKinney<mailto:wesmck...@gmail.com> Sent: Tuesday, February 18, 2020 3:34 AM To: Parquet Dev<mailto:dev@parquet.apache.org> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli Shein<mailto:sh...@microfocus.com> Subject: Re: Arrow 1404: Adding index for Page-level Skipping That's helpful, but I think it would be a good idea to have enough information in the header files to determine what the new APIs do without reading example code. On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee wrote: > > I also made changes in the low-level-api folder, couldn’t capture in that > link I think > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > Regards, > Arun Balajiee > > > From: Wes McKinney > Sent: Monday, February 17, 2020 8:11:09 AM > To: Parquet Dev > Cc: Deepak Majeti ; Anatoli Shein > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > hi Arun, > > By "public APIs" I was referring to changes in the public header > files. I see there are some changes to parquet/file_reader.h and > metadata.h > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0 > > Can you add some Doxygen comments to the new APIs that explain how > these APIs are to be used (and what the parameters mean)? The hope > would be that a user could make use of the column index functionality > by reading the .h files only. > > Thanks > Wes > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi > > I have made my changes for api here, does it look good and is this what you > > were seeking from me? The writer- api is still in the works and I need to > > make the reader more generic to support all class data types. > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0 > > > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 4, 2020 11:24 PM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > We can keep the discussion going on here and on GitHub when you have a > > pull request to discuss. There are a number of different people who > > can give advice. > > > > Thanks > > > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Actually I made some changes after the date on the pull request ( even in > > > this year), which are not getting reflected on this compare link > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 4, 2020 6:43 PM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > Here'
Re: Arrow 1404: Adding index for Page-level Skipping
That's helpful, but I think it would be a good idea to have enough information in the header files to determine what the new APIs do without reading example code. On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee wrote: > > I also made changes in the low-level-api folder, couldn’t capture in that > link I think > https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc > > Regards, > Arun Balajiee > > > From: Wes McKinney > Sent: Monday, February 17, 2020 8:11:09 AM > To: Parquet Dev > Cc: Deepak Majeti ; Anatoli Shein > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > hi Arun, > > By "public APIs" I was referring to changes in the public header > files. I see there are some changes to parquet/file_reader.h and > metadata.h > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0 > > Can you add some Doxygen comments to the new APIs that explain how > these APIs are to be used (and what the parameters mean)? The hope > would be that a user could make use of the column index functionality > by reading the .h files only. > > Thanks > Wes > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi > > I have made my changes for api here, does it look good and is this what you > > were seeking from me? The writer- api is still in the works and I need to > > make the reader more generic to support all class data types. > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=ui7ptlMyyUdlKKVdORLvjKCXidQ4yOIQqTqLFIyOVGY%3D&reserved=0 > > > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 4, 2020 11:24 PM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > hi Arun, > > > > We can keep the discussion going on here and on GitHub when you have a > > pull request to discuss. There are a number of different people who > > can give advice. > > > > Thanks > > > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Actually I made some changes after the date on the pull request ( even in > > > this year), which are not getting reflected on this compare link > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > > Sent: Tuesday, February 4, 2020 6:43 PM > > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > > Shein<mailto:sh...@microfocus.com> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > Here's a compare link in case others want to have a look > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0 > > > > > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > > > > > > > hi Arun, > > > > > > > > I took a brief look at your branch. One thing that is missing is the > > > > proposed public APIs that use the index pages -- that would be very > > > > helpful for this di
RE: Arrow 1404: Adding index for Page-level Skipping
I also made changes in the low-level-api folder, couldn’t capture in that link I think https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc Regards, Arun Balajiee From: Wes McKinney Sent: Monday, February 17, 2020 8:11:09 AM To: Parquet Dev Cc: Deepak Majeti ; Anatoli Shein Subject: Re: Arrow 1404: Adding index for Page-level Skipping hi Arun, By "public APIs" I was referring to changes in the public header files. I see there are some changes to parquet/file_reader.h and metadata.h https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0 Can you add some Doxygen comments to the new APIs that explain how these APIs are to be used (and what the parameters mean)? The hope would be that a user could make use of the column index functionality by reading the .h files only. Thanks Wes On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Hi > I have made my changes for api here, does it look good and is this what you > were seeking from me? The writer- api is still in the works and I need to > make the reader more generic to support all class data types. > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=ui7ptlMyyUdlKKVdORLvjKCXidQ4yOIQqTqLFIyOVGY%3D&reserved=0 > > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 4, 2020 11:24 PM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > hi Arun, > > We can keep the discussion going on here and on GitHub when you have a > pull request to discuss. There are a number of different people who > can give advice. > > Thanks > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Actually I made some changes after the date on the pull request ( even in > > this year), which are not getting reflected on this compare link > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 4, 2020 6:43 PM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > Here's a compare link in case others want to have a look > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0 > > > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > > > > > hi Arun, > > > > > > I took a brief look at your branch. One thing that is missing is the > > > proposed public APIs that use the index pages -- that would be very > > > helpful for this discussion. > > > > > > I don't think we have any code for doing random access of a particular > > > data page in a column chunk, so having as an initial matter would also > > > be helpful. > > > > > > - Wes > > > > > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > > > wrote: > > > > > > > > Hi Parquet dev > > > > > > > > Deepak Majeti was my dev lead during my summer internship, from when I > > > > am trying to add a few changes in the Arrow Parquet Project for the > > > > ticket below
Re: Arrow 1404: Adding index for Page-level Skipping
hi Arun, By "public APIs" I was referring to changes in the public header files. I see there are some changes to parquet/file_reader.h and metadata.h https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp Can you add some Doxygen comments to the new APIs that explain how these APIs are to be used (and what the parameters mean)? The hope would be that a user could make use of the column index functionality by reading the .h files only. Thanks Wes On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Hi > I have made my changes for api here, does it look good and is this what you > were seeking from me? The writer- api is still in the works and I need to > make the reader more generic to support all class data types. > > https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc > > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 4, 2020 11:24 PM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > hi Arun, > > We can keep the discussion going on here and on GitHub when you have a > pull request to discuss. There are a number of different people who > can give advice. > > Thanks > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Actually I made some changes after the date on the pull request ( even in > > this year), which are not getting reflected on this compare link > > > > Regards, > > Arun Balajiee > > > > From: Wes McKinney<mailto:wesmck...@gmail.com> > > Sent: Tuesday, February 4, 2020 6:43 PM > > To: Parquet Dev<mailto:dev@parquet.apache.org> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > > Shein<mailto:sh...@microfocus.com> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > Here's a compare link in case others want to have a look > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=uN6KpqxuoRrTuhoysKHkN8N9XVF8dMQTa2BfBupVCpE%3D&reserved=0 > > > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > > > > > hi Arun, > > > > > > I took a brief look at your branch. One thing that is missing is the > > > proposed public APIs that use the index pages -- that would be very > > > helpful for this discussion. > > > > > > I don't think we have any code for doing random access of a particular > > > data page in a column chunk, so having as an initial matter would also > > > be helpful. > > > > > > - Wes > > > > > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > > > wrote: > > > > > > > > Hi Parquet dev > > > > > > > > Deepak Majeti was my dev lead during my summer internship, from when I > > > > am trying to add a few changes in the Arrow Parquet Project for the > > > > ticket below > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=6ae98Gu1roe4pGw5moc8D4nwdKNNJ4HC058Ktdo8%2F8I%3D&reserved=0 > > > > (Assigned to Deepak) > > > > > > > > With this regard, I am making a few changes to > > > > src/parquet/file_reader.cc ( in a fork on my repository) > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890378845&sdata=gefWxwn8DMq7LnCLQZLpWmml%2FeNcy2XvDR2iL%2BfteKw%3D&reserved=0 > > > > > > > > I am stuck at trying to read a particular row using the index that I > > > > get in the page_location array struct of offset index. Could you help > > > > me with this ? and if there have been discussions on the forums for > > > > this as well, could you direct me to that link? > > > > > > > > Regards, > > > > Arun Balajiee > > > > > > >
RE: Arrow 1404: Adding index for Page-level Skipping
Hi I have made my changes for api here, does it look good and is this what you were seeking from me? The writer- api is still in the works and I need to make the reader more generic to support all class data types. https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc Regards, Arun Balajiee From: Wes McKinney<mailto:wesmck...@gmail.com> Sent: Tuesday, February 4, 2020 11:24 PM To: Parquet Dev<mailto:dev@parquet.apache.org> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli Shein<mailto:sh...@microfocus.com> Subject: Re: Arrow 1404: Adding index for Page-level Skipping hi Arun, We can keep the discussion going on here and on GitHub when you have a pull request to discuss. There are a number of different people who can give advice. Thanks On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Actually I made some changes after the date on the pull request ( even in > this year), which are not getting reflected on this compare link > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 4, 2020 6:43 PM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > Here's a compare link in case others want to have a look > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=uN6KpqxuoRrTuhoysKHkN8N9XVF8dMQTa2BfBupVCpE%3D&reserved=0 > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > > > hi Arun, > > > > I took a brief look at your branch. One thing that is missing is the > > proposed public APIs that use the index pages -- that would be very > > helpful for this discussion. > > > > I don't think we have any code for doing random access of a particular > > data page in a column chunk, so having as an initial matter would also > > be helpful. > > > > - Wes > > > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi Parquet dev > > > > > > Deepak Majeti was my dev lead during my summer internship, from when I am > > > trying to add a few changes in the Arrow Parquet Project for the ticket > > > below > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=6ae98Gu1roe4pGw5moc8D4nwdKNNJ4HC058Ktdo8%2F8I%3D&reserved=0 > > > (Assigned to Deepak) > > > > > > With this regard, I am making a few changes to src/parquet/file_reader.cc > > > ( in a fork on my repository) > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890378845&sdata=gefWxwn8DMq7LnCLQZLpWmml%2FeNcy2XvDR2iL%2BfteKw%3D&reserved=0 > > > > > > I am stuck at trying to read a particular row using the index that I get > > > in the page_location array struct of offset index. Could you help me with > > > this ? and if there have been discussions on the forums for this as well, > > > could you direct me to that link? > > > > > > Regards, > > > Arun Balajiee > > > >
Re: Arrow 1404: Adding index for Page-level Skipping
hi Arun, We can keep the discussion going on here and on GitHub when you have a pull request to discuss. There are a number of different people who can give advice. Thanks On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Actually I made some changes after the date on the pull request ( even in > this year), which are not getting reflected on this compare link > > Regards, > Arun Balajiee > > From: Wes McKinney<mailto:wesmck...@gmail.com> > Sent: Tuesday, February 4, 2020 6:43 PM > To: Parquet Dev<mailto:dev@parquet.apache.org> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli > Shein<mailto:sh...@microfocus.com> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > Here's a compare link in case others want to have a look > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=uGV8GSSL1e9CmaxKfkkStdcgQHf0RxLizO72NRKRrrg%3D&reserved=0 > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > > > hi Arun, > > > > I took a brief look at your branch. One thing that is missing is the > > proposed public APIs that use the index pages -- that would be very > > helpful for this discussion. > > > > I don't think we have any code for doing random access of a particular > > data page in a column chunk, so having as an initial matter would also > > be helpful. > > > > - Wes > > > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > > wrote: > > > > > > Hi Parquet dev > > > > > > Deepak Majeti was my dev lead during my summer internship, from when I am > > > trying to add a few changes in the Arrow Parquet Project for the ticket > > > below > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=aGvdRxYzQdWAo%2FC8ADw6Br5WDMxiVaeBXO7QuSYK8TU%3D&reserved=0 > > > (Assigned to Deepak) > > > > > > With this regard, I am making a few changes to src/parquet/file_reader.cc > > > ( in a fork on my repository) > > > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=cNkK9cL7v6bqI6%2FM50SyLDs%2BPQ0IVmYvvc9MnYD9WgA%3D&reserved=0 > > > > > > I am stuck at trying to read a particular row using the index that I get > > > in the page_location array struct of offset index. Could you help me with > > > this ? and if there have been discussions on the forums for this as well, > > > could you direct me to that link? > > > > > > Regards, > > > Arun Balajiee > > > >
RE: Arrow 1404: Adding index for Page-level Skipping
Actually I made some changes after the date on the pull request ( even in this year), which are not getting reflected on this compare link Regards, Arun Balajiee From: Wes McKinney<mailto:wesmck...@gmail.com> Sent: Tuesday, February 4, 2020 6:43 PM To: Parquet Dev<mailto:dev@parquet.apache.org> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli Shein<mailto:sh...@microfocus.com> Subject: Re: Arrow 1404: Adding index for Page-level Skipping Here's a compare link in case others want to have a look https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=uGV8GSSL1e9CmaxKfkkStdcgQHf0RxLizO72NRKRrrg%3D&reserved=0 On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > hi Arun, > > I took a brief look at your branch. One thing that is missing is the > proposed public APIs that use the index pages -- that would be very > helpful for this discussion. > > I don't think we have any code for doing random access of a particular > data page in a column chunk, so having as an initial matter would also > be helpful. > > - Wes > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi Parquet dev > > > > Deepak Majeti was my dev lead during my summer internship, from when I am > > trying to add a few changes in the Arrow Parquet Project for the ticket > > below > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=aGvdRxYzQdWAo%2FC8ADw6Br5WDMxiVaeBXO7QuSYK8TU%3D&reserved=0 > > (Assigned to Deepak) > > > > With this regard, I am making a few changes to src/parquet/file_reader.cc ( > > in a fork on my repository) > > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=cNkK9cL7v6bqI6%2FM50SyLDs%2BPQ0IVmYvvc9MnYD9WgA%3D&reserved=0 > > > > I am stuck at trying to read a particular row using the index that I get in > > the page_location array struct of offset index. Could you help me with this > > ? and if there have been discussions on the forums for this as well, could > > you direct me to that link? > > > > Regards, > > Arun Balajiee > >
Re: Arrow 1404: Adding index for Page-level Skipping
hi Arun, I took a brief look at your branch. One thing that is missing is the proposed public APIs that use the index pages -- that would be very helpful for this discussion. I don't think we have any code for doing random access of a particular data page in a column chunk, so having as an initial matter would also be helpful. - Wes On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Hi Parquet dev > > Deepak Majeti was my dev lead during my summer internship, from when I am > trying to add a few changes in the Arrow Parquet Project for the ticket below > > https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak) > > With this regard, I am making a few changes to src/parquet/file_reader.cc ( > in a fork on my repository) > > https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp > > I am stuck at trying to read a particular row using the index that I get in > the page_location array struct of offset index. Could you help me with this ? > and if there have been discussions on the forums for this as well, could you > direct me to that link? > > Regards, > Arun Balajiee >
Re: Arrow 1404: Adding index for Page-level Skipping
Here's a compare link in case others want to have a look https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > hi Arun, > > I took a brief look at your branch. One thing that is missing is the > proposed public APIs that use the index pages -- that would be very > helpful for this discussion. > > I don't think we have any code for doing random access of a particular > data page in a column chunk, so having as an initial matter would also > be helpful. > > - Wes > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > wrote: > > > > Hi Parquet dev > > > > Deepak Majeti was my dev lead during my summer internship, from when I am > > trying to add a few changes in the Arrow Parquet Project for the ticket > > below > > > > https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak) > > > > With this regard, I am making a few changes to src/parquet/file_reader.cc ( > > in a fork on my repository) > > > > https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp > > > > I am stuck at trying to read a particular row using the index that I get in > > the page_location array struct of offset index. Could you help me with this > > ? and if there have been discussions on the forums for this as well, could > > you direct me to that link? > > > > Regards, > > Arun Balajiee > >
Arrow 1404: Adding index for Page-level Skipping
Hi Parquet dev Deepak Majeti was my dev lead during my summer internship, from when I am trying to add a few changes in the Arrow Parquet Project for the ticket below https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak) With this regard, I am making a few changes to src/parquet/file_reader.cc ( in a fork on my repository) https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp I am stuck at trying to read a particular row using the index that I get in the page_location array struct of offset index. Could you help me with this ? and if there have been discussions on the forums for this as well, could you direct me to that link? Regards, Arun Balajiee