[Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages

2017-07-26 Thread Pooja Nilangekar (Code Review)
Pooja Nilangekar has posted comments on this change.

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter 
pages
..


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7465/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 537:   
RETURN_IF_ERROR(scan_node_->runtime_state()->io_mgr()->AddScanRange(
> A comment might help to understand why you abort altogether here. Why is it
Done


http://gerrit.cloudera.org:8080/#/c/7465/2/be/src/exec/parquet-index-filter.cc
File be/src/exec/parquet-index-filter.cc:

PS2, Line 203: 
 : while (it != filtered_ranges.end()) {
> These aren't needed anymore now, are they?
Done


http://gerrit.cloudera.org:8080/#/c/7465/2/be/src/exec/parquet-index-filter.h
File be/src/exec/parquet-index-filter.h:

PS2, Line 50: FilteredPageI
> FilteredPageInfos may be more clear.
Done


http://gerrit.cloudera.org:8080/#/c/7465/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 747:   if valid:
> No need to change it now, but this line has now ~10 levels of indent. :)
Done. Changed the previous_valid check to continue if false.


-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Pooja Nilangekar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages

2017-07-26 Thread Pooja Nilangekar (Code Review)
Pooja Nilangekar has uploaded a new patch set (#3).

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter 
pages
..

Populate OffsetIndex and ColumnIndex of a row_group and Filter pages

The statistics for each page in a ColumnChunk of a RowGroup are
added to the ColumnIndex structure. When a page is flushed to the
file, its location and offset of the first row is added to the
PageLocation structure of the Offset index. If a file is found to
have only one row_group when it is Finalized, ColumnIndex for each
Column is written to the file (just before footer) and its length
and offset is populated in the ColumnChunk. The OffsetIndexes of all
the columns  in the row_group are written to the RowGroupOffsetIndex
structure and written out to the file. The offset and length of the
index is written out to the RowGroup. This ensures that the rage
scans and point look ups can skip pages based on these statistics
while at the same time scans without selective predicates do not
incur any overhead. Space efficiency is ensured by not populating
parquet::Statistics in the ColumnMeta when the statistics are written
to the ColumnIndex. Additionally, for ordered columns, the
ColumnIndex only contains the min_values.

While scanning a RowGroup, the HdfsParquetScanner invokes the
ParquetIndexFilter for the RowGroups where the indexes are present.
The filter evaluates each conjunct against each page of the
corresponding column. It consolidates the RowRanges for the given
RowGroup and returns the final set of pages to be scanned for each
column.

Testing: The populated index structures were deserialized from the
parquet file and the validity of the offsets and statistics were
verified.

The filtered index ranges were verified manually by ensuring that
the filtered ranges would always evaluate the min/max conjuncts to
true.

Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-index-filter.cc
A be/src/exec/parquet-index-filter.h
M bin/impala-config.sh
M common/thrift/parquet.thrift
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_index.py
12 files changed, 573 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/7465/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Pooja Nilangekar 


[Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages

2017-07-26 Thread Pooja Nilangekar (Code Review)
Pooja Nilangekar has uploaded a new patch set (#3).

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter 
pages
..

Populate OffsetIndex and ColumnIndex of a row_group and Filter pages

The statistics for each page in a ColumnChunk of a RowGroup are
added to the ColumnIndex structure. When a page is flushed to the
file, its location and offset of the first row is added to the
PageLocation structure of the Offset index. If a file is found to
have only one row_group when it is Finalized, ColumnIndex for each
Column is written to the file (just before footer) and its length
and offset is populated in the ColumnChunk. The OffsetIndexes of all
the columns  in the row_group are written to the RowGroupOffsetIndex
structure and written out to the file. The offset and length of the
index is written out to the RowGroup. This ensures that the rage
scans and point look ups can skip pages based on these statistics
while at the same time scans without selective predicates do not
incur any overhead. Space efficiency is ensured by not populating
parquet::Statistics in the ColumnMeta when the statistics are written
to the ColumnIndex. Additionally, for ordered columns, the
ColumnIndex only contains the min_values.

While scanning a RowGroup, the HdfsParquetScanner invokes the
ParquetIndexFilter for the RowGroups where the indexes are present.
The filter evaluates each conjunct against each page of the
corresponding column. It consolidates the RowRanges for the given
RowGroup and returns the final set of pages to be scanned for each
column.

Testing: The populated index structures were deserialized from the
parquet file and the validity of the offsets and statistics were
verified.

The filtered index ranges were verified manually by ensuring that
the filtered ranges would always evaluate the min/max conjuncts to
true.

Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-index-filter.cc
A be/src/exec/parquet-index-filter.h
M bin/impala-config.sh
M common/thrift/parquet.thrift
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_index.py
12 files changed, 573 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/7465/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Pooja Nilangekar 


[Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages

2017-07-25 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter 
pages
..


Patch Set 2:

(4 comments)

Thank you for addressing all the comments. :)

http://gerrit.cloudera.org:8080/#/c/7465/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 537: return Status::OK();
A comment might help to understand why you abort altogether here. Why is it not 
necessary to cancel the scan ranges?


http://gerrit.cloudera.org:8080/#/c/7465/2/be/src/exec/parquet-index-filter.cc
File be/src/exec/parquet-index-filter.cc:

PS2, Line 203:page.page_id = page_idx;
 : page.first_row_index = first_row_index;
These aren't needed anymore now, are they?


http://gerrit.cloudera.org:8080/#/c/7465/2/be/src/exec/parquet-index-filter.h
File be/src/exec/parquet-index-filter.h:

PS2, Line 50: FilteredPages
FilteredPageInfos may be more clear.


http://gerrit.cloudera.org:8080/#/c/7465/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 747:   assert decode_stats_value(schema, value) <= 
max_value
No need to change it now, but this line has now ~10 levels of indent. :)


-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Pooja Nilangekar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages

2017-07-24 Thread Pooja Nilangekar (Code Review)
Pooja Nilangekar has posted comments on this change.

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter 
pages
..


Patch Set 2:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 617: }
> Have you considered moving this into its own method?
Done


http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 1205:   !current_row_group_->sorting_columns.empty();
> Couldn't it happen that columns not in the sort_columns() list are sorted?
Done


Line 1218: std::vector 
offset_indexes(columns_.size());
> I think it may be easier to track the previous value and then decide in Pro
Done


PS1, Line 1217: rquet::RowGroup* row_group = &(file_metadata_.row_groups[0]);
  : std::vector 
offset_indexes(columns_.size());
  : uint8_t* buffer = nullptr;
  : uint32_t len = 0;
  : for (int i = 0; i < columns_.size(); i++) {
  : 
  :   parquet::OffsetIndex offset_index;
  :   offset_index.__set_page_loca
> It's unclear to me how this works. Does it go through the min_values and ma
Done


http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/parquet-index-filter.cc
File be/src/exec/parquet-index-filter.cc:

Line 39:   if (!min_max_tuple_desc_) return Status::OK();
> Can you add a comment for this line? It's unclear to me why it's necessary.
Done


PS1, Line 45: ges 
> This could have a better name, e.g. skip_ranges or ranges_to_skip
Done


PS1, Line 45: o
> nit: you can remove the (0) here and in all other vector ctors. It's the de
Done


Line 46:   vector skip_ranges;
> move this to where it's used, think of a better name
Done


Line 150: std::sort(skip_ranges.begin(), skip_ranges.end(), 
std::less());
> Describe what this variable stores. Maybe also explain in a high level that
Done


PS1, Line 152: is block computes the unio
> You could also use 'for (auto& row_range : skip) ...'
Done


PS1, Line 158:  If the skip begins before
> Is this check really needed?
Yes, it is necessary when the current skip range ends before the previous skip 
range ends. In that case, the end of the skip_range should not be updated to 
ensure correctness.


PS1, Line 178: 
> nit: use pre-increment
Done


Line 181: 
> Add a comment explaining why you need to do this. The vectors should be of 
Done


Line 196: } else
> resize() should not be needed. You could also use http://en.cppreference.co
Done


Line 199: FilteredPageInfo page{.page_id = page_idx,
> One variable per line makes it more readable.
Done


PS1, Line 215:   }
> add {}, here and below. This helps prevent mistakes when changing the code 
Done


Line 224: page.ranges.push_back(std::make_pair(start_offset, 
end_offset));
> Statements that fit into a single line are ok.
Done


http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/parquet-index-filter.h
File be/src/exec/parquet-index-filter.h:

Line 37: /// Pair of int64_t to represent the start and end of a range of rows. 
Each boundary is
> Mention whether each boundary is inclusive. In my experience it's most comm
Done


Line 45:   int64_t offset;
> Maybe "PageReadInfo" or something similar makes it more clear that this str
Done


Line 50: typedef vector FilteredPages;
> Here you could add a comment explaining the overall usage of this class: Wh
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Pooja Nilangekar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages

2017-07-24 Thread Pooja Nilangekar (Code Review)
Pooja Nilangekar has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7465

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter 
pages
..

Populate OffsetIndex and ColumnIndex of a row_group and Filter pages

The statistics for each page in a ColumnChunk of a RowGroup are
added to the ColumnIndex structure. When a page is flushed to the
file, its location and offset of the first row is added to the
PageLocation structure of the Offset index. If a file is found to
have only one row_group when it is Finalized, ColumnIndex for each
Column is written to the file (just before footer) and its length
and offset is populated in the ColumnChunk. The OffsetIndexes of all
the columns  in the row_group are written to the RowGroupOffsetIndex
structure and written out to the file. The offset and length of the
index is written out to the RowGroup. This ensures that the rage
scans and point look ups can skip pages based on these statistics
while at the same time scans without selective predicates do not
incur any overhead. Space efficiency is ensured by not populating
parquet::Statistics in the ColumnMeta when the statistics are written
to the ColumnIndex. Additionally, for ordered columns, the
ColumnIndex only contains the min_values.

While scanning a RowGroup, the HdfsParquetScanner invokes the
ParquetIndexFilter for the RowGroups where the indexes are present.
The filter evaluates each conjunct against each page of the
corresponding column. It consolidates the RowRanges for the given
RowGroup and returns the final set of pages to be scanned for each
column.

Testing: The populated index structures were deserialized from the
parquet file and the validity of the offsets and statistics were
verified.

The filtered index ranges were verified manually by ensuring that
the filtered ranges would always evaluate the min/max conjuncts to
true.

Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-index-filter.cc
A be/src/exec/parquet-index-filter.h
M bin/impala-config.sh
M common/thrift/parquet.thrift
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_index.py
12 files changed, 589 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/7465/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar 
Gerrit-Reviewer: Lars Volker