[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 18: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 18
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 04:26:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Reviewed-on: http://gerrit.cloudera.org:8080/8146
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/scan-node.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/avro-util.h
22 files changed, 401 insertions(+), 114 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-31 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 18: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 18
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 17: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1421/


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 18:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1422/


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 18
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 17:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1421/


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h@405
PS16, Line 405: runtim
> runtime
Done



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Oct 2017 04:22:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 17: Code-Review+2

carry


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Oct 2017 04:22:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#17).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/scan-node.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/avro-util.h
22 files changed, 401 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/17
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 16: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h@405
PS16, Line 405: runtim
runtime



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Oct 2017 02:00:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 16: Code-Review+1

carry +1


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Oct 2017 22:03:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#16).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/scan-node.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/avro-util.h
22 files changed, 401 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/16
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 15:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@a213
PS15, Line 213:
> If I understand it correctly, this if branch was dead code before this chan
Yeah I missed cleaning it up in an earlier commit. I mention this in my 
admittedly gigantic commit message.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@51
PS15, Line 51: !tuple->CopyStrings("HdfsAvroScanner::DecodeAvroData()",
 :   state_, string_slot_offsets_.data(), 
string_slot_offsets_.size(), pool,
 :   _status_))
> nit: tuple->CopyStrings(...) == nullptr
It returns a bool though, since it doesn't reallocate the tuple itself.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h@133
PS15, Line 133: //
> nit: /// to be consistent.
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc@1066
PS15, Line 1066: HdfsScanNodeBase* node
> Should this be const HdfsScanNodeBase* ?
Done. this required propagating the const qualifier a few more places.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h@406
PS15, Line 406:   /// Codegen function to replace InitTuple(). The codegen'd 
version of InitTuple() is
  :   /// stored in 'init_tuple_fn' if codegen was successful.
> May help to also state the codegen'd version of the function has some const
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc@535
PS15, Line 535:
> nit: indent 4
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@93
PS15, Line 93: if
> is
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@109
PS15, Line 109: llvm::Constant* ToIR(LlvmCodeGen* codegen) const;
> Comment: This needs to be updated should the layout of this struct change.
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc
File be/src/runtime/tuple-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc@28
PS15, Line 28: for (int i = 0; i < num_string_slots; ++i) {
> Not sure if it will help but have you tried #pragma unroll hint here to see
I tried a couple of queries but didn't see a noticeable difference in perf.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@47
PS15, Line 47: /// Generate an LLVM Constant containing the offset values of 
this SlotOffsets instance.
> Please also comment that this function needs to be updated if the layout of
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@204
PS15, Line 204: //
> nit: ///
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@404
PS15, Line 404: materialize_strings_fn
> nit: Using the name copy_strings_fn will be more consistent.
Thanks for catching this, I missed this one place.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435
PS15, Line 435: Constant*
> Not your change but I feel it's generally less confusing to include llvm::
I removed the "using namespace llvm" in this file and added llvm:: to the 
appropriate places.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435
PS15, Line 435: slot_offset_constants
> nit: 'slot_offset_ir_constants' may make it easier to follow.
Done


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@436
PS15, Line 436:   for (SlotDescriptor* slot_desc : desc.string_slots()) {
  : SlotOffsets offsets = {slot_desc->null_indicator_offset(), 
slot_desc->tuple_offset()};
  : slot_offset_constants.push_back(offsets.ToIR(codegen));
  :   }
  :
  :   Constant* constant_slot_offsets = 
codegen->ConstantsToGVArrayPtr(
  :   slot_offsets_type, slot_offset_constants, "slot_offsets");
  :   

[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 15:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@51
PS15, Line 51: !tuple->CopyStrings("HdfsAvroScanner::DecodeAvroData()",
 :   state_, string_slot_offsets_.data(), 
string_slot_offsets_.size(), pool,
 :   _status_))
nit: tuple->CopyStrings(...) == nullptr


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc@1066
PS15, Line 1066: HdfsScanNodeBase* node
Should this be const HdfsScanNodeBase* ?


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc
File be/src/runtime/tuple-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc@28
PS15, Line 28: for (int i = 0; i < num_string_slots; ++i) {
Not sure if it will help but have you tried #pragma unroll hint here to see if 
it makes a difference.

https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Oct 2017 20:44:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 15:

(12 comments)

Sorry for the delay. Some more comments. Close to completion.

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@a213
PS15, Line 213:
If I understand it correctly, this if branch was dead code before this change, 
right ?


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h@133
PS15, Line 133: //
nit: /// to be consistent.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h@406
PS15, Line 406:   /// Codegen function to replace InitTuple(). The codegen'd 
version of InitTuple() is
  :   /// stored in 'init_tuple_fn' if codegen was successful.
May help to also state the codegen'd version of the function has some constants 
replaced.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc@535
PS15, Line 535:
nit: indent 4


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@93
PS15, Line 93: if
is


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@109
PS15, Line 109: llvm::Constant* ToIR(LlvmCodeGen* codegen) const;
Comment: This needs to be updated should the layout of this struct change.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@47
PS15, Line 47: /// Generate an LLVM Constant containing the offset values of 
this SlotOffsets instance.
Please also comment that this function needs to be updated if the layout of 
this struct is changed.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@204
PS15, Line 204: //
nit: ///

May help to also add that this is replacing some of the arguments passed to 
CopyString() in an attempt to trigger loop unrolling and constant substitution 
by LLVM.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@404
PS15, Line 404: materialize_strings_fn
nit: Using the name copy_strings_fn will be more consistent.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435
PS15, Line 435: slot_offset_constants
nit: 'slot_offset_ir_constants' may make it easier to follow.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435
PS15, Line 435: Constant*
Not your change but I feel it's generally less confusing to include llvm:: 
namespace for class not defined by codegen logic. Don't need to change it in 
this change. Someone can do it later in a clean-up patch.


http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@436
PS15, Line 436:   for (SlotDescriptor* slot_desc : desc.string_slots()) {
  : SlotOffsets offsets = {slot_desc->null_indicator_offset(), 
slot_desc->tuple_offset()};
  : slot_offset_constants.push_back(offsets.ToIR(codegen));
  :   }
  :
  :   Constant* constant_slot_offsets = 
codegen->ConstantsToGVArrayPtr(
  :   slot_offsets_type, slot_offset_constants, "slot_offsets");
  :   Constant* num_string_slots =
  :   ConstantInt::get(codegen->int_type(), 
desc.string_slots().size());
I think it may be helpful to add a comment on what line 435 - 444 is trying to 
achieve. In essence, it's converting all string slots' offsets into an array of 
IR constants which is a global variable named "slot_offsets".



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Oct 2017 19:33:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 15: Code-Review+1

rebase, carry +1


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Oct 2017 17:00:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-30 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#15).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 337 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/15
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-27 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 14: Code-Review+1

Thanks. This approach looks okay to me, please see if Michael can do the +2 
review.


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Oct 2017 20:33:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 14: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Oct 2017 17:33:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h@175
PS13, Line 175:   /// Materialize the var-len string data in this tuple into 
the provided memory
  :   /// pool.
> the name and the comment didn't make it clear that this also rewrites this
Renamed to CopyStrings(). Agree it's more accurate.



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Oct 2017 05:22:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-26 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#14).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 337 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/14
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-26 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h@175
PS13, Line 175:   /// Materialize the var-len string data in this tuple into 
the provided memory
  :   /// pool.
the name and the comment didn't make it clear that this also rewrites this 
tuple. i.e. in that sense, this is different than MaterializeExprs(). In this 
case, the slots (and strings) are already materialized -- it's re-materializing 
(or copying) the strings, right?



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 26 Oct 2017 23:46:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-24 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h@289
PS11, Line 289: );
> Are those calls to std C++ library functions ?
I realise now the comment didn't make sense without context. The invoke came 
from the call to this function and was fixed by adding a "noexcept" specifier - 
we've used that in various other places.


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449
PS11, Line 449:   Value* result_val = builder.CreateCall(cross_compiled_fn,
> Yes. That seems to be helpful. Can we also do something bigger (e.g. 16 col
I did a quick microbenchmark and there is a ~10% difference in scan time 
between the codegen'd and non-codegen'd when scanning a single string column.

Another case where it's useful is optimising out the loop and runtime checks 
entirely  when no string cols are materialised.



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Oct 2017 22:28:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h@289
PS11, Line 289: );
> Looking in the IR I noticed that there were "invoke" instruction emitted wh
Are those calls to std C++ library functions ?


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449
PS11, Line 449:   Value* result_val = builder.CreateCall(cross_compiled_fn,
> I think being conservative is probably a good thing actually. The code is n
Yes. That seems to be helpful. Can we also do something bigger (e.g. 16 
columns) ?

If I understand it correctly, the benefit for codegen'ing here (and all the new 
code for creating the constants) is for unrolling the loop, right ?

May be it's still helpful for the cases in which few number of string columns 
are materialized.



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Oct 2017 00:37:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449
PS11, Line 449:   Value* result_val = builder.CreateCall(cross_compiled_fn,
> It's probably because of the inlining overhead. I wonder what the perf is l
I think being conservative is probably a good thing actually. The code is never 
executed in the common case where the Avro files are compressed so taking too 
much of a codegen time hit would be a bad thing.

I could do some further experiments to understand the impact - e.g. compare the 
overhead of the unrolled and non-unrolled versions on 8 string columns - would 
that be helpful?



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Oct 2017 00:23:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449
PS11, Line 449:   Value* result_val = builder.CreateCall(cross_compiled_fn,
> I confirmed that it was unrolled for a handful of string columns. For 100s
It's probably because of the inlining overhead. I wonder what the perf is like 
if we force inlining and let the unroll to occur for your 100 columns example



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Oct 2017 00:10:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Thomas Tauber-Marshall,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#13).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 337 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/13
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 13: Code-Review+1

PS13 is a rebase


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Oct 2017 23:37:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Thomas Tauber-Marshall,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#12).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 336 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/12
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc@87
PS11, Line 87:   return ConstantStruct::get(null_indicator_offset_type,
 :   {ConstantInt::get(codegen->int_type(), byte_offset),
 :   ConstantInt::get(codegen->tinyint_type(), bit_mask),
 :   zeroes->getStructElement(2)});
> This needs to be updated when we change the definition of NullIndicatorOffs
Added a comment to the class definition.

ConstantStruct::get() asserts if the number of elements doesn't match the 
number of elements in the struct (that's why I had to add the padding 
explicitly). This behaviour seems better than silently failing.

The code in LLVM is:

  Constant *ConstantStruct::get(StructType *ST, ArrayRef V) {
assert((ST->isOpaque() || ST->getNumElements() == V.size()) &&
  "Incorrect # elements specified to ConstantStruct::get");


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h@289
PS11, Line 289: );
Looking in the IR I noticed that there were "invoke" instruction emitted when 
calling this function.


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@397
PS11, Line 397: Status(
> Status::Expected() ?
I think finalisation failing should be very unusual - either an Impala bug or a 
corrupt IR UDF. To be honest I don't know how robust our handling of this is, 
since we'll continue on and eventually try to optimise and compile the module.


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449
PS11, Line 449:   Value* result_val = builder.CreateCall(cross_compiled_fn,
> Just curious if you inspected the output IR to see if it's actually unrolli
I confirmed that it was unrolled for a handful of string columns. For 100s of 
string columns, e.g,  select count(*) from (select distinct * from 
widetable_1000_cols) v;, it is not unrolled.

I checked some intermediate values too and it seems pretty conservative - it 
seemed like the switch-over point is around materialising >=8 string columns. 
E.g. select * from tpch_avro did not unroll the loop.



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Oct 2017 23:36:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc@87
PS11, Line 87:   return ConstantStruct::get(null_indicator_offset_type,
 :   {ConstantInt::get(codegen->int_type(), byte_offset),
 :   ConstantInt::get(codegen->tinyint_type(), bit_mask),
 :   zeroes->getStructElement(2)});
This needs to be updated when we change the definition of NullIndicatorOffset 
in the future. Please add a remark about it in the header declaration.

Alternately, one can consider copying the ConstantAggregateZero struct and 
selectively populating the fields which we deem useful. In this way, newly 
added field will have the zero values if we forget to initialize them but this 
approach may still be error-prone if we update the struct layouts in the future.

Not sure if there are easy ways to add DCHECK() to check for consistency of the 
constant generated.


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@397
PS11, Line 397: Status(
Status::Expected() ?


http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449
PS11, Line 449:   Value* result_val = builder.CreateCall(cross_compiled_fn,
Just curious if you inspected the output IR to see if it's actually unrolling 
the loop ?



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Oct 2017 21:25:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

OK. Will take a look.


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Oct 2017 20:25:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

Michael, this uses some of the constant global value and .ToIR() concepts that 
you had used in a few places. Would be good to confirm that you think this is a 
good application of the pattern.


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 23 Oct 2017 18:37:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11: Code-Review+1

carry


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 23:36:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/exec/hdfs-avro-scanner.h@136
PS11, Line 136:   // in a simple array of struct simplifies codegen and speeds 
up interpretation.
Addressed Alex's comment on the other CR

https://gerrit.cloudera.org/#/c/8172/4/be/src/exec/hdfs-scanner.h@276



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 23:36:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-20 Thread Tim Armstrong (Code Review)
Hello Thomas Tauber-Marshall,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#11).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 333 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/11
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-20 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 10: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 17:13:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-19 Thread Tim Armstrong (Code Review)
Hello Thomas Tauber-Marshall,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8146

to look at the new patch set (#10).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 331 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/10
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 9:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc@543
PS9, Line 543: *init_tuple_fn =
 :   codegen->GetFunction(IRFunction::HDFS_SCANNER_INIT_TUPLE, 
true);
> single line?
Done


http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc@86
PS9, Line 86: ConstantAggregateZero* zeroes =
:   ConstantAggregateZero::get(null_indicator_offset_type);
> single line?
Done


http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@182
PS9, Line 182: Status* status
> Its unusual to have s status as an out parameter instead of returning it. C
That's a good point. I discovered for PhjBuilder::AppendRow() a while ago that 
the extra IR emitted noticeably hurt codegen time. Copied over the explanation 
from there.


http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@184
PS9, Line 184: Helper for MaterializeStrings()
> If this isn't intended to be called outside this class, make it private?
Good point - done.



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 00:48:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-19 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 9:

(4 comments)

Looks good

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc@543
PS9, Line 543: *init_tuple_fn =
 :   codegen->GetFunction(IRFunction::HDFS_SCANNER_INIT_TUPLE, 
true);
single line?


http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc@86
PS9, Line 86: ConstantAggregateZero* zeroes =
:   ConstantAggregateZero::get(null_indicator_offset_type);
single line?


http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@182
PS9, Line 182: Status* status
Its unusual to have s status as an out parameter instead of returning it. Could 
you document why you did this?


http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@184
PS9, Line 184: Helper for MaterializeStrings()
If this isn't intended to be called outside this class, make it private?



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Oct 2017 17:09:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-10-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..


Patch Set 8:

(1 comment)

There's a fair bit of codegen logic in this patch - maybe you could take a look 
Thomas since you've been looking at codegen recently?

http://gerrit.cloudera.org:8080/#/c/8146/6/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8146/6/be/src/runtime/tuple.cc@448
PS6, Line 448:   Value* constant_slot_offsets_first_element_ptr =
Weird ordering of statements.



--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 12 Oct 2017 17:26:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

2017-09-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#7). ( 
http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
..

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 
1;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 329 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/7
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong