[Impala-ASF-CR] IMPALA-1616: Improve the Memory Limit Exceeded error report

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-1616: Improve the Memory Limit Exceeded error report
..


Patch Set 4: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4335
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibb4e0c359d889938b4c351771ba539850bdb95ea
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4026: Implement double-buffering for BlockingQueue.

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4026: Implement double-buffering for BlockingQueue.
..


Patch Set 2:

(5 comments)

Thanks for switching back to unique_lock, it makes it a bit easier to reason 
about for me. Change looks good, just a few comments.

http://gerrit.cloudera.org:8080/#/c/4350/2/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

Line 76:   // Sleep with read lock held to block off other readers 
which cannot
Won't this change the meaning of total_get_wait_time_ so that we're only 
counting the single thread that got the lock?

This is ok since I don't think the counter is contractual, and the new 
definition may be more useful, but should update the comment.


Line 111: write_lock.unlock();
Not your change but it'd be more consistent to use braced scoping to release 
the lock instead of this explicit unlock.


Line 155:   uint32_t GetSize() const {
It looks like this is sometimes called with lock held or not. Seems like we 
should document this (i.e. caller should hold lock for non-racy read).


Line 161:   uint64_t total_get_wait_time() const {
It doesn't look like total_get_wait_time and total_put_wait_time have any 
callers - could just get rid ofthe functions?


Line 189:   boost::mutex write_lock_;
We might be able to further reduce contention if we align the locks and other 
data structures so that they're guaranteed to not share 64-bit cache lines. 
Maybe group read and write members together so that the can share cache lines, 
then align the first member of both groups to 64 bytes?

They're only 40 bytes on my system, so with the current layout read_lock and 
write_lock may be on the same cache line:

#include 
#include 
#include 
#include 


struct test {
  boost::mutex mutex1;
  boost::mutex mutex2;
};

int main() {
  printf("sizeof(pthread_mutex_t)=%ld\n", sizeof(pthread_mutex_t));
  printf("sizeof(boost::mutex)=%ld\n", sizeof(boost::mutex));
  printf("sizeof(pthread_cond_t)=%ld\n", sizeof(pthread_cond_t));
  printf("offsetof(...)=%ld\n", offsetof(struct test, mutex2));
  return 0;
}


[~]$ g++ mutex.cc -lboost_system -o mutex&& ./mutex 
   
sizeof(pthread_mutex_t)=40
sizeof(boost::mutex)=40
sizeof(pthread_cond_t)=48
offsetof(...)=40
[~]$


-- 
To view, visit http://gerrit.cloudera.org:8080/4350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Chen Huang 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4100,4112: Qgen: Replace EXTRACT UDF + IS [NOT] DISTINCT FROM in HiveSqlWriter

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4100,4112: Qgen: Replace EXTRACT UDF + IS [NOT] DISTINCT 
FROM in HiveSqlWriter
..


Patch Set 2: Code-Review+2 Verified+1

Have we considered added some basic sanity tests for the query generator? Might 
help prevent it bitrotting in future.

-- 
To view, visit http://gerrit.cloudera.org:8080/4357
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3922ca61af59ecd2899c911b1a03e11ab5c26e11
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: stak...@cloudera.com
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4100,4112: Qgen: Replace EXTRACT UDF + IS [NOT] DISTINCT FROM in HiveSqlWriter

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has submitted this change and it was merged.

Change subject: IMPALA-4100,4112: Qgen: Replace EXTRACT UDF + IS [NOT] DISTINCT 
FROM in HiveSqlWriter
..


IMPALA-4100,4112: Qgen: Replace EXTRACT UDF + IS [NOT] DISTINCT FROM in 
HiveSqlWriter

IMPALA-4100:

* Postgres and Impala support EXTRACT([field] from [date-type]), but Hive 
doesn't
* Hive has other UDFs that perform the same function, but they have different 
names
* This commit modifies the HiveSqlWriter to use the corresponding Hive functions

IMPALA-4112:

* Postgres and Impala support IS [NOT] DISTINCT FROM clauses as a null safe 
equals
* Hive doesn't support this clause, but has a null safe equals operator: <=>
* This commit modifies the HiveSqlWriter to use <=> instead of IS [NOT] 
DISTINCT FROM

Testing:

* This commit only modifies the HiveSqlWriter, so no testing against Impala was 
done
* Tested locally against Hive

Change-Id: I3922ca61af59ecd2899c911b1a03e11ab5c26e11
Reviewed-on: http://gerrit.cloudera.org:8080/4357
Reviewed-by: Michael Brown 
Reviewed-by: Tim Armstrong 
Tested-by: Tim Armstrong 
---
M tests/comparison/model_translator.py
1 file changed, 27 insertions(+), 0 deletions(-)

Approvals:
  Michael Brown: Looks good to me, but someone else must approve
  Tim Armstrong: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4357
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I3922ca61af59ecd2899c911b1a03e11ab5c26e11
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: stak...@cloudera.com


[Impala-ASF-CR] Fix typo in buildall.sh introduced in IMPALA-4006

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4382

Change subject: Fix typo in buildall.sh introduced in IMPALA-4006
..

Fix typo in buildall.sh introduced in IMPALA-4006

The typo resulted in a silent failure: an error message was printed in
the middle of the buildall.sh output and the branch was never taken.

Change-Id: I7a0f74b93bb31bd0c56fc4c20f42f8ab1fc6de78
---
M buildall.sh
1 file changed, 1 insertion(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/4382/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4382
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I7a0f74b93bb31bd0c56fc4c20f42f8ab1fc6de78
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4028: Trim sentry config file path spaces while impala start.

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4028: Trim sentry config file path spaces while impala 
start.
..


Patch Set 3:

How did you come across this? Did you start Impala from the command line 
directly or use some other tool (e.g. cloudera manager)?

-- 
To view, visit http://gerrit.cloudera.org:8080/4309
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a76b9e4236caa3f2088fba8a9cf0236fced2634
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: davy...@163.com
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: davy...@163.com
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..


Patch Set 4: Code-Review+1

Carry +1

-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4353/2/be/src/testutil/death-test-util.h
File be/src/testutil/death-test-util.h:

PS2, Line 41: setrlimit(RLIMIT_CORE, &limit);
: 
: minidumps_enabled_before_ = EnableMinidumpsForTest(false);
> I assume it's necessary to disable core dumps when we've disabled minidumps
I'm not sure I understand the comment. The class is meant to disable both in 
tests, but I don't think there's a dependency between the two. 

The previous version of the class in promise-test.cc only disabled coredumps.


http://gerrit.cloudera.org:8080/#/c/4353/3/be/src/util/minidump.h
File be/src/util/minidump.h:

PS3, Line 29:  for death test
> nit: maybe remove or reword like "for example during death test" or "during
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-12 Thread Tim Armstrong (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4353

to look at the new patch set (#4).

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..

IMPALA-4111: backend death tests should not produce minidumps

Move the existing core dump disabler into a shared utility header and
also disable minidumps too. Add a macro to simplify use of the disabler.

Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
---
A be/src/testutil/death-test-util.h
M be/src/util/minidump.cc
M be/src/util/minidump.h
M be/src/util/promise-test.cc
4 files changed, 83 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/53/4353/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 


[Impala-ASF-CR] IMPALA-4026: Implement double-buffering for BlockingQueue.

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4026: Implement double-buffering for BlockingQueue.
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4350/2/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

PS2, Line 185: mutex
> worth changing these to SpinLock, which I believe are a bit faster to lock/
Unfortunately we don't have a condition variable implementation that works with 
SpinLock (that would be very nice to have).


-- 
To view, visit http://gerrit.cloudera.org:8080/4350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Chen Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4026: Implement double-buffering for BlockingQueue.

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4026: Implement double-buffering for BlockingQueue.
..


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/4350/2/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

Line 155: 
> It's always racy as we never hold read_lock when reading the size of read_l
Ah, good point.


http://gerrit.cloudera.org:8080/#/c/4350/3/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

PS3, Line 188: put_list
shouldn't these members have _ at the end?


Line 197:   boost::mutex get_lock;
Shouldn't we align get_lock?


http://gerrit.cloudera.org:8080/#/c/4350/3/be/src/util/condition-variable.h
File be/src/util/condition-variable.h:

Line 28: /// Simple wrapper around POSIX pthread condition variable.
Explain briefly how it differs from the boost cv?


PS3, Line 29: struct
class?


PS3, Line 29: __attribute__ ((aligned(64)))
What do you think about putting this in compiler-util.h as CACHE_ALIGN or 
something along those lines?


PS3, Line 61: cv
cv_


PS3, Line 62: ConditionVariable
This and the typedef shouldn't be necessary in C++, it automatically lets you 
use the short name of the class/struct.


-- 
To view, visit http://gerrit.cloudera.org:8080/4350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Chen Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4028: Trim sentry config file path spaces while impala start.

2016-09-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4028: Trim sentry config file path spaces while impala 
start.
..


Patch Set 3:

I don't think we should go ahead with the trimming solution for the reasons I 
mentioned earlier (problems with trying to "fix" misconfigurations and 
inconsistency with other command line options). 

However part of the problem seems to be that it's very hard to see the spaces 
in the error message. Maybe you could change the patch just to quote the file 
name in the error message?

-- 
To view, visit http://gerrit.cloudera.org:8080/4309
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a76b9e4236caa3f2088fba8a9cf0236fced2634
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: davy...@163.com
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: davy...@163.com
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4026: Implement double-buffering for BlockingQueue.

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4026: Implement double-buffering for BlockingQueue.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4350/3/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

Line 99:   bool BlockingPut(const T& val) {
I think there's a pretty bad bug with how the signalling works. 

Consider a queue with a limit of N. If the queue gets into a state where 
get_list has N elements and put_list has 0 elements with all of the producer 
threads waiting on it, then the caller to BlockingGet() will drain get_list 
before signalling put_cv,  even though it should wake up a producer thread once 
there is space in the queue. Worse, it will only wake up one producer thread 
each time BlockingGet() is called, which means that one only producer thread 
can run at a time.


-- 
To view, visit http://gerrit.cloudera.org:8080/4350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Chen Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..


Patch Set 4:

Somehow Lars' earlier comments got deleted. Reproducing from my email:



Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4353/3/be/src/util/minidump.h
File be/src/util/minidump.h:

Lars Volker has posted comments on this change.

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..


Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4353/3/be/src/util/minidump.h
File be/src/util/minidump.h:

PS3, Line 29:  for death test
nit: maybe remove or reword like "for example during death test" or "during 
tests"

-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4008: don't bake in hash table and hash join pointers

2016-09-13 Thread Tim Armstrong (Code Review)
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4326

to look at the new patch set (#4).

Change subject: IMPALA-4008: don't bake in hash table and hash join pointers
..

IMPALA-4008: don't bake in hash table and hash join pointers

This fixes some of the cases where memory addresses are baked into
codegen'd code.

Testing:
Ran exhaustive build.

Perf:
Ran a local perf run. No significant changes. I was able to see some
small improvements on microbenchmarks.


+---+---+-++++
| Workload  | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |

+---+---+-++++
| TPCH(_20) | parquet / none / none | 9.07| +0.46% | 5.88   | 
+0.34% |

+---+---+-++++


+---+--+---++-++++-+---+
| Workload  | Query| File Format   | Avg(s) | Base Avg(s) | 
Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |

+---+--+---++-++++-+---+
| TPCH(_20) | TPCH-Q2  | parquet / none / none | 2.12   | 1.89|   
+12.29%  | * 10.85% * | * 20.30% * | 1   | 10|
| TPCH(_20) | TPCH-Q13 | parquet / none / none | 9.84   | 9.34|   
+5.39%   |   9.01%|   3.79%| 1   | 10|
| TPCH(_20) | TPCH-Q17 | parquet / none / none | 14.61  | 14.19   |   
+2.97%   |   2.15%|   1.52%| 1   | 10|
| TPCH(_20) | TPCH-Q18 | parquet / none / none | 14.76  | 14.35   |   
+2.82%   |   3.20%|   2.64%| 1   | 10|
| TPCH(_20) | TPCH-Q9  | parquet / none / none | 13.72  | 13.54   |   
+1.30%   |   1.75%|   0.70%| 1   | 10|
| TPCH(_20) | TPCH-Q8  | parquet / none / none | 5.71   | 5.64|   
+1.30%   |   1.21%|   1.23%| 1   | 10|
| TPCH(_20) | TPCH-Q19 | parquet / none / none | 47.35  | 46.75   |   
+1.28%   |   2.39%|   1.88%| 1   | 10|
| TPCH(_20) | TPCH-Q5  | parquet / none / none | 4.57   | 4.52|   
+1.20%   |   1.30%|   0.88%| 1   | 10|
| TPCH(_20) | TPCH-Q16 | parquet / none / none | 2.07   | 2.05|   
+1.12%   |   2.59%|   1.79%| 1   | 10|
| TPCH(_20) | TPCH-Q11 | parquet / none / none | 1.45   | 1.45|   
+0.15%   |   2.69%|   2.06%| 1   | 10|
| TPCH(_20) | TPCH-Q3  | parquet / none / none | 4.65   | 4.65|   
-0.09%   |   2.12%|   2.17%| 1   | 10|
| TPCH(_20) | TPCH-Q4  | parquet / none / none | 3.22   | 3.23|   
-0.26%   |   1.03%|   1.33%| 1   | 10|
| TPCH(_20) | TPCH-Q7  | parquet / none / none | 15.84  | 15.92   |   
-0.50%   |   0.91%|   1.15%| 1   | 10|
| TPCH(_20) | TPCH-Q14 | parquet / none / none | 3.29   | 3.31|   
-0.59%   |   3.31%|   1.58%| 1   | 10|
| TPCH(_20) | TPCH-Q22 | parquet / none / none | 2.65   | 2.67|   
-0.78%   |   3.03%|   1.46%| 1   | 10|
| TPCH(_20) | TPCH-Q15 | parquet / none / none | 4.50   | 4.55|   
-1.19%   |   2.87%|   2.45%| 1   | 10|
| TPCH(_20) | TPCH-Q20 | parquet / none / none | 3.84   | 3.91|   
-1.76%   |   2.20%|   1.94%| 1   | 10|
| TPCH(_20) | TPCH-Q10 | parquet / none / none | 5.58   | 5.70|   
-2.00%   |   1.01%|   1.79%| 1   | 10|
| TPCH(_20) | TPCH-Q21 | parquet / none / none | 22.84  | 23.42   |   
-2.47%   |   0.68%|   0.56%| 1   | 10|
| TPCH(_20) | TPCH-Q1  | parquet / none / none | 11.25  | 11.60   |   
-3.06%   |   0.48%|   1.74%| 1   | 10|
| TPCH(_20) | TPCH-Q12 | parquet / none / none | 3.81   | 3.98|   
-4.38%   |   1.62%|   1.14%| 1   | 10|
| TPCH(_20) | TPCH-Q6  | parquet / none / none | 1.94   | 2.04|   
-4.85%   |   2.40%|   1.58%| 1   | 10|

+---+--+---++-++++-+---+


++---+-++++
| Workload   | File Format   | Avg (s) | Delta(Avg) | 
GeoMean(s) | Delta(GeoMean) |

+---

[Impala-ASF-CR] IMPALA-4008: don't bake in hash table and hash join pointers

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4008: don't bake in hash table and hash join pointers
..


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/4326/3/be/src/exec/hash-table.h
File be/src/exec/hash-table.h:

PS3, Line 300: ExprValuesHash
> CurExprValuesHash()  [see below for motiviation]
Done


PS3, Line 303: SetExprValuesHash
> Let's rename this to more closely match cur_expr_values() rather than match
Done


PS3, Line 309: cur_expr_values_null
> comment explaining that there is one byte per null value (i.e. these are us
Done.

I'm not actually sure why we need this for codegen, since bool and uint8_t 
should both be represented as int8 internally in LLVM, but I don't want to get 
distracted digging into that, so I left a TODO.

There's some weirdness with vector where it uses bitfields but otherwise 
C++ bools should be 8 bits - maybe that was the reason.


PS3, Line 403: in
> the first time I read this, I read it as this functions stores the values t
Makes sense - done.


PS3, Line 403: in
> same
Done


PS3, Line 413: in
> same
Done


PS3, Line 431: This will be replaced by
 :   /// codegen.
> nit: move this sentence to end of the comment so it's standardized with sim
Done


PS3, Line 438: with 'cur_expr_values_'
> "to compare against the current row."  (since it also uses cur_expr_values_
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie353666dbb5c958f0094d169306fe930ec3014c5
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4008: don't bake in hash table and hash join pointers

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4008: don't bake in hash table and hash join pointers
..


Patch Set 4: Code-Review+1

Carry Michael's +1

-- 
To view, visit http://gerrit.cloudera.org:8080/4326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie353666dbb5c958f0094d169306fe930ec3014c5
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4122: qgen: fix bitrotted cluster unit tests

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4122: qgen: fix bitrotted cluster unit tests
..


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4404
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3e855e265ae245ebe3691d077284ac5761909e00
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Brown 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4008: don't bake in hash table and hash join pointers

2016-09-13 Thread Tim Armstrong (Code Review)
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4326

to look at the new patch set (#5).

Change subject: IMPALA-4008: don't bake in hash table and hash join pointers
..

IMPALA-4008: don't bake in hash table and hash join pointers

This fixes some of the cases where memory addresses are baked into
codegen'd code.

Testing:
Ran exhaustive build.

Perf:
Ran a local perf run. No significant changes. I was able to see some
small improvements on microbenchmarks.


+---+---+-++++
| Workload  | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |

+---+---+-++++
| TPCH(_20) | parquet / none / none | 9.07| +0.46% | 5.88   | 
+0.34% |

+---+---+-++++


+---+--+---++-++++-+---+
| Workload  | Query| File Format   | Avg(s) | Base Avg(s) | 
Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |

+---+--+---++-++++-+---+
| TPCH(_20) | TPCH-Q2  | parquet / none / none | 2.12   | 1.89|   
+12.29%  | * 10.85% * | * 20.30% * | 1   | 10|
| TPCH(_20) | TPCH-Q13 | parquet / none / none | 9.84   | 9.34|   
+5.39%   |   9.01%|   3.79%| 1   | 10|
| TPCH(_20) | TPCH-Q17 | parquet / none / none | 14.61  | 14.19   |   
+2.97%   |   2.15%|   1.52%| 1   | 10|
| TPCH(_20) | TPCH-Q18 | parquet / none / none | 14.76  | 14.35   |   
+2.82%   |   3.20%|   2.64%| 1   | 10|
| TPCH(_20) | TPCH-Q9  | parquet / none / none | 13.72  | 13.54   |   
+1.30%   |   1.75%|   0.70%| 1   | 10|
| TPCH(_20) | TPCH-Q8  | parquet / none / none | 5.71   | 5.64|   
+1.30%   |   1.21%|   1.23%| 1   | 10|
| TPCH(_20) | TPCH-Q19 | parquet / none / none | 47.35  | 46.75   |   
+1.28%   |   2.39%|   1.88%| 1   | 10|
| TPCH(_20) | TPCH-Q5  | parquet / none / none | 4.57   | 4.52|   
+1.20%   |   1.30%|   0.88%| 1   | 10|
| TPCH(_20) | TPCH-Q16 | parquet / none / none | 2.07   | 2.05|   
+1.12%   |   2.59%|   1.79%| 1   | 10|
| TPCH(_20) | TPCH-Q11 | parquet / none / none | 1.45   | 1.45|   
+0.15%   |   2.69%|   2.06%| 1   | 10|
| TPCH(_20) | TPCH-Q3  | parquet / none / none | 4.65   | 4.65|   
-0.09%   |   2.12%|   2.17%| 1   | 10|
| TPCH(_20) | TPCH-Q4  | parquet / none / none | 3.22   | 3.23|   
-0.26%   |   1.03%|   1.33%| 1   | 10|
| TPCH(_20) | TPCH-Q7  | parquet / none / none | 15.84  | 15.92   |   
-0.50%   |   0.91%|   1.15%| 1   | 10|
| TPCH(_20) | TPCH-Q14 | parquet / none / none | 3.29   | 3.31|   
-0.59%   |   3.31%|   1.58%| 1   | 10|
| TPCH(_20) | TPCH-Q22 | parquet / none / none | 2.65   | 2.67|   
-0.78%   |   3.03%|   1.46%| 1   | 10|
| TPCH(_20) | TPCH-Q15 | parquet / none / none | 4.50   | 4.55|   
-1.19%   |   2.87%|   2.45%| 1   | 10|
| TPCH(_20) | TPCH-Q20 | parquet / none / none | 3.84   | 3.91|   
-1.76%   |   2.20%|   1.94%| 1   | 10|
| TPCH(_20) | TPCH-Q10 | parquet / none / none | 5.58   | 5.70|   
-2.00%   |   1.01%|   1.79%| 1   | 10|
| TPCH(_20) | TPCH-Q21 | parquet / none / none | 22.84  | 23.42   |   
-2.47%   |   0.68%|   0.56%| 1   | 10|
| TPCH(_20) | TPCH-Q1  | parquet / none / none | 11.25  | 11.60   |   
-3.06%   |   0.48%|   1.74%| 1   | 10|
| TPCH(_20) | TPCH-Q12 | parquet / none / none | 3.81   | 3.98|   
-4.38%   |   1.62%|   1.14%| 1   | 10|
| TPCH(_20) | TPCH-Q6  | parquet / none / none | 1.94   | 2.04|   
-4.85%   |   2.40%|   1.58%| 1   | 10|

+---+--+---++-++++-+---+


++---+-++++
| Workload   | File Format   | Avg (s) | Delta(Avg) | 
GeoMean(s) | Delta(GeoMean) |

+---

[Impala-ASF-CR] IMPALA-4008: don't bake in hash table and hash join pointers

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4008: don't bake in hash table and hash join pointers
..


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4326/4/be/src/exec/hash-table.h
File be/src/exec/hash-table.h:

Line 308: uint8_t* ALWAYS_INLINE cur_expr_values() const { return 
cur_expr_values_; }
> let's add a short comment here too:
Done


PS4, Line 310: .
> for the current row.
Done


PS4, Line 400: into
> This one should have stayed "in" (or "of"), right? i.e. it doesn't write *e
Done


Line 427:   uint32_t HashVariableLenRow(uint8_t* expr_values, uint8_t* 
expr_values_null) const;
> how about making these const as well.
Done


Line 441:   TupleRow* build_row, uint8_t* expr_values, uint8_t* 
expr_values_null) const;
> these too could be const to clarify they are "in" params.
Done. Also made the various TupleRow arguments const too for consistency. This 
all led to a few additional consts.


-- 
To view, visit http://gerrit.cloudera.org:8080/4326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie353666dbb5c958f0094d169306fe930ec3014c5
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4008: don't bake in hash table and hash join pointers

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4008: don't bake in hash table and hash join pointers
..


Patch Set 6: Code-Review+2

rebase

-- 
To view, visit http://gerrit.cloudera.org:8080/4326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie353666dbb5c958f0094d169306fe930ec3014c5
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4118: extract encryption utils from BufferedBlockMgr

2016-09-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#6).

Change subject: IMPALA-4118: extract encryption utils from BufferedBlockMgr
..

IMPALA-4118: extract encryption utils from BufferedBlockMgr

As groundwork for IMPALA-4118, extract encryption and integrity
verification routines into a separate module that can be used
by the new BufferPool.

Simplify the logic in BufferedBlockMgr by not distinguishing between the
integrity check and encryption options, which are controlled by the same
command line flag anyway.

This patch changes how the OpenSSL RNG is seeded. I consulted with the
original author of the code (Michael Yoder), who suggested that the new
approach would be appropriate: "I'd pick whatever implementation is
easiest for you. You could add entropy once per query, or once every X
keys (100 keys seems reasonable), or once every "convenient place". 4kb
is probably overkill, you could use 512b for example."

Testing:
Added some unit tests for the utilities. We already have coverage of the
BufferedBlockMgr with encryption in buffered-block-mgr-test. Also reduce
the number of iterations in the buffered-block-mgr-test to save some
testing time.

Change-Id: Ibebe431f69c3c569cbff68171beaa32ef2ab03b6
---
M be/src/common/init.cc
M be/src/runtime/buffered-block-mgr-test.cc
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-block-mgr.h
M be/src/util/CMakeLists.txt
A be/src/util/openssl-util-test.cc
A be/src/util/openssl-util.cc
A be/src/util/openssl-util.h
8 files changed, 461 insertions(+), 209 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/4389/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4389
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibebe431f69c3c569cbff68171beaa32ef2ab03b6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-14 Thread Tim Armstrong (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#15).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,196 insertions(+), 1,478 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/15
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 14:

(53 comments)

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/analytic-eval-node.cc
File be/src/exec/analytic-eval-node.cc:

Line 207: state, Substitute(PREPARE_FAILED_ERROR_MSG, "write"));
> old code and agg uses state_->block_mgr()->MemLimitTooLowError().  Any reas
Historical reasons it looks like:  the PrepareForRead() version below diverged 
from the PAGG/PHJ versions.

Switched to the block_mgr version for consistency.


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/blocking-join-node.cc
File be/src/exec/blocking-join-node.cc:

PS14, Line 297: SCOPED_TIMER(build_timer_);
> We used to have two different timers in PHJ for measuring the time to hash 
It's not clear to me that the original timers were well thought out. It didn't 
look like there was a timer for how long it took to build the hash table - that 
was lumped into 'build_timer_'.

I added two new timers with clearer meanings that track the total hash table 
build timer and time spent partitioning rows. Also moved the remaining 
build-related timers into the Builder.

Now 'built_timer_' is the initial partitioning and hash table build, and 
'repartition_timer_' is the time spent repartitioning'.


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

Line 179: largest_fraction = max(largest_fraction,
> DCHECK_GT(num_build_rows, 0);
I don't think that's valid, it's casting to double to avoid crashing on the 
divide-by-zero. I changed it to avoid the calculation if num_build_rows = 0 to 
avoid that problem.


PS14, Line 318: a hash table
> hash tables
Done. Both alternatives seem grammatically correct to me though.


Line 341:   // building hash tables to avoid building hash tables for 
partitions we'll spill anyway.
> That's an interesting comment. If I understand it correctly, it means that 
Done with some slight tweaks.

This is a lot simpler than the previous behaviour, where spilling can happen 
even during probing.

I didn't measure how often, but it's not at all improbable, since the probe 
buffers are 8MB each and we could allocate up to 16 of them.


Line 401:   RETURN_IF_ERROR(SpillPartition());
> May help to document that failure to find any partition to spill (e.g. all 
I added to the comment up the top of the loop to make the termination more 
explicit.


PS14, Line 529: PhjBuilder::HashTableStoresNulls()
> This seems to belong to PartitionedHashJoinNode conceptually.
I feel like it makes sense here though since the builder owns the hash tables.

Plumbing-wise it's also easier since PhjBuilder needs this info and starts 
executing before PartitionedHashJoinNode.


Line 646:   do {
> Please see comments in BlockingJoinNode,  it would be great to retain timer
Done


PS14, Line 782: process_build_batch_fn_level0
> 'process_build_batch_fn_level0_'
I just copied this verbatim - it seems to be referring to the local variable 
though.


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

Line 425: 
> nit: unnecessary blank line ?
Done


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

PS14, Line 151: 
  : 
  : 
> This may be important information to retain.
Not sure why I removed it. May have had a good reason but I can't recall it.


PS14, Line 87: Expr::CreateExprTree(pool_, eq_join_conjunct.right
> Does this overlap with CreateExprTree() in PhjBuilder::Init() ? Same questi
They are redundant currently, but the idea is that PhjBuilder and PhjNode will 
exist more independently in different plan fragments, so I wanted to 
encapsulate the state where possible.

The expr contexts need to be independent to eventually support many probe sides 
: 1 build side for broadcast joins.

I think documenting this explicitly may be confusing because it is sort-of 
explaining the current state of things relative to the previous way of doing 
things. Whereas I think with the separate build side, the default assumption is 
that builder data structures are private and not shared with the exec node.

It may make sense to share Expr trees globally between fragment instances as 
part of the multithreading changes, but I don't think it's worth trying to 
share them here until we have the right infrastructure.


Line 213:   for (unique_ptr& partition : spilled_partitions_)
> missing { }
Ah, missed that clang-format wrapped it.


PS14, Line 494: to output
> outputting
Done


PS14, Line 585: hash_partitions_
> 'hash_partitions_'
Done


Line 594:   continue;
> Is there a reason why we cannot call OutputUnmatchedBuild() directly at thi
I don't

[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-14 Thread Tim Armstrong (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#16).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,196 insertions(+), 1,476 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/16
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 16:

Rebased onto master

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 14:

(2 comments)

Before I responded to individual comments I wanted to point out that the 
coupling is now one way - the builder never refers to PhjNode aside from 
metadata provided via the constructor (you can verify there are no references 
to partitioned-hash-join-node.h or PartitionedHashJoin node in the builder 
code).

At this point it should only require minor changes to this hash join code to 
create the builder in a separate fragment *before* the join node, provided you 
have a 1:1 mapping from builder to node. 1:n broadcast joins with no spilling 
would require some more changes, but is a lot closer.

I chose this stopping point because there was a clear physical separation 
between the classes, even if the spilling algorithm has some logical coupling 
and assumes a 1:1 mapping from node to builder. 

Reducing the logical coupling would require significant changes to the spilling 
algorithm that is driven from PartitionedHashJoin::GetNext(), which I think 
should be deferred until there is some consensus about what spilling for 1:n 
broadcast joins with multithreading might look like.

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

Line 123:   bool HashTableStoresNulls() const;
> why is this here rather than the join node? especially given that the join 
Both classes already have their own hash table context.


Line 128:   inline const std::vector& is_not_distinct_from() const {
> and then should this be here? is it to be closer to being able to break the
There isn't a parent link - all the necessary metadata about the join is passed 
in when the builder is initialised.


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3308: Get expr-test passing on PPC64LE

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3308: Get expr-test passing on PPC64LE
..


Patch Set 2:

Any luck with this?

-- 
To view, visit http://gerrit.cloudera.org:8080/4186
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4504ee6a52a085f530aadfcfa009bacb83c64787
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: segelyang 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: segelyang 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4026: Implement double-buffering for BlockingQueue.

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4026: Implement double-buffering for BlockingQueue.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4350/3/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

Line 99:   bool BlockingPut(const T& val) {
> Yes, I have thought about it before. I have tested various locations for no
We discussed this a little offline and I feel like I understand why option (3) 
works in practice now - when N producers are outpacing consumers, this switches 
to a mode where the steady state is a nearly empty queue with k producers 
sitting blocked and N - k producers working. Essentially there are some 
additional elements in the queue because the blocked producers also have some 
row batches ready to add to the queue immediately.

I'm ok moving forward with it so long as we document the expected behaviour. 
Perhaps it's adequate to document the expect behaviour with a faster consumer, 
matched producer/consumer and fast producers.

Have you benchmarked this with concurrent queries or workloads with many 
concurrent scans? One concern with the empty queue is that if the queue is 
empty and a producer isn't scheduled immediately, the consumer may end up 
waiting on the condition variable for every batch and potentially getting 
swapped it. We could perhaps work around that if the producer added its item to 
the queue *before* blocking, so that the the consumer can get the item without 
requiring a context switch.


-- 
To view, visit http://gerrit.cloudera.org:8080/4350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Chen Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4074: Configuration items duplicate in template of YARN

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4074: Configuration items duplicate in template of YARN
..


Patch Set 1:

I'm going to run tests and merge this.

-- 
To view, visit http://gerrit.cloudera.org:8080/4311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I81d6d019d4982cb35932b1d45c376b215ec5bcc6
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Yuanhao Luo 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yuanhao Luo 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4074: Configuration items duplicate in template of YARN

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4074: Configuration items duplicate in template of YARN
..


Patch Set 2: Code-Review+2

Rebase

-- 
To view, visit http://gerrit.cloudera.org:8080/4311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I81d6d019d4982cb35932b1d45c376b215ec5bcc6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Yuanhao Luo 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Yuanhao Luo 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4008: Don't bake ExprContext pointers into PAGG/AGG IR code

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4008: Don't bake ExprContext pointers into PAGG/AGG IR 
code
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4390/2/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

Line 84: // Use NULL if the aggregate evaluator is not codegen'ed or if 
there is no
It might be simpler just to always set to NULL if input_expr_ctxs is empty. We 
lose some cross-validation but I feel like the simple code would be nice.


http://gerrit.cloudera.org:8080/#/c/4390/2/be/src/exec/aggregation-node.h
File be/src/exec/aggregation-node.h:

Line 82:   std::vector agg_expr_ctxs_;
Mention that the ExprContexts are owned by aggregate_evaluators_.


http://gerrit.cloudera.org:8080/#/c/4390/2/be/src/exec/partitioned-aggregation-node.cc
File be/src/exec/partitioned-aggregation-node.cc:

Line 158: ExprContext* agg_expr_ctx = NULL;
Again, I think it would be simpler to just check if it's empty. Or add a method 
to AggFnEvaluator like HasInputExprContexts()


http://gerrit.cloudera.org:8080/#/c/4390/2/be/src/exec/partitioned-aggregation-node.h
File be/src/exec/partitioned-aggregation-node.h:

Line 202:   /// ExprContexts of 'aggregate_evaluators_'. Used in the codegen'ed 
version of
Also document ownership here.


Line 205:   std::vector agg_expr_ctxs_;
What do you think about making this vector>? 

Each pair is just a struct with two pointers so should be simple to access from 
the IR.

This would better document the correspondence between the two vectors and 
result in one fewer argument being threaded through everywhere.

I'm mostly ok with the current approach but the extra argument is a bit 
annoying.


http://gerrit.cloudera.org:8080/#/c/4390/2/be/src/exprs/agg-fn-evaluator.h
File be/src/exprs/agg-fn-evaluator.h:

Line 192:   std::vector input_expr_ctxs_;
Can you add your comment here about when this is empty/non-empty?


-- 
To view, visit http://gerrit.cloudera.org:8080/4390
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I42039eed803a39fa716b9ed647510b6440974ae5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4138: Fix AcquireState() for batches that have MarkCapacity() called

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4138: Fix AcquireState() for batches that have 
MarkCapacity() called
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/4428/1/be/src/runtime/row-batch-test.cc
File be/src/runtime/row-batch-test.cc:

Line 38:   vector nullable_tuples(1, false);
This is ok, but I find the c++11 initialisers for vectors a bit nicer. e.g. 
nullable_tuples({false})


PS1, Line 39: TTupleId
This should be a static_cast according to our coding standard.


Line 60: ASSERT_DEBUG_DEATH(bad_dest.AcquireState(&src), "");
> Will change to IMPALA_ASSERT_DEBUG_DEATH when related patch lands.
We should make sure to merge that first, otherwise we will have a lot of core 
dumps.


http://gerrit.cloudera.org:8080/#/c/4428/1/be/src/runtime/row-batch.cc
File be/src/runtime/row-batch.cc:

Line 319:   capacity_ = tuple_ptrs_size_ / (num_tuples_per_row_ * 
sizeof(Tuple*));
We already know the initial capacity using this calculation - we should compute 
it the same way in both places. Maybe just factor this out into an 
InitialCapacity() function?


-- 
To view, visit http://gerrit.cloudera.org:8080/4428
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6ceca53c406b05cd04b7d95a4f9f2eec7bc127f5
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Tim Armstrong (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#17).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,244 insertions(+), 1,491 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/17
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 14:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

PS14, Line 45: build row partitions
> is this the same as the "hash partitions" referred to in the next paragraph
really the streams, reworded.


Line 86: 
> is it expected that the renaming public functions would only be used by PHJ
Done


PS14, Line 90: Acquire ownership
> Doesn't the ownership pass from this class to the exec node?  In which case
It passes from the class to the caller. I guess I don't find this ambiguous 
given the return type and can't think of a non-awkward alternative.


PS14, Line 95: Called after probing of the partitions
 :   /// is done. The partitions are not closed or destroyed, since 
they may be spilled or
 :   /// may contain unmatched build rows for certain join modes 
(e.g. right outer join).
> The fact that we need some much explanation kinda makes this seem like it m
The hash join node uses this to determine whether it's already cleaned up the 
hash partitions, and to keep the build/probe hash partition vectors in sync.

Agree this isn't the interface we ultimately want but I wanted to avoid 
additional restructuring of GetNext()/CleanupHashPartitions().


Line 106:   }
> nit: missing blank line
Done


PS14, Line 109: When this call returns
> Is this stuff by this method or the caller?  i.e. if this method closes inp
Done


PS14, Line 112: so that the probe phase has enough buffers preallocated
  :   /// execute successfully.
> slightly garbled. seems like a word is missing or something.
Done


PS14, Line 117: /// Iterates over all the partitions in 'hash_partitions_' and 
returns the largest
  :   /// number of build rows.
> rather than say how and referring to private member, and leaving it to the 
Done


Line 123:   bool HashTableStoresNulls() const;
> Both classes already have their own hash table context.
Also, I think this makes sense to have here since the builder manages the hash 
tables.


PS14, Line 182: partition side of this partition
> not sure what this is saying. maybe just "for this build partition"?
Done


Line 188: /// false.
> maybe clarify that an error is not returned in that case.  or clarify what 
Done


PS14, Line 195: unpin_all_build
> why "build"?  shouldn't it be unpin_all_blocks?  or even better would be to
Overall these bool arguments are really confusing, and the correctness of the 
spilling depends a lot on using the right value in the right place. I ended up 
changing this significantly to use an enum all the way down to the underlying 
BufferedTupleStream method, and adding this to SpillPartition(). It turns out 
there was a bug in BuildHashTablesAndPrepareProbeStreams() where it should have 
been unpinning all of the blocks in the partition it chose to spill but wasn't. 

I'm rerunning the memory experiment just to confirm this doesn't change 
anything.


PS14, Line 234: build or probe
> does this care about the probe side partitioning?
Yes, since we share the reservation and hand off the probe buffers.


PS14, Line 244: all hash partitions for partitioning level
> this is kinda misleading. maybe qualify with ... for the current input, or 
Done


PS14, Line 252: in
> into
Done


PS14, Line 256: in memory
> what does this mean? is it trying to say it fits in the current (pinned) bl
Reworded it - I think it makes more sense to think of it as whether appending 
to the stream succeeds (which may advance the block) or whether we have to 
start spilling things.


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4138: Fix AcquireState() for batches that have MarkCapacity() called

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4138: Fix AcquireState() for batches that have 
MarkCapacity() called
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4428/1/be/src/runtime/row-batch.cc
File be/src/runtime/row-batch.cc:

Line 319:   capacity_ = tuple_ptrs_size_ / (num_tuples_per_row_ * 
sizeof(Tuple*));
> What about just setting capacity_ = initial_capacity_ here?
I don't feel strongly either way, but when this added I switched from the 
initial_capacity_ approach to the current approach to avoid redundant state:

https://gerrit.cloudera.org/#/c/1399/8/be/src/runtime/row-batch.h@301


-- 
To view, visit http://gerrit.cloudera.org:8080/4428
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6ceca53c406b05cd04b7d95a4f9f2eec7bc127f5
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 17:

The memory usage experiment results look good (no major changes)

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4138: Fix AcquireState() for batches that have MarkCapacity() called

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4138: Fix AcquireState() for batches that have 
MarkCapacity() called
..


Patch Set 2: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4428/2/be/src/runtime/row-batch.h
File be/src/runtime/row-batch.h:

PS2, Line 282: MarkCapacity
MarkAtCapacity I think


-- 
To view, visit http://gerrit.cloudera.org:8080/4428
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6ceca53c406b05cd04b7d95a4f9f2eec7bc127f5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-15 Thread Tim Armstrong (Code Review)
Hello Matthew Jacobs, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4353

to look at the new patch set (#5).

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..

IMPALA-4111: backend death tests should not produce minidumps

Move the existing core dump disabler into a shared utility header and
also disable minidumps too. Add a macro to simplify use of the disabler.

Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
---
A be/src/testutil/death-test-util.h
M be/src/util/minidump.cc
M be/src/util/minidump.h
M be/src/util/promise-test.cc
4 files changed, 83 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/53/4353/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-15 Thread Tim Armstrong (Code Review)
Hello Matthew Jacobs, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4353

to look at the new patch set (#6).

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..

IMPALA-4111: backend death tests should not produce minidumps

Move the existing core dump disabler into a shared utility header and
also disable minidumps too. Add a macro to simplify use of the disabler.

Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
---
A be/src/testutil/death-test-util.h
M be/src/util/minidump.cc
M be/src/util/minidump.h
M be/src/util/promise-test.cc
4 files changed, 83 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/53/4353/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4111: backend death tests should not produce minidumps

2016-09-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4111: backend death tests should not produce minidumps
..


Patch Set 6: Code-Review+2

rebase

-- 
To view, visit http://gerrit.cloudera.org:8080/4353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I33037f4f77c8188fc2ec46a77083a0a64f6ea404
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3308: Get expr-test passing on PPC64LE

2016-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3308: Get expr-test passing on PPC64LE
..


Patch Set 3:

(2 comments)

No problems, just wanted to make sure that this patch made it it.

http://gerrit.cloudera.org:8080/#/c/4186/3/be/src/exprs/expr-value.h
File be/src/exprs/expr-value.h:

Line 70:   /// Init for string values
Can you remove this comment? It's not helpful (I know you didn't add it :)).


Line 71:   void Init(const std::string& str){
Can you fix the missing whitespace in this line and the one below. Probably the 
easiest is to run clang-format on the patch (we recently added a .clang-format 
file). This command will run clang-format on the lines changed in the most 
recent commit on your branch:

  git diff -U0 HEAD^  | 
$IMPALA_TOOLCHAIN/llvm-3.8.0-p1/share/clang/clang-format-diff.py -i -p1


-- 
To view, visit http://gerrit.cloudera.org:8080/4186
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4504ee6a52a085f530aadfcfa009bacb83c64787
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: segelyang 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: segelyang 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4026: Implement double-buffering for BlockingQueue

2016-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4026: Implement double-buffering for BlockingQueue
..


Patch Set 4:

(8 comments)

I'm ok with this option if it generally provides better perf and avoids 
pathological behaviour. Generally looks good, just comment and style comments.

http://gerrit.cloudera.org:8080/#/c/4350/4/be/src/util/blocking-queue.h
File be/src/util/blocking-queue.h:

Line 36: /// if the queue is empty or full, respectively.
I think we should clarify when the unblocking happens. In BlockingGet() it is 
unblocked as soon as an item is added, but in BlockingPut() it is not 
guaranteed (since we're not calling put_cv_.NotifyOne() with the lock held).


Line 95: put_cv_.NotifyOne();
Can you comment that this notification is racy: producers may not be notified 
if the queue is partially empty.


Line 182:   /// Maximum number of elements in 'get_list_' + 'put_list_'.
clarify as "Maximum total number" - if a reader wasn't careful they might thing 
it meant the maximum per list.


PS4, Line 190: > >
nit: we're generally using >> now that it works in c++11


http://gerrit.cloudera.org:8080/#/c/4350/4/be/src/util/condition-variable.h
File be/src/util/condition-variable.h:

Line 29: /// This has lower overhead than boost's implementation as it
Nit: lines are wrapped very short here


PS4, Line 31: CACHE_ALIGNED
Nit: here and in the other places: does it work to put this on the line before 
'class'? That seems a little more readable to me.


Line 43:   bool inline TimedWait(boost::unique_lock& lock,
What does the return value mean?


Line 50:   void inline NotifyOne() { pthread_cond_signal(&cv_); }
Nit: here and above, we usually put 'inline' before the return type.


-- 
To view, visit http://gerrit.cloudera.org:8080/4350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Chen Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3308: Get expr-test passing on PPC64LE

2016-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3308: Get expr-test passing on PPC64LE
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4186/3/be/src/exprs/expr-value.h
File be/src/exprs/expr-value.h:

PS3, Line 73: string_val.ptr = const_cast(string_data.data());
> Does this have the possibility of creating Undefined Behavior?
The c++11 standard guarantees that string storage is contiguous. Should be fine 
as long as the string isn't modified later.

Let's change this to string_val.ptr = &string_data[0] to avoid the const_cast.


-- 
To view, visit http://gerrit.cloudera.org:8080/4186
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4504ee6a52a085f530aadfcfa009bacb83c64787
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: segelyang 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: segelyang 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3308: Get expr-test passing on PPC64LE

2016-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3308: Get expr-test passing on PPC64LE
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4186/3/be/src/exprs/expr-value.h
File be/src/exprs/expr-value.h:

PS3, Line 73: string_val.ptr = const_cast(string_data.data());
> I think string_data[0] has type const char&, so I'm not sure that will work
There's a non-const char version 
http://www.cplusplus.com/reference/string/string/operator[]/. 

This function should be the only place that string_data is touched.


-- 
To view, visit http://gerrit.cloudera.org:8080/4186
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4504ee6a52a085f530aadfcfa009bacb83c64787
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: segelyang 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: segelyang 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 17:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

PS14, Line 90: 
> I guess one term used elsewhere is "release", right? (e.g. C++ unique_ptr).
I find release confusing, makes me think that the streams are being "released" 
from the join as a whole. I changed it to Transfer() since that seems more 
neutral about the directionality.


http://gerrit.cloudera.org:8080/#/c/3873/15/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 372:   { RETURN_IF_ERROR(build_partition->BuildHashTable(&built)); }
> extraneous braces
Done


Line 395: }
> same
Done


http://gerrit.cloudera.org:8080/#/c/3873/15/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

PS15, Line 81: build
> probe
Done


PS15, Line 405:  
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/3873/15/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

Line 391:   }
> alternatively: initialize insert_pos to begin and then DCHECK_NE(insert_pos
I think I'll stay with this approach. The alternative seems a little subtle (it 
depends on us adding + 1 to it above)


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 18: Code-Review+1

Carry +_1

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-19 Thread Tim Armstrong (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#18).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,242 insertions(+), 1,491 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/18
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..


Patch Set 19:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5161/18/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 57:   private final ArrayList aggFnArgTypes_;
> Is this change a nice add-on or is it required for this patch?
The UDA codegen and FunctionContext::Get*Type() semantic fixes are somewhat 
independent but touch the same code so it was easiest to combine them. This is 
needed for GetArgType() to return the correct input expr time.

We discussed the approach offline. It's the most sane way to plumb it through 
(otherwise the merge aggregation has to go and search through the plan tree for 
the original AggregateInfo when it is generating the thrift descriptors, which 
would violate multiple abstractions).

Alex found an example where the types do indeed get out of sync:

  select typeof(sum(f + d)) from (select float_col f, 1.2 d from 
functional.alltypes) v;

However, in this case (and all cases I can think of) , the intermediate and 
output tuple descriptors are also inconsistent, which is detected by a 
precondition:

  ERROR: IllegalStateException: Agg expr sum(float_col + 1.2) returns type 
DOUBLE but its output tuple slot has type DECIMAL(38,9)

I added an additional precondition to check for the types getting out of sync. 
I verified that it's hit if I comment out the earlier preconditions:

  ERROR: IllegalStateException: Agg expr sum:merge(f + d) arg type 
DECIMAL(38,9) differs from input expr type DOUBLE in original expr 
sum(float_col + 1.2)

So we need to fix the underlying bug still (which actually stems from the 
decimal type resolution algorithm) but the behaviour won't be a regression.


Line 74:   private FunctionCallExpr(FunctionName fnName, FunctionParams params,
> don't need this anymore, right?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Hello Impala Public Jenkins, Michael Ho, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5161

to look at the new patch set (#19).

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..

IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

This uses the existing infrastructure for codegening builtin UDAs and
for codegening calls to UDFs. GetUdf() is refactored to support both
UDFs and UDAs.

IR UDAs are still not allowed by the frontend. It's unclear if we want
to enable them going forward because of the difficulties in testing and
supporting IR UDFs/UDAs.

This also fixes some bugs with the Get*Type() methods of
FunctionContext. GetArgType() and related methods now always return the
logical input types of the aggregate function. Getting the tests to pass
required fixing IMPALA-4878 because they called GetIntermediateType().

Testing:
test_udfs.py tests UDAs with codegen enabled and disabled.

Added some asserts to test UDAs to check that the correct types are
passed in via the FunctionContext.

Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/codegen-anyval.h
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/exec/partitioned-aggregation-node.cc
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/agg-fn-evaluator.h
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/expr.cc
M be/src/exprs/expr.h
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/scalar-fn-call.h
M be/src/exprs/timestamp-functions.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/testutil/test-udas.cc
M be/src/testutil/test-udas.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-ir.cc
M be/src/udf/udf.h
M common/thrift/Exprs.thrift
M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M testdata/workloads/functional-query/queries/QueryTest/uda.test
M tests/query_test/test_udfs.py
26 files changed, 496 insertions(+), 237 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/61/5161/19
-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3748: Part 1: Clean up resource estimation in planner

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#5).

Change subject: IMPALA-3748: Part 1: Clean up resource estimation in planner
..

IMPALA-3748: Part 1: Clean up resource estimation in planner

This is in preparation to use this code to compute the minimum
reservation.

The cleanup restructures the code slightly so that it's clearer whether
resource estimates are valid or invalid. It also removes the unused
VCore estimates.

Fixes various bugs and other issues:
* computeCosts() was not called for unpartitioned fragments, so
  the per-operator memory estimate was not visible.
* Nested loop join was not treated as a blocking join.
* The TODO comment about union was misleading
* The logic does not work for mt_dop > 1 because it conflates fragment
  instances with hosts. Fixing this requires identifying places where
  we want per-instance estimates instead of per-host.

I left one bug unfixed because it is subtle and will be easier to review
in isolation: IMPALA-4862.

There is some remaining questionable behaviour I left untouched:
* It's unclear why unpartitioned fragments are always excluded from
  total memory consumption.
* Many operators do not have estimates or have questionable heuristics.

Testing:
Re-enabled the explain_level tests, which appear to be the only
coverage for many of these estimates. Removed the complex and
brittle test cases and replaced with a couple of much simpler
end-to-end tests and a number of planner test cases for
memory estimates in both the MT and non-MT cases.

Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
---
M be/src/service/query-exec-state.cc
M common/thrift/Frontend.thrift
M docs/shared/impala_common.xml
M docs/topics/impala_explain.xml
M docs/topics/impala_explain_level.xml
M docs/topics/impala_explain_plan.xml
M docs/topics/impala_hbase.xml
M docs/topics/impala_known_issues.xml
M docs/topics/impala_optimize_partition_key_scans.xml
M docs/topics/impala_perf_joins.xml
M docs/topics/impala_perf_stats.xml
M fe/src/main/java/org/apache/impala/common/PrintUtils.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DataSink.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/DataStreamSink.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/EmptySetNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseTableSink.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/JoinBuildSink.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduTableSink.java
M fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java
M fe/src/main/java/org/apache/impala/planner/PipelinedPlanNodeSet.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanRootSink.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
A 
testdata/workloads/functional-planner/queries/PlannerTest/resource-estimates.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level0.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level1.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
M tests/custom_cluster/test_admission_controller.py
M tests/metadata/test_explain.py
47 files changed, 1,544 insertions(+), 1,454 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/5847/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5847
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3748: Part 1: Clean up resource estimation in planner

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3748: Part 1: Clean up resource estimation in planner
..


Patch Set 4:

(5 comments)

Some minor cleanup and tests fixes

http://gerrit.cloudera.org:8080/#/c/5847/4/fe/src/main/java/org/apache/impala/planner/AggregationNode.java
File fe/src/main/java/org/apache/impala/planner/AggregationNode.java:

Line 51:   // Default per-host memory requirement used if no valid stats are 
available.
> Need to fix comment.
Done


http://gerrit.cloudera.org:8080/#/c/5847/4/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 85: long perInstanceInputCardinality = Math.max(1L, 
inputNode.getCardinality() / numInstances);
> Need to fix long lines
Done


Line 138: output.append(PrintUtils.printInstances(" ", 
fragment_.getNumInstances(queryOptions)));
> Need to fix long line
Done


http://gerrit.cloudera.org:8080/#/c/5847/4/fe/src/main/java/org/apache/impala/planner/KuduTableSink.java
File fe/src/main/java/org/apache/impala/planner/KuduTableSink.java:

Line 62:   output.append(PrintUtils.printInstances(" ", 
fragment_.getNumInstances(queryOptions)));
> need to fix long line
Done


http://gerrit.cloudera.org:8080/#/c/5847/4/fe/src/main/java/org/apache/impala/planner/PlanFragment.java
File fe/src/main/java/org/apache/impala/planner/PlanFragment.java:

Line 213: return dop * planRoot_.getNumNodes();
> Needs to handle when getNumNodes() is invalid, i.e. -1
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5847
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4842: BufferedBlockMgrTest.WriteError is flaky

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5940

Change subject: IMPALA-4842: BufferedBlockMgrTest.WriteError is flaky
..

IMPALA-4842: BufferedBlockMgrTest.WriteError is flaky

The test should allow Unpin() to fail with a scratch allocation error to
handle the case where the first write fails and blacklists the scratch
disk around the same time that the second write starts. Usually either
the second write succeeds because it started before the first write
failed or it fails with CANCELLED because the
BufferedBlockMgr::is_cancelled_ flag is set. There is a small
window for a race after the disk is blacklisted in TmpFileMgr but
before BufferedBlockMgr::WriteComplete() is called.

Testing:
I was able to reproduce the problem locally by adding some delays
to the test. I added a variant of the WriteError test that more reliably
reproduces the bug. Ran both WriteError tests in a loop locally to try
to flush out flakiness.

Change-Id: I9878d7000b03a64ee06c2088a8c30e318fe1d2a3
---
M be/src/runtime/buffered-block-mgr-test.cc
M be/src/runtime/tmp-file-mgr.cc
M common/thrift/generate_error_codes.py
3 files changed, 43 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5940/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5940
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I9878d7000b03a64ee06c2088a8c30e318fe1d2a3
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4895: Memory limit exceeded in test outer joins

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4895: Memory limit exceeded in test_outer_joins
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5941/1/testdata/workloads/tpch/queries/tpch-outer-joins.test
File testdata/workloads/tpch/queries/tpch-outer-joins.test:

Line 35: none
Can we make it something that is unlikely to be a substring of a real error 
message? Just to avoid confusion. E.g. __NO_ERROR__ or something like that.


-- 
To view, visit http://gerrit.cloudera.org:8080/5941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4731a3e83dd2142a1d83be963f83cd1847472295
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5161/19/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 55:   // If this is merge aggregation function, the input argument types 
of the aggregate
> I think the behavior and limitations are going to be clearer if we keep a p
Doesn't that somehow entangle the lifecycles of the two different sets of exprs?

As a general design pattern storing pointers from one expr tree to a separate 
expr tree that can be mutated in-place seems like a bad idea - too hard to 
reason about. I could see just storing a clone of the source expression but I'm 
not sure that that's significantly better than the current approach.


-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Hello Impala Public Jenkins, Michael Ho, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5161

to look at the new patch set (#20).

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..

IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

This uses the existing infrastructure for codegening builtin UDAs and
for codegening calls to UDFs. GetUdf() is refactored to support both
UDFs and UDAs.

IR UDAs are still not allowed by the frontend. It's unclear if we want
to enable them going forward because of the difficulties in testing and
supporting IR UDFs/UDAs.

This also fixes some bugs with the Get*Type() methods of
FunctionContext. GetArgType() and related methods now always return the
logical input types of the aggregate function. Getting the tests to pass
required fixing IMPALA-4878 because they called GetIntermediateType().

Testing:
test_udfs.py tests UDAs with codegen enabled and disabled.

Added some asserts to test UDAs to check that the correct types are
passed in via the FunctionContext.

Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/codegen-anyval.h
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/exec/partitioned-aggregation-node.cc
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/agg-fn-evaluator.h
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/expr.cc
M be/src/exprs/expr.h
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/scalar-fn-call.h
M be/src/exprs/timestamp-functions.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/testutil/test-udas.cc
M be/src/testutil/test-udas.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-ir.cc
M be/src/udf/udf.h
M common/thrift/Exprs.thrift
M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M testdata/workloads/functional-query/queries/QueryTest/uda.test
M tests/query_test/test_udfs.py
26 files changed, 504 insertions(+), 248 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/61/5161/20
-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5161/19/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 55:   // If this is merge aggregation function, the input argument types 
of the aggregate
> I'm fine with either approach, but:
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Hello Impala Public Jenkins, Michael Ho, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5161

to look at the new patch set (#21).

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..

IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

This uses the existing infrastructure for codegening builtin UDAs and
for codegening calls to UDFs. GetUdf() is refactored to support both
UDFs and UDAs.

IR UDAs are still not allowed by the frontend. It's unclear if we want
to enable them going forward because of the difficulties in testing and
supporting IR UDFs/UDAs.

This also fixes some bugs with the Get*Type() methods of
FunctionContext. GetArgType() and related methods now always return the
logical input types of the aggregate function. Getting the tests to pass
required fixing IMPALA-4878 because they called GetIntermediateType().

Testing:
test_udfs.py tests UDAs with codegen enabled and disabled.

Added some asserts to test UDAs to check that the correct types are
passed in via the FunctionContext.

Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/codegen-anyval.h
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/exec/partitioned-aggregation-node.cc
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/agg-fn-evaluator.h
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/expr.cc
M be/src/exprs/expr.h
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/scalar-fn-call.h
M be/src/exprs/timestamp-functions.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/testutil/test-udas.cc
M be/src/testutil/test-udas.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-ir.cc
M be/src/udf/udf.h
M common/thrift/Exprs.thrift
M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M testdata/workloads/functional-query/queries/QueryTest/uda.test
M tests/query_test/test_udfs.py
26 files changed, 501 insertions(+), 248 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/61/5161/21
-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 21
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..


Patch Set 20:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5161/20/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 106: ArrayList aggFnArgTypes = Lists.newArrayList();
> unused?
Done


Line 554: "in original expr %s", toSql(), 
copiedInputType.toString(),
> also use toSql() for types
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs

2017-02-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-1430,IMPALA-4878,IMPALA-4879: codegen native UDAs
..


Patch Set 22: Code-Review+2

carry +2

-- 
To view, visit http://gerrit.cloudera.org:8080/5161
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Gerrit-PatchSet: 22
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Address additional comments

2017-02-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2328: Address additional comments
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6147/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 487:   DCHECK(min_max_tuple_desc->slots().size() == 
min_max_conjuncts_ctxs_.size());
DCHECK_EQ


-- 
To view, visit http://gerrit.cloudera.org:8080/6147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I54c205fad7afc4a0b0a7d0f654859de76db29a02
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4995: Fix integer overflow in TopNNode::PrepareForOutput

2017-02-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4995: Fix integer overflow in TopNNode::PrepareForOutput
..


Patch Set 1: Code-Review+2

This fix looks good but see my comment on the JIRA - it may make sense to 
switch to the sort operator when limit + offset is large since there are other 
potential problems with the TopN node in those cases

-- 
To view, visit http://gerrit.cloudera.org:8080/6171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5048ec67d8f086346220d56e027e6583fbb5ddad
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

2017-02-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
..


Patch Set 14:

(7 comments)

Flushing out some comments I made while in transit.

I don't have any concerns about correctness but there were a few things that 
may be confusing for people reading the code.

http://gerrit.cloudera.org:8080/#/c/5904/14//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4624: Implement Parquet dictionary filtering
Can you mention the query option in the commit message?


PS14, Line 12: incides
indices


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 704: return Status(Substitute("Could not allocate buffer of $0 bytes 
for Parquet "
Can you use MemTracker::MemLimitExceeded() here to construct the status? It 
also does some logging that can be useful to diagnose the failure.


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 855
Thanks for fixing this. I was talking to Wes McKinney (who works on 
parquet-cpp) a month or so ago and he'd been confused about why Impala was 
writing out encodings it wasn't using.


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS14, Line 256: GetDictionary
GetDictionaryDecoder() for consistency with the other APIs?


Line 865: if (!stream_->ReadBytes(data_size, &data_, &status)) return 
status;
Not your change, but can we just SkipBytes here? That would make the intent 
clearer.


http://gerrit.cloudera.org:8080/#/c/5904/14/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 243:   public boolean isDeterministic() {
I think we should be careful about the naming and comments here, since this 
will return true for many non-deterministic functions - the state of things 
before this patch is pretty confusing. The definition is Expr.isConstant() is 
subtle - the comment on that function tries to define the current rules.

E.g. UDFs can be non-deterministic but the fe treats them as deterministic (for 
now). Or now() isn't really deterministic, but we treat it as such because it 
shouldn't be re-evaluated within a query.

This list of functions is really the builtin functions that have some kind of 
randomisation. Can you rename it to something like isRandomizedBuiltin() and 
update the comment to reflect that?


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

2017-02-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
..


Patch Set 16:

It looks like you'll have to do a follow-on patch to solve the bootstrapping 
problem with the Impala version so I'm ok if you want to address the comments 
in a follow-on.

-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] Fix parquet table writer dictionary leak

2017-02-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: Fix parquet table writer dictionary leak
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6181/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 334:   OutputPartition* deletable_partition = 
current_clustered_partition_->first;
Why not just delete it here and save the temp variable?


Line 336:   delete(deletable_partition);
delete deletable_partition; - delete is a keyword instead of a function


Line 579: OutputPartition* partition = new OutputPartition();
Could we use a unique_ptr here and in PartitionPair to make the ownership 
explicit and cleanup automatic? I think this would be a unique_ptr, and it 
would be move()d into the map .


Line 587:   delete(partition);
delete partition;


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Fix parquet table writer dictionary leak

2017-02-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: Fix parquet table writer dictionary leak
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6181/2//COMMIT_MSG
Commit Message:

Line 7: Fix parquet table writer dictionary leak
JIRA?


http://gerrit.cloudera.org:8080/#/c/6181/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 579: OutputPartition* partition = new OutputPartition();
> The issue is that the OutputPartition is passed down to the table writers, 
The OutputPartition owns the HdfsTableWriter via a scoped_ptr so we can safely 
use a raw OutputPartition* pointer in HdfsTableWriter. This pattern of an 
owning unique_ptr/scoped_ptr in one direction and a raw pointer for the inverse 
shows up elsewhere in the code (e.g. for "parent" pointers). 

I'd probably add a comment on HdfsTableWriter.output_ to note that ownership 
relationship but I think that would be clear and safe.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

2017-02-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
..


Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6181/3/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 172:   const std::unique_ptr& output_partition,
Just passing the raw pointer seems ok to me - I think you can do everything 
with the const unique_ptr that you can do with a raw pointer, so the type 
doesn't convey much additional info.

I don't feel that strongly about it though


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
..


Patch Set 4: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool
..


Patch Set 5:

(39 comments)

http://gerrit.cloudera.org:8080/#/c/5811/5/be/src/runtime/buffered-tuple-stream-v2.cc
File be/src/runtime/buffered-tuple-stream-v2.cc:

PS5, Line 181: write_page_->is_pinned()
> when do we have an unpinned write page? what does that mean?
I documented that they have to be pinned and added some consistency checks. If 
I set write_page_ = null earlier that keeps it in a consistent state.


Line 189:   *got_page = 
buffer_pool_client_->reservation()->IncreaseReservationToFit(page_len);
> shouldn't this come with IMPALA-3208? or do we need this already?
We need it already for the first page or if got_page was false on an earlier 
call and we're retrying.


Line 198:   write_end_ptr_ = new_page.data() + page_len;
> would be slightly clearer to compute this based on write_page_
Done


Line 234: DCHECK(&*read_page_ != write_page_);
> here and above: DCHECK_NE/EQ
It doesn't work for the first one since the values of iterators aren't 
printable.


Line 246:   bytes_in_mem_ -= read_page_->len();
> UnpinPage()?
Done


Line 256:   if (!read_page_->is_pinned()) {
> is this check sufficient? couldn't the page by pinned due to it also being 
I just kept this logic from the old BufferedTupleStream.

Added a comment to 'write_page_' to explain.

I think it's easiest to leave as-is for now. Having a pin per iterator makes a 
lot of sense when we have a separate iterator abstraction that manages its own 
reservation, but it interacts strangely with the current UnpinStream() API - if 
you UnpinStream() a read-write stream with a single page then it could actually 
use more reservation. The only current user -AnalyticEvalNode - shouldn't do 
that but it adds a weird edge case to be wary of.

I'm open to changing it but it wasn't a strict improvement in my eyes.


Line 300: bytes_in_mem_ += pages_.front().len();
> maybe factor this into PinPage() to match UnpinPage()
Done


PS5, Line 324: is_pinned
> i guess this use is probably okay
I think we'd need to change it if we started pinned pages multiple times - we 
might need to unpin a page here if it was pinned multiple times.


http://gerrit.cloudera.org:8080/#/c/5811/5/be/src/runtime/buffered-tuple-stream-v2.h
File be/src/runtime/buffered-tuple-stream-v2.h:

Line 65: //
> ///
Done


PS5, Line 66: In the pinned mode all pages are
: /// pinned and the data is fully in memory.
> seems redundant with the next sentence
Done


PS5, Line 74: PrepareForWrite
> shouldn't that be PrepareForRead()?
Done


PS5, Line 74: the
> to
Done


Line 75: /// pass over the stream.
> this should go away soon after switching to BufferPool, right? i.e. buffers
I think the concept of a destructive read pass will continue to  exist, but the 
pages would be destroyed and buffers attached instead of the current dance with 
NeedsDeepCopy().

There will probably be other related changes though, e.g. around buffer 
management for non-destructive read passes. To support non-destructive unpinned 
read passes we might actually need a new BufferPool API to extract a buffer 
from a page without destroying the page.


Line 121: ///  <---12b---> <---12b---> <-5b>
> i guess these examples all assume non-nullable?
Added a nulllable example


Line 127: ///   caller does not need to copy if it is streaming.
> are these listed in order of precedence? i.e. if a stream is both delete on
Reworked this and the memory lifetime section to avoid ambiguity.


Line 139: ///   or free up other memory before retrying.
> nit:indentation of some of these paragraphs is inconsistent
Done


Line 152: /// layout described above.
> maybe a TODO/JIRA for what we had discussed with unifying AddRow() and Allo
Done


Line 199:   ///Returns false and sets 'status' to an error.
> nit: indentations of these bullets are inconsistent
Done


Line 262:   /// rows will remain valid until the stream is unpinned, destroyed, 
etc.
> let's also put TODO IMPALA-4179 to make it easier to keep straight of where
Done


Line 267:   /// stream remains pinned.
> does this interface require the stream to be pinned? if so, let's be explic
Done


PS5, Line 370: offset
> this isn't really the offset.
Done


Line 374:   uint8_t* read_end_ptr_;
> given we store number of rows in the page, what is this used for? oh, i see
Done


Line 380:   uint8_t* write_end_ptr_;
> same, and this one looks dead anyway.
This one is used in AllocateRow() where the perf matters - it likely saves a 
few cycles per row. I ended up removing write_page_bytes_remaining() and just 
inlining the calculations.


Line 382:   /// Number of rows returned to the caller from GetNext().
> ... since PrepareForRead() was called?
Done


Line 386:   int read_page_idx_;
> this looks dead
Done


Line 393:   int num_pinned_;
> this doesn't looked 

[Impala-ASF-CR] IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#6).

Change subject: IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool
..

IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

Add a copy of BufferedTupleStream that allocates memory from BufferPool.
This will replace the original implementation when IMPALA-3200 is
completed.

The major changes are:
* Terminology is updated from "blocks" to "pages"
* No small buffer support is needed (hooray!).
* BufferedTupleStream needs to do its own tracking of # of rows per
  page, etc instead of relying on BufferedBlockMgr to do it. A
  wrapper around PageHandle is used.
* Profile counters (unpin, pin, get new block time) are removed -
  similar counters in the BufferPool client are more useful.
* Changed the tuple null indicators so that they are allocated
  before each tuple, rather than in a block at the start of the
  page. This slightly reduces the memory density, but greatly
  simplifies much logic. In particular, it avoids problems with
  RowIdx and larger pages with offsets that don't fit in 16 bits.

Testing this required some further changes. TestEnv was refactored
so it can set up either BufferPool or BufferedBlockMgr. Some
BufferPool-related state is added to ExecEnv, QueryState and
RuntimeState, but is only created for backend tests that explicitly
create a BufferPool.

The following is left for future work:
* IMPALA-3808 (large row support) is not added. I've added
  TODOs to the code to places that will require changes.
* IMPALA-4179 (remove MarkNeedsDeepCopy()) is not fixed, since
  it requires global changes to operators that accumulate memory.

Testing:
All of the BufferedTupleStream unit tests are ported to the new
implementation, except for ones specifically testing the small
buffer functionality.

Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
---
M be/src/exec/hash-table-test.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/buffered-block-mgr-test.cc
M be/src/runtime/buffered-tuple-stream-test.cc
A be/src/runtime/buffered-tuple-stream-v2-test.cc
A be/src/runtime/buffered-tuple-stream-v2.cc
A be/src/runtime/buffered-tuple-stream-v2.h
A be/src/runtime/buffered-tuple-stream-v2.inline.h
M be/src/runtime/bufferpool/buffer-pool-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/test-env.cc
M be/src/runtime/test-env.h
M be/src/runtime/tmp-file-mgr-test.cc
20 files changed, 2,822 insertions(+), 101 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/5811/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5811
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 


[native-toolchain-CR] IMPALA-5002: Associate toolchain build scripts/flags with built packages

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5002: Associate toolchain build scripts/flags with built 
packages
..


Patch Set 2: Code-Review+2

(1 comment)

Seems like a good thing to track

http://gerrit.cloudera.org:8080/#/c/6210/2/functions.sh
File functions.sh:

Line 351:   git rev-parse HEAD > 
${BUILD_DIR}/${PACKAGE_STRING}${PATCH_VERSION}/toolchain-build-hash.txt
Long line


-- 
To view, visit http://gerrit.cloudera.org:8080/6210
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida6f1eee0c20de7fbfdbb716e5c7ac8686b558b6
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[native-toolchain-CR] Add a script to build Kudu using existing toolchain artifacts

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: Add a script to build Kudu using existing toolchain artifacts
..


Patch Set 2:

(5 comments)

Looks pretty good, just had some minor comments

http://gerrit.cloudera.org:8080/#/c/6167/2/build-kudu-single.sh
File build-kudu-single.sh:

Line 1: #!/usr/bin/env bash
Doesn't matter too much but build-kudu-only seems clearer to me


Line 15: 
Can you add a short comment up the top explaining how to use it and what 
environment variables influence it.


Line 35: function download_toolchain_dependency() {
Does this download all versions of the package?


Line 110:   pkg_name=${dir%*/}
Can we call this something instead of pkg_name? Like pkg_string? pkg_name 
usually means the name without the version in these scripts.


http://gerrit.cloudera.org:8080/#/c/6167/2/init-compiler.sh
File init-compiler.sh:

PS2, Line 20:  
trailing space


-- 
To view, visit http://gerrit.cloudera.org:8080/6167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I237580e1545033467a92285ca8bb8db1cf8c804e
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5904/14/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 243:   public boolean isDeterministic() {
> Switched this over to isBuiltinRandom/isBuiltinRandomFn. I agree that it is
We ran into a similar problem with non-deterministic UDFs when we added 
expression rewrites. We never provided any guarantees about how 
non-deterministic UDFs were evaluated, and it was already inconsistent. E.g. 
partitions have been statically pruned based on UDF output since many releases 
ago, so we don't guarantee that predicates with a non-deterministic UDF are 
evaluated against every row anyway.

The solution we went before was to change the behaviour (since it wasn't 
contractual) but provide the enable_expr_rewrites safety valve in case someone 
was relying on the behaviour.

I believe we have a JIRA somewhere to add metadata about whether a UDF is 
non-determistic.


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

2017-03-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
..


Patch Set 18: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4946: fix hang in BufferPool

2017-03-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6224

Change subject: IMPALA-4946: fix hang in BufferPool
..

IMPALA-4946: fix hang in BufferPool

Once the write is removed from the "in flight" list,
both the Client and Page may be destroyed by a different
thread. The fix is to signal condition variables before
inside the critical section that removes the write from
the in flight list.

Also fix a potential pitfall with DiskIoMgr::CancelContext()
where concurrent calls to the method, which can be called
asynchronously with other methods, could result in a hang in
DiskIoMgr::CancelContext(). I do not believe any Impala code
calls it concurrently from multiple threads, so the bug was
only latent.

Testing:
I was able to reproduce reliably by inserting a 1s sleep before
the NotifyAll() calls. After the fix, the hang didn't reproduce
with sleeps inside or outside the critical section.

I could not come up with a unit test that had a higher reproduction
rate than the current tests - the window for the race is very small.
I considered adding a debug stress option to insert these delays,
but with all the code moved into the critical section it wouldn't
be useful.

Change-Id: I13fc95b5a664544dee789c4107fccf81d2077347
---
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/disk-io-mgr-internal.h
2 files changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/6224/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6224
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I13fc95b5a664544dee789c4107fccf81d2077347
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#2).

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..

IMPALA-4593,IMPALA-4635: fix some python build issues

Build C/C++ packages with toolchain GCC to avoid ABI compatibility
issues. This requires a multi-step bootstrapping process:
1. install basic non-C/C++ packages into the virtualenv
2. use Python 2.7 from the virtualenv to bootstrap the toolchain
3. use toolchain gcc to build C/C++ packages
4. build the kudu-python package with toolchain gcc and Cython

To avoid potentially pulling in cached versions of packages
built with a different compiler, this patch also disables pip's
caching. This should not have a significant effect on performance
since we've enabled ccache and cache downloaded packages in
infra/python/deps.

Improve bootstrapping time significantly by using ccache and by
parallelising the numpy build - the most expensive part of the
install process. On a system with a warmed-up ccache,
bootstrapping after deleting infra/python/env takes 1m16s. Previously
it could take over 5m.

Testing:
Tested manually on Ubuntu 16.04 to confirm that it fixes the ABI
problem mentioned in IMPALA-4593. Initially "import kudu" failed
in my dev environment. After deleting infra/python/env and
re-bootstrapping, "import kudu" succeeded.

Also ran the standard test suite on CentOS 6 and built Impala on
a range of platforms (CentOS 5,6,7; SLES 11,12; Debian 6,7;
Ubuntu12.04,14.04,16.04) to make sure nothing broke.

Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
---
M infra/python/bootstrap_virtualenv.py
A infra/python/deps/compiled-requirements.txt
M infra/python/deps/download_requirements
A infra/python/deps/kudu-requirements.txt
M infra/python/deps/pip_download.py
M infra/python/deps/requirements.txt
6 files changed, 120 insertions(+), 78 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/6218/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4946: fix hang in BufferPool

2017-03-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4946: fix hang in BufferPool
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6224/1//COMMIT_MSG
Commit Message:

PS1, Line 11: before
remove


-- 
To view, visit http://gerrit.cloudera.org:8080/6224
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I13fc95b5a664544dee789c4107fccf81d2077347
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5008: Fix reading stats for TINYINT and SMALLINT

2017-03-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5008: Fix reading stats for TINYINT and SMALLINT
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6226/2/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

Line 34: if (col_stats < std::numeric_limits::min() ||
col_stats isn't valid if ret is false, right?


Line 39:   DCHECK(false);
I don't think we should DCHECK here - users do sometimes run DEBUG builds on 
their clusters (often their dev environment).


Line 50: if (col_stats < std::numeric_limits::min() ||
See above comments.


-- 
To view, visit http://gerrit.cloudera.org:8080/6226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b10508db53747e7b08c8bd9a69c763b82135a78
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[native-toolchain-CR] Add a script to build Kudu using existing toolchain artifacts

2017-03-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: Add a script to build Kudu using existing toolchain artifacts
..


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I237580e1545033467a92285ca8bb8db1cf8c804e
Gerrit-PatchSet: 3
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5008: Fix reading stats for TINYINT and SMALLINT

2017-03-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5008: Fix reading stats for TINYINT and SMALLINT
..


Patch Set 3: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6226/3/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

Line 39: return ret;
return true?


Line 51: return ret;
return true?


-- 
To view, visit http://gerrit.cloudera.org:8080/6226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b10508db53747e7b08c8bd9a69c763b82135a78
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[native-toolchain-CR] IMPALA-5025: upgrade to binutils 2.26.1

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6247

Change subject: IMPALA-5025: upgrade to binutils 2.26.1
..

IMPALA-5025: upgrade to binutils 2.26.1

This has two fixes that we care about:
* The slow linking problem that the -p1 patch to 2.26 solved
* A way to disable emitting relocations R_X86_64_GOTPCRELX and
   R_X86_64_REX_GOTPCRELX that aren't supported by older linkers.
   We configure binutils to disable these by default to be as safe
   as possible.

Testing:
Built Impala locally with 2.26-p1 and 2.26.1 and inspected
libImpalaUdf.a with objdump -x to make sure that the relocation
types changed to ensure that the change had an effect.

Change-Id: I0de3b30f6ade4a3fab77572caac7be3a8d018105
---
M init.sh
M source/binutils/build.sh
2 files changed, 8 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/native-toolchain 
refs/changes/47/6247/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0de3b30f6ade4a3fab77572caac7be3a8d018105
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5027: make udf headers buildable externally

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6251

Change subject: IMPALA-5027: make udf headers buildable externally
..

IMPALA-5027: make udf headers buildable externally

This fixes a number of issues that prevented building it:
* Use of C++11 features
* Use of internal headers that are only available in the Impala source
  tree

Testing:
Copied the headers and libImpalaUdf.a to /usr and confirmed I could
build impala-udf-samples (with some modifications to udf-test-harness
to use some modified arguments to the test harness).

We really need a better solution for automatically testing this, e.g.
IMPALA-5026, but I wanted to fix the known bugs first.

Change-Id: Ibdcdcd96c953cbabc3de04ca66d7aa1774ada99e
---
M be/src/udf/uda-test-harness-impl.h
M be/src/udf/uda-test-harness.h
M be/src/udf/udf-debug.h
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.h
5 files changed, 45 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/6251/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6251
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibdcdcd96c953cbabc3de04ca66d7aa1774ada99e
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5025: Update binutils to 2.26.1

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6252

Change subject: IMPALA-5025: Update binutils to 2.26.1
..

IMPALA-5025: Update binutils to 2.26.1

This release includes the fix for IMPALA-3507 and the fix for
https://sourceware.org/bugzilla/show_bug.cgi?id=19520, which
made libImpalaUdf.a unusable by older linkers.

Change-Id: Ia7e82bfca288c8c5f20cdfd560680801fa43b90a
---
M bin/impala-config.sh
1 file changed, 1 insertion(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/6252/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6252
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia7e82bfca288c8c5f20cdfd560680801fa43b90a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5025: Update binutils to 2.26.1

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has abandoned this change.

Change subject: IMPALA-5025: Update binutils to 2.26.1
..


Abandoned

Didn't mean to push - need to wait until we push out a new toolchain version

-- 
To view, visit http://gerrit.cloudera.org:8080/6252
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: Ia7e82bfca288c8c5f20cdfd560680801fa43b90a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[native-toolchain-CR] Fix setup / download order in build scripts

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: Fix setup / download order in build scripts
..


Patch Set 3: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6249/3/source/avro/build.sh
File source/avro/build.sh:

Line 26: download_dependency $PACKAGE "avro-src-${PACKAGE_VERSION}.tar.gz" 
$THIS_DIR
Maybe move all of these download_dependency calls into if needs_build_package 
for consistency. It's done in some places but not others.


-- 
To view, visit http://gerrit.cloudera.org:8080/6249
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I287f091c8e772c08c8a61644d08117145b1ef16d
Gerrit-PatchSet: 3
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5027: make udf headers buildable externally

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5027: make udf headers buildable externally
..


Patch Set 1:

A lot of widely-used Linux distributions have default compilers that don't 
support C++11 or C++14.

E.g. RHEL 6 doesn't support C++11, RHEL 7 ships gcc 4.8.x, which doesn't 
support C++14 https://access.redhat.com/solutions/19458

-- 
To view, visit http://gerrit.cloudera.org:8080/6251
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibdcdcd96c953cbabc3de04ca66d7aa1774ada99e
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
..


Patch Set 4:

Can you make the design doc public and link to it? It's shared with me but 
other reviewers may want to look at it.

-- 
To view, visit http://gerrit.cloudera.org:8080/5914
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5dcda32fc7770d42fc500ce87fc54d58e5b5dc00
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
..


Patch Set 4:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/5914/4/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

Line 257:   // non-deterministic expression.
I don't think we should trust users not to set this to false on 
non-deterministic expressions - they shouldn't be able to potentially crash an 
Impala daemon with a bad query option value.


http://gerrit.cloudera.org:8080/#/c/5914/4/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 60:   private final static List NONDETERMINISTIC_BUILTINS =
https://gerrit.cloudera.org/#/c/5904/ also factored out this list, so when you 
rebase on top of master we should be able to reuse that.


http://gerrit.cloudera.org:8080/#/c/5914/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 39:   private static final int SORT_MATERIALIZATION_COST_THRESHOLD = 5;
Comment on how it's derived? If it's somewhat arbitrary its good for readers to 
know.


Line 225: analyzer.addWarning("Expression '" + orderingExpr.toSql() 
+ " may be " +
I mentioned in an earlier comment - this seems to be unsafe - we shouldn't 
trust the query option too much.


Line 246: // LiteralExprs don't affect the sort order. TODO: consider 
removing the sort node
Do we have a test that covers the case when all ordering exprs are eliminated?


Line 263: orderingExprsIter.set(materializedRef);
Is this needed? Is this handled by putting it in the substitution map?


http://gerrit.cloudera.org:8080/#/c/5914/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java
File fe/src/main/java/org/apache/impala/planner/ExchangeNode.java:

Line 152: if (mergeInfo_.getIsMaterialized().get(i)) {
Also mentioned on the design doc - I think we should only show this at a higher 
explain levels. Any maybe spell out "MATERIALIZED" so that it's less cryptic.


http://gerrit.cloudera.org:8080/#/c/5914/4/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 187: if (info_.getIsMaterialized().get(i)) {
See previous comment


http://gerrit.cloudera.org:8080/#/c/5914/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 785: |  order by: id ASC MAT
I guess it's useful for these planner tests to show it, so maybe it is good to 
have it at this explain level. Would be interested what others think.


http://gerrit.cloudera.org:8080/#/c/5914/4/testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test:

Line 46: |  order by: int_col ASC NULLS FIRST, tinyint_col ASC NULLS FIRST
It's weird that there's no "MAT" here but there are some in the sort below - is 
that a bug? It seems like we should be showing all plain slotrefs as 
materialised.


Line 98: |  order by: int_col ASC
Same here


Line 454: |  order by: tinyint_col + 1 ASC NULLS FIRST, double_col / 2 ASC 
NULLS FIRST, 4 - int_col ASC, 4 * smallint_col ASC
Looks like the cost-based analysis is working here


Line 789: |  order by: double_col ASC NULLS FIRST, tinyint_col ASC NULLS FIRST, 
int_col ASC
It seems like we're only materialising things in the bottom-most sort for some 
reason.


http://gerrit.cloudera.org:8080/#/c/5914/4/testdata/workloads/functional-planner/queries/PlannerTest/order.test
File testdata/workloads/functional-planner/queries/PlannerTest/order.test:

Line 1245: 
It would be good to also add some basic planner tests for the materialize_sort 
query option - just as as sanity check that it actually has the intended effect.


http://gerrit.cloudera.org:8080/#/c/5914/4/testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test
File testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test:

Line 1921: # Tests 'order by' on a non-deterministic function, return order 
isn't guaranteed.
It would also be good to test the materialize_sort option for analytics


http://gerrit.cloudera.org:8080/#/c/5914/4/testdata/workloads/functional-query/queries/QueryTest/sort.test
File testdata/workloads/functional-query/queries/QueryTest/sort.test:

Line 4146: select id from alltypestiny order by random()
It would be good to add a couple of tests for the materialize_sort query option 
to make sure that it produces the correct results.


Line 4150: select id from alltypestiny order by "AAA"
I think this answers my earlier question about test coverage.


http://gerrit.cloudera.org:8080/#/c/5914/4/tests/query_test/test_sort.py
File

[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6218/2/infra/python/bootstrap_virtualenv.py
File infra/python/bootstrap_virtualenv.py:

Line 24: # instead.
> It'd be helpful to have the nice description from the commit message here t
Done


PS2, Line 90: exec_cmd
> does this really throw if the command doesn't exist? I couldn't find docs f
It's a wrapper around popen defined on line 72


PS2, Line 104: "ccache " + cc
> better to use format()
Done


Line 115:   if cc is not None: env["CC"] = cc
> Else bail with a proper error message? What's the consequence of possibly u
We won't have toolchain gcc during the first phase of bootstrapping so we need 
to fail gracefully in that case.

I changed this so that it sets CC to a bogus value if toolchain gcc is not 
available. This revealed a bunch of other packages that used a C compiler.


Line 189: LOG.debug("Skipping compiled deps: cc not available")
> What's the use case for continuing here as opposed to throwing an error.
This code can execute before we've bootstrapped the toolchain, so it needs to 
fail gracefully. I added comments to explain the bootstrap process better.


Line 326:   if install_compiled_deps_if_possible():
> Are these only needed for Kudu? If so consider only doing this when KUDU_IS
It turns out that a bunch of other packages have some compiled files.

I don't see much advantage to skipping this in that special case and it seems 
simpler to explain a general compiled dependency phase instead of splitting out 
a compiled-kudu-dependencies phase.


-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#3).

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..

IMPALA-4593,IMPALA-4635: fix some python build issues

Build C/C++ packages with toolchain GCC to avoid ABI compatibility
issues. This requires a multi-step bootstrapping process:
1. install basic non-C/C++ packages into the virtualenv
2. use Python 2.7 from the virtualenv to bootstrap the toolchain
3. use toolchain gcc to build C/C++ packages
4. build the kudu-python package with toolchain gcc and Cython

To avoid potentially pulling in cached versions of packages
built with a different compiler, this patch also disables pip's
caching. This should not have a significant effect on performance
since we've enabled ccache and cache downloaded packages in
infra/python/deps.

Improve bootstrapping time significantly by using ccache and by
parallelising the numpy build - the most expensive part of the
install process. On a system with a warmed-up ccache,
bootstrapping after deleting infra/python/env takes 1m16s. Previously
it could take over 5m.

Testing:
Tested manually on Ubuntu 16.04 to confirm that it fixes the ABI
problem mentioned in IMPALA-4593. Initially "import kudu" failed
in my dev environment. After deleting infra/python/env and
re-bootstrapping, "import kudu" succeeded.

Also ran the standard test suite on CentOS 6 and built Impala on
a range of platforms (CentOS 5,6,7; SLES 11,12; Debian 6,7;
Ubuntu12.04,14.04,16.04) to make sure nothing broke.

Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
---
M infra/python/bootstrap_virtualenv.py
A infra/python/deps/compiled-requirements.txt
M infra/python/deps/download_requirements
A infra/python/deps/kudu-requirements.txt
M infra/python/deps/pip_download.py
M infra/python/deps/requirements.txt
6 files changed, 150 insertions(+), 95 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/6218/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Matthew Jacobs 


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Hello Matthew Jacobs, Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6218

to look at the new patch set (#4).

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..

IMPALA-4593,IMPALA-4635: fix some python build issues

Build C/C++ packages with toolchain GCC to avoid ABI compatibility
issues. This requires a multi-step bootstrapping process:
1. install basic non-C/C++ packages into the virtualenv
2. use Python 2.7 from the virtualenv to bootstrap the toolchain
3. use toolchain gcc to build C/C++ packages
4. build the kudu-python package with toolchain gcc and Cython

To avoid potentially pulling in cached versions of packages
built with a different compiler, this patch also disables pip's
caching. This should not have a significant effect on performance
since we've enabled ccache and cache downloaded packages in
infra/python/deps.

Improve bootstrapping time significantly by using ccache and by
parallelising the numpy build - the most expensive part of the
install process. On a system with a warmed-up ccache,
bootstrapping after deleting infra/python/env takes 1m16s. Previously
it could take over 5m.

Testing:
Tested manually on Ubuntu 16.04 to confirm that it fixes the ABI
problem mentioned in IMPALA-4593. Initially "import kudu" failed
in my dev environment. After deleting infra/python/env and
re-bootstrapping, "import kudu" succeeded.

Also ran the standard test suite on CentOS 6 and built Impala on
a range of platforms (CentOS 5,6,7; SLES 11,12; Debian 6,7;
Ubuntu12.04,14.04,16.04) to make sure nothing broke.

Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
---
M infra/python/bootstrap_virtualenv.py
A infra/python/deps/compiled-requirements.txt
M infra/python/deps/download_requirements
A infra/python/deps/kudu-requirements.txt
M infra/python/deps/pip_download.py
M infra/python/deps/requirements.txt
6 files changed, 161 insertions(+), 95 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/6218/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6218/3/infra/python/bootstrap_virtualenv.py
File infra/python/bootstrap_virtualenv.py:

PS3, Line 28: Every time this script is run
> nit: this makes it sound like this might be run multiple times
It is run multiple times - every time impala-python is run.


http://gerrit.cloudera.org:8080/#/c/6218/4/infra/python/bootstrap_virtualenv.py
File infra/python/bootstrap_virtualenv.py:

Line 113:   if use_ccache(): cc = "ccache {}".format(cc)
This format syntax isn't supported in older Python versions (this script needs 
to support them).


Line 206:   '''Installs the Kudu python module if possible, which depends on 
the toolchain and
I ran into this issues when using toolchain gcc on CentOS5. Tested that this 
workaround works.


-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..


Patch Set 4: Code-Review+2

Carry +2

-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6218/3/infra/python/bootstrap_virtualenv.py
File infra/python/bootstrap_virtualenv.py:

PS3, Line 28: Every time this script is run
> I meant it sounds like it might be needed to run multiple times to complete
Yes, that's true.


-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6218/3/infra/python/bootstrap_virtualenv.py
File infra/python/bootstrap_virtualenv.py:

PS3, Line 28: Every time this script is run
> Yes, that's true.
(we need the virtualenv to bootstrap the toolchain, so we don't have the 
toolchain on the first run on a clean system)


-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Hello Impala Public Jenkins, Matthew Jacobs, Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6218

to look at the new patch set (#5).

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..

IMPALA-4593,IMPALA-4635: fix some python build issues

Build C/C++ packages with toolchain GCC to avoid ABI compatibility
issues. This requires a multi-step bootstrapping process:
1. install basic non-C/C++ packages into the virtualenv
2. use Python 2.7 from the virtualenv to bootstrap the toolchain
3. use toolchain gcc to build C/C++ packages
4. build the kudu-python package with toolchain gcc and Cython

To avoid potentially pulling in cached versions of packages
built with a different compiler, this patch also disables pip's
caching. This should not have a significant effect on performance
since we've enabled ccache and cache downloaded packages in
infra/python/deps.

Improve bootstrapping time significantly by using ccache and by
parallelising the numpy build - the most expensive part of the
install process. On a system with a warmed-up ccache,
bootstrapping after deleting infra/python/env takes 1m16s. Previously
it could take over 5m.

Testing:
Tested manually on Ubuntu 16.04 to confirm that it fixes the ABI
problem mentioned in IMPALA-4593. Initially "import kudu" failed
in my dev environment. After deleting infra/python/env and
re-bootstrapping, "import kudu" succeeded.

Also ran the standard test suite on CentOS 6 and built Impala on
a range of platforms (CentOS 5,6,7; SLES 11,12; Debian 6,7;
Ubuntu12.04,14.04,16.04) to make sure nothing broke.

Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
---
M infra/python/README
M infra/python/bootstrap_virtualenv.py
A infra/python/deps/compiled-requirements.txt
M infra/python/deps/download_requirements
A infra/python/deps/kudu-requirements.txt
M infra/python/deps/pip_download.py
M infra/python/deps/requirements.txt
7 files changed, 197 insertions(+), 96 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/6218/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4593,IMPALA-4635: fix some python build issues

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4593,IMPALA-4635: fix some python build issues
..


Patch Set 5: Code-Review+2

Add missing Apache headers

-- 
To view, visit http://gerrit.cloudera.org:8080/6218
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#7).

Change subject: IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool
..

IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

Add a copy of BufferedTupleStream that allocates memory from BufferPool.
This will replace the original implementation when IMPALA-3200 is
completed.

The major changes are:
* Terminology is updated from "blocks" to "pages"
* No small buffer support is needed (hooray!).
* BufferedTupleStream needs to do its own tracking of # of rows per
  page, etc instead of relying on BufferedBlockMgr to do it. A
  wrapper around PageHandle is used.
* Profile counters (unpin, pin, get new block time) are removed -
  similar counters in the BufferPool client are more useful.
* Changed the tuple null indicators so that they are allocated
  before each tuple, rather than in a block at the start of the
  page. This slightly reduces the memory density, but greatly
  simplifies much logic. In particular, it avoids problems with
  RowIdx and larger pages with offsets that don't fit in 16 bits.
* Buffer management of read/write streams uses the new pin-counting
  support to separate pinning of the read and write pages.
  This means that the reservation usage of an unpinned read/write
  stream does not fluctuate and the read/write iterators can always
  advance without requiring additional reservation.

Testing this required some further changes. TestEnv was refactored
so it can set up either BufferPool or BufferedBlockMgr. Some
BufferPool-related state is added to ExecEnv, QueryState and
RuntimeState, but is only created for backend tests that explicitly
create a BufferPool.

The following is left for future work:
* IMPALA-3808 (large row support) is not added. I've added
  TODOs to the code to places that will require changes.
* IMPALA-4179 (remove MarkNeedsDeepCopy()) is not fixed, since
  it requires global changes to operators that accumulate memory.

Testing:
All of the BufferedTupleStream unit tests are ported to the new
implementation, except for ones specifically testing the small
buffer functionality.

Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
---
M be/src/exec/hash-table-test.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/buffered-block-mgr-test.cc
M be/src/runtime/buffered-tuple-stream-test.cc
A be/src/runtime/buffered-tuple-stream-v2-test.cc
A be/src/runtime/buffered-tuple-stream-v2.cc
A be/src/runtime/buffered-tuple-stream-v2.h
A be/src/runtime/buffered-tuple-stream-v2.inline.h
M be/src/runtime/bufferpool/buffer-pool-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/test-env.cc
M be/src/runtime/test-env.h
M be/src/runtime/tmp-file-mgr-test.cc
20 files changed, 2,935 insertions(+), 101 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/5811/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5811
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#8).

Change subject: IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool
..

IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

Add a copy of BufferedTupleStream that allocates memory from BufferPool.
This will replace the original implementation when IMPALA-3200 is
completed.

The major changes are:
* Terminology is updated from "blocks" to "pages"
* No small buffer support is needed (hooray!).
* BufferedTupleStream needs to do its own tracking of # of rows per
  page, etc instead of relying on BufferedBlockMgr to do it. A
  wrapper around PageHandle is used.
* Profile counters (unpin, pin, get new block time) are removed -
  similar counters in the BufferPool client are more useful.
* Changed the tuple null indicators so that they are allocated
  before each tuple, rather than in a block at the start of the
  page. This slightly reduces the memory density, but greatly
  simplifies much logic. In particular, it avoids problems with
  RowIdx and larger pages with offsets that don't fit in 16 bits.
* Buffer management of read/write streams uses the new pin-counting
  support to separate pinning of the read and write pages.
  This means that the reservation usage of an unpinned read/write
  stream does not fluctuate and the read/write iterators can always
  advance without requiring additional reservation.

Testing this required some further changes. TestEnv was refactored
so it can set up either BufferPool or BufferedBlockMgr. Some
BufferPool-related state is added to ExecEnv, QueryState and
RuntimeState, but is only created for backend tests that explicitly
create a BufferPool.

The following is left for future work:
* IMPALA-3808 (large row support) is not added. I've added
  TODOs to the code to places that will require changes.
* IMPALA-4179 (remove MarkNeedsDeepCopy()) is not fixed, since
  it requires global changes to operators that accumulate memory.

Testing:
All of the BufferedTupleStream unit tests are ported to the new
implementation, except for ones specifically testing the small
buffer functionality.

Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
---
M be/src/exec/hash-table-test.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/buffered-block-mgr-test.cc
M be/src/runtime/buffered-tuple-stream-test.cc
A be/src/runtime/buffered-tuple-stream-v2-test.cc
A be/src/runtime/buffered-tuple-stream-v2.cc
A be/src/runtime/buffered-tuple-stream-v2.h
A be/src/runtime/buffered-tuple-stream-v2.inline.h
M be/src/runtime/bufferpool/buffer-pool-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/test-env.cc
M be/src/runtime/test-env.h
M be/src/runtime/tmp-file-mgr-test.cc
20 files changed, 2,931 insertions(+), 101 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/5811/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5811
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool

2017-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool
..


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5811/6/be/src/runtime/buffered-tuple-stream-v2.cc
File be/src/runtime/buffered-tuple-stream-v2.cc:

PS6, Line 112: DCHECK_LE(read_ptr_, read_end_ptr);
> would be nice to write this and line 106 the same to make it a little faste
Done


Line 252:   if (!read_page_->is_pinned()) {
> why does this not have to check if read and write page are the same?
I ended up reworking the buffer management as discussed to take advantage of 
the new pin-counting support. There are still some tricky corner cases but I 
think the invariants are clearer now, and we can provide stronger guarantees 
about being able to iterate through unpinned streams.


Line 276: if (!pinned_ && write_page_ != &pages_.front()) {
> this could use a comment:
Done


Line 332: && (&page == write_page_ || (read_write_ && &page == 
&*read_page_))) {
> why is this right to unpin the read_page_ if !read_write_?
This got reworked with the other changes


http://gerrit.cloudera.org:8080/#/c/5811/6/be/src/runtime/buffered-tuple-stream-v2.h
File be/src/runtime/buffered-tuple-stream-v2.h:

PS6, Line 517: CalcNumPinned
> this looks dead
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5811
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


  1   2   3   4   5   6   7   8   9   10   >