date:20170522

[Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has uploaded a new patch set (#5).

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
..

IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.

Syntax:
 TABLESAMPLE SYSTEM() [REPEATABLE()]
The first number specifies the percent of table bytes to sample.
The second number specifies the random seed to use.

The sampling is coarse-grained. Impala keeps randomly adding
files to the sample until at least the desired percentage of
file bytes have been reached.

Examples:
SELECT * FROM t TABLESAMPLE SYSTEM(10)
SELECT * FROM t TABLESAMPLE SYSTEM(50) REPEATABLE(1234)

Testing:
- Added parser, analyser, planner, and end-to-end tests
- Private core/hdfs run passed

Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
---
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CollectionTableRef.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java
M fe/src/main/java/org/apache/impala/analysis/SingularRowSrcTableRef.java
M fe/src/main/java/org/apache/impala/analysis/TableRef.java
A fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
A tests/query_test/test_tablesample.py
20 files changed, 689 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/6868/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Marcel Kornacker

[Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6868/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 809:   public HdfsPartition cloneNewFds(List newFds) {
> i don't think this is used anymore.
Done


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 1961: List> allFiles =
> i'd say it's cleaner than partition cloning, because the way the cloning wo
Returning as Map now.


http://gerrit.cloudera.org:8080/#/c/6868/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 1940:* The 'percentBytes' parameter must be between 0 and 100.
> i'd also point out that it allocates something like 12+ bytes per existing 
That's definitely a legitimate concern. Added comment.

I also did another minor speed improvement to avoid unnecessarily doing a hash 
lookup to find the partition of each selected file.

I'll run some experiments to see how this function performs.


-- 
To view, visit http://gerrit.cloudera.org:8080/6868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-HasComments: Yes

[native-toolchain-CR] Remove libevent from toolchain

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has submitted this change and it was merged.

Change subject: Remove libevent from toolchain
..


Remove libevent from toolchain

Only Thrift ever needed libevent, and it no longer needs it, so save the
time building this unused dependency.

Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82
---
M buildall.sh
D source/libevent/build.sh
2 files changed, 0 insertions(+), 42 deletions(-)

Approvals:
  Henry Robinson: Verified
  Matthew Jacobs: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6952
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong

[native-toolchain-CR] Remove libevent from toolchain

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has posted comments on this change.

Change subject: Remove libevent from toolchain
..


Patch Set 1: Verified+1

Passed a full toolchain build.

-- 
To view, visit http://gerrit.cloudera.org:8080/6952
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has posted comments on this change.

Change subject: Remove redundant libevent flag when building Thrift
..


Patch Set 2: Verified+1

Passed a full toolchain build.

-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: No

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has submitted this change and it was merged.

Change subject: Remove redundant libevent flag when building Thrift
..


Remove redundant libevent flag when building Thrift

--with-libevent was specified twice, and overridden by the second
  option. Libevent is not required for Thrift if the non-blocking server
  is not used.

Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
---
M buildall.sh
M source/thrift/build.sh
2 files changed, 1 insertion(+), 6 deletions(-)

Approvals:
  Henry Robinson: Verified
  Matthew Jacobs: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs

[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement

2017-05-22 Thread Dimitris Tsirogiannis (Code Review)

Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5259: Add REFRESH FUNCTIONS  statement
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6878/8/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java:

PS8, Line 613: refreshFunctions
I don't think this function correctly protects the db against concurrent 
modifications. Let me give you an example. Client 1 is running a 
refreshFunctions while Client 2 is calling addFunction on the same database. 
Let's assume that addFunction is executed sometime between L628 and 637 and 
adds a new function F1 to the database and HMS. The operation will succeed and 
the new function will be returned to Client 2. Then the synchronized block 
(L637- 674) is executed which will assume that F1 has been removed and will 
delete it from the database. If Client 1 then tries to run a statement that 
uses F1 an error will be thrown between the function is no longer there.


-- 
To view, visit http://gerrit.cloudera.org:8080/6878
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5349: flaky NoDirsAllocationError test
..


Patch Set 1: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6953
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5349: flaky NoDirsAllocationError test
..


IMPALA-5349: flaky NoDirsAllocationError test

The assertion was incorrect and racy - it is ok if the write error wins
the race with the Unpin() calls, causing them to fail.

Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Reviewed-on: http://gerrit.cloudera.org:8080/6953
Reviewed-by: Henry Robinson 
Reviewed-by: Dan Hecht 
Tested-by: Impala Public Jenkins
---
M be/src/runtime/buffered-block-mgr-test.cc
1 file changed, 4 insertions(+), 4 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Henry Robinson: Looks good to me, but someone else must approve
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6953
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Impala Public Jenkins

[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention

2017-05-22 Thread Bharath Vissapragada (Code Review)

Bharath Vissapragada has uploaded a new patch set (#10).

Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ 
contention
..

IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention

Holding client_request_state_map_lock_ and CRS::lock_ together in certain
paths could potentially block the impalad from registering new queries.
The most common occurrence of this is while loading the webpage of a
query while the query planning is still in progress. Since we hold the
CRS::lock_ during planning, it blocks the web page from loading which
inturn blocks incoming queries by holding client_request_state_map_lock_.

This patch makes client_request_state_map_lock_ a terminal lock so that
we don't have interleaving locking with CRS::lock_.

Testing: Tested it locally by adding a long sleep in
JniFrontend.createExecRequest() and still was able to refresh the web UI
and run parallel queries. Also added a custom cluster test that does the
same sequence of actions by injecting a metadata loading pause.

Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
---
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
A tests/custom_cluster/test_query_concurrency.py
M www/query_plan.tmpl
7 files changed, 164 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6707/10
-- 
To view, visit http://gerrit.cloudera.org:8080/6707
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson

[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention

2017-05-22 Thread Bharath Vissapragada (Code Review)

Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ 
contention
..


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6707/9/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 607:   if (request_state->query_state() == 
beeswax::QueryState::CREATED) {
> as we discussed in person, here we'll return something like the caller does
Done


Line 607:   if (request_state->query_state() == 
beeswax::QueryState::CREATED) {
> Additionally, let's exercise this case in the new test.
Done


http://gerrit.cloudera.org:8080/#/c/6707/9/tests/custom_cluster/test_query_concurrency.py
File tests/custom_cluster/test_query_concurrency.py:

PS9, Line 75: 3
> that doesn't seem large enough. check_registered_queries() is allowed to wa
I don't think it waits 15s everytime. It waits for 15s overall. So it still 
returns before pause_injection time is hit (30s). Not sure I follow you. Did it 
fail for you locally? In any case, I made the pause 100s, we don't wait on it 
anyway.


-- 
To view, visit http://gerrit.cloudera.org:8080/6707
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement

2017-05-22 Thread Taras Bobrovytsky (Code Review)

Taras Bobrovytsky has uploaded a new patch set (#8).

Change subject: IMPALA-5259: Add REFRESH FUNCTIONS  statement
..

IMPALA-5259: Add REFRESH FUNCTIONS  statement

Before this patch, Impala relied on INVALIDATE METADATA to load
externally added UDFs from HMS. The problem with this approach is that
INVALIDATE METADATA affects all databases and tables in the entire
cluster.

In this patch, we add a REFRESH FUNCTIONS  statement that reloads
the functions of a database from HMS. We return a list of updated and
removed db functions to the issuing Impalad in order to update its
local catalog cache.

Testing:
- Ran a private build which passed.

Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b
---
M common/thrift/CatalogService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java
M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
M fe/src/main/java/org/apache/impala/catalog/Db.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M tests/custom_cluster/test_permanent_udfs.py
12 files changed, 333 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6878/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6878
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Taras Bobrovytsky

[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement

2017-05-22 Thread Taras Bobrovytsky (Code Review)

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5259: Add REFRESH FUNCTIONS  statement
..


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6878/4/fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java
File fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java:

PS4, Line 38: invalidating the entire catalog
> "or refreshing database functions."
Done


PS4, Line 110: toSql()
> Don't we need to update this function?
Done


http://gerrit.cloudera.org:8080/#/c/6878/7/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java:

PS7, Line 628: loadFunctionsFromDbParams(tmpDb, msDb);
 :   // Load Java UDFs from HMS into the temporary db.
 :   loadJavaFunctions(tmpDb, javaFns);
> Both these methods not only extract and load the functions into the target 
Done


PS7, Line 635: catalogLock_.writeLock().lock();
> This lock doesn't protect against concurrent operations on databases, hence
Done


http://gerrit.cloudera.org:8080/#/c/6878/7/fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java
File fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java:

PS7, Line 256: // Remove the function first, in case there is an existing 
function with the same
 : // name and signature.
 : removeFunction(catalogObject.getFn(), 
catalogObject.getCatalog_version());
> This is weird and not expected. Is this because addFunction doesn't replace
It doesn't replace it if its "indistinguishable". We don't check that the 
binary is the same or different.

How should we modify the addFunction call? Always remove the function there?


-- 
To view, visit http://gerrit.cloudera.org:8080/6878
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement

2017-05-22 Thread Taras Bobrovytsky (Code Review)

Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6878

to look at the new patch set (#8).

Change subject: IMPALA-5259: Add REFRESH FUNCTIONS  statement
..

IMPALA-5259: Add REFRESH FUNCTIONS  statement

Before this patch, Impala relied on INVALIDATE METADATA to load
externally added UDFs from HMS. The problem with this approach is that
INVALIDATE METADATA affects all databases and tables in the entire
cluster.

In this patch, we add a REFRESH FUNCTIONS  statement that reloads
the functions of a database from HMS. We return a list of updated and
removed db functions to the issuing Impalad in order to update its
local catalog cache.

Testing:
- Ran a private build which passed.

Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b
---
M common/thrift/CatalogService.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java
M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
M fe/src/main/java/org/apache/impala/catalog/Db.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M tests/custom_cluster/test_permanent_udfs.py
12 files changed, 333 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6878/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6878
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Taras Bobrovytsky

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..


Patch Set 2: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-HasComments: No

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has submitted this change and it was merged.

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..


Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

Reverting IMPALA-2716 as SparkSQL does not agree with the approach
taken.

More details can be found at:
https://issues.apache.org/jira/browse/SPARK-12297

Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Reviewed-on: http://gerrit.cloudera.org:8080/6896
Reviewed-by: Dan Hecht 
Tested-by: Impala Public Jenkins
---
M be/src/benchmarks/CMakeLists.txt
D be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
D tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 75 insertions(+), 850 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins

[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks

2017-05-22 Thread Zach Amsden (Code Review)

Zach Amsden has uploaded a new patch set (#3).

Change subject: IMPALA-5164: Fix flaky benchmarks
..

IMPALA-5164: Fix flaky benchmarks

Improve benchmarks by detecting involuntary context switches.
If a server is heavily loaded due to some other jobs running,
benchmarking will not be reliable.  We can detect this by using
getrusage() to measure context switches.  If 90% or more of
benchmark time is lost due to switching, we abort the benchmark,
but catch the failure and exit silently.  We subtract out the
time during which we might have been switched out from the total
result time, and only count runs which were uninterrupted.

If 10% of the result is valid, that seems barely legitimate,
but we will warn when anything less than 50% of the benchmark
ran without switching.

Testing: ran the free-lists-benchmark until the first test passed
and then started "make -j 60" on an Impala directory.  Got this
result:

E0519 23:35:41.482010  8766 benchmark.cc:161] Benchmark failed to
complete due to context switching.
W0519 23:35:41.548840  8766 benchmark.cc:162] Exiting silently from
FreeLists 0 iters on 128 kb to avoid spurious test failure.

Also, fixed two benchmarks that had crashes on start due to static
initialization order problems and missing pieces of initialization.
Some benchmarks are still crashing (expr-benchmark and
network-perf-be), but those will require a lot more work to fix.

Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274
---
M be/src/benchmarks/free-lists-benchmark.cc
M be/src/benchmarks/hash-benchmark.cc
M be/src/benchmarks/lock-benchmark.cc
M be/src/util/benchmark.cc
M be/src/util/benchmark.h
5 files changed, 98 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/6935/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 979: Status HdfsParquetScanner::ResetScratchBatch() {
> Good point. The change may be an improvement, but I don't think we need to 
I'm fine with deferring it


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 979: Status HdfsParquetScanner::ResetScratchBatch() {
> ScratchTupleBatch then would have to call out to HdfsScanNode::InitTuple().
Good point. The change may be an improvement, but I don't think we need to do 
that now. In particular, rebasing my scanner change is not going to be fun (but 
I'll do it if you prefer to make the change now).


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks

2017-05-22 Thread Zach Amsden (Code Review)

Zach Amsden has posted comments on this change.

Change subject: IMPALA-5164: Fix flaky benchmarks
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc
File be/src/util/benchmark.cc:

Line 62:   throw std::runtime_error("Benchmark failed to complete due to 
context switching.");
> Sounds like a Status might be a better fit then.
Making this return a status means we have to return iters / ms as a reference 
parameter or return via pointer.  This spoils the convention that such return 
parameters should be the last parameters in the functions parameter list 
because in this case we have default parameters that are overridden almost 
nowhere.

Not a big fan of exceptions, especially across modules, but this self contained 
single source file use case might be the lesser of two evils.

Happy to do it either way.


Line 69:   }
> You could maybe create a new stopwatch in that case?
Doing so requires throwing away the results of the first set of computations or 
more complicated math on the return value.  Neither seems especially better 
than what the code does currently.


Line 90:   batch_size = max(1, batch_size / (1+(ru_stop.ru_nivcsw - 
ru_start.ru_nivcsw)));
> Maybe comment why we're scaling it down here?
Done


PS2, Line 94: 10
> Can you add a comment along those lines for future reference then?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6949/3/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 53:   MemPool aux_mem_pool;
> Thanks for trying it. I'm now convinced that dealing with the var-len data 
Well, I think it's a good idea except the complicated memory transfer is 
getting in the way.


-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..


Patch Set 4: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6949/3/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 53:   // batch, but not by future batches. E.g. a decompression buffer can 
be safely attached
> I started to implement this but realised there were some serious caveats to
Thanks for trying it. I'm now convinced that dealing with the var-len data is a 
bad idea. Too many implications and almost impossible to address all issues.


http://gerrit.cloudera.org:8080/#/c/6949/4/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 52:   // for nested arrays. This memory may be referenced by previous 
batches or the current
Thanks. This is accurate and clear now.


-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention

2017-05-22 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ 
contention
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6707/9/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 607:   if (request_state->query_state() == 
beeswax::QueryState::CREATED) {
> as we discussed in person, here we'll return something like the caller does
Additionally, let's exercise this case in the new test.


-- 
To view, visit http://gerrit.cloudera.org:8080/6707
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"

2017-05-22 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
..


Patch Set 7: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 979: Status HdfsParquetScanner::ResetScratchBatch() {
> Why not move this into ScratchTupleBatch, i.e. pass in the template tuple t
ScratchTupleBatch then would have to call out to HdfsScanNode::InitTuple(). I 
can do a larger restructure, e.g. moving InitTuple() into Tuple or similar if 
you think that will make things clearer. I think it's probably an improvement - 
just checking that you think that makes sense before doing it.


Line 983:   if (template_tuple_ == nullptr && tuple_byte_size_ <= 
CACHE_LINE_SIZE) {
> Not sure I completely understand the CACHE_LINE_SIZE check. We are zeroing 
Augmented the comment.

There's some cut-over where the old code is faster. E.g. if the tuple has 1000 
slots, it's probably better to zero out 125 bytes of null indicators row-by-row 
instead of zeroing out all the 1024 multi-kb rows.

I think this optimisation doesn't matter too much for tuples with more than a 
handful of slots, since the cost of materialization is high compared to the 
cost of zeroing things.


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#5).

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..

IMPALA-5347: Parquet scanner microoptimizations

A mix of microoptimizations that profiling the parquet scanner revealed.
All lead to some measurable improvement and added up to significant
speedups for various scans.

* Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining
  in all cases.
* Apply __restrict__ in a few places so the compiler knows that it is
  safe to cache values accessed via those pointers
* memset() the whole batch instead of the null indicators is cases where
  it is almost certainly cheaper.
* Avoid updating two correlated loop variables in MaterializeValueBatch().
* Avoid unnecessary initialization of often-unused 'val' in ReadSlot().
* Shave a few instructions off the (still very expensive) bit unpacking
  and dict decoding logic.

Performance:

Some local TPC-H and targeted-perf benchmarks showed average speedups of
~5%.

I did some benchmarks targeted at measuring column materialisation
performance using a version of lineitem with duplicated data to make
it bigger. These queries all got significantly faster.

Dict-encoded DECIMAL: 2.23 -> 1.23s

  SELECT count(*) FROM biglineitem WHERE l_quantity > 49

Plain-encoded BIGINT: 2.33s -> 1.62s

  SELECT count(*) FROM biglineitem WHERE l_orderkey != 10

Dict-encoded STRING: 2.73s -> 1.72s

  SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A'

Multiple columns: 5.15s -> 3.74s

  SELECT count(*) FROM biglineitem
  WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100

Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/bit-stream-utils.inline.h
M be/src/util/bit-util.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
7 files changed, 74 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke

[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention

2017-05-22 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ 
contention
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6707/9/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 607:   if (request_state->query_state() == 
beeswax::QueryState::CREATED) {
as we discussed in person, here we'll return something like the caller does. 
Actually, maybe we should just return an error Status with the message, and 
then let the caller put that string in the profile?


-- 
To view, visit http://gerrit.cloudera.org:8080/6707
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#4).

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..

IMPALA-4923: reduce memory transfer for selective scans

Most of the code changes are to restructure things so that the
scratch batch's tuple buffer is stored in a separate MemPool
from auxiliary memory such as decompression buffers. This part
of the change does not change the behaviour of the scanner in
itself, but allows us to recycle the tuple buffer without holding
onto unused auxiliary memory.

The optimisation is implemented in TryCompact(): if enough rows
were filtered out during the copy from the scratch batch to the
output batch, the fixed-length portions of the surviving rows
(if any) are copied to a new, smaller, buffer, and the original,
larger, buffer is reused for the next scratch batch.

Previously the large buffer was always attached to the output batch,
so a large buffer was transferred between threads for every scratch
batch processed. In combination with the decompression buffer change
in IMPALA-5304, this means that in many cases selective scans don't
produce nearly as many empty or near-empty batches and do not attach
nearly as much memory to each batch.

Performance:
Even on an 8 core machine I see some speedup on selective scans.
Profiling with "perf top" also showed that time in TCMalloc
was reduced - it went from several % of CPU time to a minimal
amount.

Running TPC-H on the same machine showed a ~5% overall improvement
and no regressions. E.g. Q6 got 20-25% faster.

I hope to do some additional cluster benchmarking on systems
with more cores to verify that the severe performance problems
there are fixed, but in the meantime it seems like we have enough
evidence that it will at least improve things.

Testing:
Add a couple of selective scans that exercise the new code paths.

Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-scratch-tuple-batch.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M testdata/workloads/functional-query/queries/QueryTest/parquet.test
6 files changed, 208 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/6949/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5338: Fix Kudu timestamp column default values

2017-05-22 Thread Matthew Jacobs (Code Review)

Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5338: Fix Kudu timestamp column default values
..


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 108:   static TimestampValue FromUnixTimeMicros(int64_t unix_time_micros);
> I'm confused as to why it no longer takes two parameters (maybe its just th
I'll update the comment, but it doesn't really matter too much. the time_t was 
in seconds, and then there were fractional sec in microseconds, but it's more 
convenient to take a single int64 which is unix time in micros and convert it 
internally to the time_t w/ fractional part because there are two callers that 
both have unix_time already in microseconds.


http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

Line 32:   int64_t ts_seconds = unix_time_micros / MICROS_PER_SEC;
> I don't understand what this is supposed to be doing - it looks like you're
integer division results in truncation, so this separates out the number of 
seconds and then the fractional seconds in microseconds. We do this to use the 
posix_time helper functions, i.e. there is no ptime constructor that just takes 
the unix time already in microseconds.


http://gerrit.cloudera.org:8080/#/c/6936/1/fe/src/main/java/org/apache/impala/catalog/KuduColumn.java
File fe/src/main/java/org/apache/impala/catalog/KuduColumn.java:

PS1, Line 125:  value
> extraneous word?
Done


http://gerrit.cloudera.org:8080/#/c/6936/1/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test
File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test:

Line 253:  ts2 timestamp default cast('00:00:00' as 
timestamp))
> I guess the intention here is that this will fail? Is there an error we can
whoops, thanks. This shouldn't be here, it's just going to be a FE test.


http://gerrit.cloudera.org:8080/#/c/6936/1/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

Line 364: schema_builder = SchemaBuilder()
> comment
Done


Line 658:   e TIMESTAMP NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION DEFAULT timestamp_from_unix_micros(%s),
> long line
The problem is I can't break this line as it needs to match the output exactly. 
I don't think it's worth something fancy like factoring out another variable 
for parameters and substituting them... I'll just add a comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/6936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4848: Add WIDTH BUCKET() function

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-4848: Add WIDTH_BUCKET() function
..


Patch Set 4:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/6023/4/be/src/exprs/math-functions-ir.cc
File be/src/exprs/math-functions-ir.cc:

Line 31: #include "exprs/math-functions.h"
order alphabetically


Line 430:   Decimal16Value width_size, range_size;
move declarations closer to where they are used


Line 432:   //FE casts all the arguments to same time.
// FE casts all the arguments to the same type.


Line 433:   int input_scale = 
ctx->impl()->GetConstFnAttr(FunctionContextImpl::ARG_TYPE_SCALE,1);
formatting, space after ','


Line 434:   int input_precision = ctx->impl()->GetConstFnAttr(
formatting, space after ','


Line 442:   if(expr < min_range) {
single line and formatting (please fix formatting everywhere)


Line 447: if (num_buckets.val < std::numeric_limits::max()) {
How about we make the return value a BIGINT and simplify this code (no need to 
check for overflow)


Line 448:   return num_buckets.val + 1;
Create an IntVal, that's clearer to read


Line 457:   range_size = max_range.template Subtract<__int128_t>(input_scale, 
min_range,
use int128_t consistently


Line 460: ctx->SetError("Overflow while bucket_width evaluation");
Please make the error more specific, so we can see where the overflow happened.


Line 472:   int256_t x = ConvertToInt256(width_size.value() *  buckets.value());
Don't you need to convert the arguments before the multiplication?


Line 475:   // to avoid avoid computing reulting scale and precision
typo: resulting


Line 476:   int256_t y = 
DecimalUtil::MultiplyByScale(ConvertToInt256(range_size.value()),
The current approach seems fine, but let's avoid going to int256_t if we can. 
Operations with that type are very expensive.


http://gerrit.cloudera.org:8080/#/c/6023/4/be/src/exprs/math-functions.h
File be/src/exprs/math-functions.h:

Line 111: const IntVal& num_buckets);
weird indentation


-- 
To view, visit http://gerrit.cloudera.org:8080/6023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I081bc916b1bef7b929ca161a9aade3b54c6b858f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: anujphadke 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Michael Ho (Code Review)

Michael Ho has uploaded a new patch set (#5).

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..

IMPALA-4164: Use inline hints instead of always-inline

When generating IR functions during codegen, we used to
always tag the functions with the "AlwaysInline" attribute.
That potentially leads to excessive inlining, leading to
very long optimization / compilation time with marginal
performance benefit at runtime. One of the reasons for doing
it was that the "target-cpu" and "target-features" attributes
are missing in the generated IR functions so the LLVM inliner
considers them incompatible with the cross-compiled functions.
As a result, the inliner will not inline the generated IR
functions into cross-compiled functions unless we put the
"AlwaysInline" attributes with the generated IR functions.

This change fixes the problem above by setting the "target-cpu"
and "target-features" attributes of all IR functions to match
that of of the host's CPUs so both generated IR functions and
cross-compiled functions will have the same values for those
attributes. With these attributes set, we now switch to using
"InlineHint" in place of "AlwaysInline" and let LLVM make the
decision of whether it's sensible to inline a function. With
this change, the codegen time of a query with a long predicate
went from 15s to 4s and the overall runtime went from 19s to 8s.

Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
---
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
A testdata/workloads/targeted-perf/queries/primitive_long_predicate.test
3 files changed, 298 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/6941/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Michael Ho (Code Review)

Michael Ho has posted comments on this change.

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

Line 163:   return Status::OK();
> We have __attribute__((target("avx2"))) on some bloom filter functions alre
Good point. They aren't cross-compiled yet but I verified the problem by 
cross-compiling them.

I took your suggestion to set all functions' attributes to be the one derived 
from 'cpu_attrs_'. Alternately, we may consider stripping the "target-features" 
attributes of the cross-compiled functions loaded but I don't see an interface 
defined in llvm::Function. IMHO, it may be clearer to set "target-features" 
attribute explicitly so we know they are the same for both hand-crafted IR and 
cross-compiled function.


Line 166: LlvmCodeGen::LlvmCodeGen(RuntimeState* state, ObjectPool* pool,
> Ah ok, maybe make the log text a bit more human-readable then?
I removed them now in the new patch as they would be redundant with cpu_attrs_.


-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Michael Ho (Code Review)

Michael Ho has uploaded a new patch set (#4).

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..

IMPALA-4164: Use inline hints instead of always-inline

When generating IR functions during codegen, we used to
always tag the functions with the "AlwaysInline" attribute.
That potentially leads to excessive inlining, leading to
very long optimization / compilation time with marginal
performance benefit at runtime. One of the reasons for doing
it was that the "target-cpu" and "target-features" attributes
are missing in the generated IR functions so the LLVM inliner
considers them incompatible with the cross-compiled functions.
As a result, the inliner will not inline the generated IR
functions into cross-compiled functions unless we put the
"AlwaysInline" attributes with the generated IR functions.

This change fixes the problem above by setting the "target-cpu"
and "target-features" attributes of all IR functions to match
that of of the host's CPUs so both generated IR functions and
cross-compiled functions will have the same values for those
attributes. With these attributes set, we now switch to using
"InlineHint" in place of "AlwaysInline" and let LLVM make the
decision of whether it's sensible to inline a function. With
this change, the codegen time of a query with a long predicate
went from 15s to 4s and the overall runtime went from 19s to 8s.

Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
---
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
A testdata/workloads/targeted-perf/queries/primitive_long_predicate.test
3 files changed, 298 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/6941/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Michael Ho (Code Review)

Michael Ho has uploaded a new patch set (#3).

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..

IMPALA-4164: Use inline hints instead of always-inline

When generating IR functions during codegen, we used to
always tag the functions with the "AlwaysInline" attribute.
That potentially leads to excessive inlining, leading to
very long optimization / compilation time with marginal
performance benefit at runtime. One of the reasons for doing
it was that the "target-cpu" and "target-features" attributes
are missing in the generated IR functions so the LLVM inliner
considers them incompatible with the cross-compiled functions.
As a result, the inliner will not inline the generated IR
functions into cross-compiled functions unless we put the
"AlwaysInline" attributes with the generated IR functions.

This change fixes the problem above by setting the "target-cpu"
and "target-features" attributes of all IR functions to match
that of of the host's CPUs so both generated IR functions and
cross-compiled functions will have the same values for those
attributes. With these attributes set, we now switch to using
"InlineHint" in place of "AlwaysInline" and let LLVM make the
decision of whether it's sensible to inline a function. With
this change, the codegen time of a query with a long predicate
went from 15s to 4s and the overall runtime went from 19s to 8s.

Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
---
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
A testdata/workloads/targeted-perf/queries/primitive_long_predicate.test
3 files changed, 304 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/6941/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-4192: Disentangle Expr and ExprContext

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4192: Disentangle Expr and ExprContext
..


Patch Set 14:

(14 comments)

Next batch - made it through everything except the expr/ subdirectory.

http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-aggregation-node.cc
File be/src/exec/partitioned-aggregation-node.cc:

Line 164: pool_->Add(build_expr);
Maybe stick with the usual idiom of wrapping pool_->Add() around the allocation?


Line 1574:   // TODO: consider moving the following codegen logic to AggFn.
This would make sense (not in this patch though).


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

Line 109: // TODO: Move to Prepare().
Is this to do in this patch or are you planning to do it as part of a future 
JIRA?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

Line 243:   /// Init() function inherited from DataSink. Overridden to be a 
no-op for now.
This is a bit of a red flag to me - I think we should structure things so that 
we always call Init() on the parent class. Otherwise it's hard to reason about 
the possible states of the member variables of DataSink.


Line 381:   /// List of filters to build.
Mention that there's a correspondence with 'filter_exprs_'?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

PS14, Line 95: nd
and


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/data-stream-sender.h
File be/src/runtime/data-stream-sender.h:

Line 49: /// TODO: Move this to the equivalent of PlanNode class for DataSink.
Is there a JIRA for this?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

Line 486:   DCHECK(partition_expr->IsLiteral());
Can you add a comment in CatalogObjects.thrift that these are always literal 
exprs? That fact is explicit in the frontend but it wasn't obvious that the 
exprs sent from the frontend are always literals just by looking at the thrift 
structures.


Line 491: // literals Partition exprs are not used in the codegen case. 
Don't codegen
Missing .


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/runtime-filter.h
File be/src/runtime/runtime-filter.h:

Line 93:   const TRuntimeFilterDesc& filter_desc_;
Comment what owns the filter?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 105:   Status Init(ObjectPool* pool, RuntimeState* state, MemPool* 
mem_pool);
Comment what the arguments are for? The names could maybe be more descriptive 
too. E.g. to disambiguate from 'obj_pool_'. E.g. 'expr_evaluator_pool' and 
'expr_mem_pool'.

I think passing in 'state' is redundant too, since it's passed into the 
constructor.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

PS14, Line 194: mem_pool
expr_mem_pool?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/udf/udf.h
File be/src/udf/udf.h:

Line 111: /// TODO: Move FRAGMENT_LOCAL states to query_state for 
multi-threading.
Do we have a JIRA trackign this?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/util/tuple-row-compare.h
File be/src/util/tuple-row-compare.h:

Line 82:   Status Open(ObjectPool* pool, RuntimeState* state, MemPool* 
mem_pool);
Similar to my comment in Sorter - it's unclear what these different arguments 
do and what the requirements are. E.g. what is allocated from 'mem_pool'.


-- 
To view, visit http://gerrit.cloudera.org:8080/5483
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iefdc9aeeba033355cb9497e3a5d2363627dcf2f3
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 979: Status HdfsParquetScanner::ResetScratchBatch() {
Why not move this into ScratchTupleBatch, i.e. pass in the template tuple there.


Line 983:   if (template_tuple_ == nullptr && tuple_byte_size_ <= 
CACHE_LINE_SIZE) {
Not sure I completely understand the CACHE_LINE_SIZE check. We are zeroing 
scratch_capacity * tuple_byte_size_ bytes of contiguous memory. Why not remove 
the second condition?


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 130: // because the batch we returned earlier may hold a reference 
into 'tuple_mem'.
> I don't think this is a significant problem for performance - if the scan i
Thanks for explaining. Makes sense to me.


http://gerrit.cloudera.org:8080/#/c/6949/3/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 53:   MemPool aux_mem_pool;
Another thought: How about splitting this pool up into memory that is only 
referenced by tuples in tuple_mem and memory that could be refd by 
already-returned batches.

For example, the decompression buffers live across multiple scratch/transfer 
rounds, but as far as I can tell all var-len memory that is allocated from this 
pool is only referenced by tuples in tuple_mem. I'd even be ok with naming the 
pools 'var_len_pool' and 'decompression_pool'. That would make it less 
mysterious what is inside them.

This change is mostly for understandability, but it would also allow us to 
avoid unnecessarily transferring var-len data (I know it's only a nice-to-have 
from a perf perspective).

People reading this code may have the same question as me regarding the var-len 
data.


Line 100:   void FinalizeTransfer(RowBatch* dst_batch, int num_to_commit) {
FinalizeTupleTransfer() is a little clearer because we typically mean resources 
in the context of "transfer"


Line 102: ++num_output_batches;
Wondering if we should make this "non-empty output batches".


Line 129: // Compaction is unsafe if the scratch batch was split across 
multiple output batches
Mind moving these two checks into the caller? Checking 'num_output_batches' in 
the same fn where it is updated makes it more obvious what it is used for.

This will also clarify in which situations TryCompact() can "fail", i.e., only 
when there is not enough mem


Line 160:   int64_t total_allocated_bytes() {
const


-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4623: Enable file handle cache

2017-05-22 Thread Joe McDonnell (Code Review)

Joe McDonnell has uploaded a new patch set (#11).

Change subject: IMPALA-4623: Enable file handle cache
..

IMPALA-4623: Enable file handle cache

Currently, every scan range maintains a file handle, even
when multiple scan ranges are accessing the same file.
Opening the file handles causes load on the NameNode, which
can lead to scaling issues.

There are two parts to this transaction:
1. Enable file handle caching by default
2. Share the file handle between scan ranges from the same
file

The scan range no longer maintains its own Hdfs file
handle. On each read, the io thread will get the Hdfs file
handle from the cache (opening it if necessary) and use
that for the read. This allows multiple scan ranges on the
same file to use the same file handle. Since the file
offsets are no longer consistent for an individual scan
range, all Hdfs reads need to either use hdfsPread or do
a seek before reading. Additionally, since Hdfs read
statistics are maintained on the file handle, the read
statistics must be retrieved and cleared after each read.

To manage contention, the file handle cache is now
partitioned by a hash of the key into independent
caches with independent locks. The allowed capacity
of the file handle cache is split evenly among the
partitions. File handles are evicted independently
for each partition. The file handle cache maintains
ownership of the file handles at all times, but it
will not evict a file handle that is in use.

If max_cached_file_handles is set to 0 or the the
scan range is accessing data cached by Hdfs, the
scan range will get a file handle from the cache
and hold it until the scan range is closed. This
mimics the existing behavior, except the file
handle stays in the cache and is owned by the
cache. Since it is in use, it will not be evicted.

If a file handle in the cache becomes invalid,
it may result in Read() calls failing. Consequently,
if Read() encounters an error using a file handle
from the cache, it will destroy the handle and
retry once with a new file handle. Any subsequent
error is unrelated to the file handle cache and
will be returned.

Tests:
query_test/test_hdfs_fd_caching.py copies the files from
an existing table into a new directory and uses that to
create an external table. It queries the external table,
then uses the hdfs commandline to manipulate the hdfs file
(delete, move, etc). It queries again to make sure we
don't crash. Then, it runs "invalidate metadata". It
checks the row counts before the modification and after
"invalidate metadata", but it does not check the results
in between.
custom_cluster/test_hdfs_fd_caching.py starts up a cluster
with a small file handle cache size. It verifies that a
file handle can be reused (i.e. rerunning a query does
not result in more file handles cached). It also verifies
that the cache capacity is enforced.

Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/buffered-block-mgr.h
A be/src/runtime/disk-io-mgr-handle-cache.h
A be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr-internal.h
M be/src/runtime/disk-io-mgr-reader-context.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/disk-io-mgr.h
M tests/custom_cluster/test_hdfs_fd_caching.py
M tests/metadata/test_refresh_partition.py
M tests/query_test/test_hdfs_fd_caching.py
13 files changed, 815 insertions(+), 410 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6478/11
-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.

2017-05-22 Thread Matthew Jacobs (Code Review)

Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4622: Add ALTER COLUMN statement.
..


Patch Set 2:

To be a bit more prescriptive than my last comment:

Can you see how much of this new functionality you can actually accomplish 
using the change syntax given it's already there?

I think we'll have a harder time justifying this new syntax to accomplish 
something you can technically already do, even if it is superior. It'll require 
a bit more discussion.

-- 
To view, visit http://gerrit.cloudera.org:8080/6955
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention

2017-05-22 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ 
contention
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6707/9/tests/custom_cluster/test_query_concurrency.py
File tests/custom_cluster/test_query_concurrency.py:

PS9, Line 75: 3
that doesn't seem large enough. check_registered_queries() is allowed to wait 
15 seconds each time, so the test can still pass  even if the bug exists, right?


-- 
To view, visit http://gerrit.cloudera.org:8080/6707
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.

2017-05-22 Thread Matthew Jacobs (Code Review)

Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4622: Add ALTER COLUMN statement.
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6955/1//COMMIT_MSG
Commit Message:

Line 8: 
Need to explain why this diverges from ALTER TABLE CHANGE, since that syntax 
exists and appears to overlap.

>From the doc, it says:
ALTER TABLE name CHANGE column_name new_name new_type

However, this change led me to look at the code, and according to the parser it 
looks the syntax is actually:

ALTER TABLE name CHANGE column_name new_name column_def

column_def ::=
  ident_or_default:col_name type_def:type column_options_map:options
  {: RESULT = new ColumnDef(col_name, type, options); :}
  | ident_or_default:col_name type_def:type
  {: RESULT = new ColumnDef(col_name, type); :}
  ;


So then the column def options would already be possible to set with that 
syntax.

While I much much prefer this syntax you've added here, we need to re-evaluate 
if it makes sense to add it given this new information.


http://gerrit.cloudera.org:8080/#/c/6955/1/common/thrift/JniCatalog.thrift
File common/thrift/JniCatalog.thrift:

Line 74:   ALTER_COLUMN,
internally can we re-use change_column?


-- 
To view, visit http://gerrit.cloudera.org:8080/6955
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5349: flaky NoDirsAllocationError test
..


Patch Set 1:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/604/

-- 
To view, visit http://gerrit.cloudera.org:8080/6953
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4192: Disentangle Expr and ExprContext

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4192: Disentangle Expr and ExprContext
..


Patch Set 14:

(26 comments)

First batch of comments - made it through exec/kudu*

http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

Line 93: RETURN_IF_ERROR(build_expr->Init(*(intermediate_row_desc_.get()), 
state));
build_expr is leaked if this returns.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/analytic-eval-node.cc
File be/src/exec/analytic-eval-node.cc:

This all works out a lot cleaner - nice!


Line 124: RETURN_IF_ERROR(AggFn::Create(analytic_node.analytic_functions[i],
I think there's a potential resource leak if the expr Init() function fails in 
certain ways. E.g. if it successfully got a lib cache entry, but failed later. 
We don't add it to 'analytic_fns_' so I don't think it can be closed.


Line 166:   fn_pool_.reset(new MemPool(expr_mem_tracker()));;
extra semicolon


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/data-sink.cc
File be/src/exec/data-sink.cc:

Line 181:   // TODO: codegen table sink
I'm not sure what this means. Is it something to be fixed in this patch? 
Otherwise could be just remove the TODO and file a JIRA if it's something that 
needs to be tracked.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/data-sink.h
File be/src/exec/data-sink.h:

Line 126:   /// Not used in DataStreamSender and PHJBuilder.
Or NljBuilder. Maybe easier to just say that they're not used in all subclasses.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/exec-node.cc
File be/src/exec/exec-node.cc:

PS14, Line 451: inline
Is the inline needed here?


Line 465: void ExecNode::FreeLocalAllocations() {
This helper function doesn't seem to help that much - consider inlining it.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

Line 261:   /// Conjuncts and their evaluators in this node.
Maybe mention who owns the pointers?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

PS14, Line 86: expr_evaluator->root()
Is the plan to eventually pass around the Expr itself and avoid going via the 
evaluator, or will we continue using this pattern? I don't feel that strongly 
either way, just curious.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

Line 120:   FilterContext() { }
nit: can just remove this and rely on the default constructor


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hash-table.h
File be/src/exec/hash-table.h:

Line 115:   /// computing the size of a scratch row.
Maybe mention what 'pool' is for


Line 415:   Status Init(ObjectPool* pool, RuntimeState* state, int 
num_build_tuples);
What is pool used for?


Line 513:   MemPool* mem_pool_;
Mention that it's not owned?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS14, Line 294:   
tab?


Line 733: vector& output_expr_evaluators)
Can't this be a const& still? The vector isn't being modified is it?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

Line 26: #include "exprs/scalar-expr.h"
Maybe not needed?


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 147: // TODO: Move this to Prepare()
Is this TODO in this patch, or does it depend on another change?


Line 165: new RowDescriptor(min_max_tuple_desc_, /* is_nullable */ 
false));
This is a little inconsistent - above on line 121 we create the RowDescriptor 
on the stack but here it's heap-allocated. I think the stack allocation pattern 
is probably simpler.


PS14, Line 186: conjunct_evaluators_map_[entry.first]
[] has a side-effect of creating the element - maybe use find() instead for the 
DCHECK.


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 146:   const std::vector 
min_max_conjunct_evaluators() const {
Shouldn't this return a const&, rather than copying the vector?


Line 393:   /// the per-scanner ScannerContext..
Add something like "These correspond to the exprs in 'filter_exprs_'"


http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 362:   MemPool filter_mem_pool(expr_mem_tracker());
This is a little subtle - could do with a comment.


Line 388: for (auto& ctx: filter_ctxs) {
Maybe use a goto and an exit block at the bottom of the function to enforce 
that the cleanup happens on both paths.

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 937: if (template_tuple_ == nullptr && tuple_byte_size_ <= 
CACHE_LINE_SIZE) {
> Might be worth putting this into an InitBatch() function.
Factored this out along with the above Reset() call, since they really go 
together.


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 486: uint8_t val_buf[sizeof(T)]; // Uninitialized stack allocation for 
temporary value.
> Comment just seems to describe what this line is doing, but not why. Mentio
Done


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/bit-stream-utils.inline.h
File be/src/util/bit-stream-utils.inline.h:

Line 91: ALWAYS_INLINE inline bool BitReader::GetValue(int num_bits, T* v) {
> remove inline?
I tried initially but it turns out the the always_inline attribute doesn't 
imply one-definition-rule linkage in the same way that "inline" - there was a 
compiler warning.

However, defining the function inside a class body implies "inline", so 
"inline" can be safely omitted in those cases.


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

Line 291: ALWAYS_INLINE inline bool DictDecoder::GetNextValue(T* value) {
> remove inline?
The definition is outside the class so it's not actually redundant (see earlier 
comment).


Line 304: ALWAYS_INLINE inline bool DictDecoder::GetNextValue(
> remove inline?
The definition is outside the class so it's not actually redundant (see earlier 
comment).


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

Line 251: ALWAYS_INLINE inline bool RleDecoder::Get(T* val) {
> remove inline?
The definition is outside the class so it's not actually redundant (see earlier 
comment).


Line 269: if (UNLIKELY(!bit_reader_.GetValue(bit_width_, val))) return 
false;
> Really a critical change? It changes the behavior, so might be dangerous / 
It made a small but measurable difference, but probably not enough to worry 
about and there is an edge case where this is less correct so I reverted it.

I think it's generally safe - it doesn't change the external behaviour except 
where the input data is malformed and the client repeatedly calls Get() after 
it returns false. Then maybe this GetValue() call could return false, but the 
one on line 280 will later return true and return some RLE-encoded data.


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.

2017-05-22 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has uploaded a new patch set (#2).

Change subject: IMPALA-4622: Add ALTER COLUMN statement.
..

IMPALA-4622: Add ALTER COLUMN statement.

Kudu recently added the ability to alter a column's default value
and storage attributes (KUDU-861). This patch adds the ability to
modify these from Impala using ALTER.

It also supports altering a column's comment for non-Kudu tables.

Syntax:
ALTER TABLE  ALTER [COLUMN] 
  SET   [  [ ...]]
where  is one of:
  - DEFAULT, BLOCK_SIZE, ENCODING, COMPRESSION (Kudu tables)
  - COMMENT (non-Kudu tables)
ALTER TABLE  ALTER [COLUMN]  DROP DEFAULT

It does not support setting a column to be a primary key or
chaning a column's nullability, because those are not supported on
the Kudu side yet.

Testing:
- Added FE tests to ParserTest and AnalyzeDDLTest
- Added EE tests to kudu_alter.test and kudu_describe.test

Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6
---
M common/thrift/JniCatalog.thrift
M fe/src/main/cup/sql-parser.cup
A fe/src/main/java/org/apache/impala/analysis/AlterColumnStmt.java
M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M testdata/workloads/functional-query/queries/QueryTest/alter-table.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_alter.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_describe.test
11 files changed, 299 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/55/6955/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6955
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Matthew Jacobs

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..


Patch Set 2:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/603/

-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-HasComments: No

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5338: Fix Kudu timestamp column default values

2017-05-22 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5338: Fix Kudu timestamp column default values
..


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 108:   static TimestampValue FromUnixTimeMicros(int64_t unix_time_micros);
I'm confused as to why it no longer takes two parameters (maybe its just that 
the comment needs to be updated?)


http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

Line 32:   int64_t ts_seconds = unix_time_micros / MICROS_PER_SEC;
I don't understand what this is supposed to be doing - it looks like you're 
just separating 'unix_time_micros' into two parts and then adding them back 
together again.


http://gerrit.cloudera.org:8080/#/c/6936/1/fe/src/main/java/org/apache/impala/catalog/KuduColumn.java
File fe/src/main/java/org/apache/impala/catalog/KuduColumn.java:

PS1, Line 125:  value
extraneous word?


http://gerrit.cloudera.org:8080/#/c/6936/1/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test
File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test:

Line 253:  ts2 timestamp default cast('00:00:00' as 
timestamp))
I guess the intention here is that this will fail? Is there an error we can 
catch?


http://gerrit.cloudera.org:8080/#/c/6936/1/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

Line 364: schema_builder = SchemaBuilder()
comment


Line 658:   e TIMESTAMP NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION DEFAULT timestamp_from_unix_micros(%s),
long line


-- 
To view, visit http://gerrit.cloudera.org:8080/6936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 937: if (template_tuple_ == nullptr && tuple_byte_size_ <= 
CACHE_LINE_SIZE) {
Might be worth putting this into an InitBatch() function.


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 486: uint8_t val_buf[sizeof(T)]; // Uninitialized stack allocation for 
temporary value.
Comment just seems to describe what this line is doing, but not why. Mention 
explicitly what we are trying to avoid for a perf gain.


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/bit-stream-utils.inline.h
File be/src/util/bit-stream-utils.inline.h:

Line 91: ALWAYS_INLINE inline bool BitReader::GetValue(int num_bits, T* v) {
remove inline?


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

Line 291: ALWAYS_INLINE inline bool DictDecoder::GetNextValue(T* value) {
remove inline?


Line 304: ALWAYS_INLINE inline bool DictDecoder::GetNextValue(
remove inline?


http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

Line 251: ALWAYS_INLINE inline bool RleDecoder::Get(T* val) {
remove inline?


Line 269: if (UNLIKELY(!bit_reader_.GetValue(bit_width_, val))) return 
false;
Really a critical change? It changes the behavior, so might be dangerous / 
introduce a bug.


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 426: *num_values = (curr_tuple - tuple_mem) / tuple_size;
> Should we add a DCHECK to make sure that tuple_size is non-zero?
Added it. It was already a precondition for the function (can't materialize a 
value into a 0-length tuple) so it's safe.


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#3).

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..

IMPALA-5347: Parquet scanner microoptimizations

A mix of microoptimizations that profiling the parquet scanner revealed.
All lead to some measurable improvement and added up to significant
speedups for various scans.

* Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining
  in all cases.
* Apply __restrict__ in a few places so the compiler knows that it is
  safe to cache values accessed via those pointers
* memset() the whole batch instead of the null indicators is cases where
  it is almost certainly cheaper.
* Avoid updating two correlated loop variables in MaterializeValueBatch().
* Avoid unnecessary initialization of often-unused 'val' in ReadSlot().
* Shave a few instructions off the (still very expensive) bit unpacking
  and dict decoding logic.

Performance:

Some local TPC-H and targeted-perf benchmarks showed average speedups of
~5%.

I did some benchmarks targeted at measuring column materialisation
performance using a version of lineitem with duplicated data to make
it bigger. These queries all got significantly faster.

Dict-encoded DECIMAL: 2.23 -> 1.23s

  SELECT count(*) FROM biglineitem WHERE l_quantity > 49

Plain-encoded BIGINT: 2.33s -> 1.62s

  SELECT count(*) FROM biglineitem WHERE l_orderkey != 10

Dict-encoded STRING: 2.73s -> 1.72s

  SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A'

Multiple columns: 5.15s -> 3.74s

  SELECT count(*) FROM biglineitem
  WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100

Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/util/bit-stream-utils.inline.h
M be/src/util/bit-util.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
6 files changed, 56 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: anujphadke

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1031: DCHECK_EQ(0, scratch_batch_->total_allocated_bytes());
> Where do the decompression buffers get freed?
IMPALA-5304 made that redundant (I missed doing this cleanup in that patch).


http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

PS2, Line 48: MemPool
> It's not clear from the var names comments where the var-len data goes. Tha
I elaborated on the comments to make it a bit more explicit.


Line 50:   // Pool used to accumulate other memory such as decompression 
buffers that may be
> may be referenced
Done


Line 109: dst_batch->tuple_data_pool()->AcquireData(_mem_pool, false);
> I would have thought that the var-len data like strings or collections can 
We need to transfer the allocations for var-len data regardless since we have 
no way to know for sure that previously-returned batches don't reference some 
of that data.

So we could still run into a similar problem with excessive transfer of var-len 
data in some cases but I think that's impossible to solve without larger 
changes to the memory ownership model because it would require more precise 
tracking of which buffers may be referenced by already-returned rows. Reducing 
the amount of fixed-length data transferred will help a lot regardless by 
reducing the overall volume of lock contention.

It's also counter-productive if the data is dictionary-encoded.


Line 130: if (num_output_batches > 1) return false;
> This new compaction has non-obvious caveats like this one, and I find the f
I don't think this is a significant problem for performance - if the scan is 
selective then the scratch batch will have many fewer rows than the output 
batches and only a handful of batches will hit this edge case.

Agree it's subtle but I think there will be subtleties regardless with our 
current memory transfer model. I tried to keep this encapsulated in 
FinalizeTransfer()/TryCompact()/Reset() so that this isn't spread out 
throughout the code.

I think the alternative has some issues that I would rather avoid. I think 
doing two passes over the batch will be slower since the loop in 
ProcessScratchBatch() is already tight.

It would also complicate the logic around 'tuple_mem' because compacted memory 
lives temporarily in the scratch batch before being moved to the destination 
batch - I think you would need additional state in the scratch batch to track 
both the compacted tuples and the tuple buffer that you're going to reuse.


Line 139: for (int i = dst_batch->num_rows(); i < end_row; ++i) {
> Don't we have a CopyRows() for this in RowBatch?
That seems to copy the tuple pointers rather than the tuples themselves.


-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4623: Enable file handle cache

2017-05-22 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4623: Enable file handle cache
..


Patch Set 10:

(3 comments)

About to post another upload.

http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-handle-cache.h
File be/src/runtime/disk-io-mgr-handle-cache.h:

Line 88:   /// is set to true. Otherwise, the partition will try to construct a 
file handle and
> true, but you're describing a scenario with concurrent queries and updates.
That makes sense. We don't provide guarantees, and it seems like an uncommon 
case. I will include this on the JIRA I file for the time based eviction.


http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-scan-range.cc
File be/src/runtime/disk-io-mgr-scan-range.cc:

Line 417: } else {
> so it's just num_bytes_read then. when i see 'last' or 'prev' or something 
That makes sense. I think this came about trying to avoid a naming collision 
with the function's argument "bytes_read". Since we already have bytes_read and 
bytes_read_, I switched this to current_bytes_read rather than num_bytes_read.


http://gerrit.cloudera.org:8080/#/c/6478/10/be/src/runtime/disk-io-mgr-scan-range.cc
File be/src/runtime/disk-io-mgr-scan-range.cc:

Line 352:   int success = hdfsGetHedgedReadMetrics(fs_, _metrics);
> you're getting metrics for the whole fs_, but every scan range does that on
This came in due to a rebase. IMPALA-4764 added this. I had to move it around 
because the code changed in Close(). It is setting the metric to this value 
rather than adding it.


-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#3).

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..

IMPALA-4923: reduce memory transfer for selective scans

Most of the code changes are to restructure things so that the
scratch batch's tuple buffer is stored in a separate MemPool
from auxiliary memory such as decompression buffers. This part
of the change does not change the behaviour of the scanner in
itself, but allows us to recycle the tuple buffer without holding
onto unused auxiliary memory.

The optimisation is implemented in TryCompact(): if enough rows
were filtered out during the copy from the scratch batch to the
output batch, the fixed-length portions of the surviving rows
(if any) are copied to a new, smaller, buffer, and the original,
larger, buffer is reused for the next scratch batch.

Previously the large buffer was always attached to the output batch,
so a large buffer was transferred between threads for every scratch
batch processed. In combination with the decompression buffer change
in IMPALA-5304, this means that in many cases selective scans don't
produce nearly as many empty or near-empty batches and do not attach
nearly as much memory to each batch.

Performance:
Even on an 8 core machine I see some speedup on selective scans.
Profiling with "perf top" also showed that time in TCMalloc
was reduced - it went from several % of CPU time to a minimal
amount.

Running TPC-H on the same machine showed a ~5% overall improvement
and no regressions. E.g. Q6 got 20-25% faster.

I hope to do some additional cluster benchmarking on systems
with more cores to verify that the severe performance problems
there are fixed, but in the meantime it seems like we have enough
evidence that it will at least improve things.

Testing:
Add a couple of selective scans that exercise the new code paths.

Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-scratch-tuple-batch.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M testdata/workloads/functional-query/queries/QueryTest/parquet.test
6 files changed, 206 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/6949/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm

[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.

2017-05-22 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6955

Change subject: IMPALA-4622: Add ALTER COLUMN statement.
..

IMPALA-4622: Add ALTER COLUMN statement.

Kudu recently added the ability to alter a column's default value
and storage attributes (KUDU-861). This patch adds the ability to
modify these from Impala using ALTER.

It also supports altering a column's comment for non-Kudu tables.

Syntax:
ALTER TABLE  ALTER [COLUMN] 
  SET   [  [ ...]]
where  is one of:
  - DEFAULT, BLOCK_SIZE, ENCODING, COMPRESSION (Kudu tables)
  - COMMENT (non-Kudu tables)
ALTER TABLE  ALTER [COLUMN]  DROP DEFAULT

It does not support setting a column to be a primary key or
chaning a column's nullability, because those are not supported on
the Kudu side yet.

Testing:
- Added FE tests to ParserTest and AnalyzeDDLTest
- Added EE tests to kudu_alter.test and kudu_describe.test

Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6
---
M common/thrift/JniCatalog.thrift
M fe/src/main/cup/sql-parser.cup
A fe/src/main/java/org/apache/impala/analysis/AlterColumnStmt.java
M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M testdata/workloads/functional-query/queries/QueryTest/alter-table.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_alter.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_describe.test
11 files changed, 299 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/55/6955/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6955
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread anujphadke (Code Review)

anujphadke has posted comments on this change.

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 426: *num_values = (curr_tuple - tuple_mem) / tuple_size;
Should we add a DCHECK to make sure that tuple_size is non-zero?


-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: anujphadke 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries

2017-05-22 Thread Zach Amsden (Code Review)

Zach Amsden has posted comments on this change.

Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries
..


Patch Set 7:

(17 comments)

Thanks for looking at this!

http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries
> The JIRA seems to be talking about doing something slightly different: taki
There is a bit of a disconnect with the JIRA as specified and the intention 
behind the JIRA, which had me initially working down the wrong path.  I 
realized that (x OP k) type predicates can only be evaluated for OP = or <=>, 
and started working on a way to identify such predicates in sub-expressions of 
the conjuncts.  It all seemed quite speculative and limited in applicability.  
Once can imagine expanding this to ordering predicates with sorted 
dictionaries, but as we don't have that yet, it would be building without a 
foundation.

After clarification from Marcel, it became apparent that what was intended was 
to be able to evaluate arbitrary single slot predicates against the dictionary 
(as we do this anyway for row-group filtering) and save those results to 
eliminate the cost of re-evaluating the same conjuncts again later.

The central difficulty of this problem is that the set of conjuncts which can 
be dictionary evaluated is not known until row group time.

This is the best proposal that I have come up with so far at attempting to 
solve that problem.

I agree we should get the design right on this one.  Doing this wrong could be 
costly in terms of either performance or complexity.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc
File be/src/exec/hdfs-parquet-scanner-ir.cc:

Line 40
> I believe this code was deliberately written like this to avoid the indirec
I have no problem reverting this.  Originally I had the two functions merged 
into one, but I didn't like imposing the overhead of checking the bitmap on 
this version.  While the code evolved, I had to put an interface on 
ScratchBatch just for sanity, as too many pieces of code were playing with its 
internals.

I think the interface is actually broad enough now to go back to the original 
code (minus the interference with internals).  I'll simply do this at the end:

scratch_batch_->Advance((scratch_tuple - scratch_tuple_start) / tuple_size);


Line 77:tuple_index = scratch_batch_->NextValidTuple()) {
> It seems like this might be slow for the case when many or all of the tuple
Your intuition and mine are similar.  I did my best to make the "simple" 
version of NextSetBit() as fast as possible, but I believe intrinsics might 
actually be warranted here.

Wanted to get the core ideas sound before going in that direction though.

Oh, and for reference, I also tried various other bitmap implementations - 
boost::dynamic_bitset and std::vector.  I didn't like the boost 
dependency (or the code too much) and the std::vector (specialized 
bit-vector) implementation compiled to assembler sent me running for the hills.

Branches, branches everywhere...


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 935
> I'm not sure, but this hoisting may have been deliberate because the compil
It certainly shouldn't hurt to re-hoist


Line 208:   dict_filters_active_.reset(new 
bool[scanner_conjunct_ctxs_->size()]);
> This probably works fine in practice but afaik scoped_ptr will call free in
Will check


PS7, Line 1051: _
> nit: extraneous _
Originally a class member, until I realized this function had no access.  Will 
fix.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 317:   bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) {
> I think it would be best to avoid using the ScratchTupleBatch abstraction i
My guess is that inlining the functions from the header should get about as 
optimal as possible.

Maybe hoisting 'max = scratch.capacity()' to tell the compiler it is unchanged 
throughout the loop, but other than that, I don't see much room for improvement 
that has been lost.  Let's see what perf results look like when that is ready.

I don't want to pass the bitmap directly, as I only want the bitmap to be 
touched (or even known about) by those pieces of code which actually are 
running in a filtered column reader.


Line 408:   if (IS_FILTERED && IS_DICT_ENCODED &&
> This patch pushes the predicate evaluation down into ReadFilteredSlot() so 
Perhaps there is a way to do this, but I don't see a clear path to that.  We 
have some choices, but they are all problematic:

Materialize codewords and values.  Where do the codewords go?  There is no 
space in the tuple.

Materialize codewords only.  What do we do when there is no dictionary?

Materialize

[native-toolchain-CR] Remove libevent from toolchain

2017-05-22 Thread Matthew Jacobs (Code Review)

Matthew Jacobs has posted comments on this change.

Change subject: Remove libevent from toolchain
..


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6952
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4803: [DOCS] 2.9 release notes

2017-05-22 Thread John Russell (Code Review)

John Russell has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6954

Change subject: IMPALA-4803: [DOCS] 2.9 release notes
..

IMPALA-4803: [DOCS] 2.9 release notes

Start with placeholder for 2.9 new features topic.

Initially just point to the changelog file.

Change-Id: I1f6cabc2427daf1243bd69dbed295c6923c4091b
---
M docs/impala_keydefs.ditamap
M docs/topics/impala_new_features.xml
2 files changed, 28 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/6954/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6954
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I1f6cabc2427daf1243bd69dbed295c6923c4091b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell

[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5164: Fix flaky benchmarks
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc
File be/src/util/benchmark.cc:

Line 62:   throw std::runtime_error("Benchmark failed to complete due to 
context switching.");
> That is true, but having an error message and call graph doesn't help here.
Sounds like a Status might be a better fit then.


Line 69:   }
> I thought so as well, but there is no reset method currently, so this actua
You could maybe create a new stopwatch in that case?


PS2, Line 94: 10
> Arbitrary flake factor.  If we lose 90% of measurements, we bail.  This par
Can you add a comment along those lines for future reference then?


-- 
To view, visit http://gerrit.cloudera.org:8080/6935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Alex Behm (Code Review)

Alex Behm has posted comments on this change.

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1031: DCHECK_EQ(0, scratch_batch_->total_allocated_bytes());
Where do the decompression buffers get freed?


http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

PS2, Line 48: MemPool
It's not clear from the var names comments where the var-len data goes. That's 
important to point out explicitly.


Line 50:   // Pool used to accumulate other memory such as decompression 
buffers that may be
may be referenced


Line 109: dst_batch->tuple_data_pool()->AcquireData(_mem_pool, false);
I would have thought that the var-len data like strings or collections can make 
up the bulk of memory that needs to be transferred, so why not deep-copy those 
out as well and avoid this transfer? What's the rationale behind only avoiding 
transferring the mem for the fixed-len portion?


Line 130: if (num_output_batches > 1) return false;
This new compaction has non-obvious caveats like this one, and I find the flow 
of memory difficult to follow now. I wonder if this process could be simplified 
if we did something along these lines:
1. Evaluate conjuncts over all tuples in scratch batch. Keep a bitmap of 
survivors.
2. Decide whether to compact scratch batch or not.
3. Transfer rows to output batch. When AtEnd() of the scratch batch, have a 
function TransferResources() or similar to transfer whatever the output batch 
needs. This may be the original memory or memory from compaction.

Let's discuss before you make any changes obviously :)


Line 139: for (int i = dst_batch->num_rows(); i < end_row; ++i) {
Don't we have a CopyRows() for this in RowBatch?


-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks

2017-05-22 Thread Zach Amsden (Code Review)

Zach Amsden has posted comments on this change.

Change subject: IMPALA-5164: Fix flaky benchmarks
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc
File be/src/util/benchmark.cc:

Line 35: double Benchmark::Measure(BenchmarkFunction function, void* args,
> Can we change the behaviour so that benchmarks can opt in or out from the c
Definitely.  I noticed a few tests with more complicated behavior that 
definitely don't want this (unfortunately they crash instead of running).


Line 62:   throw std::runtime_error("Benchmark failed to complete due to 
context switching.");
> We usually use LOG(FATAL) for fatal errors or Status for non-fatal errors. 
That is true, but having an error message and call graph doesn't help here.  
Instead, I want to throw an error which can be caught and logged at the level 
where we have context on which benchmark was being run (see the sample error 
message in the summary).


Line 69:   }
> Maybe we should reset 'sw' and 'lost_time' after doing the tuning step so w
I thought so as well, but there is no reset method currently, so this actually 
seems easiest.


PS2, Line 94: 10
> Where does the 10 come from?
Arbitrary flake factor.  If we lose 90% of measurements, we bail.  This 
parameter just made the arithmetic easy.


Line 95: throw std::runtime_error("Benchmark failed to complete due to 
noisy measurements.");
> Same comment here about using Status/LOG(FATAL).
Same reason - we want context on which benchmark was failing, which we don't 
have here.


-- 
To view, visit http://gerrit.cloudera.org:8080/6935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has posted comments on this change.

Change subject: Remove redundant libevent flag when building Thrift
..


Patch Set 1:

Yep - mentioned in the commit msg "Libevent is not required for Thrift if the 
non-blocking server is not used.". We stopped using the non-blocking server a 
few years ago.

-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: No

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Matthew Jacobs (Code Review)

Matthew Jacobs has posted comments on this change.

Change subject: Remove redundant libevent flag when building Thrift
..


Patch Set 2:

I should have asked earlier - do you know why we stopped needing libevent for 
thrift?

-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has posted comments on this change.

Change subject: IMPALA-5349: flaky NoDirsAllocationError test
..


Patch Set 1: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6953
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5085: large rows in BufferedTupleStreamV2

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5085: large rows in BufferedTupleStreamV2
..


Patch Set 5:

Rebased

-- 
To view, visit http://gerrit.cloudera.org:8080/6638
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6953

Change subject: IMPALA-5349: flaky NoDirsAllocationError test
..

IMPALA-5349: flaky NoDirsAllocationError test

The assertion was incorrect and racy - it is ok if the write error wins
the race with the Unpin() calls, causing them to fail.

Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
---
M be/src/runtime/buffered-block-mgr-test.cc
1 file changed, 4 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/53/6953/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6953
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-5085: large rows in BufferedTupleStreamV2

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#5).

Change subject: IMPALA-5085: large rows in BufferedTupleStreamV2
..

IMPALA-5085: large rows in BufferedTupleStreamV2

The stream defaults to pages of default_page_len_. If a row doesn't
fit in that page, it will allocate another page up to max_page_len_
bytes and append a single row to that page, then immediately advance
to the next page. This means that when writing a stream, the large
page only needs to be kept in memory temporarily, which helps with
memory requirements.  E.g. consider a hash join that is repartitioning
1 unpinned stream into 16 unpinned streams. We will need
default_page_len_ * 15 + max_page_len_ * 2 bytes of reservation because
when processing a large row we only need one large write buffer at a
time.

The large row cases are not as optimised for memory consumption or
performance - queries processing very large numbers of large rows
are an extreme edge case that is likely to hit other performance
bottlenecks first. Pages with large rows can have up to 50%
internal fragmentation. I tried to avoid adding empty pages to the
stream, but it was tricky to avoid in the case of a read/write
stream where the read iterator was already pointing at the write
page - in that case we can end up with an empty page.

To avoid duplicating more logic between AddRow() and AllocateRow()
I restructured things so that AddRowSlow() is implemented in terms
of AllocateRowSlow(). AllocateRow() now takes a function as an
argument to populate the row.

Testing:
* Added tests for the case where 0 rows are added to the stream
* Extend BigRow to exercise the new code.
* Also test large strings and read/write streams.

Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5
---
M be/src/runtime/buffered-tuple-stream-v2-test.cc
M be/src/runtime/buffered-tuple-stream-v2.cc
M be/src/runtime/buffered-tuple-stream-v2.h
M be/src/runtime/buffered-tuple-stream-v2.inline.h
M common/thrift/generate_error_codes.py
5 files changed, 544 insertions(+), 236 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/38/6638/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6638
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Matthew Jacobs (Code Review)

Matthew Jacobs has posted comments on this change.

Change subject: Remove redundant libevent flag when building Thrift
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

Line 163:   target_cpu_attr_ = 
fn->getFnAttribute("target-cpu").getValueAsString();
> When we add AVX2 instruction to cross-compiled code, wouldn't we need to ad
We have __attribute__((target("avx2"))) on some bloom filter functions already 
that I think get cross-compiled - that overrides whatever compilation flags 
were present. We could also emit AVX2 instructions as part of codegen, or 
inline cross-compiled functions with AVX2 into codegen'd functions.


Line 166:   LOG(INFO) << "target_features_attr: " << target_features_attr_;
> This is logged once during start-up. Would be helpful for debugging.
Ah ok, maybe make the log text a bit more human-readable then?


-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has posted comments on this change.

Change subject: Remove redundant libevent flag when building Thrift
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6904/1/source/thrift/build.sh
File source/thrift/build.sh:

Line 35
> I don't think we need libevent otherwise, right? If not, please remove from
Done - did it in a separate commit in case we ever want to revert it.


-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: Yes

[native-toolchain-CR] Remove libevent from toolchain

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6952

Change subject: Remove libevent from toolchain
..

Remove libevent from toolchain

Only Thrift ever needed libevent, and it no longer needs it, so save the
time building this unused dependency.

Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82
---
M buildall.sh
D source/libevent/build.sh
2 files changed, 0 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/native-toolchain 
refs/changes/52/6952/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6952
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson

[native-toolchain-CR] Remove redundant libevent flag when building Thrift

2017-05-22 Thread Henry Robinson (Code Review)

Henry Robinson has uploaded a new patch set (#2).

Change subject: Remove redundant libevent flag when building Thrift
..

Remove redundant libevent flag when building Thrift

--with-libevent was specified twice, and overridden by the second
  option. Libevent is not required for Thrift if the non-blocking server
  is not used.

Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
---
M buildall.sh
M source/thrift/build.sh
2 files changed, 1 insertion(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/native-toolchain 
refs/changes/04/6904/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8
Gerrit-PatchSet: 2
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs

[Impala-ASF-CR] Loads all TPC-DS tables

2017-05-22 Thread Mostafa Mokhtar (Code Review)

Mostafa Mokhtar has posted comments on this change.

Change subject: Loads all TPC-DS tables
..


Patch Set 3: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6877
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries
..


Patch Set 7:

(19 comments)

Looks promising. We're likely to be building more things on top of this, and 
the potential impact is pretty big if we made the right design decisions, so I 
want to make sure that we're making the right choices here - I think there are 
a few decision points we need to think through careful.

http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries
The JIRA seems to be talking about doing something slightly different: taking 
predicates of form (x OP constant), translating the constant into the codeword, 
then comparing dict_index directly to the codeword. That avoids any kind of 
dictionary or bitmap lookup and can be vectorized so can be very effective. 
There are various papers about this kind of trick - I can find links if you 
haven't seen them. I think maybe we need to split out this work into a separate 
JIRA.

Your current approach applies to a much larger range of predicates so is also 
very useful. It would be good to think about whether the infrastructure will 
support that kind of optimisation in the future. I think doing a more 
column-oriented approach might make it more natural to do multiple passes over 
the column.


Line 41: function per file format, so we would have to simulate a file format
Yeah that codegen'd function map is a bit of a mess.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc
File be/src/exec/hdfs-parquet-scanner-ir.cc:

Line 40
I believe this code was deliberately written like this to avoid the indirection 
via this->scratch_batch_ in the tight loop below - I think we should leave this 
alone.


Line 77:tuple_index = scratch_batch_->NextValidTuple()) {
It seems like this might be slow for the case when many or all of the tuples 
are valid, since NextSetBit() is fairly expensive compared to incrementing a 
counter. It's probably worth waiting to see performance numbers, but my 
intuition is that we'll need to make sure that advancing to the next bit is 
cheaper.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 935
I'm not sure, but this hoisting may have been deliberate because the compiler 
couldn't hoist the load via scratch_batch_ out of the loop


Line 208:   dict_filters_active_.reset(new 
bool[scanner_conjunct_ctxs_->size()]);
This probably works fine in practice but afaik scoped_ptr will call free 
instead of free[] on the array. I think unique_ptr does the right thing.


Line 865: // Legacy impala files cannot be eliminated here, because the 
only way to
I think we're missing an opportunity to apply the optimisation to columns with 
a mix of plain and dictionary-encoded pages.

My understanding is that the pre-existing dictionary filtering optimisation 
only works when the whole column is dictionary-encoded, but your new 
optimisation can be applied on a page-by-page basis.

I'm not sure how hard it is to restructure things to get it to work in that 
case. I think it's probably fairly low-impact because most columns will either 
be predominantly plain-encoded or predominantly dict-encoded, but we should add 
a comment so that the limitation is explicit.


PS7, Line 1051: _
nit: extraneous _


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 436:   /// XXX Why is this a scoped_ptr ?
IIRC so we don't need to include the full header.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 317:   bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) {
I think it would be best to avoid using the ScratchTupleBatch abstraction in 
this code - these functions are perf critical and it's easier to reason about 
performance when the abstraction is stripped away and the code is directly 
operating over contiguous arrays as much as possible.

Maybe just pass in the bitmap directly?


Line 408:   if (IS_FILTERED && IS_DICT_ENCODED &&
This patch pushes the predicate evaluation down into ReadFilteredSlot() so that 
it's done value-by-value as they are materialised.

I'm not convinced this is the best approach - I think we should consider 
peeling predicate evaluation out into a separate loop and doing it columnwise. 
This would mean that ReadSlot() or some variant would materialize the values 
and the codewords into an array, then we'd do another pass over the codeword 
array to evaluate the conjunctions. Or maybe materialize the codewords in one 
pass, then doing another pass to evaluate conjuncts, then another one to lookup 
the surviving values in the dictionary.

[Impala-ASF-CR] [DOCS] Add several items to "known issues" page

2017-05-22 Thread John Russell (Code Review)

John Russell has posted comments on this change.

Change subject: [DOCS] Add several items to "known issues" page
..


Patch Set 3:

I moved IMPALA-4432 into the "Incompatible Changes" page. The better to discuss 
the behavior change due to the fix, since it was found and fixed within the 
same release cycle.

-- 
To view, visit http://gerrit.cloudera.org:8080/5809
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Michael Ho (Code Review)

Michael Ho has posted comments on this change.

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

Line 163:   target_cpu_attr_ = 
fn->getFnAttribute("target-cpu").getValueAsString();
> This solves the inlining problem but the outcome is a bit counterintuitive 
When we add AVX2 instruction to cross-compiled code, wouldn't we need to add 
"-mavx2" to the compilation flag when cross-compiling the code ? In that case, 
wouldn't the lowest-common-denominator also include "avx2" ? May be I 
misunderstood the comment ?


Line 166:   LOG(INFO) << "target_features_attr: " << target_features_attr_;
> Should this logging be left in?
This is logged once during start-up. Would be helpful for debugging.


-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] [DOCS] Add several items to "known issues" page

2017-05-22 Thread John Russell (Code Review)

Hello Michael Brown, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5809

to look at the new patch set (#3).

Change subject: [DOCS] Add several items to "known issues" page
..

[DOCS] Add several items to "known issues" page

IMPALA-4432 is actually an incompatible change, so it gets
listed on the separate "Incompatible Changes" page.
Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf
---
M docs/topics/impala_incompatible_changes.xml
M docs/topics/impala_known_issues.xml
2 files changed, 93 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/5809/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5809
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4164: Use inline hints instead of always-inline
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

Line 163:   target_cpu_attr_ = 
fn->getFnAttribute("target-cpu").getValueAsString();
This solves the inlining problem but the outcome is a bit counterintuitive - 
it's taking the target attributes from the IR module (which is the 
lowest-common-denominator set of features that each function was originally 
compiled for) and copying them to all the functions. 

I think this makes an incorrect assumption that all functions in the module 
have the same target attributes, which wouldn't be true for, say, AVX2-specific 
functions. So if one of those functions was in the module the outcome would 
depend on the (semi-arbitrary) order of the functions in the IR module.

At a minimum it needs some explanation, but I think it would make more sense to 
ensure that cross-compiled functions target the current machine, rather than 
the lowest-common-denominator. If I remember correctly, if the function has no 
attributes set then it defaults to the execution engine's, which match the 
current machine.

This would also potentially causes problems if we ever wanted to create 
functions with instructions like AVX2.


Line 166:   LOG(INFO) << "target_features_attr: " << target_features_attr_;
Should this logging be left in?


-- 
To view, visit http://gerrit.cloudera.org:8080/6941
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5031: Initialize only parsing header before reading it

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5031: Initialize only_parsing_header_ before reading it
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6937
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0ddc496019941048b3e0775606fa5e8e3f9c075a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#2).

Change subject: IMPALA-5347: Parquet scanner microoptimizations
..

IMPALA-5347: Parquet scanner microoptimizations

A mix of microoptimizations that profiling the parquet scanner revealed.
All lead to some measurable improvement and added up to significant
speedups for various scans.

* Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining
  in all cases.
* Apply __restrict__ in a few places so the compiler knows that it is
  safe to cache values accessed via those pointers
* memset() the whole batch instead of the null indicators is cases where
  it is almost certainly cheaper.
* Avoid updating two correlated loop variables in MaterializeValueBatch().
* Avoid unnecessary initialization of often-unused 'val' in ReadSlot().
* Shave a few instructions off the (still very expensive) bit unpacking
  and dict decoding logic.

Performance:

Some local TPC-H and targeted-perf benchmarks showed average speedups of
~5%.

I did some benchmarks targeted at measuring column materialisation
performance using a version of lineitem with duplicated data to make
it bigger. These queries all got significantly faster.

Dict-encoded DECIMAL: 2.23 -> 1.23s

  SELECT count(*) FROM biglineitem WHERE l_quantity > 49

Plain-encoded BIGINT: 2.33s -> 1.62s

  SELECT count(*) FROM biglineitem WHERE l_orderkey != 10

Dict-encoded STRING: 2.73s -> 1.72s

  SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A'

Multiple columns: 5.15s -> 3.74s

  SELECT count(*) FROM biglineitem
  WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100

Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/util/bit-stream-utils.inline.h
M be/src/util/bit-util.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
6 files changed, 55 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-2522: Add doc for sortby() and clustered hints
..


IMPALA-2522: Add doc for sortby() and clustered hints

Add CLUSTERED hint.

Update hint syntax in INSERT topic.

Also modernize the hint syntax as shown under INSERT
to include the -- and /* */ formats also. List
the [] style last since it is the least-preferred
option.

Switch to preferring /* */ syntax for hints
instead of using the [ ] notation by default.

Finally, take out references to the SORTBY hint because
it didn't actually make it in. Intent for future is to have a way
to get this behavior without using a hint.
Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Reviewed-on: http://gerrit.cloudera.org:8080/5655
Reviewed-by: John Russell 
Tested-by: Impala Public Jenkins
---
M docs/shared/impala_common.xml
M docs/topics/impala_hints.xml
M docs/topics/impala_insert.xml
3 files changed, 71 insertions(+), 28 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  John Russell: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5655
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Brown

[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2522: Add doc for sortby() and clustered hints
..


Patch Set 8: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5655
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: No

[Impala-ASF-CR] [DOCS] Add several items to "known issues" page

2017-05-22 Thread John Russell (Code Review)

John Russell has posted comments on this change.

Change subject: [DOCS] Add several items to "known issues" page
..


Patch Set 2:

Resolved conflict - brought in a couple of other known issues that were added 
in the meantime.

-- 
To view, visit http://gerrit.cloudera.org:8080/5809
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] [DOCS] Add several items to "known issues" page

2017-05-22 Thread John Russell (Code Review)

Hello Michael Brown, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5809

to look at the new patch set (#2).

Change subject: [DOCS] Add several items to "known issues" page
..

[DOCS] Add several items to "known issues" page

Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf
---
M docs/topics/impala_known_issues.xml
1 file changed, 95 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/5809/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5809
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5164: Fix flaky benchmarks
..


Patch Set 2:

(6 comments)

Thanks for doing this. The overall idea looks good, just some comments about a 
few details.

http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc
File be/src/util/benchmark.cc:

Line 35: double Benchmark::Measure(BenchmarkFunction function, void* args,
Can we change the behaviour so that benchmarks can opt in or out from the 
context switch test?

The new behaviour makes sense if the benchmark is measuring single-threaded 
efficiency but doesn't make sense in some other cases. E.g. for lock-benchmark 
we want to understand what the OS scheduler does when there are more threads 
than logical cores contending for a lcok.


Line 62:   throw std::runtime_error("Benchmark failed to complete due to 
context switching.");
We usually use LOG(FATAL) for fatal errors or Status for non-fatal errors. I 
think either could make sense here.


Line 69:   }
Maybe we should reset 'sw' and 'lost_time' after doing the tuning step so we 
don't have unnecessary lost time.


Line 90:   batch_size = max(1, batch_size / (1+(ru_stop.ru_nivcsw - 
ru_start.ru_nivcsw)));
Maybe comment why we're scaling it down here?


PS2, Line 94: 10
Where does the 10 come from?


Line 95: throw std::runtime_error("Benchmark failed to complete due to 
noisy measurements.");
Same comment here about using Status/LOG(FATAL).


-- 
To view, visit http://gerrit.cloudera.org:8080/6935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2522: Add doc for sortby() and clustered hints
..


Patch Set 8:

Build started: http://jenkins.impala.io:8080/job/gerrit-docs-submit/115/

-- 
To view, visit http://gerrit.cloudera.org:8080/5655
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints

2017-05-22 Thread John Russell (Code Review)

John Russell has posted comments on this change.

Change subject: IMPALA-2522: Add doc for sortby() and clustered hints
..


Patch Set 8: Code-Review+2

Carrying forward +2 after resolving merge conflict.

-- 
To view, visit http://gerrit.cloudera.org:8080/5655
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"

2017-05-22 Thread David Knupp (Code Review)

David Knupp has posted comments on this change.

Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
..


Patch Set 7: Code-Review+1

Python test code looks  OK to me.

-- 
To view, visit http://gerrit.cloudera.org:8080/6894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints

2017-05-22 Thread John Russell (Code Review)

Hello Lars Volker,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5655

to look at the new patch set (#8).

Change subject: IMPALA-2522: Add doc for sortby() and clustered hints
..

IMPALA-2522: Add doc for sortby() and clustered hints

Add CLUSTERED hint.

Update hint syntax in INSERT topic.

Also modernize the hint syntax as shown under INSERT
to include the -- and /* */ formats also. List
the [] style last since it is the least-preferred
option.

Switch to preferring /* */ syntax for hints
instead of using the [ ] notation by default.

Finally, take out references to the SORTBY hint because
it didn't actually make it in. Intent for future is to have a way
to get this behavior without using a hint.
Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
---
M docs/shared/impala_common.xml
M docs/topics/impala_hints.xml
M docs/topics/impala_insert.xml
3 files changed, 71 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/55/5655/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5655
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Brown

[Impala-ASF-CR] Parquet scanner microoptimizations

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6950

Change subject: Parquet scanner microoptimizations
..

Parquet scanner microoptimizations

A mix of microoptimizations that profiling the parquet scanner revealed.
All lead to some measurable improvement and added up to significant
speedups for various scans.

* Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining
  in all cases.
* Apply __restrict__ in a few places so the compiler knows that it is
  safe to cache values accessed via those pointers
* memset() the whole batch instead of the null indicators is cases where
  it is almost certainly cheaper.
* Avoid updating two correlated loop variables in MaterializeValueBatch().
* Avoid unnecessary initialization of often-unused 'val' in ReadSlot().
* Shave a few instructions off the (still very expensive) bit unpacking
  and dict decoding logic.

Performance:

Some local TPC-H and targeted-perf benchmarks showed average speedups of
~5%.

I did some benchmarks targeted at measuring column materialisation
performance using a version of lineitem with duplicated data to make
it bigger. These queries all got significantly faster.

Dict-encoded DECIMAL: 2.23 -> 1.23s

  SELECT count(*) FROM biglineitem WHERE l_quantity > 49

Plain-encoded BIGINT: 2.33s -> 1.62s

  SELECT count(*) FROM biglineitem WHERE l_orderkey != 10

Dict-encoded STRING: 2.73s -> 1.72s

  SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A'

Multiple columns: 5.15s -> 3.74s

  SELECT count(*) FROM biglineitem
  WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100

Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/util/bit-stream-utils.inline.h
M be/src/util/bit-util.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
6 files changed, 55 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Attila Jeges (Code Review)

Attila Jeges has uploaded a new patch set (#2).

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..

Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

Reverting IMPALA-2716 as SparkSQL does not agree with the approach
taken.

More details can be found at:
https://issues.apache.org/jira/browse/SPARK-12297

Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
---
M be/src/benchmarks/CMakeLists.txt
D be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
D tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 75 insertions(+), 850 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/6896/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Attila Jeges (Code Review)

Attila Jeges has posted comments on this change.

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6896/1/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 25: #include 
> why do we need these headers?
Removed the new includes and kept boost/date_time/local_time/local_time.hpp (it 
is required for  boost::posix_time)


http://gerrit.cloudera.org:8080/#/c/6896/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 123:   // Avro schema of this table if this is an Avro table, otherwise 
null. Set in load().
> what did these have to do with your change? are they just dead variables (i
Removed them, they were left behind accidentaly when I resolved the conflicts.


-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-HasComments: Yes

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Attila Jeges (Code Review)

Hello Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6896

to look at the new patch set (#2).

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..

Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

Reverting IMPALA-2716 as SparkSQL does not agree with the approach
taken.

More details can be found at:
https://issues.apache.org/jira/browse/SPARK-12297

Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
---
M be/src/benchmarks/CMakeLists.txt
D be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
D tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 75 insertions(+), 850 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/6896/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht

[Impala-ASF-CR] Add a script to test performance on a developer machine

2017-05-22 Thread Michael Brown (Code Review)

Michael Brown has posted comments on this change.

Change subject: Add a script to test performance on a developer machine
..


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6818/4/bin/single_node_perf_run.py
File bin/single_node_perf_run.py:

PS4, Line 110:   git_hash = sh.git("rev-parse", "HEAD").strip()
get_git_hash_for_name("HEAD") ?


PS4, Line 254:   parser.add_option("--no_load", dest="load", 
action="store_false", default=True,
 : help="use already-present workload 
databases")
 :   parser.add_option("--no_start_minicluster", 
dest="start_minicluster",
 : action="store_false", default=True,
 : help="use already-running minicluster")
It's confusing to me to have --no options that default to True that are 
stored_false. I can't keep track of what's going on and what the defaults will 
actually be. Are there ways you could make it less confusing? Can you reduce 
some of the negativity?


PS4, Line 262:   parser.set_usage("single_node_perf_run.py [options] git_hash_A 
[git_hash_B]\n\n"
 :"Compares the performance of Impala at two 
git hashes on\n"
 :"some standard benchmarks. Output is in\n"
 :"$IMPALA_HOME/perf_results/latest.\n\n"
 :"WARNING: This script will run git checkout. 
You should not touch\n"
 :"the tree while the script is running. You 
should start the script\n"
 :"from a clean git tree.\n\n"
 :"WARNING: Unless --no_load is used, this 
script calls load_data.py\n"
 :"which can overwrite your TPC-H and TPC-DS 
data.")
Nit: Consider a triple-quoted string and using textwrap.dedent().

  parser.set_usage(textwrap.dedent("""\
single_node_perf_run.py [options] git_hash_A [git_hash_B]


Compares the performance of Impala [etc]
""")


PS4, Line 296: main()
2 spaces.


-- 
To view, visit http://gerrit.cloudera.org:8080/6818
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans

2017-05-22 Thread Tim Armstrong (Code Review)

Tim Armstrong has uploaded a new patch set (#2).

Change subject: IMPALA-4923: reduce memory transfer for selective scans
..

IMPALA-4923: reduce memory transfer for selective scans

Most of the code changes are to restructure things so that the
scratch batch's tuple buffer is stored in a separate MemPool
from auxiliary memory such as decompression buffers. This part
of the change does not change the behaviour of the scanner in
itself, but allows us to recycle the tuple buffer without holding
onto unused auxiliary memory.

The optimisation is implemented in TryCompact(): if enough rows
were filtered out during the copy from the scratch batch to the
output batch, the fixed-length portions of the surviving rows
(if any) are copied to a new, smaller, buffer, and the original,
larger, buffer is reused for the next scratch batch.

Previously the large buffer was always attached to the output batch,
so a large buffer was transferred between threads for every scratch
batch processed. In combination with the decompression buffer change
in IMPALA-5304, this means that in many cases selective scans don't
produce nearly as many empty or near-empty batches and do not attach
nearly as much memory to each batch.

Performance:
Even on an 8 core machine I see some speedup on selective scans.
Profiling with "perf top" also showed that time in TCMalloc
was reduced - it went from several % of CPU time to a minimal
amount.

Running TPC-H on the same machine showed a ~5% overall improvement
and no regressions. E.g. Q6 got 20-25% faster.

I hope to do some additional cluster benchmarking on systems
with more cores to verify that the severe performance problems
there are fixed, but in the meantime it seems like we have enough
evidence that it will at least improve things.

Testing:
Add a couple of selective scans that exercise the new code paths.

Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-scratch-tuple-batch.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M testdata/workloads/functional-query/queries/QueryTest/parquet.test
6 files changed, 205 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/6949/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm

[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

2017-05-22 Thread Attila Jeges (Code Review)

Hello Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6896

to look at the new patch set (#2).

Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp 
data in Parquet"
..

Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"

Reverting IMPALA-2716 as SparkSQL does not agree with the approach
taken.

More details can be found at:
https://issues.apache.org/jira/browse/SPARK-12297

Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
---
M be/src/benchmarks/CMakeLists.txt
D be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
D tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 84 insertions(+), 850 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/6896/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6896
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht

[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"

2017-05-22 Thread Sailesh Mukil (Code Review)

Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6894/6/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

Line 78: try:
> I wonder about stacking all of these asserts in one block. If any of them f
Yes, that makes sense. If it fails, we'd want to know why.
I've added a line that prints the stderr if it's not empty.

This should have sufficient information to be able to debug if any of these 
commands fail.

We'd want the safe mode to be turned off regardless of success or failure, 
since we don't want the effect of this test to leak into other tests if 
anything gets run after this.


Line 97:   # Confirm that we were able to get the root cause.
> If this is the happy path, this is fine. If the assert fails though, you mi
I've added a message that says "Couldn't turn Safe mode OFF".


-- 
To view, visit http://gerrit.cloudera.org:8080/6894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"

2017-05-22 Thread Sailesh Mukil (Code Review)

Hello Matthew Jacobs, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6894

to look at the new patch set (#7).

Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
..

IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"

We use the new libHDFS API hdfsGetLastExceptionRootCause() to return
the last seen HDFS error on that thread.

This patch depends on the recent HDFS commit:
https://github.com/apache/hadoop/commit/fda86ef2a32026c02d9b5d4cca1ecb7b4decd872

Testing: A test has been added which puts HDFS in safe mode and then
verifies that we see a 255 error with the root cause.

Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d
---
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-util.cc
M tests/data_errors/test_data_errors.py
3 files changed, 44 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/6894/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Sailesh Mukil

[Impala-ASF-CR] IMPALA-5166: clean up BufferPool counters

2017-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5166: clean up BufferPool counters
..


Patch Set 9: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6690
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I34b7f4d94c3d396ac89026c7559d6b2c6e02697c
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

1 2 >

1 - 100 of 107 matches

Mail list logo