[Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
Alex Behm has uploaded a new patch set (#5). Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. .. IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. Syntax: TABLESAMPLE SYSTEM() [REPEATABLE()] The first number specifies the percent of table bytes to sample. The second number specifies the random seed to use. The sampling is coarse-grained. Impala keeps randomly adding files to the sample until at least the desired percentage of file bytes have been reached. Examples: SELECT * FROM t TABLESAMPLE SYSTEM(10) SELECT * FROM t TABLESAMPLE SYSTEM(50) REPEATABLE(1234) Testing: - Added parser, analyser, planner, and end-to-end tests - Private core/hdfs run passed Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 --- M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CollectionTableRef.java M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java M fe/src/main/java/org/apache/impala/analysis/InlineViewRef.java M fe/src/main/java/org/apache/impala/analysis/SingularRowSrcTableRef.java M fe/src/main/java/org/apache/impala/analysis/TableRef.java A fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/jflex/sql-scanner.flex M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java A testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test A tests/query_test/test_tablesample.py 20 files changed, 689 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/6868/5 -- To view, visit http://gerrit.cloudera.org:8080/6868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Marcel Kornacker
[Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
Alex Behm has posted comments on this change. Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/6868/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: Line 809: public HdfsPartition cloneNewFds(List newFds) { > i don't think this is used anymore. Done http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 1961: List> allFiles = > i'd say it's cleaner than partition cloning, because the way the cloning wo Returning as Map now. http://gerrit.cloudera.org:8080/#/c/6868/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 1940:* The 'percentBytes' parameter must be between 0 and 100. > i'd also point out that it allocates something like 12+ bytes per existing That's definitely a legitimate concern. Added comment. I also did another minor speed improvement to avoid unnecessarily doing a hash lookup to find the partition of each selected file. I'll run some experiments to see how this function performs. -- To view, visit http://gerrit.cloudera.org:8080/6868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Marcel Kornacker Gerrit-HasComments: Yes
[native-toolchain-CR] Remove libevent from toolchain
Henry Robinson has submitted this change and it was merged. Change subject: Remove libevent from toolchain .. Remove libevent from toolchain Only Thrift ever needed libevent, and it no longer needs it, so save the time building this unused dependency. Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82 --- M buildall.sh D source/libevent/build.sh 2 files changed, 0 insertions(+), 42 deletions(-) Approvals: Henry Robinson: Verified Matthew Jacobs: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/6952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82 Gerrit-PatchSet: 1 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[native-toolchain-CR] Remove libevent from toolchain
Henry Robinson has posted comments on this change. Change subject: Remove libevent from toolchain .. Patch Set 1: Verified+1 Passed a full toolchain build. -- To view, visit http://gerrit.cloudera.org:8080/6952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82 Gerrit-PatchSet: 1 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Henry Robinson has posted comments on this change. Change subject: Remove redundant libevent flag when building Thrift .. Patch Set 2: Verified+1 Passed a full toolchain build. -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 2 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: No
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Henry Robinson has submitted this change and it was merged. Change subject: Remove redundant libevent flag when building Thrift .. Remove redundant libevent flag when building Thrift --with-libevent was specified twice, and overridden by the second option. Libevent is not required for Thrift if the non-blocking server is not used. Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 --- M buildall.sh M source/thrift/build.sh 2 files changed, 1 insertion(+), 6 deletions(-) Approvals: Henry Robinson: Verified Matthew Jacobs: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 2 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs
[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement
Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-5259: Add REFRESH FUNCTIONS statement .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/6878/8/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: PS8, Line 613: refreshFunctions I don't think this function correctly protects the db against concurrent modifications. Let me give you an example. Client 1 is running a refreshFunctions while Client 2 is calling addFunction on the same database. Let's assume that addFunction is executed sometime between L628 and 637 and adds a new function F1 to the database and HMS. The operation will succeed and the new function will be returned to Client 2. Then the synchronized block (L637- 674) is executed which will assume that F1 has been removed and will delete it from the database. If Client 1 then tries to run a statement that uses F1 an error will be thrown between the function is no longer there. -- To view, visit http://gerrit.cloudera.org:8080/6878 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-5349: flaky NoDirsAllocationError test .. Patch Set 1: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/6953 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Impala Public Jenkins Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test
Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-5349: flaky NoDirsAllocationError test .. IMPALA-5349: flaky NoDirsAllocationError test The assertion was incorrect and racy - it is ok if the write error wins the race with the Unpin() calls, causing them to fail. Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 Reviewed-on: http://gerrit.cloudera.org:8080/6953 Reviewed-by: Henry RobinsonReviewed-by: Dan Hecht Tested-by: Impala Public Jenkins --- M be/src/runtime/buffered-block-mgr-test.cc 1 file changed, 4 insertions(+), 4 deletions(-) Approvals: Impala Public Jenkins: Verified Henry Robinson: Looks good to me, but someone else must approve Dan Hecht: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/6953 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention
Bharath Vissapragada has uploaded a new patch set (#10). Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention .. IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention Holding client_request_state_map_lock_ and CRS::lock_ together in certain paths could potentially block the impalad from registering new queries. The most common occurrence of this is while loading the webpage of a query while the query planning is still in progress. Since we hold the CRS::lock_ during planning, it blocks the web page from loading which inturn blocks incoming queries by holding client_request_state_map_lock_. This patch makes client_request_state_map_lock_ a terminal lock so that we don't have interleaving locking with CRS::lock_. Testing: Tested it locally by adding a long sleep in JniFrontend.createExecRequest() and still was able to refresh the web UI and run parallel queries. Also added a custom cluster test that does the same sequence of actions by injecting a metadata loading pause. Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 --- M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h A tests/custom_cluster/test_query_concurrency.py M www/query_plan.tmpl 7 files changed, 164 insertions(+), 54 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6707/10 -- To view, visit http://gerrit.cloudera.org:8080/6707 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson
[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention
Bharath Vissapragada has posted comments on this change. Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/6707/9/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 607: if (request_state->query_state() == beeswax::QueryState::CREATED) { > as we discussed in person, here we'll return something like the caller does Done Line 607: if (request_state->query_state() == beeswax::QueryState::CREATED) { > Additionally, let's exercise this case in the new test. Done http://gerrit.cloudera.org:8080/#/c/6707/9/tests/custom_cluster/test_query_concurrency.py File tests/custom_cluster/test_query_concurrency.py: PS9, Line 75: 3 > that doesn't seem large enough. check_registered_queries() is allowed to wa I don't think it waits 15s everytime. It waits for 15s overall. So it still returns before pause_injection time is hit (30s). Not sure I follow you. Did it fail for you locally? In any case, I made the pause 100s, we don't wait on it anyway. -- To view, visit http://gerrit.cloudera.org:8080/6707 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement
Taras Bobrovytsky has uploaded a new patch set (#8). Change subject: IMPALA-5259: Add REFRESH FUNCTIONS statement .. IMPALA-5259: Add REFRESH FUNCTIONS statement Before this patch, Impala relied on INVALIDATE METADATA to load externally added UDFs from HMS. The problem with this approach is that INVALIDATE METADATA affects all databases and tables in the entire cluster. In this patch, we add a REFRESH FUNCTIONS statement that reloads the functions of a database from HMS. We return a list of updated and removed db functions to the issuing Impalad in order to update its local catalog cache. Testing: - Ran a private build which passed. Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b --- M common/thrift/CatalogService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/Db.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java M tests/custom_cluster/test_permanent_udfs.py 12 files changed, 333 insertions(+), 63 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6878/8 -- To view, visit http://gerrit.cloudera.org:8080/6878 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-5259: Add REFRESH FUNCTIONS statement .. Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/6878/4/fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java File fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java: PS4, Line 38: invalidating the entire catalog > "or refreshing database functions." Done PS4, Line 110: toSql() > Don't we need to update this function? Done http://gerrit.cloudera.org:8080/#/c/6878/7/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: PS7, Line 628: loadFunctionsFromDbParams(tmpDb, msDb); : // Load Java UDFs from HMS into the temporary db. : loadJavaFunctions(tmpDb, javaFns); > Both these methods not only extract and load the functions into the target Done PS7, Line 635: catalogLock_.writeLock().lock(); > This lock doesn't protect against concurrent operations on databases, hence Done http://gerrit.cloudera.org:8080/#/c/6878/7/fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java File fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java: PS7, Line 256: // Remove the function first, in case there is an existing function with the same : // name and signature. : removeFunction(catalogObject.getFn(), catalogObject.getCatalog_version()); > This is weird and not expected. Is this because addFunction doesn't replace It doesn't replace it if its "indistinguishable". We don't check that the binary is the same or different. How should we modify the addFunction call? Always remove the function there? -- To view, visit http://gerrit.cloudera.org:8080/6878 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5259: Add REFRESH FUNCTIONS statement
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6878 to look at the new patch set (#8). Change subject: IMPALA-5259: Add REFRESH FUNCTIONS statement .. IMPALA-5259: Add REFRESH FUNCTIONS statement Before this patch, Impala relied on INVALIDATE METADATA to load externally added UDFs from HMS. The problem with this approach is that INVALIDATE METADATA affects all databases and tables in the entire cluster. In this patch, we add a REFRESH FUNCTIONS statement that reloads the functions of a database from HMS. We return a list of updated and removed db functions to the issuing Impalad in order to update its local catalog cache. Testing: - Ran a private build which passed. Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b --- M common/thrift/CatalogService.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/org/apache/impala/analysis/ResetMetadataStmt.java M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/Db.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java M tests/custom_cluster/test_permanent_udfs.py 12 files changed, 333 insertions(+), 63 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6878/8 -- To view, visit http://gerrit.cloudera.org:8080/6878 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Impala Public Jenkins has posted comments on this change. Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Patch Set 2: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-HasComments: No
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Impala Public Jenkins has submitted this change and it was merged. Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Reviewed-on: http://gerrit.cloudera.org:8080/6896 Reviewed-by: Dan HechtTested-by: Impala Public Jenkins --- M be/src/benchmarks/CMakeLists.txt D be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py D tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 75 insertions(+), 850 deletions(-) Approvals: Impala Public Jenkins: Verified Dan Hecht: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks
Zach Amsden has uploaded a new patch set (#3). Change subject: IMPALA-5164: Fix flaky benchmarks .. IMPALA-5164: Fix flaky benchmarks Improve benchmarks by detecting involuntary context switches. If a server is heavily loaded due to some other jobs running, benchmarking will not be reliable. We can detect this by using getrusage() to measure context switches. If 90% or more of benchmark time is lost due to switching, we abort the benchmark, but catch the failure and exit silently. We subtract out the time during which we might have been switched out from the total result time, and only count runs which were uninterrupted. If 10% of the result is valid, that seems barely legitimate, but we will warn when anything less than 50% of the benchmark ran without switching. Testing: ran the free-lists-benchmark until the first test passed and then started "make -j 60" on an Impala directory. Got this result: E0519 23:35:41.482010 8766 benchmark.cc:161] Benchmark failed to complete due to context switching. W0519 23:35:41.548840 8766 benchmark.cc:162] Exiting silently from FreeLists 0 iters on 128 kb to avoid spurious test failure. Also, fixed two benchmarks that had crashes on start due to static initialization order problems and missing pieces of initialization. Some benchmarks are still crashing (expr-benchmark and network-perf-be), but those will require a lot more work to fix. Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274 --- M be/src/benchmarks/free-lists-benchmark.cc M be/src/benchmarks/hash-benchmark.cc M be/src/benchmarks/lock-benchmark.cc M be/src/util/benchmark.cc M be/src/util/benchmark.h 5 files changed, 98 insertions(+), 37 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/6935/3 -- To view, visit http://gerrit.cloudera.org:8080/6935 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach AmsdenGerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 979: Status HdfsParquetScanner::ResetScratchBatch() { > Good point. The change may be an improvement, but I don't think we need to I'm fine with deferring it -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Alex Behm has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 5: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 979: Status HdfsParquetScanner::ResetScratchBatch() { > ScratchTupleBatch then would have to call out to HdfsScanNode::InitTuple(). Good point. The change may be an improvement, but I don't think we need to do that now. In particular, rebasing my scanner change is not going to be fun (but I'll do it if you prefer to make the change now). -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks
Zach Amsden has posted comments on this change. Change subject: IMPALA-5164: Fix flaky benchmarks .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc File be/src/util/benchmark.cc: Line 62: throw std::runtime_error("Benchmark failed to complete due to context switching."); > Sounds like a Status might be a better fit then. Making this return a status means we have to return iters / ms as a reference parameter or return via pointer. This spoils the convention that such return parameters should be the last parameters in the functions parameter list because in this case we have default parameters that are overridden almost nowhere. Not a big fan of exceptions, especially across modules, but this self contained single source file use case might be the lesser of two evils. Happy to do it either way. Line 69: } > You could maybe create a new stopwatch in that case? Doing so requires throwing away the results of the first set of computations or more complicated math on the return value. Neither seems especially better than what the code does currently. Line 90: batch_size = max(1, batch_size / (1+(ru_stop.ru_nivcsw - ru_start.ru_nivcsw))); > Maybe comment why we're scaling it down here? Done PS2, Line 94: 10 > Can you add a comment along those lines for future reference then? Done -- To view, visit http://gerrit.cloudera.org:8080/6935 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach AmsdenGerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4923: reduce memory transfer for selective scans .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6949/3/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 53: MemPool aux_mem_pool; > Thanks for trying it. I'm now convinced that dealing with the var-len data Well, I think it's a good idea except the complicated memory transfer is getting in the way. -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Alex Behm has posted comments on this change. Change subject: IMPALA-4923: reduce memory transfer for selective scans .. Patch Set 4: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/6949/3/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 53: // batch, but not by future batches. E.g. a decompression buffer can be safely attached > I started to implement this but realised there were some serious caveats to Thanks for trying it. I'm now convinced that dealing with the var-len data is a bad idea. Too many implications and almost impossible to address all issues. http://gerrit.cloudera.org:8080/#/c/6949/4/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 52: // for nested arrays. This memory may be referenced by previous batches or the current Thanks. This is accurate and clear now. -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention
Dan Hecht has posted comments on this change. Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/6707/9/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 607: if (request_state->query_state() == beeswax::QueryState::CREATED) { > as we discussed in person, here we'll return something like the caller does Additionally, let's exercise this case in the new test. -- To view, visit http://gerrit.cloudera.org:8080/6707 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
Dan Hecht has posted comments on this change. Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255" .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 979: Status HdfsParquetScanner::ResetScratchBatch() { > Why not move this into ScratchTupleBatch, i.e. pass in the template tuple t ScratchTupleBatch then would have to call out to HdfsScanNode::InitTuple(). I can do a larger restructure, e.g. moving InitTuple() into Tuple or similar if you think that will make things clearer. I think it's probably an improvement - just checking that you think that makes sense before doing it. Line 983: if (template_tuple_ == nullptr && tuple_byte_size_ <= CACHE_LINE_SIZE) { > Not sure I completely understand the CACHE_LINE_SIZE check. We are zeroing Augmented the comment. There's some cut-over where the old code is faster. E.g. if the tuple has 1000 slots, it's probably better to zero out 125 bytes of null indicators row-by-row instead of zeroing out all the 1024 multi-kb rows. I think this optimisation doesn't matter too much for tuples with more than a handful of slots, since the cost of materialization is high compared to the cost of zeroing things. -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has uploaded a new patch set (#5). Change subject: IMPALA-5347: Parquet scanner microoptimizations .. IMPALA-5347: Parquet scanner microoptimizations A mix of microoptimizations that profiling the parquet scanner revealed. All lead to some measurable improvement and added up to significant speedups for various scans. * Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining in all cases. * Apply __restrict__ in a few places so the compiler knows that it is safe to cache values accessed via those pointers * memset() the whole batch instead of the null indicators is cases where it is almost certainly cheaper. * Avoid updating two correlated loop variables in MaterializeValueBatch(). * Avoid unnecessary initialization of often-unused 'val' in ReadSlot(). * Shave a few instructions off the (still very expensive) bit unpacking and dict decoding logic. Performance: Some local TPC-H and targeted-perf benchmarks showed average speedups of ~5%. I did some benchmarks targeted at measuring column materialisation performance using a version of lineitem with duplicated data to make it bigger. These queries all got significantly faster. Dict-encoded DECIMAL: 2.23 -> 1.23s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 Plain-encoded BIGINT: 2.33s -> 1.62s SELECT count(*) FROM biglineitem WHERE l_orderkey != 10 Dict-encoded STRING: 2.73s -> 1.72s SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A' Multiple columns: 5.15s -> 3.74s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100 Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/dict-encoding.h M be/src/util/rle-encoding.h 7 files changed, 74 insertions(+), 35 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/5 -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention
Dan Hecht has posted comments on this change. Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/6707/9/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 607: if (request_state->query_state() == beeswax::QueryState::CREATED) { as we discussed in person, here we'll return something like the caller does. Actually, maybe we should just return an error Status with the message, and then let the caller put that string in the profile? -- To view, visit http://gerrit.cloudera.org:8080/6707 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Tim Armstrong has uploaded a new patch set (#4). Change subject: IMPALA-4923: reduce memory transfer for selective scans .. IMPALA-4923: reduce memory transfer for selective scans Most of the code changes are to restructure things so that the scratch batch's tuple buffer is stored in a separate MemPool from auxiliary memory such as decompression buffers. This part of the change does not change the behaviour of the scanner in itself, but allows us to recycle the tuple buffer without holding onto unused auxiliary memory. The optimisation is implemented in TryCompact(): if enough rows were filtered out during the copy from the scratch batch to the output batch, the fixed-length portions of the surviving rows (if any) are copied to a new, smaller, buffer, and the original, larger, buffer is reused for the next scratch batch. Previously the large buffer was always attached to the output batch, so a large buffer was transferred between threads for every scratch batch processed. In combination with the decompression buffer change in IMPALA-5304, this means that in many cases selective scans don't produce nearly as many empty or near-empty batches and do not attach nearly as much memory to each batch. Performance: Even on an 8 core machine I see some speedup on selective scans. Profiling with "perf top" also showed that time in TCMalloc was reduced - it went from several % of CPU time to a minimal amount. Running TPC-H on the same machine showed a ~5% overall improvement and no regressions. E.g. Q6 got 20-25% faster. I hope to do some additional cluster benchmarking on systems with more cores to verify that the severe performance problems there are fixed, but in the meantime it seems like we have enough evidence that it will at least improve things. Testing: Add a couple of selective scans that exercise the new code paths. Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/row-batch.cc M be/src/runtime/row-batch.h M testdata/workloads/functional-query/queries/QueryTest/parquet.test 6 files changed, 208 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/6949/4 -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5338: Fix Kudu timestamp column default values
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-5338: Fix Kudu timestamp column default values .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 108: static TimestampValue FromUnixTimeMicros(int64_t unix_time_micros); > I'm confused as to why it no longer takes two parameters (maybe its just th I'll update the comment, but it doesn't really matter too much. the time_t was in seconds, and then there were fractional sec in microseconds, but it's more convenient to take a single int64 which is unix time in micros and convert it internally to the time_t w/ fractional part because there are two callers that both have unix_time already in microseconds. http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: Line 32: int64_t ts_seconds = unix_time_micros / MICROS_PER_SEC; > I don't understand what this is supposed to be doing - it looks like you're integer division results in truncation, so this separates out the number of seconds and then the fractional seconds in microseconds. We do this to use the posix_time helper functions, i.e. there is no ptime constructor that just takes the unix time already in microseconds. http://gerrit.cloudera.org:8080/#/c/6936/1/fe/src/main/java/org/apache/impala/catalog/KuduColumn.java File fe/src/main/java/org/apache/impala/catalog/KuduColumn.java: PS1, Line 125: value > extraneous word? Done http://gerrit.cloudera.org:8080/#/c/6936/1/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test: Line 253: ts2 timestamp default cast('00:00:00' as timestamp)) > I guess the intention here is that this will fail? Is there an error we can whoops, thanks. This shouldn't be here, it's just going to be a FE test. http://gerrit.cloudera.org:8080/#/c/6936/1/tests/query_test/test_kudu.py File tests/query_test/test_kudu.py: Line 364: schema_builder = SchemaBuilder() > comment Done Line 658: e TIMESTAMP NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION DEFAULT timestamp_from_unix_micros(%s), > long line The problem is I can't break this line as it needs to match the output exactly. I don't think it's worth something fancy like factoring out another variable for parameters and substituting them... I'll just add a comment. -- To view, visit http://gerrit.cloudera.org:8080/6936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew JacobsGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4848: Add WIDTH BUCKET() function
Alex Behm has posted comments on this change. Change subject: IMPALA-4848: Add WIDTH_BUCKET() function .. Patch Set 4: (14 comments) http://gerrit.cloudera.org:8080/#/c/6023/4/be/src/exprs/math-functions-ir.cc File be/src/exprs/math-functions-ir.cc: Line 31: #include "exprs/math-functions.h" order alphabetically Line 430: Decimal16Value width_size, range_size; move declarations closer to where they are used Line 432: //FE casts all the arguments to same time. // FE casts all the arguments to the same type. Line 433: int input_scale = ctx->impl()->GetConstFnAttr(FunctionContextImpl::ARG_TYPE_SCALE,1); formatting, space after ',' Line 434: int input_precision = ctx->impl()->GetConstFnAttr( formatting, space after ',' Line 442: if(expr < min_range) { single line and formatting (please fix formatting everywhere) Line 447: if (num_buckets.val < std::numeric_limits::max()) { How about we make the return value a BIGINT and simplify this code (no need to check for overflow) Line 448: return num_buckets.val + 1; Create an IntVal, that's clearer to read Line 457: range_size = max_range.template Subtract<__int128_t>(input_scale, min_range, use int128_t consistently Line 460: ctx->SetError("Overflow while bucket_width evaluation"); Please make the error more specific, so we can see where the overflow happened. Line 472: int256_t x = ConvertToInt256(width_size.value() * buckets.value()); Don't you need to convert the arguments before the multiplication? Line 475: // to avoid avoid computing reulting scale and precision typo: resulting Line 476: int256_t y = DecimalUtil::MultiplyByScale(ConvertToInt256(range_size.value()), The current approach seems fine, but let's avoid going to int256_t if we can. Operations with that type are very expensive. http://gerrit.cloudera.org:8080/#/c/6023/4/be/src/exprs/math-functions.h File be/src/exprs/math-functions.h: Line 111: const IntVal& num_buckets); weird indentation -- To view, visit http://gerrit.cloudera.org:8080/6023 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I081bc916b1bef7b929ca161a9aade3b54c6b858f Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: anujphadkeGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Michael Ho has uploaded a new patch set (#5). Change subject: IMPALA-4164: Use inline hints instead of always-inline .. IMPALA-4164: Use inline hints instead of always-inline When generating IR functions during codegen, we used to always tag the functions with the "AlwaysInline" attribute. That potentially leads to excessive inlining, leading to very long optimization / compilation time with marginal performance benefit at runtime. One of the reasons for doing it was that the "target-cpu" and "target-features" attributes are missing in the generated IR functions so the LLVM inliner considers them incompatible with the cross-compiled functions. As a result, the inliner will not inline the generated IR functions into cross-compiled functions unless we put the "AlwaysInline" attributes with the generated IR functions. This change fixes the problem above by setting the "target-cpu" and "target-features" attributes of all IR functions to match that of of the host's CPUs so both generated IR functions and cross-compiled functions will have the same values for those attributes. With these attributes set, we now switch to using "InlineHint" in place of "AlwaysInline" and let LLVM make the decision of whether it's sensible to inline a function. With this change, the codegen time of a query with a long predicate went from 15s to 4s and the overall runtime went from 19s to 8s. Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce --- M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h A testdata/workloads/targeted-perf/queries/primitive_long_predicate.test 3 files changed, 298 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/6941/5 -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Michael Ho has posted comments on this change. Change subject: IMPALA-4164: Use inline hints instead of always-inline .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc File be/src/codegen/llvm-codegen.cc: Line 163: return Status::OK(); > We have __attribute__((target("avx2"))) on some bloom filter functions alre Good point. They aren't cross-compiled yet but I verified the problem by cross-compiling them. I took your suggestion to set all functions' attributes to be the one derived from 'cpu_attrs_'. Alternately, we may consider stripping the "target-features" attributes of the cross-compiled functions loaded but I don't see an interface defined in llvm::Function. IMHO, it may be clearer to set "target-features" attribute explicitly so we know they are the same for both hand-crafted IR and cross-compiled function. Line 166: LlvmCodeGen::LlvmCodeGen(RuntimeState* state, ObjectPool* pool, > Ah ok, maybe make the log text a bit more human-readable then? I removed them now in the new patch as they would be redundant with cpu_attrs_. -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Michael Ho has uploaded a new patch set (#4). Change subject: IMPALA-4164: Use inline hints instead of always-inline .. IMPALA-4164: Use inline hints instead of always-inline When generating IR functions during codegen, we used to always tag the functions with the "AlwaysInline" attribute. That potentially leads to excessive inlining, leading to very long optimization / compilation time with marginal performance benefit at runtime. One of the reasons for doing it was that the "target-cpu" and "target-features" attributes are missing in the generated IR functions so the LLVM inliner considers them incompatible with the cross-compiled functions. As a result, the inliner will not inline the generated IR functions into cross-compiled functions unless we put the "AlwaysInline" attributes with the generated IR functions. This change fixes the problem above by setting the "target-cpu" and "target-features" attributes of all IR functions to match that of of the host's CPUs so both generated IR functions and cross-compiled functions will have the same values for those attributes. With these attributes set, we now switch to using "InlineHint" in place of "AlwaysInline" and let LLVM make the decision of whether it's sensible to inline a function. With this change, the codegen time of a query with a long predicate went from 15s to 4s and the overall runtime went from 19s to 8s. Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce --- M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h A testdata/workloads/targeted-perf/queries/primitive_long_predicate.test 3 files changed, 298 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/6941/4 -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Michael Ho has uploaded a new patch set (#3). Change subject: IMPALA-4164: Use inline hints instead of always-inline .. IMPALA-4164: Use inline hints instead of always-inline When generating IR functions during codegen, we used to always tag the functions with the "AlwaysInline" attribute. That potentially leads to excessive inlining, leading to very long optimization / compilation time with marginal performance benefit at runtime. One of the reasons for doing it was that the "target-cpu" and "target-features" attributes are missing in the generated IR functions so the LLVM inliner considers them incompatible with the cross-compiled functions. As a result, the inliner will not inline the generated IR functions into cross-compiled functions unless we put the "AlwaysInline" attributes with the generated IR functions. This change fixes the problem above by setting the "target-cpu" and "target-features" attributes of all IR functions to match that of of the host's CPUs so both generated IR functions and cross-compiled functions will have the same values for those attributes. With these attributes set, we now switch to using "InlineHint" in place of "AlwaysInline" and let LLVM make the decision of whether it's sensible to inline a function. With this change, the codegen time of a query with a long predicate went from 15s to 4s and the overall runtime went from 19s to 8s. Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce --- M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h A testdata/workloads/targeted-perf/queries/primitive_long_predicate.test 3 files changed, 304 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/6941/3 -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4192: Disentangle Expr and ExprContext
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4192: Disentangle Expr and ExprContext .. Patch Set 14: (14 comments) Next batch - made it through everything except the expr/ subdirectory. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-aggregation-node.cc File be/src/exec/partitioned-aggregation-node.cc: Line 164: pool_->Add(build_expr); Maybe stick with the usual idiom of wrapping pool_->Add() around the allocation? Line 1574: // TODO: consider moving the following codegen logic to AggFn. This would make sense (not in this patch though). http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: Line 109: // TODO: Move to Prepare(). Is this to do in this patch or are you planning to do it as part of a future JIRA? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: Line 243: /// Init() function inherited from DataSink. Overridden to be a no-op for now. This is a bit of a red flag to me - I think we should structure things so that we always call Init() on the parent class. Otherwise it's hard to reason about the possible states of the member variables of DataSink. Line 381: /// List of filters to build. Mention that there's a correspondence with 'filter_exprs_'? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: PS14, Line 95: nd and http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/data-stream-sender.h File be/src/runtime/data-stream-sender.h: Line 49: /// TODO: Move this to the equivalent of PlanNode class for DataSink. Is there a JIRA for this? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: Line 486: DCHECK(partition_expr->IsLiteral()); Can you add a comment in CatalogObjects.thrift that these are always literal exprs? That fact is explicit in the frontend but it wasn't obvious that the exprs sent from the frontend are always literals just by looking at the thrift structures. Line 491: // literals Partition exprs are not used in the codegen case. Don't codegen Missing . http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/runtime-filter.h File be/src/runtime/runtime-filter.h: Line 93: const TRuntimeFilterDesc& filter_desc_; Comment what owns the filter? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/runtime/sorter.h File be/src/runtime/sorter.h: Line 105: Status Init(ObjectPool* pool, RuntimeState* state, MemPool* mem_pool); Comment what the arguments are for? The names could maybe be more descriptive too. E.g. to disambiguate from 'obj_pool_'. E.g. 'expr_evaluator_pool' and 'expr_mem_pool'. I think passing in 'state' is redundant too, since it's passed into the constructor. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/service/fe-support.cc File be/src/service/fe-support.cc: PS14, Line 194: mem_pool expr_mem_pool? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/udf/udf.h File be/src/udf/udf.h: Line 111: /// TODO: Move FRAGMENT_LOCAL states to query_state for multi-threading. Do we have a JIRA trackign this? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/util/tuple-row-compare.h File be/src/util/tuple-row-compare.h: Line 82: Status Open(ObjectPool* pool, RuntimeState* state, MemPool* mem_pool); Similar to my comment in Sorter - it's unclear what these different arguments do and what the requirements are. E.g. what is allocated from 'mem_pool'. -- To view, visit http://gerrit.cloudera.org:8080/5483 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iefdc9aeeba033355cb9497e3a5d2363627dcf2f3 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Alex Behm has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/6950/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 979: Status HdfsParquetScanner::ResetScratchBatch() { Why not move this into ScratchTupleBatch, i.e. pass in the template tuple there. Line 983: if (template_tuple_ == nullptr && tuple_byte_size_ <= CACHE_LINE_SIZE) { Not sure I completely understand the CACHE_LINE_SIZE check. We are zeroing scratch_capacity * tuple_byte_size_ bytes of contiguous memory. Why not remove the second condition? -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Alex Behm has posted comments on this change. Change subject: IMPALA-4923: reduce memory transfer for selective scans .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 130: // because the batch we returned earlier may hold a reference into 'tuple_mem'. > I don't think this is a significant problem for performance - if the scan i Thanks for explaining. Makes sense to me. http://gerrit.cloudera.org:8080/#/c/6949/3/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 53: MemPool aux_mem_pool; Another thought: How about splitting this pool up into memory that is only referenced by tuples in tuple_mem and memory that could be refd by already-returned batches. For example, the decompression buffers live across multiple scratch/transfer rounds, but as far as I can tell all var-len memory that is allocated from this pool is only referenced by tuples in tuple_mem. I'd even be ok with naming the pools 'var_len_pool' and 'decompression_pool'. That would make it less mysterious what is inside them. This change is mostly for understandability, but it would also allow us to avoid unnecessarily transferring var-len data (I know it's only a nice-to-have from a perf perspective). People reading this code may have the same question as me regarding the var-len data. Line 100: void FinalizeTransfer(RowBatch* dst_batch, int num_to_commit) { FinalizeTupleTransfer() is a little clearer because we typically mean resources in the context of "transfer" Line 102: ++num_output_batches; Wondering if we should make this "non-empty output batches". Line 129: // Compaction is unsafe if the scratch batch was split across multiple output batches Mind moving these two checks into the caller? Checking 'num_output_batches' in the same fn where it is updated makes it more obvious what it is used for. This will also clarify in which situations TryCompact() can "fail", i.e., only when there is not enough mem Line 160: int64_t total_allocated_bytes() { const -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4623: Enable file handle cache
Joe McDonnell has uploaded a new patch set (#11). Change subject: IMPALA-4623: Enable file handle cache .. IMPALA-4623: Enable file handle cache Currently, every scan range maintains a file handle, even when multiple scan ranges are accessing the same file. Opening the file handles causes load on the NameNode, which can lead to scaling issues. There are two parts to this transaction: 1. Enable file handle caching by default 2. Share the file handle between scan ranges from the same file The scan range no longer maintains its own Hdfs file handle. On each read, the io thread will get the Hdfs file handle from the cache (opening it if necessary) and use that for the read. This allows multiple scan ranges on the same file to use the same file handle. Since the file offsets are no longer consistent for an individual scan range, all Hdfs reads need to either use hdfsPread or do a seek before reading. Additionally, since Hdfs read statistics are maintained on the file handle, the read statistics must be retrieved and cleared after each read. To manage contention, the file handle cache is now partitioned by a hash of the key into independent caches with independent locks. The allowed capacity of the file handle cache is split evenly among the partitions. File handles are evicted independently for each partition. The file handle cache maintains ownership of the file handles at all times, but it will not evict a file handle that is in use. If max_cached_file_handles is set to 0 or the the scan range is accessing data cached by Hdfs, the scan range will get a file handle from the cache and hold it until the scan range is closed. This mimics the existing behavior, except the file handle stays in the cache and is owned by the cache. Since it is in use, it will not be evicted. If a file handle in the cache becomes invalid, it may result in Read() calls failing. Consequently, if Read() encounters an error using a file handle from the cache, it will destroy the handle and retry once with a new file handle. Any subsequent error is unrelated to the file handle cache and will be returned. Tests: query_test/test_hdfs_fd_caching.py copies the files from an existing table into a new directory and uses that to create an external table. It queries the external table, then uses the hdfs commandline to manipulate the hdfs file (delete, move, etc). It queries again to make sure we don't crash. Then, it runs "invalidate metadata". It checks the row counts before the modification and after "invalidate metadata", but it does not check the results in between. custom_cluster/test_hdfs_fd_caching.py starts up a cluster with a small file handle cache size. It verifies that a file handle can be reused (i.e. rerunning a query does not result in more file handles cached). It also verifies that the cache capacity is enforced. Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/buffered-block-mgr.h A be/src/runtime/disk-io-mgr-handle-cache.h A be/src/runtime/disk-io-mgr-handle-cache.inline.h M be/src/runtime/disk-io-mgr-internal.h M be/src/runtime/disk-io-mgr-reader-context.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/disk-io-mgr.h M tests/custom_cluster/test_hdfs_fd_caching.py M tests/metadata/test_refresh_partition.py M tests/query_test/test_hdfs_fd_caching.py 13 files changed, 815 insertions(+), 410 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6478/11 -- To view, visit http://gerrit.cloudera.org:8080/6478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnellGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-4622: Add ALTER COLUMN statement. .. Patch Set 2: To be a bit more prescriptive than my last comment: Can you see how much of this new functionality you can actually accomplish using the change syntax given it's already there? I think we'll have a harder time justifying this new syntax to accomplish something you can technically already do, even if it is superior. It'll require a bit more discussion. -- To view, visit http://gerrit.cloudera.org:8080/6955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-1972/IMPALA-3882: Fix client request state map lock contention
Dan Hecht has posted comments on this change. Change subject: IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/6707/9/tests/custom_cluster/test_query_concurrency.py File tests/custom_cluster/test_query_concurrency.py: PS9, Line 75: 3 that doesn't seem large enough. check_registered_queries() is allowed to wait 15 seconds each time, so the test can still pass even if the bug exists, right? -- To view, visit http://gerrit.cloudera.org:8080/6707 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-4622: Add ALTER COLUMN statement. .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6955/1//COMMIT_MSG Commit Message: Line 8: Need to explain why this diverges from ALTER TABLE CHANGE, since that syntax exists and appears to overlap. >From the doc, it says: ALTER TABLE name CHANGE column_name new_name new_type However, this change led me to look at the code, and according to the parser it looks the syntax is actually: ALTER TABLE name CHANGE column_name new_name column_def column_def ::= ident_or_default:col_name type_def:type column_options_map:options {: RESULT = new ColumnDef(col_name, type, options); :} | ident_or_default:col_name type_def:type {: RESULT = new ColumnDef(col_name, type); :} ; So then the column def options would already be possible to set with that syntax. While I much much prefer this syntax you've added here, we need to re-evaluate if it makes sense to add it given this new information. http://gerrit.cloudera.org:8080/#/c/6955/1/common/thrift/JniCatalog.thrift File common/thrift/JniCatalog.thrift: Line 74: ALTER_COLUMN, internally can we re-use change_column? -- To view, visit http://gerrit.cloudera.org:8080/6955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-5349: flaky NoDirsAllocationError test .. Patch Set 1: Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/604/ -- To view, visit http://gerrit.cloudera.org:8080/6953 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Impala Public Jenkins Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4192: Disentangle Expr and ExprContext
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4192: Disentangle Expr and ExprContext .. Patch Set 14: (26 comments) First batch of comments - made it through exec/kudu* http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/aggregation-node.cc File be/src/exec/aggregation-node.cc: Line 93: RETURN_IF_ERROR(build_expr->Init(*(intermediate_row_desc_.get()), state)); build_expr is leaked if this returns. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/analytic-eval-node.cc File be/src/exec/analytic-eval-node.cc: This all works out a lot cleaner - nice! Line 124: RETURN_IF_ERROR(AggFn::Create(analytic_node.analytic_functions[i], I think there's a potential resource leak if the expr Init() function fails in certain ways. E.g. if it successfully got a lib cache entry, but failed later. We don't add it to 'analytic_fns_' so I don't think it can be closed. Line 166: fn_pool_.reset(new MemPool(expr_mem_tracker()));; extra semicolon http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/data-sink.cc File be/src/exec/data-sink.cc: Line 181: // TODO: codegen table sink I'm not sure what this means. Is it something to be fixed in this patch? Otherwise could be just remove the TODO and file a JIRA if it's something that needs to be tracked. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/data-sink.h File be/src/exec/data-sink.h: Line 126: /// Not used in DataStreamSender and PHJBuilder. Or NljBuilder. Maybe easier to just say that they're not used in all subclasses. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/exec-node.cc File be/src/exec/exec-node.cc: PS14, Line 451: inline Is the inline needed here? Line 465: void ExecNode::FreeLocalAllocations() { This helper function doesn't seem to help that much - consider inlining it. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/exec-node.h File be/src/exec/exec-node.h: Line 261: /// Conjuncts and their evaluators in this node. Maybe mention who owns the pointers? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/filter-context.cc File be/src/exec/filter-context.cc: PS14, Line 86: expr_evaluator->root() Is the plan to eventually pass around the Expr itself and avoid going via the evaluator, or will we continue using this pattern? I don't feel that strongly either way, just curious. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/filter-context.h File be/src/exec/filter-context.h: Line 120: FilterContext() { } nit: can just remove this and rely on the default constructor http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hash-table.h File be/src/exec/hash-table.h: Line 115: /// computing the size of a scratch row. Maybe mention what 'pool' is for Line 415: Status Init(ObjectPool* pool, RuntimeState* state, int num_build_tuples); What is pool used for? Line 513: MemPool* mem_pool_; Mention that it's not owned? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS14, Line 294: tab? Line 733: vector& output_expr_evaluators) Can't this be a const& still? The vector isn't being modified is it? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: Line 26: #include "exprs/scalar-expr.h" Maybe not needed? http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: Line 147: // TODO: Move this to Prepare() Is this TODO in this patch, or does it depend on another change? Line 165: new RowDescriptor(min_max_tuple_desc_, /* is_nullable */ false)); This is a little inconsistent - above on line 121 we create the RowDescriptor on the stack but here it's heap-allocated. I think the stack allocation pattern is probably simpler. PS14, Line 186: conjunct_evaluators_map_[entry.first] [] has a side-effect of creating the element - maybe use find() instead for the DCHECK. http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 146: const std::vector min_max_conjunct_evaluators() const { Shouldn't this return a const&, rather than copying the vector? Line 393: /// the per-scanner ScannerContext.. Add something like "These correspond to the exprs in 'filter_exprs_'" http://gerrit.cloudera.org:8080/#/c/5483/14/be/src/exec/hdfs-scan-node.cc File be/src/exec/hdfs-scan-node.cc: Line 362: MemPool filter_mem_pool(expr_mem_tracker()); This is a little subtle - could do with a comment. Line 388: for (auto& ctx: filter_ctxs) { Maybe use a goto and an exit block at the bottom of the function to enforce that the cleanup happens on both paths.
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 937: if (template_tuple_ == nullptr && tuple_byte_size_ <= CACHE_LINE_SIZE) { > Might be worth putting this into an InitBatch() function. Factored this out along with the above Reset() call, since they really go together. http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 486: uint8_t val_buf[sizeof(T)]; // Uninitialized stack allocation for temporary value. > Comment just seems to describe what this line is doing, but not why. Mentio Done http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/bit-stream-utils.inline.h File be/src/util/bit-stream-utils.inline.h: Line 91: ALWAYS_INLINE inline bool BitReader::GetValue(int num_bits, T* v) { > remove inline? I tried initially but it turns out the the always_inline attribute doesn't imply one-definition-rule linkage in the same way that "inline" - there was a compiler warning. However, defining the function inside a class body implies "inline", so "inline" can be safely omitted in those cases. http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: Line 291: ALWAYS_INLINE inline bool DictDecoder::GetNextValue(T* value) { > remove inline? The definition is outside the class so it's not actually redundant (see earlier comment). Line 304: ALWAYS_INLINE inline bool DictDecoder::GetNextValue( > remove inline? The definition is outside the class so it's not actually redundant (see earlier comment). http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 251: ALWAYS_INLINE inline bool RleDecoder::Get(T* val) { > remove inline? The definition is outside the class so it's not actually redundant (see earlier comment). Line 269: if (UNLIKELY(!bit_reader_.GetValue(bit_width_, val))) return false; > Really a critical change? It changes the behavior, so might be dangerous / It made a small but measurable difference, but probably not enough to worry about and there is an edge case where this is less correct so I reverted it. I think it's generally safe - it doesn't change the external behaviour except where the input data is malformed and the client repeatedly calls Get() after it returns false. Then maybe this GetValue() call could return false, but the one on line 280 will later return true and return some RLE-encoded data. -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.
Thomas Tauber-Marshall has uploaded a new patch set (#2). Change subject: IMPALA-4622: Add ALTER COLUMN statement. .. IMPALA-4622: Add ALTER COLUMN statement. Kudu recently added the ability to alter a column's default value and storage attributes (KUDU-861). This patch adds the ability to modify these from Impala using ALTER. It also supports altering a column's comment for non-Kudu tables. Syntax: ALTER TABLE ALTER [COLUMN] SET [ [ ...]] where is one of: - DEFAULT, BLOCK_SIZE, ENCODING, COMPRESSION (Kudu tables) - COMMENT (non-Kudu tables) ALTER TABLE ALTER [COLUMN] DROP DEFAULT It does not support setting a column to be a primary key or chaning a column's nullability, because those are not supported on the Kudu side yet. Testing: - Added FE tests to ParserTest and AnalyzeDDLTest - Added EE tests to kudu_alter.test and kudu_describe.test Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 --- M common/thrift/JniCatalog.thrift M fe/src/main/cup/sql-parser.cup A fe/src/main/java/org/apache/impala/analysis/AlterColumnStmt.java M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M testdata/workloads/functional-query/queries/QueryTest/alter-table.test M testdata/workloads/functional-query/queries/QueryTest/kudu_alter.test M testdata/workloads/functional-query/queries/QueryTest/kudu_describe.test 11 files changed, 299 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/55/6955/2 -- To view, visit http://gerrit.cloudera.org:8080/6955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Matthew Jacobs
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Impala Public Jenkins has posted comments on this change. Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Patch Set 2: Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/603/ -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-HasComments: No
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Dan Hecht has posted comments on this change. Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5338: Fix Kudu timestamp column default values
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5338: Fix Kudu timestamp column default values .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 108: static TimestampValue FromUnixTimeMicros(int64_t unix_time_micros); I'm confused as to why it no longer takes two parameters (maybe its just that the comment needs to be updated?) http://gerrit.cloudera.org:8080/#/c/6936/1/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: Line 32: int64_t ts_seconds = unix_time_micros / MICROS_PER_SEC; I don't understand what this is supposed to be doing - it looks like you're just separating 'unix_time_micros' into two parts and then adding them back together again. http://gerrit.cloudera.org:8080/#/c/6936/1/fe/src/main/java/org/apache/impala/catalog/KuduColumn.java File fe/src/main/java/org/apache/impala/catalog/KuduColumn.java: PS1, Line 125: value extraneous word? http://gerrit.cloudera.org:8080/#/c/6936/1/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test: Line 253: ts2 timestamp default cast('00:00:00' as timestamp)) I guess the intention here is that this will fail? Is there an error we can catch? http://gerrit.cloudera.org:8080/#/c/6936/1/tests/query_test/test_kudu.py File tests/query_test/test_kudu.py: Line 364: schema_builder = SchemaBuilder() comment Line 658: e TIMESTAMP NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION DEFAULT timestamp_from_unix_micros(%s), long line -- To view, visit http://gerrit.cloudera.org:8080/6936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew JacobsGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Alex Behm has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 937: if (template_tuple_ == nullptr && tuple_byte_size_ <= CACHE_LINE_SIZE) { Might be worth putting this into an InitBatch() function. http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 486: uint8_t val_buf[sizeof(T)]; // Uninitialized stack allocation for temporary value. Comment just seems to describe what this line is doing, but not why. Mention explicitly what we are trying to avoid for a perf gain. http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/bit-stream-utils.inline.h File be/src/util/bit-stream-utils.inline.h: Line 91: ALWAYS_INLINE inline bool BitReader::GetValue(int num_bits, T* v) { remove inline? http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: Line 291: ALWAYS_INLINE inline bool DictDecoder::GetNextValue(T* value) { remove inline? Line 304: ALWAYS_INLINE inline bool DictDecoder::GetNextValue( remove inline? http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 251: ALWAYS_INLINE inline bool RleDecoder::Get(T* val) { remove inline? Line 269: if (UNLIKELY(!bit_reader_.GetValue(bit_width_, val))) return false; Really a critical change? It changes the behavior, so might be dangerous / introduce a bug. -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 426: *num_values = (curr_tuple - tuple_mem) / tuple_size; > Should we add a DCHECK to make sure that tuple_size is non-zero? Added it. It was already a precondition for the function (can't materialize a value into a 0-length tuple) so it's safe. -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has uploaded a new patch set (#3). Change subject: IMPALA-5347: Parquet scanner microoptimizations .. IMPALA-5347: Parquet scanner microoptimizations A mix of microoptimizations that profiling the parquet scanner revealed. All lead to some measurable improvement and added up to significant speedups for various scans. * Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining in all cases. * Apply __restrict__ in a few places so the compiler knows that it is safe to cache values accessed via those pointers * memset() the whole batch instead of the null indicators is cases where it is almost certainly cheaper. * Avoid updating two correlated loop variables in MaterializeValueBatch(). * Avoid unnecessary initialization of often-unused 'val' in ReadSlot(). * Shave a few instructions off the (still very expensive) bit unpacking and dict decoding logic. Performance: Some local TPC-H and targeted-perf benchmarks showed average speedups of ~5%. I did some benchmarks targeted at measuring column materialisation performance using a version of lineitem with duplicated data to make it bigger. These queries all got significantly faster. Dict-encoded DECIMAL: 2.23 -> 1.23s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 Plain-encoded BIGINT: 2.33s -> 1.62s SELECT count(*) FROM biglineitem WHERE l_orderkey != 10 Dict-encoded STRING: 2.73s -> 1.72s SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A' Multiple columns: 5.15s -> 3.74s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100 Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/dict-encoding.h M be/src/util/rle-encoding.h 6 files changed, 56 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/3 -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4923: reduce memory transfer for selective scans .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 1031: DCHECK_EQ(0, scratch_batch_->total_allocated_bytes()); > Where do the decompression buffers get freed? IMPALA-5304 made that redundant (I missed doing this cleanup in that patch). http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: PS2, Line 48: MemPool > It's not clear from the var names comments where the var-len data goes. Tha I elaborated on the comments to make it a bit more explicit. Line 50: // Pool used to accumulate other memory such as decompression buffers that may be > may be referenced Done Line 109: dst_batch->tuple_data_pool()->AcquireData(_mem_pool, false); > I would have thought that the var-len data like strings or collections can We need to transfer the allocations for var-len data regardless since we have no way to know for sure that previously-returned batches don't reference some of that data. So we could still run into a similar problem with excessive transfer of var-len data in some cases but I think that's impossible to solve without larger changes to the memory ownership model because it would require more precise tracking of which buffers may be referenced by already-returned rows. Reducing the amount of fixed-length data transferred will help a lot regardless by reducing the overall volume of lock contention. It's also counter-productive if the data is dictionary-encoded. Line 130: if (num_output_batches > 1) return false; > This new compaction has non-obvious caveats like this one, and I find the f I don't think this is a significant problem for performance - if the scan is selective then the scratch batch will have many fewer rows than the output batches and only a handful of batches will hit this edge case. Agree it's subtle but I think there will be subtleties regardless with our current memory transfer model. I tried to keep this encapsulated in FinalizeTransfer()/TryCompact()/Reset() so that this isn't spread out throughout the code. I think the alternative has some issues that I would rather avoid. I think doing two passes over the batch will be slower since the loop in ProcessScratchBatch() is already tight. It would also complicate the logic around 'tuple_mem' because compacted memory lives temporarily in the scratch batch before being moved to the destination batch - I think you would need additional state in the scratch batch to track both the compacted tuples and the tuple buffer that you're going to reuse. Line 139: for (int i = dst_batch->num_rows(); i < end_row; ++i) { > Don't we have a CopyRows() for this in RowBatch? That seems to copy the tuple pointers rather than the tuples themselves. -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4623: Enable file handle cache
Joe McDonnell has posted comments on this change. Change subject: IMPALA-4623: Enable file handle cache .. Patch Set 10: (3 comments) About to post another upload. http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-handle-cache.h File be/src/runtime/disk-io-mgr-handle-cache.h: Line 88: /// is set to true. Otherwise, the partition will try to construct a file handle and > true, but you're describing a scenario with concurrent queries and updates. That makes sense. We don't provide guarantees, and it seems like an uncommon case. I will include this on the JIRA I file for the time based eviction. http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-scan-range.cc File be/src/runtime/disk-io-mgr-scan-range.cc: Line 417: } else { > so it's just num_bytes_read then. when i see 'last' or 'prev' or something That makes sense. I think this came about trying to avoid a naming collision with the function's argument "bytes_read". Since we already have bytes_read and bytes_read_, I switched this to current_bytes_read rather than num_bytes_read. http://gerrit.cloudera.org:8080/#/c/6478/10/be/src/runtime/disk-io-mgr-scan-range.cc File be/src/runtime/disk-io-mgr-scan-range.cc: Line 352: int success = hdfsGetHedgedReadMetrics(fs_, _metrics); > you're getting metrics for the whole fs_, but every scan range does that on This came in due to a rebase. IMPALA-4764 added this. I had to move it around because the code changed in Close(). It is setting the metric to this value rather than adding it. -- To view, visit http://gerrit.cloudera.org:8080/6478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnellGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Tim Armstrong has uploaded a new patch set (#3). Change subject: IMPALA-4923: reduce memory transfer for selective scans .. IMPALA-4923: reduce memory transfer for selective scans Most of the code changes are to restructure things so that the scratch batch's tuple buffer is stored in a separate MemPool from auxiliary memory such as decompression buffers. This part of the change does not change the behaviour of the scanner in itself, but allows us to recycle the tuple buffer without holding onto unused auxiliary memory. The optimisation is implemented in TryCompact(): if enough rows were filtered out during the copy from the scratch batch to the output batch, the fixed-length portions of the surviving rows (if any) are copied to a new, smaller, buffer, and the original, larger, buffer is reused for the next scratch batch. Previously the large buffer was always attached to the output batch, so a large buffer was transferred between threads for every scratch batch processed. In combination with the decompression buffer change in IMPALA-5304, this means that in many cases selective scans don't produce nearly as many empty or near-empty batches and do not attach nearly as much memory to each batch. Performance: Even on an 8 core machine I see some speedup on selective scans. Profiling with "perf top" also showed that time in TCMalloc was reduced - it went from several % of CPU time to a minimal amount. Running TPC-H on the same machine showed a ~5% overall improvement and no regressions. E.g. Q6 got 20-25% faster. I hope to do some additional cluster benchmarking on systems with more cores to verify that the severe performance problems there are fixed, but in the meantime it seems like we have enough evidence that it will at least improve things. Testing: Add a couple of selective scans that exercise the new code paths. Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/row-batch.cc M be/src/runtime/row-batch.h M testdata/workloads/functional-query/queries/QueryTest/parquet.test 6 files changed, 206 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/6949/3 -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm
[Impala-ASF-CR] IMPALA-4622: Add ALTER COLUMN statement.
Thomas Tauber-Marshall has uploaded a new change for review. http://gerrit.cloudera.org:8080/6955 Change subject: IMPALA-4622: Add ALTER COLUMN statement. .. IMPALA-4622: Add ALTER COLUMN statement. Kudu recently added the ability to alter a column's default value and storage attributes (KUDU-861). This patch adds the ability to modify these from Impala using ALTER. It also supports altering a column's comment for non-Kudu tables. Syntax: ALTER TABLE ALTER [COLUMN] SET [ [ ...]] where is one of: - DEFAULT, BLOCK_SIZE, ENCODING, COMPRESSION (Kudu tables) - COMMENT (non-Kudu tables) ALTER TABLE ALTER [COLUMN] DROP DEFAULT It does not support setting a column to be a primary key or chaning a column's nullability, because those are not supported on the Kudu side yet. Testing: - Added FE tests to ParserTest and AnalyzeDDLTest - Added EE tests to kudu_alter.test and kudu_describe.test Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 --- M common/thrift/JniCatalog.thrift M fe/src/main/cup/sql-parser.cup A fe/src/main/java/org/apache/impala/analysis/AlterColumnStmt.java M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/ParserTest.java M testdata/workloads/functional-query/queries/QueryTest/alter-table.test M testdata/workloads/functional-query/queries/QueryTest/kudu_alter.test M testdata/workloads/functional-query/queries/QueryTest/kudu_describe.test 11 files changed, 299 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/55/6955/1 -- To view, visit http://gerrit.cloudera.org:8080/6955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
anujphadke has posted comments on this change. Change subject: IMPALA-5347: Parquet scanner microoptimizations .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/6950/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 426: *num_values = (curr_tuple - tuple_mem) / tuple_size; Should we add a DCHECK to make sure that tuple_size is non-zero? -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: anujphadke Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 7: (17 comments) Thanks for looking at this! http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries > The JIRA seems to be talking about doing something slightly different: taki There is a bit of a disconnect with the JIRA as specified and the intention behind the JIRA, which had me initially working down the wrong path. I realized that (x OP k) type predicates can only be evaluated for OP = or <=>, and started working on a way to identify such predicates in sub-expressions of the conjuncts. It all seemed quite speculative and limited in applicability. Once can imagine expanding this to ordering predicates with sorted dictionaries, but as we don't have that yet, it would be building without a foundation. After clarification from Marcel, it became apparent that what was intended was to be able to evaluate arbitrary single slot predicates against the dictionary (as we do this anyway for row-group filtering) and save those results to eliminate the cost of re-evaluating the same conjuncts again later. The central difficulty of this problem is that the set of conjuncts which can be dictionary evaluated is not known until row group time. This is the best proposal that I have come up with so far at attempting to solve that problem. I agree we should get the design right on this one. Doing this wrong could be costly in terms of either performance or complexity. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 40 > I believe this code was deliberately written like this to avoid the indirec I have no problem reverting this. Originally I had the two functions merged into one, but I didn't like imposing the overhead of checking the bitmap on this version. While the code evolved, I had to put an interface on ScratchBatch just for sanity, as too many pieces of code were playing with its internals. I think the interface is actually broad enough now to go back to the original code (minus the interference with internals). I'll simply do this at the end: scratch_batch_->Advance((scratch_tuple - scratch_tuple_start) / tuple_size); Line 77:tuple_index = scratch_batch_->NextValidTuple()) { > It seems like this might be slow for the case when many or all of the tuple Your intuition and mine are similar. I did my best to make the "simple" version of NextSetBit() as fast as possible, but I believe intrinsics might actually be warranted here. Wanted to get the core ideas sound before going in that direction though. Oh, and for reference, I also tried various other bitmap implementations - boost::dynamic_bitset and std::vector. I didn't like the boost dependency (or the code too much) and the std::vector (specialized bit-vector) implementation compiled to assembler sent me running for the hills. Branches, branches everywhere... http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 935 > I'm not sure, but this hoisting may have been deliberate because the compil It certainly shouldn't hurt to re-hoist Line 208: dict_filters_active_.reset(new bool[scanner_conjunct_ctxs_->size()]); > This probably works fine in practice but afaik scoped_ptr will call free in Will check PS7, Line 1051: _ > nit: extraneous _ Originally a class member, until I realized this function had no access. Will fix. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { > I think it would be best to avoid using the ScratchTupleBatch abstraction i My guess is that inlining the functions from the header should get about as optimal as possible. Maybe hoisting 'max = scratch.capacity()' to tell the compiler it is unchanged throughout the loop, but other than that, I don't see much room for improvement that has been lost. Let's see what perf results look like when that is ready. I don't want to pass the bitmap directly, as I only want the bitmap to be touched (or even known about) by those pieces of code which actually are running in a filtered column reader. Line 408: if (IS_FILTERED && IS_DICT_ENCODED && > This patch pushes the predicate evaluation down into ReadFilteredSlot() so Perhaps there is a way to do this, but I don't see a clear path to that. We have some choices, but they are all problematic: Materialize codewords and values. Where do the codewords go? There is no space in the tuple. Materialize codewords only. What do we do when there is no dictionary? Materialize
[native-toolchain-CR] Remove libevent from toolchain
Matthew Jacobs has posted comments on this change. Change subject: Remove libevent from toolchain .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82 Gerrit-PatchSet: 1 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4803: [DOCS] 2.9 release notes
John Russell has uploaded a new change for review. http://gerrit.cloudera.org:8080/6954 Change subject: IMPALA-4803: [DOCS] 2.9 release notes .. IMPALA-4803: [DOCS] 2.9 release notes Start with placeholder for 2.9 new features topic. Initially just point to the changelog file. Change-Id: I1f6cabc2427daf1243bd69dbed295c6923c4091b --- M docs/impala_keydefs.ditamap M docs/topics/impala_new_features.xml 2 files changed, 28 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/6954/1 -- To view, visit http://gerrit.cloudera.org:8080/6954 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I1f6cabc2427daf1243bd69dbed295c6923c4091b Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell
[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5164: Fix flaky benchmarks .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc File be/src/util/benchmark.cc: Line 62: throw std::runtime_error("Benchmark failed to complete due to context switching."); > That is true, but having an error message and call graph doesn't help here. Sounds like a Status might be a better fit then. Line 69: } > I thought so as well, but there is no reset method currently, so this actua You could maybe create a new stopwatch in that case? PS2, Line 94: 10 > Arbitrary flake factor. If we lose 90% of measurements, we bail. This par Can you add a comment along those lines for future reference then? -- To view, visit http://gerrit.cloudera.org:8080/6935 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach AmsdenGerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Alex Behm has posted comments on this change. Change subject: IMPALA-4923: reduce memory transfer for selective scans .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 1031: DCHECK_EQ(0, scratch_batch_->total_allocated_bytes()); Where do the decompression buffers get freed? http://gerrit.cloudera.org:8080/#/c/6949/2/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: PS2, Line 48: MemPool It's not clear from the var names comments where the var-len data goes. That's important to point out explicitly. Line 50: // Pool used to accumulate other memory such as decompression buffers that may be may be referenced Line 109: dst_batch->tuple_data_pool()->AcquireData(_mem_pool, false); I would have thought that the var-len data like strings or collections can make up the bulk of memory that needs to be transferred, so why not deep-copy those out as well and avoid this transfer? What's the rationale behind only avoiding transferring the mem for the fixed-len portion? Line 130: if (num_output_batches > 1) return false; This new compaction has non-obvious caveats like this one, and I find the flow of memory difficult to follow now. I wonder if this process could be simplified if we did something along these lines: 1. Evaluate conjuncts over all tuples in scratch batch. Keep a bitmap of survivors. 2. Decide whether to compact scratch batch or not. 3. Transfer rows to output batch. When AtEnd() of the scratch batch, have a function TransferResources() or similar to transfer whatever the output batch needs. This may be the original memory or memory from compaction. Let's discuss before you make any changes obviously :) Line 139: for (int i = dst_batch->num_rows(); i < end_row; ++i) { Don't we have a CopyRows() for this in RowBatch? -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks
Zach Amsden has posted comments on this change. Change subject: IMPALA-5164: Fix flaky benchmarks .. Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc File be/src/util/benchmark.cc: Line 35: double Benchmark::Measure(BenchmarkFunction function, void* args, > Can we change the behaviour so that benchmarks can opt in or out from the c Definitely. I noticed a few tests with more complicated behavior that definitely don't want this (unfortunately they crash instead of running). Line 62: throw std::runtime_error("Benchmark failed to complete due to context switching."); > We usually use LOG(FATAL) for fatal errors or Status for non-fatal errors. That is true, but having an error message and call graph doesn't help here. Instead, I want to throw an error which can be caught and logged at the level where we have context on which benchmark was being run (see the sample error message in the summary). Line 69: } > Maybe we should reset 'sw' and 'lost_time' after doing the tuning step so w I thought so as well, but there is no reset method currently, so this actually seems easiest. PS2, Line 94: 10 > Where does the 10 come from? Arbitrary flake factor. If we lose 90% of measurements, we bail. This parameter just made the arithmetic easy. Line 95: throw std::runtime_error("Benchmark failed to complete due to noisy measurements."); > Same comment here about using Status/LOG(FATAL). Same reason - we want context on which benchmark was failing, which we don't have here. -- To view, visit http://gerrit.cloudera.org:8080/6935 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach AmsdenGerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Henry Robinson has posted comments on this change. Change subject: Remove redundant libevent flag when building Thrift .. Patch Set 1: Yep - mentioned in the commit msg "Libevent is not required for Thrift if the non-blocking server is not used.". We stopped using the non-blocking server a few years ago. -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 1 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: No
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Matthew Jacobs has posted comments on this change. Change subject: Remove redundant libevent flag when building Thrift .. Patch Set 2: I should have asked earlier - do you know why we stopped needing libevent for thrift? -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 2 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test
Henry Robinson has posted comments on this change. Change subject: IMPALA-5349: flaky NoDirsAllocationError test .. Patch Set 1: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/6953 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Henry Robinson Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5085: large rows in BufferedTupleStreamV2
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5085: large rows in BufferedTupleStreamV2 .. Patch Set 5: Rebased -- To view, visit http://gerrit.cloudera.org:8080/6638 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5349: flaky NoDirsAllocationError test
Tim Armstrong has uploaded a new change for review. http://gerrit.cloudera.org:8080/6953 Change subject: IMPALA-5349: flaky NoDirsAllocationError test .. IMPALA-5349: flaky NoDirsAllocationError test The assertion was incorrect and racy - it is ok if the write error wins the race with the Unpin() calls, causing them to fail. Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 --- M be/src/runtime/buffered-block-mgr-test.cc 1 file changed, 4 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/53/6953/1 -- To view, visit http://gerrit.cloudera.org:8080/6953 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-5085: large rows in BufferedTupleStreamV2
Tim Armstrong has uploaded a new patch set (#5). Change subject: IMPALA-5085: large rows in BufferedTupleStreamV2 .. IMPALA-5085: large rows in BufferedTupleStreamV2 The stream defaults to pages of default_page_len_. If a row doesn't fit in that page, it will allocate another page up to max_page_len_ bytes and append a single row to that page, then immediately advance to the next page. This means that when writing a stream, the large page only needs to be kept in memory temporarily, which helps with memory requirements. E.g. consider a hash join that is repartitioning 1 unpinned stream into 16 unpinned streams. We will need default_page_len_ * 15 + max_page_len_ * 2 bytes of reservation because when processing a large row we only need one large write buffer at a time. The large row cases are not as optimised for memory consumption or performance - queries processing very large numbers of large rows are an extreme edge case that is likely to hit other performance bottlenecks first. Pages with large rows can have up to 50% internal fragmentation. I tried to avoid adding empty pages to the stream, but it was tricky to avoid in the case of a read/write stream where the read iterator was already pointing at the write page - in that case we can end up with an empty page. To avoid duplicating more logic between AddRow() and AllocateRow() I restructured things so that AddRowSlow() is implemented in terms of AllocateRowSlow(). AllocateRow() now takes a function as an argument to populate the row. Testing: * Added tests for the case where 0 rows are added to the stream * Extend BigRow to exercise the new code. * Also test large strings and read/write streams. Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5 --- M be/src/runtime/buffered-tuple-stream-v2-test.cc M be/src/runtime/buffered-tuple-stream-v2.cc M be/src/runtime/buffered-tuple-stream-v2.h M be/src/runtime/buffered-tuple-stream-v2.inline.h M common/thrift/generate_error_codes.py 5 files changed, 544 insertions(+), 236 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/38/6638/5 -- To view, visit http://gerrit.cloudera.org:8080/6638 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Matthew Jacobs has posted comments on this change. Change subject: Remove redundant libevent flag when building Thrift .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 2 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4164: Use inline hints instead of always-inline .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc File be/src/codegen/llvm-codegen.cc: Line 163: target_cpu_attr_ = fn->getFnAttribute("target-cpu").getValueAsString(); > When we add AVX2 instruction to cross-compiled code, wouldn't we need to ad We have __attribute__((target("avx2"))) on some bloom filter functions already that I think get cross-compiled - that overrides whatever compilation flags were present. We could also emit AVX2 instructions as part of codegen, or inline cross-compiled functions with AVX2 into codegen'd functions. Line 166: LOG(INFO) << "target_features_attr: " << target_features_attr_; > This is logged once during start-up. Would be helpful for debugging. Ah ok, maybe make the log text a bit more human-readable then? -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Henry Robinson has posted comments on this change. Change subject: Remove redundant libevent flag when building Thrift .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6904/1/source/thrift/build.sh File source/thrift/build.sh: Line 35 > I don't think we need libevent otherwise, right? If not, please remove from Done - did it in a separate commit in case we ever want to revert it. -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 1 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-HasComments: Yes
[native-toolchain-CR] Remove libevent from toolchain
Henry Robinson has uploaded a new change for review. http://gerrit.cloudera.org:8080/6952 Change subject: Remove libevent from toolchain .. Remove libevent from toolchain Only Thrift ever needed libevent, and it no longer needs it, so save the time building this unused dependency. Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82 --- M buildall.sh D source/libevent/build.sh 2 files changed, 0 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/native-toolchain refs/changes/52/6952/1 -- To view, visit http://gerrit.cloudera.org:8080/6952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Icceedef1a6e4ee3e0f16a5e867b93925739b3c82 Gerrit-PatchSet: 1 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry Robinson
[native-toolchain-CR] Remove redundant libevent flag when building Thrift
Henry Robinson has uploaded a new patch set (#2). Change subject: Remove redundant libevent flag when building Thrift .. Remove redundant libevent flag when building Thrift --with-libevent was specified twice, and overridden by the second option. Libevent is not required for Thrift if the non-blocking server is not used. Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 --- M buildall.sh M source/thrift/build.sh 2 files changed, 1 insertion(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/native-toolchain refs/changes/04/6904/2 -- To view, visit http://gerrit.cloudera.org:8080/6904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibf6ef921f4389bd875578cd681f008b86f633af8 Gerrit-PatchSet: 2 Gerrit-Project: native-toolchain Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Matthew Jacobs
[Impala-ASF-CR] Loads all TPC-DS tables
Mostafa Mokhtar has posted comments on this change. Change subject: Loads all TPC-DS tables .. Patch Set 3: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/6877 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 7: (19 comments) Looks promising. We're likely to be building more things on top of this, and the potential impact is pretty big if we made the right design decisions, so I want to make sure that we're making the right choices here - I think there are a few decision points we need to think through careful. http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries The JIRA seems to be talking about doing something slightly different: taking predicates of form (x OP constant), translating the constant into the codeword, then comparing dict_index directly to the codeword. That avoids any kind of dictionary or bitmap lookup and can be vectorized so can be very effective. There are various papers about this kind of trick - I can find links if you haven't seen them. I think maybe we need to split out this work into a separate JIRA. Your current approach applies to a much larger range of predicates so is also very useful. It would be good to think about whether the infrastructure will support that kind of optimisation in the future. I think doing a more column-oriented approach might make it more natural to do multiple passes over the column. Line 41: function per file format, so we would have to simulate a file format Yeah that codegen'd function map is a bit of a mess. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 40 I believe this code was deliberately written like this to avoid the indirection via this->scratch_batch_ in the tight loop below - I think we should leave this alone. Line 77:tuple_index = scratch_batch_->NextValidTuple()) { It seems like this might be slow for the case when many or all of the tuples are valid, since NextSetBit() is fairly expensive compared to incrementing a counter. It's probably worth waiting to see performance numbers, but my intuition is that we'll need to make sure that advancing to the next bit is cheaper. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 935 I'm not sure, but this hoisting may have been deliberate because the compiler couldn't hoist the load via scratch_batch_ out of the loop Line 208: dict_filters_active_.reset(new bool[scanner_conjunct_ctxs_->size()]); This probably works fine in practice but afaik scoped_ptr will call free instead of free[] on the array. I think unique_ptrdoes the right thing. Line 865: // Legacy impala files cannot be eliminated here, because the only way to I think we're missing an opportunity to apply the optimisation to columns with a mix of plain and dictionary-encoded pages. My understanding is that the pre-existing dictionary filtering optimisation only works when the whole column is dictionary-encoded, but your new optimisation can be applied on a page-by-page basis. I'm not sure how hard it is to restructure things to get it to work in that case. I think it's probably fairly low-impact because most columns will either be predominantly plain-encoded or predominantly dict-encoded, but we should add a comment so that the limitation is explicit. PS7, Line 1051: _ nit: extraneous _ http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 436: /// XXX Why is this a scoped_ptr ? IIRC so we don't need to include the full header. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { I think it would be best to avoid using the ScratchTupleBatch abstraction in this code - these functions are perf critical and it's easier to reason about performance when the abstraction is stripped away and the code is directly operating over contiguous arrays as much as possible. Maybe just pass in the bitmap directly? Line 408: if (IS_FILTERED && IS_DICT_ENCODED && This patch pushes the predicate evaluation down into ReadFilteredSlot() so that it's done value-by-value as they are materialised. I'm not convinced this is the best approach - I think we should consider peeling predicate evaluation out into a separate loop and doing it columnwise. This would mean that ReadSlot() or some variant would materialize the values and the codewords into an array, then we'd do another pass over the codeword array to evaluate the conjunctions. Or maybe materialize the codewords in one pass, then doing another pass to evaluate conjuncts, then another one to lookup the surviving values in the dictionary.
[Impala-ASF-CR] [DOCS] Add several items to "known issues" page
John Russell has posted comments on this change. Change subject: [DOCS] Add several items to "known issues" page .. Patch Set 3: I moved IMPALA-4432 into the "Incompatible Changes" page. The better to discuss the behavior change due to the fix, since it was found and fixed within the same release cycle. -- To view, visit http://gerrit.cloudera.org:8080/5809 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: John Russell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Michael Ho has posted comments on this change. Change subject: IMPALA-4164: Use inline hints instead of always-inline .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc File be/src/codegen/llvm-codegen.cc: Line 163: target_cpu_attr_ = fn->getFnAttribute("target-cpu").getValueAsString(); > This solves the inlining problem but the outcome is a bit counterintuitive When we add AVX2 instruction to cross-compiled code, wouldn't we need to add "-mavx2" to the compilation flag when cross-compiling the code ? In that case, wouldn't the lowest-common-denominator also include "avx2" ? May be I misunderstood the comment ? Line 166: LOG(INFO) << "target_features_attr: " << target_features_attr_; > Should this logging be left in? This is logged once during start-up. Would be helpful for debugging. -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] [DOCS] Add several items to "known issues" page
Hello Michael Brown, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5809 to look at the new patch set (#3). Change subject: [DOCS] Add several items to "known issues" page .. [DOCS] Add several items to "known issues" page IMPALA-4432 is actually an incompatible change, so it gets listed on the separate "Incompatible Changes" page. Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf --- M docs/topics/impala_incompatible_changes.xml M docs/topics/impala_known_issues.xml 2 files changed, 93 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/5809/3 -- To view, visit http://gerrit.cloudera.org:8080/5809 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: John Russell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4164: Use inline hints instead of always-inline
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4164: Use inline hints instead of always-inline .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6941/2/be/src/codegen/llvm-codegen.cc File be/src/codegen/llvm-codegen.cc: Line 163: target_cpu_attr_ = fn->getFnAttribute("target-cpu").getValueAsString(); This solves the inlining problem but the outcome is a bit counterintuitive - it's taking the target attributes from the IR module (which is the lowest-common-denominator set of features that each function was originally compiled for) and copying them to all the functions. I think this makes an incorrect assumption that all functions in the module have the same target attributes, which wouldn't be true for, say, AVX2-specific functions. So if one of those functions was in the module the outcome would depend on the (semi-arbitrary) order of the functions in the IR module. At a minimum it needs some explanation, but I think it would make more sense to ensure that cross-compiled functions target the current machine, rather than the lowest-common-denominator. If I remember correctly, if the function has no attributes set then it defaults to the execution engine's, which match the current machine. This would also potentially causes problems if we ever wanted to create functions with instructions like AVX2. Line 166: LOG(INFO) << "target_features_attr: " << target_features_attr_; Should this logging be left in? -- To view, visit http://gerrit.cloudera.org:8080/6941 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2d87ae8d222b415587e7320cb9072e4a8d6615ce Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Michael HoGerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5031: Initialize only parsing header before reading it
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5031: Initialize only_parsing_header_ before reading it .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6937 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0ddc496019941048b3e0775606fa5e8e3f9c075a Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5347: Parquet scanner microoptimizations
Tim Armstrong has uploaded a new patch set (#2). Change subject: IMPALA-5347: Parquet scanner microoptimizations .. IMPALA-5347: Parquet scanner microoptimizations A mix of microoptimizations that profiling the parquet scanner revealed. All lead to some measurable improvement and added up to significant speedups for various scans. * Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining in all cases. * Apply __restrict__ in a few places so the compiler knows that it is safe to cache values accessed via those pointers * memset() the whole batch instead of the null indicators is cases where it is almost certainly cheaper. * Avoid updating two correlated loop variables in MaterializeValueBatch(). * Avoid unnecessary initialization of often-unused 'val' in ReadSlot(). * Shave a few instructions off the (still very expensive) bit unpacking and dict decoding logic. Performance: Some local TPC-H and targeted-perf benchmarks showed average speedups of ~5%. I did some benchmarks targeted at measuring column materialisation performance using a version of lineitem with duplicated data to make it bigger. These queries all got significantly faster. Dict-encoded DECIMAL: 2.23 -> 1.23s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 Plain-encoded BIGINT: 2.33s -> 1.62s SELECT count(*) FROM biglineitem WHERE l_orderkey != 10 Dict-encoded STRING: 2.73s -> 1.72s SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A' Multiple columns: 5.15s -> 3.74s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100 Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/dict-encoding.h M be/src/util/rle-encoding.h 6 files changed, 55 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/2 -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints
Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-2522: Add doc for sortby() and clustered hints .. IMPALA-2522: Add doc for sortby() and clustered hints Add CLUSTERED hint. Update hint syntax in INSERT topic. Also modernize the hint syntax as shown under INSERT to include the -- and /* */ formats also. List the [] style last since it is the least-preferred option. Switch to preferring /* */ syntax for hints instead of using the [ ] notation by default. Finally, take out references to the SORTBY hint because it didn't actually make it in. Intent for future is to have a way to get this behavior without using a hint. Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed Reviewed-on: http://gerrit.cloudera.org:8080/5655 Reviewed-by: John RussellTested-by: Impala Public Jenkins --- M docs/shared/impala_common.xml M docs/topics/impala_hints.xml M docs/topics/impala_insert.xml 3 files changed, 71 insertions(+), 28 deletions(-) Approvals: Impala Public Jenkins: Verified John Russell: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/5655 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: John Russell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Brown
[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2522: Add doc for sortby() and clustered hints .. Patch Set 8: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/5655 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: John Russell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Brown Gerrit-HasComments: No
[Impala-ASF-CR] [DOCS] Add several items to "known issues" page
John Russell has posted comments on this change. Change subject: [DOCS] Add several items to "known issues" page .. Patch Set 2: Resolved conflict - brought in a couple of other known issues that were added in the meantime. -- To view, visit http://gerrit.cloudera.org:8080/5809 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: John Russell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] [DOCS] Add several items to "known issues" page
Hello Michael Brown, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5809 to look at the new patch set (#2). Change subject: [DOCS] Add several items to "known issues" page .. [DOCS] Add several items to "known issues" page Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf --- M docs/topics/impala_known_issues.xml 1 file changed, 95 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/5809/2 -- To view, visit http://gerrit.cloudera.org:8080/5809 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeed92e8f6603b858622b1d397d1f5cad810f4cf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: John Russell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5164: Fix flaky benchmarks
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5164: Fix flaky benchmarks .. Patch Set 2: (6 comments) Thanks for doing this. The overall idea looks good, just some comments about a few details. http://gerrit.cloudera.org:8080/#/c/6935/2/be/src/util/benchmark.cc File be/src/util/benchmark.cc: Line 35: double Benchmark::Measure(BenchmarkFunction function, void* args, Can we change the behaviour so that benchmarks can opt in or out from the context switch test? The new behaviour makes sense if the benchmark is measuring single-threaded efficiency but doesn't make sense in some other cases. E.g. for lock-benchmark we want to understand what the OS scheduler does when there are more threads than logical cores contending for a lcok. Line 62: throw std::runtime_error("Benchmark failed to complete due to context switching."); We usually use LOG(FATAL) for fatal errors or Status for non-fatal errors. I think either could make sense here. Line 69: } Maybe we should reset 'sw' and 'lost_time' after doing the tuning step so we don't have unnecessary lost time. Line 90: batch_size = max(1, batch_size / (1+(ru_stop.ru_nivcsw - ru_start.ru_nivcsw))); Maybe comment why we're scaling it down here? PS2, Line 94: 10 Where does the 10 come from? Line 95: throw std::runtime_error("Benchmark failed to complete due to noisy measurements."); Same comment here about using Status/LOG(FATAL). -- To view, visit http://gerrit.cloudera.org:8080/6935 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7e0ddfaa9bd1340990ae2b6ff946d72bbd26c274 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach AmsdenGerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2522: Add doc for sortby() and clustered hints .. Patch Set 8: Build started: http://jenkins.impala.io:8080/job/gerrit-docs-submit/115/ -- To view, visit http://gerrit.cloudera.org:8080/5655 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: John Russell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Brown Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints
John Russell has posted comments on this change. Change subject: IMPALA-2522: Add doc for sortby() and clustered hints .. Patch Set 8: Code-Review+2 Carrying forward +2 after resolving merge conflict. -- To view, visit http://gerrit.cloudera.org:8080/5655 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: John Russell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Brown Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
David Knupp has posted comments on this change. Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255" .. Patch Set 7: Code-Review+1 Python test code looks OK to me. -- To view, visit http://gerrit.cloudera.org:8080/6894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2522: Add doc for sortby() and clustered hints
Hello Lars Volker, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5655 to look at the new patch set (#8). Change subject: IMPALA-2522: Add doc for sortby() and clustered hints .. IMPALA-2522: Add doc for sortby() and clustered hints Add CLUSTERED hint. Update hint syntax in INSERT topic. Also modernize the hint syntax as shown under INSERT to include the -- and /* */ formats also. List the [] style last since it is the least-preferred option. Switch to preferring /* */ syntax for hints instead of using the [ ] notation by default. Finally, take out references to the SORTBY hint because it didn't actually make it in. Intent for future is to have a way to get this behavior without using a hint. Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed --- M docs/shared/impala_common.xml M docs/topics/impala_hints.xml M docs/topics/impala_insert.xml 3 files changed, 71 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/55/5655/8 -- To view, visit http://gerrit.cloudera.org:8080/5655 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: John Russell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Brown
[Impala-ASF-CR] Parquet scanner microoptimizations
Tim Armstrong has uploaded a new change for review. http://gerrit.cloudera.org:8080/6950 Change subject: Parquet scanner microoptimizations .. Parquet scanner microoptimizations A mix of microoptimizations that profiling the parquet scanner revealed. All lead to some measurable improvement and added up to significant speedups for various scans. * Add ALWAYS_INLINE to hot functions that GCC was mistakenly not inlining in all cases. * Apply __restrict__ in a few places so the compiler knows that it is safe to cache values accessed via those pointers * memset() the whole batch instead of the null indicators is cases where it is almost certainly cheaper. * Avoid updating two correlated loop variables in MaterializeValueBatch(). * Avoid unnecessary initialization of often-unused 'val' in ReadSlot(). * Shave a few instructions off the (still very expensive) bit unpacking and dict decoding logic. Performance: Some local TPC-H and targeted-perf benchmarks showed average speedups of ~5%. I did some benchmarks targeted at measuring column materialisation performance using a version of lineitem with duplicated data to make it bigger. These queries all got significantly faster. Dict-encoded DECIMAL: 2.23 -> 1.23s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 Plain-encoded BIGINT: 2.33s -> 1.62s SELECT count(*) FROM biglineitem WHERE l_orderkey != 10 Dict-encoded STRING: 2.73s -> 1.72s SELECT count(*) FROM biglineitem WHERE l_returnflag = 'A' Multiple columns: 5.15s -> 3.74s SELECT count(*) FROM biglineitem WHERE l_quantity > 49 and l_partkey != 199 and l_tax < 100 Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/dict-encoding.h M be/src/util/rle-encoding.h 6 files changed, 55 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6950/1 -- To view, visit http://gerrit.cloudera.org:8080/6950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I49ec523a65542fdbabd53fbcc4a8901d769e5cd5 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Attila Jeges has uploaded a new patch set (#2). Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd --- M be/src/benchmarks/CMakeLists.txt D be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py D tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 75 insertions(+), 850 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/6896/2 -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Attila Jeges has posted comments on this change. Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6896/1/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 25: #include > why do we need these headers? Removed the new includes and kept boost/date_time/local_time/local_time.hpp (it is required for boost::posix_time) http://gerrit.cloudera.org:8080/#/c/6896/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 123: // Avro schema of this table if this is an Avro table, otherwise null. Set in load(). > what did these have to do with your change? are they just dead variables (i Removed them, they were left behind accidentaly when I resolved the conflicts. -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-HasComments: Yes
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Hello Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6896 to look at the new patch set (#2). Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd --- M be/src/benchmarks/CMakeLists.txt D be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py D tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 75 insertions(+), 850 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/6896/2 -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht
[Impala-ASF-CR] Add a script to test performance on a developer machine
Michael Brown has posted comments on this change. Change subject: Add a script to test performance on a developer machine .. Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/6818/4/bin/single_node_perf_run.py File bin/single_node_perf_run.py: PS4, Line 110: git_hash = sh.git("rev-parse", "HEAD").strip() get_git_hash_for_name("HEAD") ? PS4, Line 254: parser.add_option("--no_load", dest="load", action="store_false", default=True, : help="use already-present workload databases") : parser.add_option("--no_start_minicluster", dest="start_minicluster", : action="store_false", default=True, : help="use already-running minicluster") It's confusing to me to have --no options that default to True that are stored_false. I can't keep track of what's going on and what the defaults will actually be. Are there ways you could make it less confusing? Can you reduce some of the negativity? PS4, Line 262: parser.set_usage("single_node_perf_run.py [options] git_hash_A [git_hash_B]\n\n" :"Compares the performance of Impala at two git hashes on\n" :"some standard benchmarks. Output is in\n" :"$IMPALA_HOME/perf_results/latest.\n\n" :"WARNING: This script will run git checkout. You should not touch\n" :"the tree while the script is running. You should start the script\n" :"from a clean git tree.\n\n" :"WARNING: Unless --no_load is used, this script calls load_data.py\n" :"which can overwrite your TPC-H and TPC-DS data.") Nit: Consider a triple-quoted string and using textwrap.dedent(). parser.set_usage(textwrap.dedent("""\ single_node_perf_run.py [options] git_hash_A [git_hash_B] Compares the performance of Impala [etc] """) PS4, Line 296: main() 2 spaces. -- To view, visit http://gerrit.cloudera.org:8080/6818 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: David Knupp Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Michael Brown Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4923: reduce memory transfer for selective scans
Tim Armstrong has uploaded a new patch set (#2). Change subject: IMPALA-4923: reduce memory transfer for selective scans .. IMPALA-4923: reduce memory transfer for selective scans Most of the code changes are to restructure things so that the scratch batch's tuple buffer is stored in a separate MemPool from auxiliary memory such as decompression buffers. This part of the change does not change the behaviour of the scanner in itself, but allows us to recycle the tuple buffer without holding onto unused auxiliary memory. The optimisation is implemented in TryCompact(): if enough rows were filtered out during the copy from the scratch batch to the output batch, the fixed-length portions of the surviving rows (if any) are copied to a new, smaller, buffer, and the original, larger, buffer is reused for the next scratch batch. Previously the large buffer was always attached to the output batch, so a large buffer was transferred between threads for every scratch batch processed. In combination with the decompression buffer change in IMPALA-5304, this means that in many cases selective scans don't produce nearly as many empty or near-empty batches and do not attach nearly as much memory to each batch. Performance: Even on an 8 core machine I see some speedup on selective scans. Profiling with "perf top" also showed that time in TCMalloc was reduced - it went from several % of CPU time to a minimal amount. Running TPC-H on the same machine showed a ~5% overall improvement and no regressions. E.g. Q6 got 20-25% faster. I hope to do some additional cluster benchmarking on systems with more cores to verify that the severe performance problems there are fixed, but in the meantime it seems like we have enough evidence that it will at least improve things. Testing: Add a couple of selective scans that exercise the new code paths. Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/row-batch.cc M be/src/runtime/row-batch.h M testdata/workloads/functional-query/queries/QueryTest/parquet.test 6 files changed, 205 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/6949/2 -- To view, visit http://gerrit.cloudera.org:8080/6949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Alex Behm
[Impala-ASF-CR] Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet"
Hello Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6896 to look at the new patch set (#2). Change subject: Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" .. Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd --- M be/src/benchmarks/CMakeLists.txt D be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py D tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 84 insertions(+), 850 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/6896/2 -- To view, visit http://gerrit.cloudera.org:8080/6896 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht
[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255" .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/6894/6/tests/data_errors/test_data_errors.py File tests/data_errors/test_data_errors.py: Line 78: try: > I wonder about stacking all of these asserts in one block. If any of them f Yes, that makes sense. If it fails, we'd want to know why. I've added a line that prints the stderr if it's not empty. This should have sufficient information to be able to debug if any of these commands fail. We'd want the safe mode to be turned off regardless of success or failure, since we don't want the effect of this test to leak into other tests if anything gets run after this. Line 97: # Confirm that we were able to get the root cause. > If this is the happy path, this is fine. If the assert fails though, you mi I've added a message that says "Couldn't turn Safe mode OFF". -- To view, visit http://gerrit.cloudera.org:8080/6894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5331: Use new libHDFS API to address "Unknown Error 255"
Hello Matthew Jacobs, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6894 to look at the new patch set (#7). Change subject: IMPALA-5331: Use new libHDFS API to address "Unknown Error 255" .. IMPALA-5331: Use new libHDFS API to address "Unknown Error 255" We use the new libHDFS API hdfsGetLastExceptionRootCause() to return the last seen HDFS error on that thread. This patch depends on the recent HDFS commit: https://github.com/apache/hadoop/commit/fda86ef2a32026c02d9b5d4cca1ecb7b4decd872 Testing: A test has been added which puts HDFS in safe mode and then verifies that we see a 255 error with the root cause. Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d --- M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-util.cc M tests/data_errors/test_data_errors.py 3 files changed, 44 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/6894/7 -- To view, visit http://gerrit.cloudera.org:8080/6894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil
[Impala-ASF-CR] IMPALA-5166: clean up BufferPool counters
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-5166: clean up BufferPool counters .. Patch Set 9: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/6690 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I34b7f4d94c3d396ac89026c7559d6b2c6e02697c Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No