[Impala-ASF-CR] IMPALA-1654: DDL for multiple partitions
Amos Bird has posted comments on this change. Change subject: IMPALA-1654: DDL for multiple partitions .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/5726/3/docs/topics/impala_alter_table.xml File docs/topics/impala_alter_table.xml: Line 144: For example, you might drop a group of partitions corresponding to a particular date should we warn user about the scalability limit (multiple hms RPCs) of current bulk dropping? http://gerrit.cloudera.org:8080/#/c/5726/3/docs/topics/impala_compute_stats.xml File docs/topics/impala_compute_stats.xml: Line 130: The following COMPUTE INCREMENTAL STATS statements affect some but not all trailing space. -- To view, visit http://gerrit.cloudera.org:8080/5726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Amos Bird Gerrit-HasComments: Yes
[Impala-ASF-CR] Add doc for MT DOP query option.
Mostafa Mokhtar has posted comments on this change. Change subject: Add doc for MT_DOP query option. .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/5652/1/docs/topics/impala_mt_dop.xml File docs/topics/impala_mt_dop.xml: Line 39: Sets the degree of parallelism used for certain operations that Should we mention the operations where mt_dop applies? Compute stats and queries that have scan and aggregate only operators? Line 42: and increased memory and CPU usage during statement processing. I would reword to "ideal balance between response time, memory and CPU". As some operations like "compute stats" consume less overall memory due to running with less scanner threads. -- To view, visit http://gerrit.cloudera.org:8080/5652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ife2786532b425af6d230074f1c0b5c7dcb2b8a92 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John RussellGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Lars Volker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer .. Patch Set 2: (15 comments) Thanks for the review, and apologies for the long delay. I ran into several issues when comparing statistics with Hive, until I found PARQUET-251, which seems to cause invalid stats to be written by Hive. http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG Commit Message: Line 9: Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 > Do we have end-to-end testing that the column stats are correct? We really There is a simple test in test_insert_parquet.py but it only checks rowgroup statistics, not page statistics. I think it will be easier to write more extensive tests once the read path has been implemented. http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 90: BaseColumnWriter(HdfsParquetTableWriter* parent, ExprContext* expr_ctx, > I feel like this file is getting pretty large. Can we pull these classes ou Done Line 97: num_values_(0), > I think we call this Merge() in other places like aggregate functions. Done Line 101: def_levels_(NULL), > I think the memory management here needs a bit more explanation. I.e. when Done Line 206: // levels data and the encoded values. If compression is enabled, this is the > Do we know what ordering the parquet specification defines for each type? T I added a comment to the class header. To me it looks safe to rely on the overloaded operators. Should we try and explicitly stamp out the template class only for the supported types to prevent new types from being added without manual interference? Are there any particular classes that you think may break in the future? Line 224: scoped_ptr compressor_; > I guess this is the explanation of Copy() I was looking for. I think we nee Done Line 288: new ColumnStats(parent_->per_file_mem_pool_.get(), encoded_value_size_)); > This can potentially use a lot of memory since the old strings are never fr Done PS1, Line 349: value to h > Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMem Done Line 485: > Can't we initialise most of these using the initialiser list? We should do I moved the initialization to Reset() to pass the correct memory pool, similar to dict_encoder_. Line 598: > Can't we initialise most of these using the initialiser list? We should do I made the *_stats_ inline members, so their default initialization is sufficient. The *_stats_base members are stored in the base class. I'm reluctant to move them into the c'tor, since for all other column writers they will not be initialized at construction time, but when calling Reset(). Line 628: // TODO: copy repetition data when we support nested types. > Do we need scoped_ptr? Can we just store these inline? Done http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 214: > Extra line Done Line 281: 'hive_skip_col_idx' are excluded from the verification of the expected values. > I think we need to find a way to test timestamp. Done Line 293: > We should also test the types that aren't in alltypes - char, varchar, deci Done Line 299: (qualified_table_name, source_table, num_lines) > Does this exercise the string copying code? If the table is too small it se I added a test using tpch_parquet.customer and manually verified that the resulting .parq file has multiple pages per column, thus exercising the string copy code. In hindsight it should already be sufficient to use a file with more multiple row batches, which functional.alltypes (7300 rows) should provide. I still think it makes sense to have a test with multiple pages per column so I'd opt to keep the new one, too. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars VolkerGerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Lars Volker has uploaded a new patch set (#2). Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer .. IMPALA-3909: Populate min/max statistics in Parquet writer Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h A be/src/exec/parquet-column-stats.h M be/src/exec/parquet-common.h M be/src/runtime/string-value.h M be/src/util/dict-test.cc M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 8 files changed, 697 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/5611/2 -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars VolkerGerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-1654: DDL for multiple partitions
John Russell has uploaded a new patch set (#3). Change subject: IMPALA-1654: DDL for multiple partitions .. IMPALA-1654: DDL for multiple partitions Syntax and usage notes for ALTER TABLE, COMPUTE STATS, and SHOW FILES. Mixed in a little bit with new Kudu syntax for ALTER TABLE. Didn't include all new Kudu info in this CR, the better to minimize merge conflicts. Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b --- M docs/topics/impala_alter_table.xml M docs/topics/impala_compute_stats.xml M docs/topics/impala_show.xml 3 files changed, 158 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/5726/3 -- To view, visit http://gerrit.cloudera.org:8080/5726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell
[Impala-ASF-CR] IMPALA-1654: DDL for multiple partitions
John Russell has uploaded a new patch set (#2). Change subject: IMPALA-1654: DDL for multiple partitions .. IMPALA-1654: DDL for multiple partitions Syntax and usage notes for ALTER TABLE, COMPUTE STATS, and SHOW FILES. Mixed in a little bit with new Kudu syntax for ALTER TABLE. Didn't include all new Kudu info in this CR, the better to minimize merge conflicts. Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b --- M docs/topics/impala_alter_table.xml M docs/topics/impala_compute_stats.xml M docs/topics/impala_show.xml 3 files changed, 139 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/5726/2 -- To view, visit http://gerrit.cloudera.org:8080/5726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell
[Impala-ASF-CR] IMPALA-1654: DDL for multiple partitions
John Russell has uploaded a new change for review. http://gerrit.cloudera.org:8080/5726 Change subject: IMPALA-1654: DDL for multiple partitions .. IMPALA-1654: DDL for multiple partitions Syntax and usage notes for both ALTER TABLE and COMPUTE STATS. Mixed in a little bit with new Kudu syntax for ALTER TABLE. Didn't include all new Kudu info in this CR, the better to minimize merge conflicts. Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b --- M docs/topics/impala_alter_table.xml M docs/topics/impala_compute_stats.xml 2 files changed, 125 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/5726/1 -- To view, visit http://gerrit.cloudera.org:8080/5726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell
[Impala-ASF-CR] IMPALA-4449: Revisit table locking pattern in the catalog
Bharath Vissapragada has posted comments on this change. Change subject: IMPALA-4449: Revisit table locking pattern in the catalog .. Patch Set 1: (7 comments) http://gerrit.cloudera.org:8080/#/c/5710/1//COMMIT_MSG Commit Message: PS1, Line 29: getAllCatalogObjects getCatalogObjects()? http://gerrit.cloudera.org:8080/#/c/5710/1/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: Line 926: while (true) { Regarding supportability, should we log a debug message/expose a average wait time metric via JMX if its stuck in these loops for longer than a configured value? IMO that makes it easier to debug queries hung at Catalog operations. (Same comment for other methods that follow this pattern, just wanted to hear you thoughts) Line 928: if (tbl.getLock().writeLock().tryLock()) { To avoid extra branch, may be we could refactor to something like catalogLock_.writeLock().lock(); if (!tbl.getLock().writeLock().tryLock()) { catalogLock_.writeLock().unlock(); continue; } . . . (Same comment for other places that uses this pattern) PS1, Line 950: continue; redundant? Line 1305: hdfsTable.setCatalogVersion(newCatalogVersion); - Just curious, is this an existing bug that we didn't set it if the partition exists? - May be we can move this line after L1285 before the try block since we are setting for both cases anyway and remove it at L1296? Line 1312: continue; redundant? http://gerrit.cloudera.org:8080/#/c/5710/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: Line 190: * } I think we should document the exception of getCatalogObjects() here. -- To view, visit http://gerrit.cloudera.org:8080/5710 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id08e21da31deb1f003b3cada4517651f3b3b2bb2 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris TsirogiannisGerrit-Reviewer: Bharath Vissapragada Gerrit-HasComments: Yes
[Impala-ASF-CR](asf-site) Initial commit of the blog section of the Impala ASF website.
David Knupp has posted comments on this change. Change subject: Initial commit of the blog section of the Impala ASF website. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/5667/3/blog/assets/js/html5.js File blog/assets/js/html5.js: Line 2: HTML5 Shiv v3.7.0 | @afarkas @jdalton @jon_neal @rem | MIT/GPL2 Licensed > https://issues.apache.org/jira/browse/LEGAL-285 FYI: "IMO yes see [1] you can just select the MIT license. Given that your probably not bundling this in a release it probably doesn't matter." https://issues.apache.org/jira/browse/LEGAL-285 -- To view, visit http://gerrit.cloudera.org:8080/5667 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If5810f04b56c587a5114953cae7e4f484a92be75 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: asf-site Gerrit-Owner: David KnuppGerrit-Reviewer: David Knupp Gerrit-Reviewer: Jim Apple Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4670 / IMPALA-4672: Add RpcMgr and port Statestore services to KRPC
Henry Robinson has posted comments on this change. Change subject: IMPALA-4670 / IMPALA-4672: Add RpcMgr and port Statestore services to KRPC .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5720/2/tests/statestore/test_statestore.py File tests/statestore/test_statestore.py: Line 18: # TODO(KRPC): Re-enable. > Or theoretically (meaning I haven't tried this) you could add a pytest skip I expect this file to get removed completely as part of this commit. -- To view, visit http://gerrit.cloudera.org:8080/5720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8dbf12b9ecd71d26d239d31c19b487175194c766 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: David Knupp Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4670 / IMPALA-4672: Add RpcMgr and port Statestore services to KRPC
David Knupp has posted comments on this change. Change subject: IMPALA-4670 / IMPALA-4672: Add RpcMgr and port Statestore services to KRPC .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5720/2/tests/statestore/test_statestore.py File tests/statestore/test_statestore.py: Line 18: # TODO(KRPC): Re-enable. > Instead of adding a # on every line, I would suggest to add ''' (three sing Or theoretically (meaning I haven't tried this) you could add a pytest skipif mark at the module (global) scope. pytestmark = pytest.mark.skipif(reason="All tests to be rewritten in C++") -- To view, visit http://gerrit.cloudera.org:8080/5720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8dbf12b9ecd71d26d239d31c19b487175194c766 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: David Knupp Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes