[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil Tested-by: Internal Jenkins --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/t
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 33: Code-Review+2 Rebase, carry +2. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 33 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 31:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:
Line 137: if (!(fs instanceof DistributedFileSystem) || !(fs instanceof
S3AFileSystem)) {
> how did the tests not catch this?
It seems to have gotten skipped in every exhaustive run I did. I'm not sure why
yet, I'm investigating now.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Internal Jenkins, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#33). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/que
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Marcel Kornacker has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 31:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:
Line 137: if (!(fs instanceof DistributedFileSystem) || !(fs instanceof
S3AFileSystem)) {
> Oh, yes. Sorry, I've changed it.
how did the tests not catch this?
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 32: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 32 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Internal Jenkins, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#32). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/que
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 31:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:
Line 137: if (!(fs instanceof DistributedFileSystem) || !(fs instanceof
S3AFileSystem)) {
> if this must be either a dfs subclass or s3a, shouldn't this be
Oh, yes. Sorry, I've changed it.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Marcel Kornacker has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 31:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:
Line 137: if (!(fs instanceof DistributedFileSystem) || !(fs instanceof
S3AFileSystem)) {
if this must be either a dfs subclass or s3a, shouldn't this be
if (!(fs instanceof DFS) && !(fs instanceof S3A)) throw ...
?
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 31: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java: Line 139: HDFS > If Isilon uses hdfs:// URIs, what you have now makes sense. Yes, Isilon implements HDFS protocol. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 31 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 31: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java: Line 139: HDFS > It seems like it was always assumed to be an HDFS filesystem because LOAD D If Isilon uses hdfs:// URIs, what you have now makes sense. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 31 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 31: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java: Line 2876: // Source must be HDFS or S3A. Placeholder: FYI last patchset changes were made here too. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 31 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 31: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java: Line 139: HDFS > What about (e.g.) Isilon? Are you calling that an HDFS filesystem? It seems like it was always assumed to be an HDFS filesystem because LOAD DATA is supposed to work for Isilon. Should I explicitly mention that too? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 31 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 31: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java: Line 139: HDFS What about (e.g.) Isilon? Are you calling that an HDFS filesystem? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 31 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 31: Ok just figured out that there was no file:/test-warehouse/test.out file and that my patch accidentally allowed LOAD DATA for local filesystem too, which we do not support. I fixed it in this patch. Will GVM after I get a +2. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 31 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Internal Jenkins, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#31). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/que
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 30: > Build failed: > http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge/2331/ Load data did not load a file and so a FE test failed. Resubmitting for GVM. Log: --- Test set: com.cloudera.impala.analysis.AnalyzeStmtsTest --- Tests run: 47, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.089 sec <<< FAILURE! - in com.cloudera.impala.analysis.AnalyzeStmtsTest TestLoadData(com.cloudera.impala.analysis.AnalyzeStmtsTest) Time elapsed: 0.322 sec <<< FAILURE! java.lang.AssertionError: got error: INPATH location 'file:/test-warehouse/test.out' does not exist. expected: INPATH location 'file:/test-warehouse/test.out' must point to an HDFS filesystem at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at com.cloudera.impala.analysis.AnalyzerTest.AnalysisError(AnalyzerTest.java:347) at com.cloudera.impala.analysis.AnalyzerTest.AnalysisError(AnalyzerTest.java:326) at com.cloudera.impala.analysis.AnalyzeStmtsTest.TestLoadData(AnalyzeStmtsTest.java:2877) -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 30 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 29: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/29/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java: Line 180: if (!doRename) doRename = !destIsDfs && sameFileSystem; > out of curiosity: what category is s3 in? S3 falls under the non-distributed category. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 29 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 30: Code-Review+2 Rebase, carry +2. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 30 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#30). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_s
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 29: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/29/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java: Line 180: if (!doRename) doRename = !destIsDfs && sameFileSystem; out of curiosity: what category is s3 in? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 29 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 28:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/2574/28/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:
Line 144:* file. The file is moved (renamed) to the new location unless the
source and
> update comment, this isn't quite true anymore (you also don't rename if the
Done
Line 175: (destIsDfs && sameFileSystem) ? sameEncryptionZone :
sameFileSystem;
> i still find it a bit hard to follow when exactly a rename is possible.
Done
Line 194: LOG.info(String.format("Copying '%s' to '%s' between
filesystems.",
> preconditions.checkstate(!sameFileSystem)
Done
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 28
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#29). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Marcel Kornacker has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 28:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/2574/28/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:
Line 144:* file. The file is moved (renamed) to the new location unless the
source and
update comment, this isn't quite true anymore (you also don't rename if the
files are in different file systems).
Line 175: (destIsDfs && sameFileSystem) ? sameEncryptionZone :
sameFileSystem;
i still find it a bit hard to follow when exactly a rename is possible.
there seem to be two cases: 1) destIsDfs && sameFileSystem &&
sameEncryptionZone, 2) !destIsDfs && sameFileSystem
if you break this out into explicit if statements, you can add a comment to
each case (e.g., which one is the s3 case).
Line 194: LOG.info(String.format("Copying '%s' to '%s' between
filesystems.",
preconditions.checkstate(!sameFileSystem)
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 28
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 27:
(9 comments)
Thanks Marcel.
http://gerrit.cloudera.org:8080/#/c/2574/27/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:
Line 400: // Fully qualified URI with the scheme to the base directory for
this partition.
> what does "with the scheme to the base directory" mean?
it was actually supposed to be "URI with the scheme, to the base directory".
But I forgot to remove "with the scheme" when I added "Fully qualified".
Changed it now.
http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:
Line 153: "INPATH location '%s' cannot contain non-hidden
subdirectories.", sourceDataPath_));
> long line (i know, not your fault)
Done
http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:
Line 88: private static boolean arePathsInSameEncryptionZone(FileSystem fs,
Path p1,
> arePathsInSameHdfsEncryptionZone?
Done
Line 149:* preserving the the existing file extension. If
renameIfAlreadyExists is false, an
> the the
Done
Line 157: Path destFile = destFs.isDirectory(dest) ? new Path(dest,
sourceFile.getName()) :
> better:
Done
Line 167: boolean isDestDfs = isDistributedFileSystem(destFs);
> destIsDfs
Done
Line 189: LOG.info(String.format(
> precede w/ preconditions.checkstate(!doRename))
Done
Line 193: LOG.info(String.format("Copying '%s' to '%s' between
filesystems.",
> couldn't this be !isDestDfs && sameFileSystem?
No, that would always make doRename = true, and so this code wouldn't get
executed. Do you think I should make the else into an else if() so it's more
readable?
Line 286:* Returns true iff the filesystem is a S3AFileSystem.
> if the path is on an ...
Done
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 27
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#28). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Marcel Kornacker has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 27:
(9 comments)
http://gerrit.cloudera.org:8080/#/c/2574/27/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:
Line 400: // Fully qualified URI with the scheme to the base directory for
this partition.
what does "with the scheme to the base directory" mean?
http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:
Line 153: "INPATH location '%s' cannot contain non-hidden
subdirectories.", sourceDataPath_));
long line (i know, not your fault)
http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:
Line 88: private static boolean arePathsInSameEncryptionZone(FileSystem fs,
Path p1,
arePathsInSameHdfsEncryptionZone?
not a fun name either way
Line 149:* preserving the the existing file extension. If
renameIfAlreadyExists is false, an
the the
Line 157: Path destFile = destFs.isDirectory(dest) ? new Path(dest,
sourceFile.getName()) :
better:
Path = ... \n<4 spaces>? ... : ...
Line 167: boolean isDestDfs = isDistributedFileSystem(destFs);
destIsDfs
Line 189: LOG.info(String.format(
precede w/ preconditions.checkstate(!doRename))
Line 193: LOG.info(String.format("Copying '%s' to '%s' between
filesystems.",
couldn't this be !isDestDfs && sameFileSystem?
Line 286:* Returns true iff the filesystem is a S3AFileSystem.
if the path is on an ...
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 27
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 27: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 27 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 26: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/26/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 108: DCHECK > DCHECK_GT like you did below. Done -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 26 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 24:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:
Line 58: EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path",
"hdfs://"));
> Should be two non-zero lengths, so still don't see why they compare true.
Oops, my bad. I think I ran the tests without building the test file. That test
case is wrong. I've removed it.
http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:
Line 70: @SkipIfS3.qualified_path
> okay. let's change this to SkipIfS3.jra(3459) or whatever the syntax is the
I looked up what PURGE is supposed to do and it ultimately just calls a "hadoop
fs -rm -skipTrash [filename]". The only extra thing being -skipTrash. That on
an S3 path has no effect, so it's the same as DROP TABLE.
I've changed it to SkipIfS3.hdfs_purge.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#27). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 26: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/26/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 108: DCHECK DCHECK_GT like you did below. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 26 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 24:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:
Line 58: EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path",
"hdfs://"));
> This case is special, because for some reason if we get a path with only a
Should be two non-zero lengths, so still don't see why they compare true.
http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:
Line 70: @SkipIfS3.qualified_path
> I'll read up on the semantics of PURGE and try to fix it for S3 or write an
okay. let's change this to SkipIfS3.jra(3459) or whatever the syntax is then,
since it doesn't seem related to the qualified_path jira. Or if this test
doesn't make sense on s3, let's have hdfs_purge or something.
Also, might be worth seeing what DROP TABLE PURGE does on S3.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 24:
(4 comments)
> (4 comments)
>
> +2 for be. Python test changes look fine, if MichaelB/Henry are
> happy with the rest of the python, then let's consider those +2.
>
> So, just need a +2 for fe changes from Alex or Marcel.
http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:
Line 58: EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path",
"hdfs://"));
> why do these match? does it matter what scheme the default_fs has? i.e. if
This case is special, because for some reason if we get a path with only a
scheme, then we do not have enough information to parse it. We ideally
shouldn't get such a path, but we've used that as a fallback here if we don't
get a defaultFS from the core-site (this should never happen though):
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/exec-env.cc#L419
For this test case, it wouldn't matter what scheme the default_fs has, because
both these paths have their own schemes. So only those schemes will be taken
into consideration for matching.
http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:
Line 105: && fs_b_name_length > 0
> isn't this implied by the previous if-conditional being false? i.e. they sh
Yes, you're right. I've used your format and I've added DCHECKs too.
http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:
Line 70: @SkipIfS3.qualified_path
> you don't have to do it in this change, but can't this test be fixed?
I'll read up on the semantics of PURGE and try to fix it for S3 or write an S3
specific PURGE test as a part of another JIRA. Filed IMPALA-3459.
The only difference I know for now is that S3 doesn't have a ".Trash" folder or
something equivalent, so a DROP TABLE PURGE shouldn't be that different from a
regular DROP TABLE for S3.
http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_hdfs_encryption.py
File tests/metadata/test_hdfs_encryption.py:
Line 31: qualified_path
> or should this be hdfs_encryption? qualified_path is really specifically f
Yes hdfs_encryption is the actual reason. I've added a SkipIfS3.hdfs_encryption
and used that here.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#26). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 25: One more thing I forgot to mention, and you should just do this as a separate change: can you remove the grant_revoke_no_insert.test file and use the normal one on S3? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 25 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 25: > Uploaded patch set 25. Oops, please ignore patch set #25. Will post another patch set instead. Patch set #24 is the latest until then. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 25 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#25). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 24: Code-Review+1
(4 comments)
+2 for be. Python test changes look fine, if MichaelB/Henry are happy with the
rest of the python, then let's consider those +2.
So, just need a +2 for fe changes from Alex or Marcel.
http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:
Line 58: EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path",
"hdfs://"));
why do these match? does it matter what scheme the default_fs has? i.e. if
default_fs=s3a://blah, should this still return true? what should happen in
that case?
http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:
Line 105: && fs_b_name_length > 0
isn't this implied by the previous if-conditional being false? i.e. they should
be DCHECKs. Also, a couple comments would make it easier to read through:
// Neither is fully qualified: both are on default_fs.
if (fs_a_name_length == 0 && fs_b_name_length == 0) return true;
// One is a relative path: check fully-qualified one against default_fs.
if (fs_a_name_length == 0) {
return strncmp(path_b, default_fs, default_fs_name_length) == 0;
}
if (fs_b_name_length == 0) {
return strncmp(path_a, default_fs, default_fs_name_length) == 0;
}
// Both fully qualified: check the filesystem prefix.
if (fs_a_name_length != fs_b_name_length) return false;
return strncmp(path_a, path_b, fs_a_name_length) == 0;
http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:
Line 70: @SkipIfS3.qualified_path
you don't have to do it in this change, but can't this test be fixed?
http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_hdfs_encryption.py
File tests/metadata/test_hdfs_encryption.py:
Line 31: qualified_path
or should this be hdfs_encryption? qualified_path is really specifically for
IMPALA-1872; didn't notice that being used here.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 23: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 94: after_authority - path > hdfs://n/ should return '8': Oops, right. Thanks for the test case, though. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 23 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 23: (5 comments) http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 81: This is a helper function and is not declared in the header file. It returns > Returns Done Line 88: file > Let's check for "file:" since there could conceivable be a scheme called fi Done Line 88: file > And please add a test case for this (e.g. file:blah vs file2:blah should be Done Line 90: after_scheme + 3 > this is missing the dereference. please add a test case to your unit test Done. Line 94: after_authority - path > is this off by one? i.e. hdfs://n/ should return 7, right? hdfs://n/ should return '8': 4 for 'hdfs', 3 for '://' and 1 for 'n'. (4 + 3 + 1) At this point after_authority would point to the '/' after 'n'. I've added a test case for hdfs://namenode, hdfs://namenode/ and it works as expected, i.e. they match. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 23 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#24). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/shell/test
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 23: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 88: file > Let's check for "file:" since there could conceivable be a scheme called fi And please add a test case for this (e.g. file:blah vs file2:blah should be false). -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 23 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 23: (4 comments) http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 81: This is a helper function and is not declared in the header file. It returns Returns (the first sentence is obvious). Line 88: file Let's check for "file:" since there could conceivable be a scheme called fileotherstuff, and change both 4 to 5s. Also, this case should come first, so that "file:path" works okay. And then I think after_scheme should be at occurrence of "://" since we're assuming the second slash below (3 characters). Line 90: after_scheme + 3 this is missing the dereference. please add a test case to your unit test that catches this bug. Line 94: after_authority - path is this off by one? i.e. hdfs://n/ should return 7, right? It think this test case would catch the bug: hdfs://namenode hdfs://namenode/ should compare equal but i think FilesystemMatch() currently false in that case? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 23 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 22: (7 comments) http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 301: // the file is not accessible until it's closed. > I guess we should clarify this comment further because of the fact that eve Done Line 312: > nit: remove blank line - this is all one logical block of code. Done http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 93: get_block_size > actually, since this is a struct, we can just access block_size. We're alre Done http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util-test.cc File be/src/util/hdfs-util-test.cc: Line 60: "temp_dir/temp_path")); > let's add tests for the file scheme case. see other comments for why that's Done http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 81: getNamenodeLength > static int GetNamenodeLength ... and add a quick function header comment he You're right. I can see how namenode could sound confusing. I changed it to GetFilesystemNameLength() and added a comment above the function. Please let me know if you think the name should be changed. Line 86: skip_delimiter > would it make sense to call this delimiter_len? actually, since you use po That makes sense. I got rid of the delimiter_len and added a special case for "file:". Line 91: if (after_authority == NULL) return strlen(path); > I think for file scheme, we shouldn't treat the next thing as authority, ri Yes that's right. I missed that. I've added a case for that. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 22 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#23). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/shell/test
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 22: (7 comments) http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 301: // the file is not accessible until it's closed. I guess we should clarify this comment further because of the fact that even if the file were closed, it would just return the default block size. How about: On S3A, the file cannot be stat'ed until after it's closed, and even so, the block size reported will be just the filesystem default. So, remember the requested block size. Line 312: nit: remove blank line - this is all one logical block of code. http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 93: get_block_size actually, since this is a struct, we can just access block_size. We're already doing that for all the other fields. http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util-test.cc File be/src/util/hdfs-util-test.cc: Line 60: "temp_dir/temp_path")); let's add tests for the file scheme case. see other comments for why that's special. http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 81: getNamenodeLength static int GetNamenodeLength ... and add a quick function header comment here since this isn't declared in a header. Also, this returns the scheme://authority right? Is that considered the "namenode" or is there a better term for that (I thought namenode was effectively a HDFS way of saying authority, but I could be wrong)? Line 86: skip_delimiter would it make sense to call this delimiter_len? actually, since you use pointers for the other two markers, it might be clearer to do that here to: // Scheme after_delimiter = after_scheme + 2; if (after_delimiter[0] == '/') after_delimiter++; But really, I think if you see :/ rarther than ://, then the next thing isn't the authority. But this should happen only for file (where the / is really part of the path). and I think we need to special case 'file' scheme anyway (see below), so we can probably special case that and just make this code skip over :// Line 91: if (after_authority == NULL) return strlen(path); I think for file scheme, we shouldn't treat the next thing as authority, right? i.e. file:/dir1, file://dir2, file:///dir3 are all the same filesystem. I'm pretty sure the Hadoop java filesystem resolution code treats these as all the same. I think we should just return 4 (i.e. "file") when scheme is file. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 22 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 21: (9 comments) http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 652: block_size > On some HDFS implementations I think it's possible for HDFS to override the I went for the second option. Done. http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 88: The block size decided on for this file. > The block size requested when creating the file. (unless you go with the a I went for the alternate suggestion, so leaving this as it is. http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util-test.cc File be/src/util/hdfs-util-test.cc: Line 44: "hdfs://namenode_2/temp_dir2/temp_path_2")); > what about the 'port' cases? Done http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 89: after_default_fs_namenode > why is this called "after". Isn't this going to point at the default FS nam I've added a helper function and all these comments are addressed in that. Line 90: after_default_fs_namenode += 3; > 3? i'm not sure that works all the time. i think file:/path/to/my/file is l Done Line 105: "://". > see above. Done Line 108: after_authority_a > isn't this analogous to after_default_fs_namenode but for path 'a'? If so, I've removed this. Line 112: ( > extraneous parenthesis I've removed this. Line 115: > it seems like this code is parsing three URIs in the same way. It might be Done -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 21 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Michael Brown, Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#22). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/shell/test
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 21: (9 comments) http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 652: block_size On some HDFS implementations I think it's possible for HDFS to override the requested block size, and so this block_size and the one in mBlockSize might not match. So, how about we call this field "requested_block_size" to indicate that this is the block size that was requested, not the true block size of the file? Alternatively, we could move this function's code to CreateNewTmpFile() and then save away the true block size for hdfs. And then just make a OutputPartition::block_size() accessor that the parquet writer could use. That might be simplest, but would add a stat() call on non-parquet paths, but that doesn't seem like it should matter. http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 88: The block size decided on for this file. The block size requested when creating the file. (unless you go with the alternate suggestion). http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util-test.cc File be/src/util/hdfs-util-test.cc: Line 44: "hdfs://namenode_2/temp_dir2/temp_path_2")); what about the 'port' cases? http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 89: after_default_fs_namenode why is this called "after". Isn't this going to point at the default FS namenode? Line 90: after_default_fs_namenode += 3; 3? i'm not sure that works all the time. i think file:/path/to/my/file is legal... Line 105: "://". see above. Line 108: after_authority_a isn't this analogous to after_default_fs_namenode but for path 'a'? If so, they should be named consistently (i.e. namenode vs authority). Line 112: ( extraneous parenthesis Line 115: it seems like this code is parsing three URIs in the same way. It might be clearer to have a helper subroutine that breaks a URI down into it's components that you can call three times. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 21 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Michael Brown has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 21: Code-Review+1 +1 for the python -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 21 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 20: (4 comments) Thanks Michael. http://gerrit.cloudera.org:8080/#/c/2574/20/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 138: assert size < BLOCK_SIZE > Not your change, but can you add a meaningful error message for this assert Done Line 139: print size > Not your change, but random prints should be removed. We should use the log Done http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/filesystem_base.py File tests/util/filesystem_base.py: Line 58: def get_all_file_sizes(self, path): > As we discussed please make sure both FS-specific implementations return li Both return lists of integers. I added a comment to state that. http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/s3_util.py File tests/util/s3_util.py: Line 78: if 'Contents' in response: : return [t['Size'] for t in response['Contents']] > It seems like we should handle the case where 'Contents' is not in response I think it's just best to have the same behavior as hdfs_util here. Which is to return an empty list if there are no files found. If list_objects() itself fails, it will throw an exception. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#21). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/shell/test_shell_commandline.py A tests/util/filesystem_base.py M te
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Michael Brown has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 20:
(7 comments)
For the recent tagged Python changes you asked me to look at, I replied with
'Done' to your helpful markers and left comments in relevant places.
http://gerrit.cloudera.org:8080/#/c/2574/20/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:
Line 136: sizes = self.filesystem_client.get_all_file_sizes(DIR)
> Change made as part of fix for IMPALA-3245.
Done
Line 138: assert size < BLOCK_SIZE
Not your change, but can you add a meaningful error message for this assert if
it were to fail?
You can do it like:
assert size < BLOCK_SIZE, "wrong size {0} [etc]".format(size)
Line 139: print size
Not your change, but random prints should be removed. We should use the logging
facility if we need to print this.
http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:
Line 58: def get_all_file_sizes(self, path):
> Added for IMPALA-3245.
As we discussed please make sure both FS-specific implementations return lists
of numbers, not lists of strings, since I don't see you specifically changing
the type.
http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:
Line 79: def get_all_file_sizes(self, path):
> Added for IMPALA-3245.
Done
http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/s3_util.py
File tests/util/s3_util.py:
Line 72: def get_all_file_sizes(self, path):
> Added for IMPALA-3245.
Done
Line 78: if 'Contents' in response:
: return [t['Size'] for t in response['Contents']]
It seems like we should handle the case where 'Contents' is not in response. If
it's only missing due to some bad 'path' value, or some S3 error, it will still
help with test failures to know that. As it is the method will just return
None. If it returns None, something higher in the stack trying to use the value
will produce a more obscure exception.
My suggestion is to raise an exception here and include path and response (if
not large) in the exception string.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 20: > (7 comments) > > I've added placeholders for the changes I've made for the fix to > IMPALA-3245, so that it's easy to review them. Whoops, the last round of changes are for IMPALA-3425, not IMPALA-3245. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 20: (7 comments) I've added placeholders for the changes I've made for the fix to IMPALA-3245, so that it's easy to review them. http://gerrit.cloudera.org:8080/#/c/2574/20/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 299: output_partition->block_size = block_size; Fix for IMPALA-3425. Previously PARQUET_FILE_SIZE was ignored because GetFileBlockSize() always returned 32MB (the default 'block' size in S3) for S3 files. We now store the block size so that the GetFileBlockSize() returns the correct size which is either the set by the user, or the default from core-site.xml. Line 652: *size = output_partition->block_size; Added as part of fix for IMPALA-3245. http://gerrit.cloudera.org:8080/#/c/2574/20/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 89: uint64_t block_size; Added as part of fix for IMPALA-3245. http://gerrit.cloudera.org:8080/#/c/2574/20/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 136: sizes = self.filesystem_client.get_all_file_sizes(DIR) Change made as part of fix for IMPALA-3245. http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/filesystem_base.py File tests/util/filesystem_base.py: Line 58: def get_all_file_sizes(self, path): Added for IMPALA-3245. http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/hdfs_util.py File tests/util/hdfs_util.py: Line 79: def get_all_file_sizes(self, path): Added for IMPALA-3245. http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/s3_util.py File tests/util/s3_util.py: Line 72: def get_all_file_sizes(self, path): Added for IMPALA-3245. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#20). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/shell/test_shell_commandline.py A tests/util/filesystem_base.py M te
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#19). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_col_stats.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_delimited_text.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_insert_permutation.py M tests/query_test/test_join_queries.py M tests/query_test/test_multiple_filesystems.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/shell/test_shell_commandline.py A tests/util/filesystem_base.py M te
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 18:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/2574/18/fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
File fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java:
Line 52: String fsNameStr = System.getenv("DEFAULT_FS");
Just FYI, had to make this change after rebasing on Anuj's patch.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 18
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#18). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 60 files changed, 730 insertions(+), 407 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/18 --
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 17: Added backend tests "hdfs-util-test" to test FilesystemsMatch() -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 17 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#17). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/CMakeLists.txt M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h A be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 59 files changed, 729 insertions(+), 406 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/17 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscri
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#16). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 57 files changed, 666 insertions(+), 406 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/16 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Me
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#15). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 57 files changed, 666 insertions(+), 406 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/15 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Me
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 14: (15 comments) http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 64: /// Records an error if it happens during an operation. : void AddError(const string& error_msg) const; > Does this need to be public? No, I've made it private. Line 95: HdfsOperationSet(HdfsFsCache::HdfsFsMap* connection_cache); > Re-instate some comment about ownership / usage of connection_cache Done http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 81: FilesystemsMatch > can you add some unit tests for this method? Trying to add tests for this is tricky. I will do it and upload a patch set, but meanwhile I don't want to hold up this patch set. Line 93: return true; > make one line Done Line 94: else > no need for 'else' if every block returns Done Line 95: strncmp > return strncmp() (and below) Done Line 101: // By this point we know that both the paths have schemes. > That may be true, but comment that the next step is to compare authorities. Done Line 102: += 3 > need to comment what this magic constant is doing. Done Line 107: namenode_a_len > for once I think a ternary would be clearer here: Done http://gerrit.cloudera.org:8080/#/c/2574/14/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java: Line 90: if (!isDistributedFileSystem(p1) || !isDistributedFileSystem(p2)) return false; > Doesn't this method assume that p1 and p2 are on the same filesystem? What' No, p1 and p2 can be on different filesystems, we just check if they are on HDFS because arePathsInSameEncryptionZone() is a HDFS specific function. I've added another comment to clarify that. Line 168: Only distributed file systems have different encryption zones. > I think this comment (or something like it) belongs in arePathsInSameEncypt Done Line 168: If the source and : // destination are on different file systems, we need to use FileUtil.copy() to copy : // between filesystems. > "are on different file systems, or in different encryption zones, files can Done Line 173: sameEncryptionZone : :sameFileSystem; > for formatting this, prefer line breaking after the '=' and putting the who Done Line 277: Returns true iff the filesystem is a DistributedFileSystem. > Comment is wrong Done Line 284: Returns true iff the filesystem is a DistributedFileSystem. > Comment is wrong. Done -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 14: (15 comments) http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 64: /// Records an error if it happens during an operation. : void AddError(const string& error_msg) const; Does this need to be public? Line 95: HdfsOperationSet(HdfsFsCache::HdfsFsMap* connection_cache); Re-instate some comment about ownership / usage of connection_cache http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 81: FilesystemsMatch can you add some unit tests for this method? Line 93: return true; make one line Line 94: else no need for 'else' if every block returns Line 95: strncmp return strncmp() (and below) Line 101: // By this point we know that both the paths have schemes. That may be true, but comment that the next step is to compare authorities. Line 102: += 3 need to comment what this magic constant is doing. Line 107: namenode_a_len for once I think a ternary would be clearer here: int namnode_a_len = (after_authority_a == NULL) ? strlen(path_a) : (after_authority_a - path_a) http://gerrit.cloudera.org:8080/#/c/2574/14/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java: Line 90: if (!isDistributedFileSystem(p1) || !isDistributedFileSystem(p2)) return false; Doesn't this method assume that p1 and p2 are on the same filesystem? What's the purpose of checking both paths here? Line 168: Only distributed file systems have different encryption zones. I think this comment (or something like it) belongs in arePathsInSameEncyptionZone() Line 168: If the source and : // destination are on different file systems, we need to use FileUtil.copy() to copy : // between filesystems. "are on different file systems, or in different encryption zones, files can't be moved from one location to the other and must be copied instead." Line 173: sameEncryptionZone : :sameFileSystem; for formatting this, prefer line breaking after the '=' and putting the whole condition on one line where possible. Line 277: Returns true iff the filesystem is a DistributedFileSystem. Comment is wrong Line 284: Returns true iff the filesystem is a DistributedFileSystem. Comment is wrong. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 13: (4 comments) I've rebased on Anuj's patch and addressed the comments. http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 649: // that file, it still does not "exist" on S3 and hdfsGetPathInfo() will return an error > this comment is kind of confusing because it explains why running this code Done http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-bulk-ops.cc File be/src/util/hdfs-bulk-ops.cc: Line 154: connection_cache_ = connection_cache; > if we're going to stash this as a member variable, why not just pass it dur Done http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 91: return strncmp(path_a, path_b, scheme_a_len) == 0; > it needs to also check the namenode (or bucket, etc). i.e.: Yup, I've rebased on Anuj's patch and added checks to accommodate for the authority too. http://gerrit.cloudera.org:8080/#/c/2574/13/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: Line 400: URI with the scheme > I think a more standard way to say it is: "Fully qualified URI". This is the same as THdfsTable.hdfsBaseDir. That should be fully qualified. I'll confirm that by asking someone tomorrow. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#14). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_partitioning.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 57 files changed, 668 insertions(+), 406 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/14 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Me
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 11:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:
Line 774: InsertStmt
> Nit: Don't refer to a Java-side class here. Just say "INSERT allow writes..
Done
Line 843: mimick
> 'mimic'
Done
http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:
Line 76: bool FilesystemsMatch(const char* path_a, const char* path_b) {
: const char* after_scheme_a = strstr(path_a, ":/");
: const char* after_scheme_b = strstr(path_b, ":/");
:
: bool path_a_has_scheme = after_scheme_a != NULL;
: bool path_b_has_scheme = after_scheme_b != NULL;
: // Currently, our defaultFS is HDFS. If there is no scheme present,
then that path is a
: // HDFS path.
: if (!path_a_has_scheme) return IsHdfsPath(path_b);
: if (!path_b_has_scheme) return IsHdfsPath(path_a);
:
: // By this point we know that both the paths have schemes.
: bool scheme_a_len = after_scheme_a - path_a;
: bool scheme_b_len = after_scheme_b - path_b;
: if (scheme_a_len != scheme_b_len) return false;
: return strncmp(path_a, path_b, scheme_a_len) == 0;
: }
> Just realised: this won't work if there are two different HDFS filesystems
Done
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 11: Sailesh: ping - let's get this in asap. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 11:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:
Line 774: InsertStmt
Nit: Don't refer to a Java-side class here. Just say "INSERT allow writes..."
Line 843: mimick
'mimic'
http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:
Line 76: bool FilesystemsMatch(const char* path_a, const char* path_b) {
: const char* after_scheme_a = strstr(path_a, ":/");
: const char* after_scheme_b = strstr(path_b, ":/");
:
: bool path_a_has_scheme = after_scheme_a != NULL;
: bool path_b_has_scheme = after_scheme_b != NULL;
: // Currently, our defaultFS is HDFS. If there is no scheme present,
then that path is a
: // HDFS path.
: if (!path_a_has_scheme) return IsHdfsPath(path_b);
: if (!path_b_has_scheme) return IsHdfsPath(path_a);
:
: // By this point we know that both the paths have schemes.
: bool scheme_a_len = after_scheme_a - path_a;
: bool scheme_b_len = after_scheme_b - path_b;
: if (scheme_a_len != scheme_b_len) return false;
: return strncmp(path_a, path_b, scheme_a_len) == 0;
: }
Just realised: this won't work if there are two different HDFS filesystems
(because you don't check the authority).
That's ok, because multiple HDFS filesystems aren't in scope to support - but
if it's easy to do the authority check, might as well do so.
--
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil
Gerrit-Reviewer: Dan Hecht
Gerrit-Reviewer: Henry Robinson
Gerrit-Reviewer: Michael Brown
Gerrit-Reviewer: Mostafa Mokhtar
Gerrit-Reviewer: Sailesh Mukil
Gerrit-Reviewer: Taras Bobrovytsky
Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 13: (4 comments) http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 649: // that file, it still does not "exist" on S3 and hdfsGetPathInfo() will return an error this comment is kind of confusing because it explains why running this code is okay first, and not until later does it explain why the special case is needed at all. How about rephrasing as below (note however, that it's possible that one day S3 will have per-file block sizes, and even variable block sizes): For S3A, when creating files, hdfsGetPathInfo() cannot be used until the file is closed. Since S3A is really an object store, all files have the same "block" size, which is the same as the default block size. http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-bulk-ops.cc File be/src/util/hdfs-bulk-ops.cc: Line 154: connection_cache_ = connection_cache; if we're going to stash this as a member variable, why not just pass it during construction? does it change between Execute() calls? http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 91: return strncmp(path_a, path_b, scheme_a_len) == 0; it needs to also check the namenode (or bucket, etc). i.e.: hdfs://namenode0:8020/my/file hdfs://namenode1:8020/other/file hdfs://namenode0:8021/another/file are all different filesystems. And you won't be able to do that without knowing the defaultfs (i.e. Anuj's patch). http://gerrit.cloudera.org:8080/#/c/2574/13/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: Line 400: URI with the scheme I think a more standard way to say it is: "Fully qualified URI". But how do we guarantee that this is actually fully qualified? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 11: Dan - could you review this patch please? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#13). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 56 files changed, 645 insertions(+), 405 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/13 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id:
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 11: (7 comments) That sounds good to me. But just FYI, Anuj is on PTO today and tomorrow. Which means we will need to wait till Monday till we can get this in. http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test File testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test: Line 4: '$FILESYSTEM_PREFIX/test-warehouse/insert_premutation_test' > typo in the location. Good catch. Yes, we'll no longer need to explicitly specify FILESYSTEM_PREFIX after the defaultFS patch. http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test File testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test: Line 33 > Are we losing coverage by not having a test that tries to INSERT into a non We are losing out on testing the path that returns an error if we use an unsupported filesystem during an INSERT. However, we don't have a filesystem now that we can CREATE a table in but not INSERT. http://gerrit.cloudera.org:8080/#/c/2574/11/tests/common/impala_test_suite.py File tests/common/impala_test_suite.py: Line 120: # Filesystem client acts as a generic client to supported filesystems. > Comment is a bit redundant. Done http://gerrit.cloudera.org:8080/#/c/2574/11/tests/metadata/test_ddl.py File tests/metadata/test_ddl.py: Line 70: @SkipIfS3.qualified_path > FWIW, I just filed https://issues.cloudera.org/browse/IMPALA-3313 to deal w Yes, that sounds like a much better opt-in/out process than what we have now. http://gerrit.cloudera.org:8080/#/c/2574/11/tests/query_test/test_cancellation.py File tests/query_test/test_cancellation.py: Line 160: @SkipIfS3.jira > What's the JIRA? It hits IMPALA-551 sometimes. But I just realized that I don't need to skip it for that. It gets xfailed automatically. http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_base.py File tests/util/filesystem_base.py: Line 1: > Remove blank line Done http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_utils.py File tests/util/filesystem_utils.py: Line 32: speicfic > specific Done -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 130: Promise promise_; > I was thinking that the caller of Wait() would just check the error status Yea that makes sense. I've made a change to use the CountingBarrier and checked the error status after Wait(). -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#12). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 56 files changed, 645 insertions(+), 405 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/12 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id:
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 11: (7 comments) Looks pretty good to me! I think we should let the default FS patch get in first, and then do any follow-on work required for that here. Does that sound ok with you? http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test File testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test: Line 4: '$FILESYSTEM_PREFIX/test-warehouse/insert_premutation_test' typo in the location. Presumably this requirement to specify the location goes away if S3 is the default FS? http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test File testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test: Line 33 Are we losing coverage by not having a test that tries to INSERT into a non-insertable filesystem? http://gerrit.cloudera.org:8080/#/c/2574/11/tests/common/impala_test_suite.py File tests/common/impala_test_suite.py: Line 120: # Filesystem client acts as a generic client to supported filesystems. Comment is a bit redundant. http://gerrit.cloudera.org:8080/#/c/2574/11/tests/metadata/test_ddl.py File tests/metadata/test_ddl.py: Line 70: @SkipIfS3.qualified_path FWIW, I just filed https://issues.cloudera.org/browse/IMPALA-3313 to deal with having to add multiple decorators to each test to manually select them in or out for different filesystems. http://gerrit.cloudera.org:8080/#/c/2574/11/tests/query_test/test_cancellation.py File tests/query_test/test_cancellation.py: Line 160: @SkipIfS3.jira What's the JIRA? http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_base.py File tests/util/filesystem_base.py: Line 1: Remove blank line http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_utils.py File tests/util/filesystem_utils.py: Line 32: speicfic specific -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 130: Promise promise_; > I think the reason a promise was used here instead of a counting barrier wa I was thinking that the caller of Wait() would just check the error status after Wait() returned. Using Promise<> for this is fine if there's a need to create an actual channel (maybe producer and consumer don't share any other state than the promise), but in this case I don't think that's right. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 10: (12 comments) http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: Line 839: There is no directory structure in S3, so "inheriting" permissions is not : // possible. > S3 has per-object permissions (in the form of ACLs see http://docs.aws.amaz That is possible, but since there is no directory structure, we don't have anywhere to inherit permissions from. An alternative way of doing it would be to save the partition's default permissions in a hidden file or something similar (just basically someplace to hold the state), and every time we insert, we "inherit" permissions from this file. Or maybe just stat an existing object and copy over those permissions (this could have other complications though). I'll leave this as a TODO for now and address it in a following patch. Line 875: stringstream ss; : ss << "Error(s) deleting partition directories. First error (of " : << partition_create_ops.errors().size() << ") was: " << err.second; > While you are here, could you clean up he use of stringstream in favour of Done http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.cc File be/src/util/hdfs-bulk-ops.cc: Line 111: .c_str() > not really your change, but I think you can get rid of .c_str() in all the Done Line 128: string error_msg = connection_status.ok() ? GetStrErrMsg() : > put the line break here, and put the whole ternary statement on the next li Done Line 156: if (connection_cache != NULL) connection_cache_ = connection_cache; > You should do this even if connection_cache is NULL - the caller might be t Done http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 94: /// Constructs a new operation set. > comment is now redundant, can be removed. Done Line 109: /// If abort_on_error is true, execution will finish after the first error seen. > update comment to explain connection_cache Done Line 121: return connection_cache_; > one line Done Line 130: Promise promise_; > Can you replace this with a CountingBarrier? I think that will be more natu I think the reason a promise was used here instead of a counting barrier was because a CountingBarrier cannot return an error. And we need the promise to notify us of an error if it does happen. I can modify CountingBarrier to maintain its current behavior and optionally return an error if necessary. Should I do that instead? http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 77: // TODO: 's3a' is the smallest scheme that our filesystem supports, so we compare the : // first 6 characters. Change if we support a filesystem with a smaller scheme. > Comment is out-of-date Done Line 85: // HDFS path. > this isn't a valid assumption after Anuj's change. Once Anuj pushes his change, I'll rebase and submit the final version. (Or he can rebase if I get this patch in first) http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.h File be/src/util/hdfs-util.h: Line 45: IsDfsPath > I think this should be IsHdfsPath(...), because other filesystems technical Done -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#11). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl A infra/python/deps/docutils-0.12.tar.gz A infra/python/deps/futures-3.0.5-py2-none-any.whl A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl M infra/python/deps/requirements.txt A infra/python/deps/simplejson-3.3.0.tar.gz M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 56 files changed, 640 insertions(+), 395 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/11 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id:
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 10: (11 comments) http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: Line 839: There is no directory structure in S3, so "inheriting" permissions is not : // possible. S3 has per-object permissions (in the form of ACLs see http://docs.aws.amazon.com/AmazonS3/latest/dev/S3_ACLs_UsingACLs.html). So what's to stop us mimicking inherited permissions? Even though there's no directory structure, we could insure that permissions flow down the partition hierarchy. I'm happy to leave this as a TODO, but check my reasoning first. Line 875: stringstream ss; : ss << "Error(s) deleting partition directories. First error (of " : << partition_create_ops.errors().size() << ") was: " << err.second; While you are here, could you clean up he use of stringstream in favour of Substitute()? http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.cc File be/src/util/hdfs-bulk-ops.cc: Line 111: .c_str() not really your change, but I think you can get rid of .c_str() in all the logging messages Line 128: string error_msg = connection_status.ok() ? GetStrErrMsg() : put the line break here, and put the whole ternary statement on the next line if it fits Line 156: if (connection_cache != NULL) connection_cache_ = connection_cache; You should do this even if connection_cache is NULL - the caller might be trying to explicitly say "don't use a connection cache". http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 94: /// Constructs a new operation set. comment is now redundant, can be removed. Line 109: /// If abort_on_error is true, execution will finish after the first error seen. update comment to explain connection_cache Line 121: return connection_cache_; one line Line 130: Promise promise_; Can you replace this with a CountingBarrier? I think that will be more natural and will simplify some of the logic. http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 77: // TODO: 's3a' is the smallest scheme that our filesystem supports, so we compare the : // first 6 characters. Change if we support a filesystem with a smaller scheme. Comment is out-of-date http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.h File be/src/util/hdfs-util.h: Line 45: IsDfsPath I think this should be IsHdfsPath(...), because other filesystems technically implement the DFS abstraction. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Dan Hecht has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 85: // HDFS path. this isn't a valid assumption after Anuj's change. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#10). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 15x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M infra/python/deps/requirements.txt M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 47 files changed, 621 insertions(+), 377 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/10 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 5:
(29 comments)
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:
Line 47: (
> remove
Done
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:
Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
> is this per partition or per filesystem? Not clear here what role this play
It's a connection cache for all the filesystems that the partitions connect to.
So I guess it makes more sense to call it filesystem_connection_cache. Changed
it.
Line 772: over
> in or on, not over (you partition over a partition function).
Done
Line 772: allows to
> "allow us to" or better "allows writes to tables ..."
Done
Line 773: So we need to open connections to different filesystems as necessary
> this sentence should be where you explain the "one cnxn per fs" requirement
I removed mention of the requirement. See my comment below for explanation.
Line 774: the connections
> "one connection per filesystem"
Done
Line 776: we don't want to open multiple
: // connections to the same filesystem. Doing so could cause
inconsistencies in the
: // filesystem and return with errors
> this is vague - what could go wrong?
I spent all afternoon trying to recreate the errors that I saw earlier, so that
I could write a better description of what could go wrong if we use multiple
connections to a FS. However, I wasn't able to reproduce the errors, which
leads me to believe that the errors I saw before were probably because of some
other bugs which I fixed later.
So I'm removing mention of the "inconsistencies", however, it's still better to
cache connections (so that we're not unnecessarily creating too many
connections at once).
Line 772: InsertStmt allows to insert in tables that have partitions over
multiple filesystems.
: // So we need to open connections to different filesystems as
necessary.
: // We use a local connection cache and populate it with the
connections to the necessary
: // filesystems before we do the actual operations (copy, move,
delete, etc.). This is
: // because the operations are done in parallel and we don't want
to open multiple
: // connections to the same filesystem. Doing so could cause
inconsistencies in the
: // filesystem and return with errors.
> This comment as a whole would be better before line 754.
I've removed this comment. See CR comment below.
Line 779: partition_connection
> partition_fs_connection (or just fs_connection)
Done
Line 780: RETURN_IF_ERROR
> does this leave behind any cleanup that needs to be done?
No the context of everything that happens in this loop is local. And none of
the operations actually start until after this loop execution is completed. So
returning on an error here would not leave any uncleaned state.
Line 798:
> this blank line can go
Done
Line 799: bool is_s3_path = false;
: if (IsS3APath(part_path.c_str())) is_s3_path = true;
> write this as:
Done
Line 823:
> indentation (not your change)
isn't it already 2 tabs?
Line 846: !is_s3_path
> move this into the if on line 843, save a level of nesting
Done
Line 864: if (!is_s3_path
> again, combine with above line
Done
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:
Line 55: Status::OK();
> combine declaration with line 57
Done
Line 60: if (connection_status.ok()) {
> instead of this indentation, make a method call "AddError(const string& err
Done
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:
Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap*
local_connection_cache) {
: DCHECK(local_connection_cache != NULL);
: local_connection_cache_ = local_connection_cache;
: }
> how about just making this a parameter to Execute() - is it needed anywhere
No, I've removed this and moved it to Execute().
Line 137: local
> what does it mean to be 'local'? Who owns this object?
I see the confusion. I've removed mentions of 'local' and renamed this field.
It's local with respect to the process wide global connection cache. But this
class doesn't care about that. Also, it's not owned. The caller should supply
the connection cache.
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:
Line 76: const char* src, const char* to_check
> does this ever get called with char*s that aren't derived from string objec
Not currently, no.
Line 77: scheme of our filesystem support
> grammar
Done
Line 79: strncmp
> this needs fixing. You nee
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Hello Taras Bobrovytsky, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2574 to look at the new patch set (#9). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 15x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M infra/python/deps/requirements.txt M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 47 files changed, 621 insertions(+), 377 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/9 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 9 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Henry Robinson has posted comments on this change.
Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between
filesystems
..
Patch Set 5:
(30 comments)
http://gerrit.cloudera.org:8080/#/c/2574/5//COMMIT_MSG
Commit Message:
Line 9: Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
: functionally enables LOAD DATA and INSERT on S3 without making major
: changes for the sake of improving performance over S3. This patch also
: enables both INSERT and LOAD DATA between file systems.
Say something about the approach here. What had to change?
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:
Line 47: (
remove
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:
Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
is this per partition or per filesystem? Not clear here what role this plays in
the method.
Line 772: over
in or on, not over (you partition over a partition function).
Line 772: allows to
"allow us to" or better "allows writes to tables ..."
Line 773: So we need to open connections to different filesystems as necessary
this sentence should be where you explain the "one cnxn per fs" requirement.
Line 774: the connections
"one connection per filesystem"
Line 772: InsertStmt allows to insert in tables that have partitions over
multiple filesystems.
: // So we need to open connections to different filesystems as
necessary.
: // We use a local connection cache and populate it with the
connections to the necessary
: // filesystems before we do the actual operations (copy, move,
delete, etc.). This is
: // because the operations are done in parallel and we don't want
to open multiple
: // connections to the same filesystem. Doing so could cause
inconsistencies in the
: // filesystem and return with errors.
This comment as a whole would be better before line 754.
Line 776: we don't want to open multiple
: // connections to the same filesystem. Doing so could cause
inconsistencies in the
: // filesystem and return with errors
this is vague - what could go wrong?
Line 779: partition_connection
partition_fs_connection (or just fs_connection)
Line 780: RETURN_IF_ERROR
does this leave behind any cleanup that needs to be done?
Line 798:
this blank line can go
Line 799: bool is_s3_path = false;
: if (IsS3APath(part_path.c_str())) is_s3_path = true;
write this as:
bool is_s3_path = IsS3APath(part_path.c_str());
Line 823:
indentation (not your change)
Line 846: !is_s3_path
move this into the if on line 843, save a level of nesting
Line 864: if (!is_s3_path
again, combine with above line
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:
Line 55: Status::OK();
combine declaration with line 57
Line 60: if (connection_status.ok()) {
instead of this indentation, make a method call "AddError(const string&
error_msg)" which does lines 102 -> 126 and call it here and return (after
calling MarkOneOpDone())
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:
Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap*
local_connection_cache) {
: DCHECK(local_connection_cache != NULL);
: local_connection_cache_ = local_connection_cache;
: }
how about just making this a parameter to Execute() - is it needed anywhere
else?
Line 137: local
what does it mean to be 'local'? Who owns this object?
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:
Line 76: const char* src, const char* to_check
does this ever get called with char*s that aren't derived from string objects?
Line 77: scheme of our filesystem support
grammar
Line 79: strncmp
this needs fixing. You need to extract the scheme (by searching for "://" I
think) and do this properly.
Line 82: bool SpansMultipleFilesystems(const char* pathA, const char* pathB) {
: return !IsPathOnFilesystem(pathA, pathB);
: }
remove this method. Have the caller just call IsPathOnFilesystem().
http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:
Line 51: IsPathOnFilesystem
All paths are on some filesystem. Call this "FilesystemsMatch" or something.
What should this return if either path doesn't have a scheme component?
Line 51: to_check
'to_check' doesn't really tell me what this is (and the type doesn't help).
Obviously by context it's a path, but still better to show this in the variable
name.
Also, since this method is symmetric (i.e. doesn't matter what order the
arguments are given), better just to name them symm
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 8: Code-Review+1 I reviewed the Python side, it looks good to me now. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 8 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has uploaded a new patch set (#8). Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 15x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/runtime/coordinator.cc M be/src/runtime/hdfs-fs-cache.h M be/src/util/hdfs-bulk-ops.cc M be/src/util/hdfs-bulk-ops.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java M fe/src/main/java/com/cloudera/impala/service/Frontend.java M infra/python/deps/requirements.txt M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test M testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test M testdata/workloads/tpch/queries/insert_parquet.test M tests/common/impala_test_suite.py M tests/common/skip.py M tests/custom_cluster/test_insert_behaviour.py M tests/custom_cluster/test_parquet_max_page_header.py M tests/data_errors/test_data_errors.py M tests/metadata/test_compute_stats.py M tests/metadata/test_ddl.py M tests/metadata/test_explain.py M tests/metadata/test_hdfs_encryption.py M tests/metadata/test_hdfs_permissions.py M tests/metadata/test_last_ddl_time_update.py M tests/metadata/test_load.py M tests/metadata/test_partition_metadata.py M tests/metadata/test_recover_partitions.py M tests/metadata/test_show_create_table.py M tests/query_test/test_aggregation.py M tests/query_test/test_cancellation.py M tests/query_test/test_chars.py M tests/query_test/test_compressed_formats.py M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/query_test/test_insert_parquet.py M tests/query_test/test_join_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py A tests/util/filesystem_base.py M tests/util/filesystem_utils.py M tests/util/hdfs_util.py A tests/util/s3_util.py 47 files changed, 599 insertions(+), 359 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/8 -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 8 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky
[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/2574/7/tests/util/hdfs_util.py File tests/util/hdfs_util.py: Line 108: exists > I tested this on my machine and it does not work (returns None instead of T As you mentioned in person, it points to _pyweb_hdfs_client_exists(). I've changed it to just have this function exists() which stats the file/dir to check for it's existence. This should sort out the confusion. -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 7 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
