[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Internal Jenkins (Code Review)
Internal Jenkins has submitted this change and it was merged.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Reviewed-on: http://gerrit.cloudera.org:8080/2574
Reviewed-by: Sailesh Mukil 
Tested-by: Internal Jenkins
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/t

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 33: Code-Review+2

Rebase, carry +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 33
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 137:   if (!(fs instanceof DistributedFileSystem) || !(fs instanceof 
S3AFileSystem)) {
> how did the tests not catch this?
It seems to have gotten skipped in every exhaustive run I did. I'm not sure why 
yet, I'm investigating now.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Sailesh Mukil (Code Review)
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Internal Jenkins, Dan 
Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#33).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/que

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 137:   if (!(fs instanceof DistributedFileSystem) || !(fs instanceof 
S3AFileSystem)) {
> Oh, yes. Sorry, I've changed it.
how did the tests not catch this?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 32: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 32
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Sailesh Mukil (Code Review)
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Internal Jenkins, Dan 
Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#32).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/que

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 137:   if (!(fs instanceof DistributedFileSystem) || !(fs instanceof 
S3AFileSystem)) {
> if this must be either a dfs subclass or s3a, shouldn't this be
Oh, yes. Sorry, I've changed it.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 137:   if (!(fs instanceof DistributedFileSystem) || !(fs instanceof 
S3AFileSystem)) {
if this must be either a dfs subclass or s3a, shouldn't this be

if (!(fs instanceof DFS) && !(fs instanceof S3A)) throw ...

?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 139: HDFS
> If Isilon uses hdfs:// URIs, what you have now makes sense.
Yes, Isilon implements HDFS protocol.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-02 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 139: HDFS
> It seems like it was always assumed to be an HDFS filesystem because LOAD D
If Isilon uses hdfs:// URIs, what you have now makes sense.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java:

Line 2876:   // Source must be HDFS or S3A.
Placeholder: FYI last patchset changes were made here too.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 139: HDFS
> What about (e.g.) Isilon? Are you calling that an HDFS filesystem?
It seems like it was always assumed to be an HDFS filesystem because LOAD DATA 
is supposed to work for Isilon. Should I explicitly mention that too?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/31/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 139: HDFS
What about (e.g.) Isilon? Are you calling that an HDFS filesystem?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 31:

Ok just figured out that there was no file:/test-warehouse/test.out file and 
that my patch accidentally allowed LOAD DATA for local filesystem too, which we 
do not support. I fixed it in this patch. Will GVM after I get a +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 31
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Internal Jenkins, Dan 
Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#31).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/que

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 30:

> Build failed: 
> http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge/2331/

Load data did not load a file and so a FE test failed. Resubmitting for GVM.


Log:
---
Test set: com.cloudera.impala.analysis.AnalyzeStmtsTest
---
Tests run: 47, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.089 sec <<< 
FAILURE! - in com.cloudera.impala.analysis.AnalyzeStmtsTest
TestLoadData(com.cloudera.impala.analysis.AnalyzeStmtsTest)  Time elapsed: 
0.322 sec  <<< FAILURE!
java.lang.AssertionError: got error:
INPATH location 'file:/test-warehouse/test.out' does not exist.
expected:
INPATH location 'file:/test-warehouse/test.out' must point to an HDFS filesystem
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
com.cloudera.impala.analysis.AnalyzerTest.AnalysisError(AnalyzerTest.java:347)
at 
com.cloudera.impala.analysis.AnalyzerTest.AnalysisError(AnalyzerTest.java:326)
at 
com.cloudera.impala.analysis.AnalyzeStmtsTest.TestLoadData(AnalyzeStmtsTest.java:2877)

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 30
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 29:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/29/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 180: if (!doRename) doRename = !destIsDfs && sameFileSystem;
> out of curiosity: what category is s3 in?
S3 falls under the non-distributed category.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 29
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 30: Code-Review+2

Rebase, carry +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 30
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Sailesh Mukil (Code Review)
Hello Marcel Kornacker, Michael Brown, Taras Bobrovytsky, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#30).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_s

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-01 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 29: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/29/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 180: if (!doRename) doRename = !destIsDfs && sameFileSystem;
out of curiosity: what category is s3 in?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 29
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-30 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 28:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/2574/28/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 144:* file. The file is moved (renamed) to the new location unless the 
source and
> update comment, this isn't quite true anymore (you also don't rename if the
Done


Line 175: (destIsDfs && sameFileSystem) ? sameEncryptionZone : 
sameFileSystem;
> i still find it a bit hard to follow when exactly a rename is possible.
Done


Line 194:   LOG.info(String.format("Copying '%s' to '%s' between 
filesystems.",
> preconditions.checkstate(!sameFileSystem)
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 28
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-30 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#29).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-30 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 28:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/2574/28/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 144:* file. The file is moved (renamed) to the new location unless the 
source and
update comment, this isn't quite true anymore (you also don't rename if the 
files are in different file systems).


Line 175: (destIsDfs && sameFileSystem) ? sameEncryptionZone : 
sameFileSystem;
i still find it a bit hard to follow when exactly a rename is possible.

there seem to be two cases: 1) destIsDfs && sameFileSystem && 
sameEncryptionZone, 2) !destIsDfs && sameFileSystem

if you break this out into explicit if statements, you can add a comment to 
each case (e.g., which one is the s3 case).


Line 194:   LOG.info(String.format("Copying '%s' to '%s' between 
filesystems.",
preconditions.checkstate(!sameFileSystem)


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 28
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 27:

(9 comments)

Thanks Marcel.

http://gerrit.cloudera.org:8080/#/c/2574/27/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

Line 400:   // Fully qualified URI with the scheme to the base directory for 
this partition.
> what does "with the scheme to the base directory" mean?
it was actually supposed to be "URI with the scheme, to the base directory". 
But I forgot to remove "with the scheme" when I added "Fully qualified". 
Changed it now.


http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 153:   "INPATH location '%s' cannot contain non-hidden 
subdirectories.", sourceDataPath_));
> long line (i know, not your fault)
Done


http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 88:   private static boolean arePathsInSameEncryptionZone(FileSystem fs, 
Path p1,
> arePathsInSameHdfsEncryptionZone?
Done


Line 149:* preserving the the existing file extension. If 
renameIfAlreadyExists is false, an
> the the
Done


Line 157: Path destFile = destFs.isDirectory(dest) ? new Path(dest, 
sourceFile.getName()) :
> better:
Done


Line 167: boolean isDestDfs = isDistributedFileSystem(destFs);
> destIsDfs
Done


Line 189:   LOG.info(String.format(
> precede w/ preconditions.checkstate(!doRename))
Done


Line 193:   LOG.info(String.format("Copying '%s' to '%s' between 
filesystems.",
> couldn't this be !isDestDfs && sameFileSystem?
No, that would always make doRename = true, and so this code wouldn't get 
executed. Do you think I should make the else into an else if() so it's more 
readable?


Line 286:* Returns true iff the filesystem is a S3AFileSystem.
> if the path is on an ...
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 27
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#28).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 27:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/2574/27/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

Line 400:   // Fully qualified URI with the scheme to the base directory for 
this partition.
what does "with the scheme to the base directory" mean?


http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 153:   "INPATH location '%s' cannot contain non-hidden 
subdirectories.", sourceDataPath_));
long line (i know, not your fault)


http://gerrit.cloudera.org:8080/#/c/2574/27/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 88:   private static boolean arePathsInSameEncryptionZone(FileSystem fs, 
Path p1,
arePathsInSameHdfsEncryptionZone?

not a fun name either way


Line 149:* preserving the the existing file extension. If 
renameIfAlreadyExists is false, an
the the


Line 157: Path destFile = destFs.isDirectory(dest) ? new Path(dest, 
sourceFile.getName()) :
better:

Path = ... \n<4 spaces>? ... : ...


Line 167: boolean isDestDfs = isDistributedFileSystem(destFs);
destIsDfs


Line 189:   LOG.info(String.format(
precede w/ preconditions.checkstate(!doRename))


Line 193:   LOG.info(String.format("Copying '%s' to '%s' between 
filesystems.",
couldn't this be !isDestDfs && sameFileSystem?


Line 286:* Returns true iff the filesystem is a S3AFileSystem.
if the path is on an ...


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 27
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 27: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 27
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/26/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 108: DCHECK
> DCHECK_GT like you did below.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 26
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 24:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 58:   EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path", 
"hdfs://"));
> Should be two non-zero lengths, so still don't see why they compare true.
Oops, my bad. I think I ran the tests without building the test file. That test 
case is wrong. I've removed it.


http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 70:   @SkipIfS3.qualified_path
> okay. let's change this to SkipIfS3.jra(3459) or whatever the syntax is the
I looked up what PURGE is supposed to do and it ultimately just calls a "hadoop 
fs -rm -skipTrash [filename]". The only extra thing being -skipTrash. That on 
an S3 path has no effect, so it's the same as DROP TABLE.
I've changed it to SkipIfS3.hdfs_purge.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#27).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/26/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 108: DCHECK
DCHECK_GT like you did below.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 26
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 24:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 58:   EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path", 
"hdfs://"));
> This case is special, because for some reason if we get a path with only a 
Should be two non-zero lengths, so still don't see why they compare true.


http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 70:   @SkipIfS3.qualified_path
> I'll read up on the semantics of PURGE and try to fix it for S3 or write an
okay. let's change this to SkipIfS3.jra(3459) or whatever the syntax is then, 
since it doesn't seem related to the qualified_path jira.  Or if this test 
doesn't make sense on s3, let's have hdfs_purge or something.

Also, might be worth seeing what DROP TABLE PURGE does on S3.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 24:

(4 comments)

> (4 comments)
 > 
 > +2 for be. Python test changes look fine, if MichaelB/Henry are
 > happy with the rest of the python, then let's consider those +2.
 > 
 > So, just need a +2 for fe changes from Alex or Marcel.

http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 58:   EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path", 
"hdfs://"));
> why do these match?  does it matter what scheme the default_fs has? i.e. if
This case is special, because for some reason if we get a path with only a 
scheme, then we do not have enough information to parse it. We ideally 
shouldn't get such a path, but we've used that as a fallback here if we don't 
get a defaultFS from the core-site (this should never happen though):
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/exec-env.cc#L419

For this test case, it wouldn't matter what scheme the default_fs has, because 
both these paths have their own schemes. So only those schemes will be taken 
into consideration for matching.


http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 105:  && fs_b_name_length > 0
> isn't this implied by the previous if-conditional being false? i.e. they sh
Yes, you're right. I've used your format and I've added DCHECKs too.


http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 70:   @SkipIfS3.qualified_path
> you don't have to do it in this change, but can't this test be fixed?
I'll read up on the semantics of PURGE and try to fix it for S3 or write an S3 
specific PURGE test as a part of another JIRA. Filed IMPALA-3459.
The only difference I know for now is that S3 doesn't have a ".Trash" folder or 
something equivalent, so a DROP TABLE PURGE shouldn't be that different from a 
regular DROP TABLE for S3.


http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_hdfs_encryption.py
File tests/metadata/test_hdfs_encryption.py:

Line 31: qualified_path
> or should this be hdfs_encryption?  qualified_path is really specifically f
Yes hdfs_encryption is the actual reason. I've added a SkipIfS3.hdfs_encryption 
and used that here.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#26).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 25:

One more thing I forgot to mention, and you should just do this as a separate 
change: can you remove the grant_revoke_no_insert.test file and use the normal 
one on S3?

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 25
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 25:

> Uploaded patch set 25.

Oops, please ignore patch set #25. Will post another patch set instead.
Patch set #24 is the latest until then.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 25
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#25).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 24: Code-Review+1

(4 comments)

+2 for be. Python test changes look fine, if MichaelB/Henry are happy with the 
rest of the python, then let's consider those +2.

So, just need a +2 for fe changes from Alex or Marcel.

http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 58:   EXPECT_TRUE(FilesystemsMatch("hdfs://namenode/temp_dir/temp_path", 
"hdfs://"));
why do these match?  does it matter what scheme the default_fs has? i.e. if 
default_fs=s3a://blah, should this still return true? what should happen in 
that case?


http://gerrit.cloudera.org:8080/#/c/2574/24/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 105:  && fs_b_name_length > 0
isn't this implied by the previous if-conditional being false? i.e. they should 
be DCHECKs.  Also, a couple comments would make it easier to read through:

// Neither is fully qualified: both are on default_fs.
if (fs_a_name_length == 0 && fs_b_name_length == 0) return true;
// One is a relative path: check fully-qualified one against default_fs.
if (fs_a_name_length == 0) {
  return strncmp(path_b, default_fs, default_fs_name_length) == 0;
}
if (fs_b_name_length == 0) {
  return strncmp(path_a, default_fs, default_fs_name_length) == 0;
}
// Both fully qualified: check the filesystem prefix.
if (fs_a_name_length != fs_b_name_length) return false;
return strncmp(path_a, path_b, fs_a_name_length) == 0;


http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 70:   @SkipIfS3.qualified_path
you don't have to do it in this change, but can't this test be fixed?


http://gerrit.cloudera.org:8080/#/c/2574/24/tests/metadata/test_hdfs_encryption.py
File tests/metadata/test_hdfs_encryption.py:

Line 31: qualified_path
or should this be hdfs_encryption?  qualified_path is really specifically for 
IMPALA-1872; didn't notice that being used here.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-29 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 23:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 94: after_authority - path
> hdfs://n/ should return '8':
Oops, right.  Thanks for the test case, though.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 23
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-28 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 23:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 81: This is a helper function and is not declared in the header file. It 
returns
> Returns
Done


Line 88: file
> Let's check for "file:" since there could conceivable be a scheme called fi
Done


Line 88: file
> And please add a test case for this (e.g. file:blah vs file2:blah should be
Done


Line 90: after_scheme + 3
> this is missing the dereference.  please add a test case to your unit test 
Done.


Line 94: after_authority - path
> is this off by one? i.e. hdfs://n/ should return 7, right?
hdfs://n/ should return '8':
4 for 'hdfs', 3 for '://' and 1 for 'n'. (4 + 3 + 1)

At this point after_authority would point to the '/' after 'n'.

I've added a test case for hdfs://namenode, hdfs://namenode/ and it works as 
expected, i.e. they match.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 23
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-28 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#24).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/shell/test

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-28 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 23:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 88: file
> Let's check for "file:" since there could conceivable be a scheme called fi
And please add a test case for this (e.g. file:blah vs file2:blah should be 
false).


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 23
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-28 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 23:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/2574/23/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 81: This is a helper function and is not declared in the header file. It 
returns
Returns
(the first sentence is obvious).


Line 88: file
Let's check for "file:" since there could conceivable be a scheme called 
fileotherstuff, and change both 4 to 5s.  Also, this case should come first, so 
that "file:path" works okay.

And then I think after_scheme should be at occurrence of "://" since we're 
assuming the second slash below (3 characters).


Line 90: after_scheme + 3
this is missing the dereference.  please add a test case to your unit test that 
catches this bug.


Line 94: after_authority - path
is this off by one? i.e. hdfs://n/ should return 7, right?

It think this test case would catch the bug:

hdfs://namenode
hdfs://namenode/

should compare equal but i think FilesystemMatch() currently false in that case?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 23
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-27 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 22:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 301: // the file is not accessible until it's closed.
> I guess we should clarify this comment further because of the fact that eve
Done


Line 312: 
> nit: remove blank line - this is all one logical block of code.
Done


http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 93: get_block_size
> actually, since this is a struct, we can just access block_size. We're alre
Done


http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 60: "temp_dir/temp_path"));
> let's add tests for the file scheme case. see other comments for why that's
Done


http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 81: getNamenodeLength
> static int GetNamenodeLength ... and add a quick function header comment he
You're right. I can see how namenode could sound confusing. I changed it to 
GetFilesystemNameLength() and added a comment above the function.
Please let me know if you think the name should be changed.


Line 86: skip_delimiter
> would it make sense to call this delimiter_len?  actually, since you use po
That makes sense. I got rid of the delimiter_len and added a special case for 
"file:".


Line 91:   if (after_authority == NULL) return strlen(path);
> I think for file scheme, we shouldn't treat the next thing as authority, ri
Yes that's right. I missed that. I've added a case for that.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 22
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-27 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#23).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/shell/test

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-27 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 22:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 301: // the file is not accessible until it's closed.
I guess we should clarify this comment further because of the fact that even if 
the file were closed, it would just return the default block size.  How about:

On S3A, the file cannot be stat'ed until after it's closed, and even so, the 
block size reported will be just the filesystem default.  So, remember the 
requested block size.


Line 312: 
nit: remove blank line - this is all one logical block of code.


http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 93: get_block_size
actually, since this is a struct, we can just access block_size. We're already 
doing that for all the other fields.


http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 60: "temp_dir/temp_path"));
let's add tests for the file scheme case. see other comments for why that's 
special.


http://gerrit.cloudera.org:8080/#/c/2574/22/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 81: getNamenodeLength
static int GetNamenodeLength ... and add a quick function header comment here 
since this isn't declared in a header.  Also, this returns the 
scheme://authority right?  Is that considered the "namenode" or is there a 
better term for that (I thought namenode was effectively a HDFS way of saying 
authority, but I could be wrong)?


Line 86: skip_delimiter
would it make sense to call this delimiter_len?  actually, since you use 
pointers for the other two markers, it might be clearer to do that here to:

// Scheme
after_delimiter = after_scheme + 2;
if (after_delimiter[0] == '/') after_delimiter++;

But really, I think if you see :/ rarther than ://, then the next thing isn't 
the authority. But this should happen only for file (where the / is really part 
of the path).  and I think we need to special case 'file' scheme anyway (see 
below), so we can probably special case that and just make this code skip over 
://


Line 91:   if (after_authority == NULL) return strlen(path);
I think for file scheme, we shouldn't treat the next thing as authority, right? 
i.e. file:/dir1, file://dir2, file:///dir3 are all the same filesystem. I'm 
pretty sure the Hadoop java filesystem resolution code treats these as all the 
same. I think we should just return 4 (i.e. "file") when scheme is file.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 22
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-27 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 21:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 652: block_size
> On some HDFS implementations I think it's possible for HDFS to override the
I went for the second option. Done.


http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 88: The block size decided on for this file.
> The block size requested when creating the file.  (unless you go with the a
I went for the alternate suggestion, so leaving this as it is.


http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 44: 
"hdfs://namenode_2/temp_dir2/temp_path_2"));
> what about the 'port' cases?
Done


http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 89: after_default_fs_namenode
> why is this called "after". Isn't this going to point at the default FS nam
I've added a helper function and all these comments are addressed in that.


Line 90:   after_default_fs_namenode += 3;
> 3? i'm not sure that works all the time. i think file:/path/to/my/file is l
Done


Line 105: "://".
> see above.
Done


Line 108: after_authority_a
> isn't this analogous to after_default_fs_namenode but for path 'a'?  If so,
I've removed this.


Line 112: (
> extraneous parenthesis
I've removed this.


Line 115: 
> it seems like this code is parsing three URIs in the same way.  It might be
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 21
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-27 Thread Sailesh Mukil (Code Review)
Hello Michael Brown, Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#22).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/shell/test

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 21:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 652: block_size
On some HDFS implementations I think it's possible for HDFS to override the 
requested block size, and so this block_size and the one in mBlockSize might 
not match. So, how about we call this field "requested_block_size" to indicate 
that this is the block size that was requested, not the true block size of the 
file?

Alternatively, we could move this function's code to CreateNewTmpFile() and 
then save away the true block size for hdfs.  And then just make a 
OutputPartition::block_size() accessor that the parquet writer could use.  That 
might be simplest, but would add a stat() call on non-parquet paths, but that 
doesn't seem like it should matter.


http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 88: The block size decided on for this file.
The block size requested when creating the file.  (unless you go with the 
alternate suggestion).


http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util-test.cc
File be/src/util/hdfs-util-test.cc:

Line 44: 
"hdfs://namenode_2/temp_dir2/temp_path_2"));
what about the 'port' cases?


http://gerrit.cloudera.org:8080/#/c/2574/21/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 89: after_default_fs_namenode
why is this called "after". Isn't this going to point at the default FS 
namenode?


Line 90:   after_default_fs_namenode += 3;
3? i'm not sure that works all the time. i think file:/path/to/my/file is 
legal...


Line 105: "://".
see above.


Line 108: after_authority_a
isn't this analogous to after_default_fs_namenode but for path 'a'?  If so, 
they should be named consistently (i.e. namenode vs authority).


Line 112: (
extraneous parenthesis


Line 115: 
it seems like this code is parsing three URIs in the same way.  It might be 
clearer to have a helper subroutine that breaks a URI down into it's components 
that you can call three times.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 21
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Michael Brown (Code Review)
Michael Brown has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 21: Code-Review+1

+1 for the python

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 21
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 20:

(4 comments)

Thanks Michael.

http://gerrit.cloudera.org:8080/#/c/2574/20/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 138:   assert size < BLOCK_SIZE
> Not your change, but can you add a meaningful error message for this assert
Done


Line 139:   print size
> Not your change, but random prints should be removed. We should use the log
Done


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:

Line 58:   def get_all_file_sizes(self, path):
> As we discussed please make sure both FS-specific implementations return li
Both return lists of integers. I added a comment to state that.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/s3_util.py
File tests/util/s3_util.py:

Line 78: if 'Contents' in response:
   :   return [t['Size'] for t in response['Contents']]
> It seems like we should handle the case where 'Contents' is not in response
I think it's just best to have the same behavior as hdfs_util here. Which is to 
return an empty list if there are no files found. If list_objects() itself 
fails, it will throw an exception.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#21).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/shell/test_shell_commandline.py
A tests/util/filesystem_base.py
M te

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Michael Brown (Code Review)
Michael Brown has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 20:

(7 comments)

For the recent tagged Python changes you asked me to look at, I replied with 
'Done' to your helpful markers and left comments in relevant places.

http://gerrit.cloudera.org:8080/#/c/2574/20/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 136: sizes = self.filesystem_client.get_all_file_sizes(DIR)
> Change made as part of fix for IMPALA-3245.
Done


Line 138:   assert size < BLOCK_SIZE
Not your change, but can you add a meaningful error message for this assert if 
it were to fail?

You can do it like:
  assert size < BLOCK_SIZE, "wrong size {0} [etc]".format(size)


Line 139:   print size
Not your change, but random prints should be removed. We should use the logging 
facility if we need to print this.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:

Line 58:   def get_all_file_sizes(self, path):
> Added for IMPALA-3245.
As we discussed please make sure both FS-specific implementations return lists 
of numbers, not lists of strings, since I don't see you specifically changing 
the type.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

Line 79:   def get_all_file_sizes(self, path):
> Added for IMPALA-3245.
Done


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/s3_util.py
File tests/util/s3_util.py:

Line 72:   def get_all_file_sizes(self, path):
> Added for IMPALA-3245.
Done


Line 78: if 'Contents' in response:
   :   return [t['Size'] for t in response['Contents']]
It seems like we should handle the case where 'Contents' is not in response. If 
it's only missing due to some bad 'path' value, or some S3 error, it will still 
help with test failures to know that. As it is the method will just return 
None. If it returns None, something higher in the stack trying to use the value 
will produce a more obscure exception.

My suggestion is to raise an exception here and include path and response (if 
not large) in the exception string.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 20:

> (7 comments)
 > 
 > I've added placeholders for the changes I've made for the fix to
 > IMPALA-3245, so that it's easy to review them.

Whoops, the last round of changes are for IMPALA-3425, not IMPALA-3245.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 20:

(7 comments)

I've added placeholders for the changes I've made for the fix to IMPALA-3245, 
so that it's easy to review them.

http://gerrit.cloudera.org:8080/#/c/2574/20/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 299:   output_partition->block_size = block_size;
Fix for IMPALA-3425.
Previously PARQUET_FILE_SIZE was ignored because GetFileBlockSize() always 
returned 32MB (the default 'block' size in S3) for S3 files.

We now store the block size so that the GetFileBlockSize() returns the correct 
size which is either the set by the user, or the default from core-site.xml.


Line 652: *size = output_partition->block_size;
Added as part of fix for IMPALA-3245.


http://gerrit.cloudera.org:8080/#/c/2574/20/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 89:   uint64_t block_size;
Added as part of fix for IMPALA-3245.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 136: sizes = self.filesystem_client.get_all_file_sizes(DIR)
Change made as part of fix for IMPALA-3245.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:

Line 58:   def get_all_file_sizes(self, path):
Added for IMPALA-3245.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

Line 79:   def get_all_file_sizes(self, path):
Added for IMPALA-3245.


http://gerrit.cloudera.org:8080/#/c/2574/20/tests/util/s3_util.py
File tests/util/s3_util.py:

Line 72:   def get_all_file_sizes(self, path):
Added for IMPALA-3245.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#20).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/shell/test_shell_commandline.py
A tests/util/filesystem_base.py
M te

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-26 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#19).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_col_stats.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_delimited_text.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_insert_permutation.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/shell/test_shell_commandline.py
A tests/util/filesystem_base.py
M te

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/18/fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
File fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java:

Line 52: String fsNameStr = System.getenv("DEFAULT_FS");
Just FYI, had to make this change after rebasing on Anuj's patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 18
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#18).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M fe/src/test/java/com/cloudera/impala/planner/S3PlannerTest.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
60 files changed, 730 insertions(+), 407 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/18
--

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 17:

Added backend tests "hdfs-util-test" to test FilesystemsMatch()

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 17
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#17).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/CMakeLists.txt
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
A be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
59 files changed, 729 insertions(+), 406 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/17
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscri

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#16).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
57 files changed, 666 insertions(+), 406 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/16
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Me

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#15).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
57 files changed, 666 insertions(+), 406 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/15
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Me

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 14:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 64: /// Records an error if it happens during an operation.
   :   void AddError(const string& error_msg) const;
> Does this need to be public?
No, I've made it private.


Line 95: HdfsOperationSet(HdfsFsCache::HdfsFsMap* connection_cache);
> Re-instate some comment about ownership / usage of connection_cache
Done


http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 81: FilesystemsMatch
> can you add some unit tests for this method?
Trying to add tests for this is tricky. I will do it and upload a patch set, 
but meanwhile I don't want to hold up this patch set.


Line 93: return true;
> make one line
Done


Line 94: else
> no need for 'else' if every block returns
Done


Line 95: strncmp
> return strncmp() (and below)
Done


Line 101:   // By this point we know that both the paths have schemes.
> That may be true, but comment that the next step is to compare authorities.
Done


Line 102: += 3
> need to comment what this magic constant is doing.
Done


Line 107: namenode_a_len
> for once I think a ternary would be clearer here:
Done


http://gerrit.cloudera.org:8080/#/c/2574/14/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 90: if (!isDistributedFileSystem(p1) || !isDistributedFileSystem(p2)) 
return false;
> Doesn't this method assume that p1 and p2 are on the same filesystem? What'
No, p1 and p2 can be on different filesystems, we just check if they are on 
HDFS because arePathsInSameEncryptionZone() is a HDFS specific function. I've 
added another comment to clarify that.


Line 168: Only distributed file systems have different encryption zones.
> I think this comment (or something like it) belongs in arePathsInSameEncypt
Done


Line 168: If the source and
: // destination  are on different file systems, we need to use 
FileUtil.copy() to copy
: // between filesystems.
> "are on different file systems, or in different encryption zones, files can
Done


Line 173: sameEncryptionZone :
:sameFileSystem;
> for formatting this, prefer line breaking after the '=' and putting the who
Done


Line 277: Returns true iff the filesystem is a DistributedFileSystem.
> Comment is wrong
Done


Line 284: Returns true iff the filesystem is a DistributedFileSystem.
> Comment is wrong.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-25 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 14:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 64: /// Records an error if it happens during an operation.
   :   void AddError(const string& error_msg) const;
Does this need to be public?


Line 95: HdfsOperationSet(HdfsFsCache::HdfsFsMap* connection_cache);
Re-instate some comment about ownership / usage of connection_cache


http://gerrit.cloudera.org:8080/#/c/2574/14/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 81: FilesystemsMatch
can you add some unit tests for this method?


Line 93: return true;
make one line


Line 94: else
no need for 'else' if every block returns


Line 95: strncmp
return strncmp() (and below)


Line 101:   // By this point we know that both the paths have schemes.
That may be true, but comment that the next step is to compare authorities.


Line 102: += 3
need to comment what this magic constant is doing.


Line 107: namenode_a_len
for once I think a ternary would be clearer here:

int namnode_a_len = (after_authority_a == NULL) ? strlen(path_a) : 
(after_authority_a - path_a)


http://gerrit.cloudera.org:8080/#/c/2574/14/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 90: if (!isDistributedFileSystem(p1) || !isDistributedFileSystem(p2)) 
return false;
Doesn't this method assume that p1 and p2 are on the same filesystem? What's 
the purpose of checking both paths here?


Line 168: Only distributed file systems have different encryption zones.
I think this comment (or something like it) belongs in 
arePathsInSameEncyptionZone()


Line 168: If the source and
: // destination  are on different file systems, we need to use 
FileUtil.copy() to copy
: // between filesystems.
"are on different file systems, or in different encryption zones, files can't 
be moved from one location to the other and must be copied instead."


Line 173: sameEncryptionZone :
:sameFileSystem;
for formatting this, prefer line breaking after the '=' and putting the whole 
condition on one line where possible.


Line 277: Returns true iff the filesystem is a DistributedFileSystem.
Comment is wrong


Line 284: Returns true iff the filesystem is a DistributedFileSystem.
Comment is wrong.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-24 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 13:

(4 comments)

I've rebased on Anuj's patch and addressed the comments.

http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 649:   // that file, it still does not "exist" on S3 and hdfsGetPathInfo() 
will return an error
> this comment is kind of confusing because it explains why running this code
Done


http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 154:   connection_cache_ = connection_cache;
> if we're going to stash this as a member variable, why not just pass it dur
Done


http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 91:   return strncmp(path_a, path_b, scheme_a_len) == 0;
> it needs to also check the namenode (or bucket, etc). i.e.:
Yup, I've rebased on Anuj's patch and added checks to accommodate for the 
authority too.


http://gerrit.cloudera.org:8080/#/c/2574/13/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

Line 400:  URI with the scheme 
> I think a more standard way to say it is: "Fully qualified URI".
This is the same as THdfsTable.hdfsBaseDir. That should be fully qualified. 
I'll confirm that by asking someone tomorrow.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-24 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#14).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_partitioning.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
57 files changed, 668 insertions(+), 406 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/14
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Me

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-24 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 774: InsertStmt
> Nit: Don't refer to a Java-side class here. Just say "INSERT allow writes..
Done


Line 843: mimick
> 'mimic'
Done


http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 76: bool FilesystemsMatch(const char* path_a, const char* path_b) {
   :   const char* after_scheme_a = strstr(path_a, ":/");
   :   const char* after_scheme_b = strstr(path_b, ":/");
   : 
   :   bool path_a_has_scheme = after_scheme_a != NULL;
   :   bool path_b_has_scheme = after_scheme_b != NULL;
   :   // Currently, our defaultFS is HDFS. If there is no scheme present, 
then that path is a
   :   // HDFS path.
   :   if (!path_a_has_scheme) return IsHdfsPath(path_b);
   :   if (!path_b_has_scheme) return IsHdfsPath(path_a);
   : 
   :   // By this point we know that both the paths have schemes.
   :   bool scheme_a_len = after_scheme_a - path_a;
   :   bool scheme_b_len = after_scheme_b - path_b;
   :   if (scheme_a_len != scheme_b_len) return false;
   :   return strncmp(path_a, path_b, scheme_a_len) == 0;
   : }
> Just realised: this won't work if there are two different HDFS filesystems 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-22 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 11:

Sailesh: ping - let's get this in asap.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-20 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 774: InsertStmt
Nit: Don't refer to a Java-side class here. Just say "INSERT allow writes..."


Line 843: mimick
'mimic'


http://gerrit.cloudera.org:8080/#/c/2574/11/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 76: bool FilesystemsMatch(const char* path_a, const char* path_b) {
   :   const char* after_scheme_a = strstr(path_a, ":/");
   :   const char* after_scheme_b = strstr(path_b, ":/");
   : 
   :   bool path_a_has_scheme = after_scheme_a != NULL;
   :   bool path_b_has_scheme = after_scheme_b != NULL;
   :   // Currently, our defaultFS is HDFS. If there is no scheme present, 
then that path is a
   :   // HDFS path.
   :   if (!path_a_has_scheme) return IsHdfsPath(path_b);
   :   if (!path_b_has_scheme) return IsHdfsPath(path_a);
   : 
   :   // By this point we know that both the paths have schemes.
   :   bool scheme_a_len = after_scheme_a - path_a;
   :   bool scheme_b_len = after_scheme_b - path_b;
   :   if (scheme_a_len != scheme_b_len) return false;
   :   return strncmp(path_a, path_b, scheme_a_len) == 0;
   : }
Just realised: this won't work if there are two different HDFS filesystems 
(because you don't check the authority).

That's ok, because multiple HDFS filesystems aren't in scope to support - but 
if it's easy to do the authority check, might as well do so.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-20 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 13:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 649:   // that file, it still does not "exist" on S3 and hdfsGetPathInfo() 
will return an error
this comment is kind of confusing because it explains why running this code is 
okay first, and not until later does it explain why the special case is needed 
at all.   How about rephrasing as below (note however, that it's possible that 
one day S3 will have per-file block sizes, and even variable block sizes):

For S3A, when creating files, hdfsGetPathInfo() cannot be used until the file 
is closed. Since S3A is really an object store, all files have the same "block" 
size, which is the same as the default block size.


http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 154:   connection_cache_ = connection_cache;
if we're going to stash this as a member variable, why not just pass it during 
construction? does it change between Execute() calls?


http://gerrit.cloudera.org:8080/#/c/2574/13/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 91:   return strncmp(path_a, path_b, scheme_a_len) == 0;
it needs to also check the namenode (or bucket, etc). i.e.:

hdfs://namenode0:8020/my/file
hdfs://namenode1:8020/other/file
hdfs://namenode0:8021/another/file

are all different filesystems. And you won't be able to do that without knowing 
the defaultfs (i.e. Anuj's patch).


http://gerrit.cloudera.org:8080/#/c/2574/13/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

Line 400:  URI with the scheme 
I think a more standard way to say it is: "Fully qualified URI".

But how do we guarantee that this is actually fully qualified?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-19 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 11:

Dan - could you review this patch please?

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-08 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#13).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
56 files changed, 645 insertions(+), 405 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/13
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id:

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-07 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 11:

(7 comments)

That sounds good to me. But just FYI, Anuj is on PTO today and tomorrow. Which 
means we will need to wait till Monday till we can get this in.

http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
File 
testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test:

Line 4: '$FILESYSTEM_PREFIX/test-warehouse/insert_premutation_test'
> typo in the location.
Good catch. Yes, we'll no longer need to explicitly specify FILESYSTEM_PREFIX 
after the defaultFS patch.


http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
File 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test:

Line 33
> Are we losing coverage by not having a test that tries to INSERT into a non
We are losing out on testing the path that returns an error if we use an 
unsupported filesystem during an INSERT.
However, we don't have a filesystem now that we can CREATE a table in but not 
INSERT.


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

Line 120: # Filesystem client acts as a generic client to supported 
filesystems.
> Comment is a bit redundant.
Done


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 70:   @SkipIfS3.qualified_path
> FWIW, I just filed https://issues.cloudera.org/browse/IMPALA-3313 to deal w
Yes, that sounds like a much better opt-in/out process than what we have now.


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/query_test/test_cancellation.py
File tests/query_test/test_cancellation.py:

Line 160:   @SkipIfS3.jira
> What's the JIRA?
It hits IMPALA-551 sometimes. But I just realized that I don't need to skip it 
for that. It gets xfailed automatically.


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:

Line 1: 
> Remove blank line
Done


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_utils.py
File tests/util/filesystem_utils.py:

Line 32: speicfic
> specific
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-07 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 130:   Promise promise_;
> I was thinking that the caller of Wait() would just check the error status 
Yea that makes sense. I've made a change to use the CountingBarrier and checked 
the error status after Wait().


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-07 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#12).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
56 files changed, 645 insertions(+), 405 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/12
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id:

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-06 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 11:

(7 comments)

Looks pretty good to me! I think we should let the default FS patch get in 
first, and then do any follow-on work required for that here. Does that sound 
ok with you?

http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
File 
testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test:

Line 4: '$FILESYSTEM_PREFIX/test-warehouse/insert_premutation_test'
typo in the location.

Presumably this requirement to specify the location goes away if S3 is the 
default FS?


http://gerrit.cloudera.org:8080/#/c/2574/11/testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
File 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test:

Line 33
Are we losing coverage by not having a test that tries to INSERT into a 
non-insertable filesystem?


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

Line 120: # Filesystem client acts as a generic client to supported 
filesystems.
Comment is a bit redundant.


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 70:   @SkipIfS3.qualified_path
FWIW, I just filed https://issues.cloudera.org/browse/IMPALA-3313 to deal with 
having to add multiple decorators to each test to manually select them in or 
out for different filesystems.


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/query_test/test_cancellation.py
File tests/query_test/test_cancellation.py:

Line 160:   @SkipIfS3.jira
What's the JIRA?


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:

Line 1: 
Remove blank line


http://gerrit.cloudera.org:8080/#/c/2574/11/tests/util/filesystem_utils.py
File tests/util/filesystem_utils.py:

Line 32: speicfic
specific


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-06 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 130:   Promise promise_;
> I think the reason a promise was used here instead of a counting barrier wa
I was thinking that the caller of Wait() would just check the error status 
after Wait() returned. Using Promise<> for this is fine if there's a need to 
create an actual channel (maybe producer and consumer don't share any other 
state than the promise), but in this case I don't think that's right.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-05 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 10:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 839: There is no directory structure in S3, so "inheriting" permissions is 
not
:   // possible.
> S3 has per-object permissions (in the form of ACLs see http://docs.aws.amaz
That is possible, but since there is no directory structure, we don't have 
anywhere to inherit permissions from.

An alternative way of doing it would be to save the partition's default 
permissions in a hidden file or something similar (just basically someplace to 
hold the state), and every time we insert, we "inherit" permissions from this 
file.
Or maybe just stat an existing object and copy over those permissions (this 
could have other complications though).

I'll leave this as a TODO for now and address it in a following patch.


Line 875: stringstream ss;
:   ss << "Error(s) deleting partition directories. First error 
(of "
:  << partition_create_ops.errors().size() << ") was: " << 
err.second;
> While you are here, could you clean up he use of stringstream in favour of 
Done


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 111: .c_str()
> not really your change, but I think you can get rid of .c_str() in all the 
Done


Line 128: string error_msg = connection_status.ok() ? GetStrErrMsg() :
> put the line break here, and put the whole ternary statement on the next li
Done


Line 156:   if (connection_cache != NULL) connection_cache_ = connection_cache;
> You should do this even if connection_cache is NULL - the caller might be t
Done


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 94:   /// Constructs a new operation set.
> comment is now redundant, can be removed.
Done


Line 109:   /// If abort_on_error is true, execution will finish after the 
first error seen.
> update comment to explain connection_cache
Done


Line 121: return connection_cache_;
> one line
Done


Line 130:   Promise promise_;
> Can you replace this with a CountingBarrier? I think that will be more natu
I think the reason a promise was used here instead of a counting barrier was 
because a CountingBarrier cannot return an error. And we need the promise to 
notify us of an error if it does happen.

I can modify CountingBarrier to maintain its current behavior and optionally 
return an error if necessary. Should I do that instead?


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 77: // TODO: 's3a' is the smallest scheme that our filesystem supports, so 
we compare the
   :   // first 6 characters. Change if we support a filesystem with a 
smaller scheme.
> Comment is out-of-date
Done


Line 85:   // HDFS path.
> this isn't a valid assumption after Anuj's change.
Once Anuj pushes his change, I'll rebase and submit the final version. (Or he 
can rebase if I get this patch in first)


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

Line 45: IsDfsPath
> I think this should be IsHdfsPath(...), because other filesystems technical
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-05 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#11).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
A infra/python/deps/boto3-1.2.3-py2.py3-none-any.whl
A infra/python/deps/botocore-1.3.30-py2.py3-none-any.whl
A infra/python/deps/docutils-0.12.tar.gz
A infra/python/deps/futures-3.0.5-py2-none-any.whl
A infra/python/deps/jmespath-0.9.0-py2.py3-none-any.whl
A infra/python/deps/python_dateutil-2.5.2-py2.py3-none-any.whl
M infra/python/deps/requirements.txt
A infra/python/deps/simplejson-3.3.0.tar.gz
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
56 files changed, 640 insertions(+), 395 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/11
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id:

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-04-05 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 10:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 839: There is no directory structure in S3, so "inheriting" permissions is 
not
:   // possible.
S3 has per-object permissions (in the form of ACLs see 
http://docs.aws.amazon.com/AmazonS3/latest/dev/S3_ACLs_UsingACLs.html). So 
what's to stop us mimicking inherited permissions? Even though there's no 
directory structure, we could insure that permissions flow down the partition 
hierarchy.

I'm happy to leave this as a TODO, but check my reasoning first.


Line 875: stringstream ss;
:   ss << "Error(s) deleting partition directories. First error 
(of "
:  << partition_create_ops.errors().size() << ") was: " << 
err.second;
While you are here, could you clean up he use of stringstream in favour of 
Substitute()?


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 111: .c_str()
not really your change, but I think you can get rid of .c_str() in all the 
logging messages


Line 128: string error_msg = connection_status.ok() ? GetStrErrMsg() :
put the line break here, and put the whole ternary statement on the next line 
if it fits


Line 156:   if (connection_cache != NULL) connection_cache_ = connection_cache;
You should do this even if connection_cache is NULL - the caller might be 
trying to explicitly say "don't use a connection cache".


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 94:   /// Constructs a new operation set.
comment is now redundant, can be removed.


Line 109:   /// If abort_on_error is true, execution will finish after the 
first error seen.
update comment to explain connection_cache


Line 121: return connection_cache_;
one line


Line 130:   Promise promise_;
Can you replace this with a CountingBarrier? I think that will be more natural 
and will simplify some of the logic.


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 77: // TODO: 's3a' is the smallest scheme that our filesystem supports, so 
we compare the
   :   // first 6 characters. Change if we support a filesystem with a 
smaller scheme.
Comment is out-of-date


http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

Line 45: IsDfsPath
I think this should be IsHdfsPath(...), because other filesystems technically 
implement the DFS abstraction.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-30 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/10/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 85:   // HDFS path.
this isn't a valid assumption after Anuj's change.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-28 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#10).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 15x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. However, larger INSERTs
come closer to HDFS permformance than smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
47 files changed, 621 insertions(+), 377 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/10
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-28 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 5:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 47: (
> remove
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
> is this per partition or per filesystem? Not clear here what role this play
It's a connection cache for all the filesystems that the partitions connect to. 
So I guess it makes more sense to call it filesystem_connection_cache. Changed 
it.


Line 772: over
> in or on, not over (you partition over a partition function).
Done


Line 772: allows to
> "allow us to" or better "allows writes to tables ..."
Done


Line 773: So we need to open connections to different filesystems as necessary
> this sentence should be where you explain the "one cnxn per fs" requirement
I removed mention of the requirement. See my comment below for explanation.


Line 774: the connections
> "one connection per filesystem"
Done


Line 776: we don't want to open multiple
: // connections to the same filesystem. Doing so could cause 
inconsistencies in the
: // filesystem and return with errors
> this is vague - what could go wrong?
I spent all afternoon trying to recreate the errors that I saw earlier, so that 
I could write a better description of what could go wrong if we use multiple 
connections to a FS. However, I wasn't able to reproduce the errors, which 
leads me to believe that the errors I saw before were probably because of some 
other bugs which I fixed later.
So I'm removing mention of the "inconsistencies", however, it's still better to 
cache connections (so that we're not unnecessarily creating too many 
connections at once).


Line 772: InsertStmt allows to insert in tables that have partitions over 
multiple filesystems.
: // So we need to open connections to different filesystems as 
necessary.
: // We use a local connection cache and populate it with the 
connections to the necessary
: // filesystems before we do the actual operations (copy, move, 
delete, etc.). This is
: // because the operations are done in parallel and we don't want 
to open multiple
: // connections to the same filesystem. Doing so could cause 
inconsistencies in the
: // filesystem and return with errors.
> This comment as a whole would be better before line 754.
I've removed this comment. See CR comment below.


Line 779: partition_connection
> partition_fs_connection (or just fs_connection)
Done


Line 780: RETURN_IF_ERROR
> does this leave behind any cleanup that needs to be done?
No the context of everything that happens in this loop is local. And none of 
the operations actually start until after this loop execution is completed. So 
returning on an error here would not leave any uncleaned state.


Line 798: 
> this blank line can go
Done


Line 799: bool is_s3_path = false;
: if (IsS3APath(part_path.c_str())) is_s3_path = true;
> write this as:
Done


Line 823:  
> indentation (not your change)
isn't it already 2 tabs?


Line 846: !is_s3_path
> move this into the if on line 843, save a level of nesting
Done


Line 864: if (!is_s3_path
> again, combine with above line
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 55: Status::OK();
> combine declaration with line 57
Done


Line 60: if (connection_status.ok()) {
> instead of this indentation, make a method call "AddError(const string& err
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap* 
local_connection_cache) {
: DCHECK(local_connection_cache != NULL);
: local_connection_cache_ = local_connection_cache;
:   }
> how about just making this a parameter to Execute() - is it needed anywhere
No, I've removed this and moved it to Execute().


Line 137: local
> what does it mean to be 'local'? Who owns this object?
I see the confusion. I've removed mentions of 'local' and renamed this field. 
It's local with respect to the process wide global connection cache. But this 
class doesn't care about that. Also, it's not owned. The caller should supply 
the connection cache.


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 76: const char* src, const char* to_check
> does this ever get called with char*s that aren't derived from string objec
Not currently, no.


Line 77: scheme of our filesystem support
> grammar
Done


Line 79: strncmp
> this needs fixing. You nee

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-28 Thread Sailesh Mukil (Code Review)
Hello Taras Bobrovytsky,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2574

to look at the new patch set (#9).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 15x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. However, larger INSERTs
come closer to HDFS permformance than smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
47 files changed, 621 insertions(+), 377 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/9
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 9
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-24 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 5:

(30 comments)

http://gerrit.cloudera.org:8080/#/c/2574/5//COMMIT_MSG
Commit Message:

Line 9: Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
  : functionally enables LOAD DATA and INSERT on S3 without making major
  : changes for the sake of improving performance over S3. This patch also
  : enables both INSERT and LOAD DATA between file systems.
Say something about the approach here. What had to change?


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 47: (
remove


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
is this per partition or per filesystem? Not clear here what role this plays in 
the method.


Line 772: over
in or on, not over (you partition over a partition function).


Line 772: allows to
"allow us to" or better "allows writes to tables ..."


Line 773: So we need to open connections to different filesystems as necessary
this sentence should be where you explain the "one cnxn per fs" requirement.


Line 774: the connections
"one connection per filesystem"


Line 772: InsertStmt allows to insert in tables that have partitions over 
multiple filesystems.
: // So we need to open connections to different filesystems as 
necessary.
: // We use a local connection cache and populate it with the 
connections to the necessary
: // filesystems before we do the actual operations (copy, move, 
delete, etc.). This is
: // because the operations are done in parallel and we don't want 
to open multiple
: // connections to the same filesystem. Doing so could cause 
inconsistencies in the
: // filesystem and return with errors.
This comment as a whole would be better before line 754.


Line 776: we don't want to open multiple
: // connections to the same filesystem. Doing so could cause 
inconsistencies in the
: // filesystem and return with errors
this is vague - what could go wrong?


Line 779: partition_connection
partition_fs_connection (or just fs_connection)


Line 780: RETURN_IF_ERROR
does this leave behind any cleanup that needs to be done?


Line 798: 
this blank line can go


Line 799: bool is_s3_path = false;
: if (IsS3APath(part_path.c_str())) is_s3_path = true;
write this as:

  bool is_s3_path = IsS3APath(part_path.c_str());


Line 823:  
indentation (not your change)


Line 846: !is_s3_path
move this into the if on line 843, save a level of nesting


Line 864: if (!is_s3_path
again, combine with above line


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 55: Status::OK();
combine declaration with line 57


Line 60: if (connection_status.ok()) {
instead of this indentation, make a method call "AddError(const string& 
error_msg)" which does lines 102 -> 126 and call it here and return (after 
calling MarkOneOpDone())


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap* 
local_connection_cache) {
: DCHECK(local_connection_cache != NULL);
: local_connection_cache_ = local_connection_cache;
:   }
how about just making this a parameter to Execute() - is it needed anywhere 
else?


Line 137: local
what does it mean to be 'local'? Who owns this object?


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 76: const char* src, const char* to_check
does this ever get called with char*s that aren't derived from string objects?


Line 77: scheme of our filesystem support
grammar


Line 79: strncmp
this needs fixing. You need to extract the scheme (by searching for "://" I 
think) and do this properly.


Line 82: bool SpansMultipleFilesystems(const char* pathA, const char* pathB) {
   :   return !IsPathOnFilesystem(pathA, pathB);
   : }
remove this method. Have the caller just call IsPathOnFilesystem().


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

Line 51: IsPathOnFilesystem
All paths are on some filesystem. Call this "FilesystemsMatch" or something.

What should this return if either path doesn't have a scheme component?


Line 51: to_check
'to_check' doesn't really tell me what this is (and the type doesn't help). 
Obviously by context it's a path, but still better to show this in the variable 
name.

Also, since this method is symmetric (i.e. doesn't matter what order the 
arguments are given), better just to name them symm

[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-22 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 8: Code-Review+1

I reviewed the Python side, it looks good to me now.

-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-22 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has uploaded a new patch set (#8).

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 15x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. However, larger INSERTs
come closer to HDFS permformance than smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/ImpalaInternalService.thrift
M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
M fe/src/main/java/com/cloudera/impala/service/Frontend.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
M 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
M testdata/workloads/tpch/queries/insert_parquet.test
M tests/common/impala_test_suite.py
M tests/common/skip.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_compute_stats.py
M tests/metadata/test_ddl.py
M tests/metadata/test_explain.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_last_ddl_time_update.py
M tests/metadata/test_load.py
M tests/metadata/test_partition_metadata.py
M tests/metadata/test_recover_partitions.py
M tests/metadata/test_show_create_table.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_join_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
A tests/util/filesystem_base.py
M tests/util/filesystem_utils.py
M tests/util/hdfs_util.py
A tests/util/s3_util.py
47 files changed, 599 insertions(+), 359 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/8
-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-03-22 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2574/7/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

Line 108: exists
> I tested this on my machine and it does not work (returns None instead of T
As you mentioned in person, it points to _pyweb_hdfs_client_exists(). I've 
changed it to just have this function exists() which stats the file/dir to 
check for it's existence. This should sort out the confusion.


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes