[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-07 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 14: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-07 Thread David Knupp (Code Review)
David Knupp has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4769/13//COMMIT_MSG
Commit Message:

Line 51: still problems to work out with the remote data load script itself.
> Did you try loading data on a remote cluster and running tests on in with t
Yes, many times. I should update this sentence to be more clear.

This is mainly a references to the several "clean up" changes that Harrison 
suggested earlier, for which JIRA's have been opened. We can address those when 
there's time. More pressing than cleaning up all those things is fact that we 
need to have this checked in order to validate Impala running against a remote 
CDH 5.10 cluster, and time is getting short. We have less than two weeks now.

There were some other actual problems that were mysterious to me initially. 
E.g., Kudu related failures started appearing once recent Kudu changes were 
submitted -- until I realized that this issue was breaking things: 

https://jira.cloudera.com/browse/OPSAPS-37322

But after tweaking the cluster, data loading works, and tests run -- though 
many tests may need to be tweaked to work remotely.


http://gerrit.cloudera.org:8080/#/c/4769/13/bin/remote_data_load.py
File bin/remote_data_load.py:

Line 534: sys.exit(1)
> In general, I think it's a bad practice to call sys.exit inside functions. 
OK, I'll move this. I'd seen this pattern used here in other scripts here 
(e.g., load-data.py that we use for local data loading), so didn't know it was 
a frowned upon practice.


http://gerrit.cloudera.org:8080/#/c/4769/13/testdata/datasets/functional/schema_constraints.csv
File testdata/datasets/functional/schema_constraints.csv:

PS13, Line 120: Wide tables fail due to the SERDEPROPERTIES limits
> is this a new issue? Is it specific to remote data loading?
For our mini-cluster, we work around this problem here:

https://github.com/apache/incubator-impala/blob/master/bin/create-test-configuration.sh#L99

However, create-test-configuration.sh is part of our local build process. It 
doesn't get called when CDH is deployed to a remote cluster. Besides, that 
script assumes that the metastore database will always be postgres, which is 
not the case when testing against a remote cluster.

Before the change to this file, I had been using another hand-rolled script to 
configure the property separately after deployment. With this, I can drop that 
step.

I've also tested the local data load after this change, and it's unaffected.


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-07 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4769/13//COMMIT_MSG
Commit Message:

Line 51: still problems to work out with the remote data load script itself.
Did you try loading data on a remote cluster and running tests on in with the 
scripts in this patch?


http://gerrit.cloudera.org:8080/#/c/4769/13/bin/remote_data_load.py
File bin/remote_data_load.py:

Line 534: sys.exit(1)
In general, I think it's a bad practice to call sys.exit inside functions. It's 
better to raise an exception and if it does not get caught on the outside, then 
exit will be triggered.


http://gerrit.cloudera.org:8080/#/c/4769/13/testdata/datasets/functional/schema_constraints.csv
File testdata/datasets/functional/schema_constraints.csv:

PS13, Line 120: Wide tables fail due to the SERDEPROPERTIES limits
is this a new issue? Is it specific to remote data loading?

I looked at HIVE-1364 it says it got fixed a few years ago.


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-05 Thread David Knupp (Code Review)
David Knupp has uploaded a new patch set (#12).

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..

IMPALA-4365: Enabling end-to-end tests on a remote cluster

This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:

  - Managed by Cloudera Manager (CM)
  - GPL Extras need to be installed
  - KMS and KeyTrustee installed and available as a service
  - SERDEPROPERTIES in the Hive DB modified to accept wide tables
  - Hive warehouse dir points to /test-warehouse

The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)

It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.

Usage: remote_data_load.py [options] 

Options:
  -h, --helpshow this help message and exit
  --snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
  --cm-user=CM_USER Cloudera Manager admin user
  --cm-pass=CM_PASS Cloudera Manager admin user password
  --gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with
passwordless SSH configured.
  --no-load Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --testRun end-to-end tests against cluster

Testing:

This patch is being submitted with the understanding that there are
still problems to work out with the remote data load script itself.

However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.

Remote data loading is working for the most part, although recent
Kudu-related changes have introduced unforeseen problems:

https://github.com/apache/incubator-impala/commit/041fa6d

In the meantime, setting KUDU_IS_SUPPORTED to false provides a
temporary workaround.

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
---
M bin/load-data.py
A bin/remote_data_load.py
M testdata/bin/compute-table-stats.sh
M testdata/bin/create-load-data.sh
M testdata/bin/create-table-many-blocks.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-test-warehouse-snapshot.sh
M testdata/bin/load_nested.py
M testdata/bin/run-step.sh
M testdata/bin/setup-hdfs-env.sh
M testdata/datasets/functional/schema_constraints.csv
11 files changed, 796 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/4769/12
-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-03 Thread David Knupp (Code Review)
David Knupp has uploaded a new patch set (#11).

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..

IMPALA-4365: Enabling end-to-end tests on a remote cluster

This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:

  - Managed by Cloudera Manager (CM)
  - GPL Extras need to be installed
  - KMS and KeyTrustee installed and available as a service
  - SERDEPROPERTIES in the Hive DB modified to accept wide tables
  - Hive warehouse dir points to /test-warehouse

The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)

It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.

Usage: remote_data_load.py [options] 

Options:
  -h, --helpshow this help message and exit
  --snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
  --cm-user=CM_USER Cloudera Manager admin user
  --cm-pass=CM_PASS Cloudera Manager admin user password
  --gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with
passwordless SSH configured.
  --no-load Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --testRun end-to-end tests against cluster

Testing:

This patch is being submitted with the understanding that there are
still problems to work out with the remote data load script itself.

However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.

Remote data loading is working for the most part, although recent
Kudu-related changes have introduced unforeseen problems:

https://github.com/apache/incubator-impala/commit/041fa6d

In the meantime, setting KUDU_IS_SUPPORTED to false provides a
temporary workaround.

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
---
M bin/load-data.py
A bin/remote_data_load.py
M testdata/bin/compute-table-stats.sh
M testdata/bin/create-load-data.sh
M testdata/bin/create-table-many-blocks.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-test-warehouse-snapshot.sh
M testdata/bin/load_nested.py
M testdata/bin/run-step.sh
M testdata/bin/setup-hdfs-env.sh
10 files changed, 791 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/4769/11
-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-03 Thread David Knupp (Code Review)
David Knupp has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 9:

(8 comments)

Addressed comments. Also removed another place in remote_data_load.py where 
sys.exit() was called. Finally, fixed a bug that I noticed in the arg parsing 
code.

http://gerrit.cloudera.org:8080/#/c/4769/9/bin/remote_data_load.py
File bin/remote_data_load.py:

PS9, Line 63: 'HDFS',
:  'YARN',
:  'HIVE',
:  'IMPALA',
:  'MAPREDUCE',
:  'KUDU',
:  'HBASE',
:  'ZOOKEEPER
> I think it would be nice to put these in sorted order. (Vim has a nice abil
Done


Line 174: try:
> It's kind of weird to do this through try and except. How about:
Done


PS9, Line 177: missing_svcs
> I suggest not shortening: missing_services
Done


Line 179: sys.exit("Cluster not ready.")
> It's better to raise an exception that exiting right away. It might be unex
Done


Line 181: try:
> it's better to do this with if then instead of assert. (same as above)
Done


PS9, Line 253: svc_type
> I prefer service_type here and elsewhere. It's clearer and easier on the br
Done


Line 379: # 'database_name' is a header from the beeline output.
> Maybe we can skip the header some other way? for example db.list.split()[1:
Done


http://gerrit.cloudera.org:8080/#/c/4769/9/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 482:   if [[ -n "$CM_HOST" ]]; then
> this should be moved right above where it's used (line 491)
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-03 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 9:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/4769/9/bin/remote_data_load.py
File bin/remote_data_load.py:

PS9, Line 63: 'HDFS',
:  'YARN',
:  'HIVE',
:  'IMPALA',
:  'MAPREDUCE',
:  'KUDU',
:  'HBASE',
:  'ZOOKEEPER
I think it would be nice to put these in sorted order. (Vim has a nice ability 
to sort lines alphabetically).


Line 174: try:
It's kind of weird to do this through try and except. How about:

if set(REQUIRED_SERVICES) != set(setvices.keys()):


PS9, Line 177: missing_svcs
I suggest not shortening: missing_services


Line 179: sys.exit("Cluster not ready.")
It's better to raise an exception that exiting right away. It might be 
unexpected that exit is called from a method like get_services.


Line 181: try:
it's better to do this with if then instead of assert. (same as above)


PS9, Line 253: svc_type
I prefer service_type here and elsewhere. It's clearer and easier on the brain 
:)


Line 379: # 'database_name' is a header from the beeline output.
Maybe we can skip the header some other way? for example db.list.split()[1:]


http://gerrit.cloudera.org:8080/#/c/4769/9/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 482:   if [[ -n "$CM_HOST" ]]; then
this should be moved right above where it's used (line 491)


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-01 Thread David Knupp (Code Review)
David Knupp has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 9:

Ignore remote_load_data.py in patch set 8. I accidentally pushed an interim 
in-progress version of the file in that patch set. Patch set 9 has the right 
code.

Sorry about the confusion.

-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-01 Thread David Knupp (Code Review)
David Knupp has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 7:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/4769/7/bin/remote_data_load.py
File bin/remote_data_load.py:

PS7, Line 17: This script is a setup script
> How about simply "this is a setup script"
Done


PS7, Line 21: The script expects a setup Impala development environment.
> How about:
Done


PS7, Line 22: appropriately
> It's not clear what this means. Maybe expand on this a little?
Done


PS7, Line 47: import logging
> import statements should be above from x import y
Done


PS7, Line 111: ,
> remove
Done


PS7, Line 116: that,
> remove
Done


PS7, Line 127: not
> having a not here makes it harder to reason about, how about:
Done


PS7, Line 130: password=options.cm_pass)
> Is this indent correct according to the style guide? I thought it was suppo
This passes PEP-8 linter inspection.


PS7, Line 204: Only pick the first cluster configured
> This comment does not really add anything, it's obvious what the code is do
Done


PS7, Line 209: for service in cluster.get_all_services():
> Just an observation: it looks like we don't check that all services we need
I think that confirming the readiness of the cluster is a good idea. I'm adding 
a method to do this as part of the init procedure. Right now, all it does is 
check that the required services are installed and running. It can be extended 
later to do other checks as well.


Line 281: # See https://issues.cloudera.org/browse/IMPALA-4365
> It looks like the issue is resolved, is this still necessary?
Removed.


PS7, Line 338: "database_name"
> Why would it ever be called "database_name"? Where is this constant coming 
I figured out it's the header from the beeline output. (Much of this code, I've 
inherited, and I'll confess I didn't catch every line.) I'll add a comment.


PS7, Line 432: wmay
> may
Done


PS7, Line 451: # "--skip_local_tests",
> why is this commented out?
Bascially, this is another remainder from Martin was last working on the 
script. In fact, this entire method may go away (it's not currently working to 
use runtest to run remote cluster tests, but you can do it calilng 
impala-py.test directly.)


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 492: run-step "Loading nested data" load-nested.log \
> The command is very similar in both cases. The only difference is --cm-host
Done


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

Line 411: if os.getenv("REMOTE_LOAD"):
> base_load_dir = os.getenv("REMOTE_LOAD") or os.getenv("IMPALA_HOME")
Done


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

Line 301:   parser.add_argument("-s", "--source-db", default="tpch_parquet")
> don't we also need cm-host parameter here?
Yup, but we pick that up from L300:


  cli_options.add_cluster_options(parser)


Good eye though. I'll add a comment.


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/setup-hdfs-env.sh
File testdata/bin/setup-hdfs-env.sh:

Line 20: set -e
> we are already setting -euo pipefail below
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-11-01 Thread David Knupp (Code Review)
David Knupp has uploaded a new patch set (#8).

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..

IMPALA-4365: Enabling end-to-end tests on a remote cluster

This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:

  - Managed by Cloudera Manager (CM)
  - GPL Extras need to be installed
  - KMS and KeyTrustee installed and available as a service
  - SERDEPROPERTIES in the Hive DB modified to accept wide tables
  - Hive warehouse dir points to /test-warehouse

The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)

It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.

Usage: remote_data_load.py [options] 

Options:
  -h, --helpshow this help message and exit
  --snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
  --cm-user=CM_USER Cloudera Manager admin user
  --cm-pass=CM_PASS Cloudera Manager admin user password
  --gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with
passwordless SSH configured.
  --no-load Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --testRun end-to-end tests against cluster

Testing:

This patch is being submitted with the understanding that there are
still problems to work out with the remote data load script itself.

However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.

Remote data loading is working for the most part, although recent
Kudu-related changes have introduced unforeseen problems:

https://github.com/apache/incubator-impala/commit/041fa6d

In the meantime, setting KUDU_IS_SUPPORTED to false provides a
temporary workaround.

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
---
M bin/load-data.py
A bin/remote_data_load.py
M testdata/bin/compute-table-stats.sh
M testdata/bin/create-load-data.sh
M testdata/bin/create-table-many-blocks.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-test-warehouse-snapshot.sh
M testdata/bin/load_nested.py
M testdata/bin/run-step.sh
M testdata/bin/setup-hdfs-env.sh
10 files changed, 794 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/4769/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-ASF-CR] IMPALA-4365: Enabling end-to-end tests on a remote cluster

2016-10-31 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4365: Enabling end-to-end tests on a remote cluster
..


Patch Set 7:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/4769/7/bin/remote_data_load.py
File bin/remote_data_load.py:

PS7, Line 17: This script is a setup script
How about simply "this is a setup script"


PS7, Line 21: The script expects a setup Impala development environment.
How about:
This script should be executed from a machine that has the Impala development 
environment set up.


PS7, Line 22: appropriately
It's not clear what this means. Maybe expand on this a little?


PS7, Line 47: import logging
import statements should be above from x import y


PS7, Line 111: ,
remove


PS7, Line 116: that,
remove


PS7, Line 127: not
having a not here makes it harder to reason about, how about:
options.gateway if options.gateway else cm_host


PS7, Line 130: password=options.cm_pass)
Is this indent correct according to the style guide? I thought it was supposed 
to be double the regular indent (so 8 spaces)?


PS7, Line 204: Only pick the first cluster configured
This comment does not really add anything, it's obvious what the code is doing. 
Maybe explain WHY we are getting only the first?


PS7, Line 209: for service in cluster.get_all_services():
Just an observation: it looks like we don't check that all services we need are 
present. For example if IMPALA is not in the list, this will not get caught in 
this method. If Impala is not present, there will be an exception on line 275.

Maybe instead of iterating way, we can first construct a dictionary with 
service.type as key and service as value. Then pull out one service after 
another to populate result. If a service is missing it will be caught earlier 
this way.

What do you think?


Line 281: # See https://issues.cloudera.org/browse/IMPALA-4365
It looks like the issue is resolved, is this still necessary?


PS7, Line 284: self.options.snapshot_file is not None:
How about removing the not and switching the if and else clause?


PS7, Line 338: "database_name"
Why would it ever be called "database_name"? Where is this constant coming from?


PS7, Line 432: wmay
may


PS7, Line 451: # "--skip_local_tests",
why is this commented out?


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 163:   set -e
Don't we already set -eou at the top of the script? Either remove or add a 
comment why we need to set -e again.


Line 492: run-step "Loading nested data" load-nested.log \
The command is very similar in both cases. The only difference is --cm-host. 
How about factoring that out?


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

Line 411: if os.getenv("REMOTE_LOAD"):
base_load_dir = os.getenv("REMOTE_LOAD") or os.getenv("IMPALA_HOME")


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

Line 301:   parser.add_argument("-s", "--source-db", default="tpch_parquet")
don't we also need cm-host parameter here?


http://gerrit.cloudera.org:8080/#/c/4769/7/testdata/bin/setup-hdfs-env.sh
File testdata/bin/setup-hdfs-env.sh:

Line 20: set -e
we are already setting -euo pipefail below


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: Yes