[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-26 Thread Harrison Sheinblatt (Code Review)
Harrison Sheinblatt has posted comments on this change.

Change subject: Enabling end-to-end tests on a remote cluster
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4769/1/bin/remote_data_load.py
File bin/remote_data_load.py:

PS1, Line 365: main
> I'm having a bit of trouble parsing this sentence. Can you clarify?
With the parser options directly in main() it would be difficult to invoke the 
main() logic from another python script without shelling out to execute the 
script as a sub process. If instead, you define the parse options in a separate 
method, and create a method that does all the logic in main() but takes a 
parameter of the args, then another python program could set an arg dictionary 
and invoke the main logic directly without need to shell out.


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-26 Thread David Knupp (Code Review)
David Knupp has posted comments on this change.

Change subject: Enabling end-to-end tests on a remote cluster
..


Patch Set 1:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/4769/1/bin/remote_data_load.py
File bin/remote_data_load.py:

PS1, Line 142: fe
> Does this mean we're still using the 3rd party client libraries with these 
Nope, this is the path to where we keep our config files. We're just literally 
overwriting some of these files on the client with the same files downloaded 
from the cluster.

  $ ls fe/src/test/resources/*.xml -l
  lrwxrwxrwx 1 dknupp dknupp78 Oct 25 18:22 
fe/src/test/resources/core-site.xml -> 
/home/dknupp/Impala/testdata/cluster/cdh5/node-1/etc/hadoop/conf/core-site.xml
  -rw-rw-r-- 1 dknupp dknupp  1985 Oct 25 18:22 
fe/src/test/resources/hbase-site.xml
  lrwxrwxrwx 1 dknupp dknupp78 Oct 25 18:22 
fe/src/test/resources/hdfs-site.xml -> 
/home/dknupp/Impala/testdata/cluster/cdh5/node-1/etc/hadoop/conf/hdfs-site.xml
  -rw-rw-r-- 1 dknupp dknupp 67730 Oct 18 18:18 
fe/src/test/resources/hive-default.xml
  -rw-rw-r-- 1 dknupp dknupp  4728 Oct 25 18:22 
fe/src/test/resources/hive-site.xml
  -rw-rw-r-- 1 dknupp dknupp  1976 Oct 25 18:22 
fe/src/test/resources/sentry-site.xml


PS1, Line 149: service
> I believe the Cluster object in comparisons/cluster.py has helper methods f
Going to leave this for a later investigation.


PS1, Line 160: settings required for data loading
> It would be good to document here exactly what is returned, and an explanat
Done


PS1, Line 224: environment
> Is there a reason to update the current environment rather than create an e
My presumption is that we set environment variables here because "that's how 
it's done" under our current model.

That said, I don't think the current environment really gets updated, right? 
Python gets forked as a child process for the shell, and the environment gets 
set for the life span of the script. I agree that it seems a bit hacky, but it 
shouldn't have a persistent effect on one's environment.


PS1, Line 266: load
> Might be good to time this at least overall.  Even if we just log the total
I added a decorator that we can use on various functions. It might be handy 
when/if this script gets refactors to time various parts or stages of it.

For right now, it just logs the time as you requested, but we can change the 
decorator to do something more intelligent at any time, e.g., record time in a 
DB for eventual trending, etc.


PS1, Line 278: INFO A
> What does this mean?
You know, I'm not sure. I think Martin may have just been marking when certain 
phases completed, or testing the logger setup. I'll remove it.


PS1, Line 281: logger
> Two blank lines before this line, probably remove at least one.
Done


PS1, Line 296: INFO B
> This must relate to INFO A above, but what does it mean?
Removed.


PS1, Line 297: chmod
> Are we re-setting these permissions at the end, or do we know that tests do
I'm not sure, but as elsewhere, I've filed a JIRA to investigate at a later 
time.


PS1, Line 315: Re-load
> Does this mean it was already loaded and now it's being loaded again?  Why?
I'm not sure, but I can't actually get this far into the script now, owing to 
the breakages introduced by the latest Kudu changes. I'll have to make a note 
to look into this once we fix IMPALA-4365.


PS1, Line 335: test
> This seems to not belong in this class; it doesn't do any data load.
This may be here due to the fact that, running as part of the forked child 
python process, it can make use of the environment changes from before. I'm 
going to leave this in place for now, with the idea that we can refactor it out 
at a later time. JIRA has been filed.


PS1, Line 365: main
> If we have a parse_options() method a run(parsed_options) method, then you 
I'm having a bit of trouble parsing this sentence. Can you clarify?


PS1, Line 393: test
> This seems to belong elsewhere.  Why does it go here?
See the reply from above.


http://gerrit.cloudera.org:8080/#/c/4769/1/testdata/bin/compute-table-stats.sh
File testdata/bin/compute-table-stats.sh:

PS1, Line 27: IMPALAD
> Can you reference the Jira in a comment?
Yup, a comment was added. I think you may have been looking at an older patch.


http://gerrit.cloudera.org:8080/#/c/4769/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

PS1, Line 38: HS2_HOST_PORT
> Is it reasonable to add a comment referencing the Jira here?
Possible you were looking at an older patch. A comment has been added to the 
code.


http://gerrit.cloudera.org:8080/#/c/4769/1/testdata/bin/setup-hdfs-env.sh
File testdata/bin/setup-hdfs-env.sh:

PS1, Line 53: CACHEADMIN_ARGS
> If the is_kerberized block is executed above, then the CACHADMIN_ARGS would
I feel like some of these comments might be outside of the scope of this 
review, esp. with regard to factoring out the existing is_kerberized block. 
Since I'm not an 

[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-25 Thread Harrison Sheinblatt (Code Review)
Harrison Sheinblatt has posted comments on this change.

Change subject: Enabling end-to-end tests on a remote cluster
..


Patch Set 1:

(3 comments)

Responded to comments.

http://gerrit.cloudera.org:8080/#/c/4769/1/testdata/bin/compute-table-stats.sh
File testdata/bin/compute-table-stats.sh:

PS1, Line 27: IMPALAD
> IMPALA-4346 has been filed.
Can you reference the Jira in a comment?


http://gerrit.cloudera.org:8080/#/c/4769/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

PS1, Line 38: HS2_HOST_PORT
> I think the latter is preferable -- corral all the required configs in one 
Is it reasonable to add a comment referencing the Jira here?


http://gerrit.cloudera.org:8080/#/c/4769/1/testdata/bin/setup-hdfs-env.sh
File testdata/bin/setup-hdfs-env.sh:

PS1, Line 53: CACHEADMIN_ARGS
> Clarification: can you be more explicit about the check you want? Something
If the is_kerberized block is executed above, then the CACHADMIN_ARGS would 
include '-owner ${PREVIOUS_USER}'.  If HADOOP_USER_NAME is also true, then we 
add another '-owner ${USER}' to this, which probably breaks it.  I think there 
are probably 4 bugs: 1) The is_kerberized block above probably isn't supported 
and should be removed and 2) the CACHEADMIN_ARGS definition logic needs a clear 
conditional, ideally in a single location, that sets the user/group/owner 
information properly in a way that you can easily tell it's always 
well-defined.  Here it looks like the logic is intended to be that if it's 
kerberized it sets owner one way, if it's not kerberized and the hadoop user is 
defined it's set another way and if it's not kerberized and the hadoop user is 
not defined it stays undefined. If we want to keep the is_kerberized logic in 
one place, then we can have it set another parameter about owner fields and 
here only update it if it's set already.  3) CACHEADMIN_ARGS is prob!
 ably the wrong name as it is for the -addPool command and sets a subset of the 
args 4) We should explicitly set all arguments to cacheadmin, -addPool if 
possible (e.g. mode maxTtl)


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-25 Thread David Knupp (Code Review)
David Knupp has posted comments on this change.

Change subject: Enabling end-to-end tests on a remote cluster
..


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/4769/1/bin/remote_data_load.py
File bin/remote_data_load.py:

PS1, Line 88: RemoteDataLoad
> I'd separate out the common functionality needed for dealing with remote cl
IMPALA-4367 has been filed.


PS1, Line 132: v10
> Hardcoding v10.  Is this necessary?  I think URL may be missing with later 
IMPALA-4367 has been filed.


PS1, Line 155: get_service_client_configurations
> A lot of this seems like it could be in comparisons/cluster.py, or at least
IMPALA-4367 has been filed.


PS1, Line 212: find_snapshot_file
> It would be good to start converting the snapshot file management into pyth
IMPALA-4367 has been filed.


-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: Yes


[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-24 Thread David Knupp (Code Review)
David Knupp has uploaded a new patch set (#4).

Change subject: Enabling end-to-end tests on a remote cluster
..

Enabling end-to-end tests on a remote cluster

This patch enables data loading and running end-to-end tests on a remote
cluster. The requirements to run the tests on a remote cluster are

  - CDH cluster that is CM managed
  - KMS and KeyTrustee installed and available as service
  - Hive warehouse dir points to /test-warehouse

The new remote_load_data.py script takes a CM host as argument and will
load the test warehouse snapshot on the first cluster managed by this
instance of CM. It will automatically pick the necessary configuration
needed to perform the data load process.

Usage: remote_data_load.py [options] cm_host

Options:
  -h, --helpshow this help message and exit
  --cm-user=CM_USER Cloudera Manager admin user
  --cm-pass=CM_PASS Cloudera Manager admin user password
  --gateway=GATEWAY Gateway host to upload the data from. If not set, uses
the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with passwordless
SSH configured.
  --no-load Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --testRun end-to-end tests against cluster

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
---
M bin/load-data.py
A bin/remote_data_load.py
M testdata/bin/compute-table-stats.sh
M testdata/bin/create-load-data.sh
M testdata/bin/create-table-many-blocks.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-test-warehouse-snapshot.sh
M testdata/bin/load_nested.py
M testdata/bin/run-step.sh
M testdata/bin/setup-hdfs-env.sh
10 files changed, 575 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/4769/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 


[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-24 Thread David Knupp (Code Review)
David Knupp has uploaded a new patch set (#4).

Change subject: Enabling end-to-end tests on a remote cluster
..

Enabling end-to-end tests on a remote cluster

This patch enables data loading and running end-to-end tests on a remote
cluster. The requirements to run the tests on a remote cluster are

  - CDH cluster that is CM managed
  - KMS and KeyTrustee installed and available as service
  - Hive warehouse dir points to /test-warehouse

The new remote_load_data.py script takes a CM host as argument and will
load the test warehouse snapshot on the first cluster managed by this
instance of CM. It will automatically pick the necessary configuration
needed to perform the data load process.

Usage: remote_data_load.py [options] cm_host

Options:
  -h, --helpshow this help message and exit
  --cm-user=CM_USER Cloudera Manager admin user
  --cm-pass=CM_PASS Cloudera Manager admin user password
  --gateway=GATEWAY Gateway host to upload the data from. If not set, uses
the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with passwordless
SSH configured.
  --no-load Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --testRun end-to-end tests against cluster

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
---
M bin/load-data.py
A bin/remote_data_load.py
M testdata/bin/compute-table-stats.sh
M testdata/bin/create-load-data.sh
M testdata/bin/create-table-many-blocks.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-test-warehouse-snapshot.sh
M testdata/bin/load_nested.py
M testdata/bin/run-step.sh
M testdata/bin/setup-hdfs-env.sh
10 files changed, 575 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/4769/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown 


[Impala-ASF-CR] Enabling end-to-end tests on a remote cluster

2016-10-21 Thread David Knupp (Code Review)
David Knupp has uploaded a new patch set (#3).

Change subject: Enabling end-to-end tests on a remote cluster
..

Enabling end-to-end tests on a remote cluster

This patch enables data loading and running end-to-end tests on a remote
cluster. The requirements to run the tests on a remote cluster are

  - CDH cluster that is CM managed
  - KMS and KeyTrustee installed and available as service
  - Hive warehouse dir points to /test-warehouse

The new remote_load_data.py script takes a CM host as argument and will
load the test warehouse snapshot on the first cluster managed by this
instance of CM. It will automatically pick the necessary configuration
needed to perform the data load process.

Usage: remote_data_load.py [options] cm_host

Options:
  -h, --helpshow this help message and exit
  --cm-user=CM_USER Cloudera Manager admin user
  --cm-pass=CM_PASS Cloudera Manager admin user password
  --gateway=GATEWAY Gateway host to upload the data from. If not set, uses
the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with passwordless
SSH configured.
  --no-load Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --testRun end-to-end tests against cluster

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
---
M bin/load-data.py
A bin/remote_data_load.py
M testdata/bin/compute-table-stats.sh
M testdata/bin/create-load-data.sh
M testdata/bin/create-table-many-blocks.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-test-warehouse-snapshot.sh
M testdata/bin/load_nested.py
M testdata/bin/run-step.sh
M testdata/bin/setup-hdfs-env.sh
10 files changed, 574 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/4769/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4769
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Harrison Sheinblatt 
Gerrit-Reviewer: Martin Grund 
Gerrit-Reviewer: Michael Brown