IMPALA-4188: Leopard: support external Docker volumes

To be able to run the Random Query Generator with Impala and Kudu, we
need to mount an external Docker volume as a workaround to KUDU-1419.
This patch introduces a series of environment variables a user may tweak
in order to help with that purpose. The patch assumes a viable,
reasonable Docker container based on a standard Linux distribution like
Ubuntu 14.

To assist users, I've updated the Leopard README with instructions on
the environment variables' meanings.

The gist here is that the container is the source of truth, which means
to create an external volume, we need to copy the testdata off the
container onto the host running Docker Engine. To do that we suggest a
strategy using rsync via passwordless SSH key.

Testing:
I used a Cloudera Docker container that has Impala in /home/dev/Impala.
Before, Kudu would fail to start due to KUDU-1419. Now, we load testdata
into an external volume, build Impala, run the minicluster including
Kudu, and can access the tpch_kudu data.

I made flake8 fixes as well. flake8 on this file is now clean.

Change-Id: Ia7d9d9253fcd7e3905e389ddeb1438cee3e24480
Reviewed-on: http://gerrit.cloudera.org:8080/4678
Reviewed-by: Michael Brown <mi...@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Tested-by: Internal Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/db5de41a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/db5de41a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/db5de41a

Branch: refs/heads/hadoop-next
Commit: db5de41a808d0e177ac0089ead2e420ab6043d1d
Parents: 784716f
Author: Michael Brown <mi...@cloudera.com>
Authored: Thu Sep 22 15:04:41 2016 -0700
Committer: Internal Jenkins <cloudera-hud...@gerrit.cloudera.org>
Committed: Fri Oct 14 07:44:23 2016 +0000

----------------------------------------------------------------------
 tests/comparison/leopard/README               |  70 +++++++-
 tests/comparison/leopard/impala_docker_env.py | 199 +++++++++++++++------
 2 files changed, 209 insertions(+), 60 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/db5de41a/tests/comparison/leopard/README
----------------------------------------------------------------------
diff --git a/tests/comparison/leopard/README b/tests/comparison/leopard/README
index 45f5ad8..a8144ac 100644
--- a/tests/comparison/leopard/README
+++ b/tests/comparison/leopard/README
@@ -1,13 +1,77 @@
+Summary
+-------
+
 This package runs the query generator continuously. Compares Impala and 
Postgres results
 for a randomly generated query and produces several reports per day. Reports 
are
 displayed on a web page which allows the user to conveniently examine the 
discovered
 issues. The user can also start a custom run against a private Impala branch 
using the
 web interface.
 
-Requirements:
+Requirements
+------------
 
-Docker -- A docker image with Impala and Postgres installed and functional 
database
-    loaded into Postgres.
+Docker -- A docker image with Impala and PostgresQL installed and at
+    least one reference database loaded into PostgresQL. data_generator.py is 
a useful
+    tool to migrate data from Impala into PostgresQL.
 
 To get started, run ./controller.py and ./front_end.py. You should be able to 
view the
 web page at http://localhost:5000. Results and logs are saved to /tmp/query_gen
+
+
+Basic Configuration
+-------------------
+
+The following are useful environment variables for running the
+controller and Docker images within it.
+
+DOCKER_USER - user *within* the Impala Docker container who owns the
+Impala source tree and test data.
+
+DOCKER_PASSWORD - password for the user *within* the Impala Docker
+container.
+
+TARGET_HOST - host system on which Docker Engine is running. This is the
+host that the controller will use to issue Docker commands like "docker
+run".
+
+TARGET_HOST_USERNAME - username for controller process to use to SSH
+into TARGET_HOST. Via Fabric, one can either type a password or use SSH
+keys.
+
+DOCKER_IMAGE_NAME - image to pull via "docker pull"
+
+
+External Volume Configuration
+-----------------------------
+
+To run Leopard against Impala with Kudu, we need to work around
+KUDU-1419. KUDU-1419 is likely to occur if your Docker Storage Engine is
+AUFS, or maybe others.  The easiest way to overcome this is to mount an
+external Docker volume that contain the necessary test data.  To try to
+handle this automatically, you can export any or all of the environment
+variables, depending on your host and container setups:
+
+DOCKER_IMPALA_USER_UID, DOCKER_IMPALA_USER_GID - numeric UID and GID for
+the owner of the Impala test data (testdata/cluster from an Impala
+source checkout) within your Docker container. Numeric IDs are needed,
+because there is no guarantee the symbolic owner and group on the
+container match the IDs on the target host.
+
+HOST_TESTDATA_EXTERNAL_VOLUME_PATH - path on TARGET_HOST where the
+external volume will reside. This is the destination for rsync to warm
+the volume and the left-hand side of "docker run -v".
+
+DOCKER_TESTDATA_VOLUME_PATH - path on your Docker container to the
+testdata/cluster Impala directory. This is source for rsync to warm the
+volume and the right-hand side of "docker run -v".
+
+HOST_TO_DOCKER_SSH_KEY - name of private key on TARGET_HOST for use with
+rsync so as to "warm" the external volume automatically.
+
+You are encouraged to configure your container in such a way that rsync
+with passwordless SSH is possible so as to create the external volume
+using the environment variables above.
+
+To do that, this is a handy guide on how to use rsync with SSH keys:
+
+https://www.guyrutenberg.com/2014/01/14/restricting-ssh-access-to-rsync/

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/db5de41a/tests/comparison/leopard/impala_docker_env.py
----------------------------------------------------------------------
diff --git a/tests/comparison/leopard/impala_docker_env.py 
b/tests/comparison/leopard/impala_docker_env.py
index 715ad65..4ce12d8 100755
--- a/tests/comparison/leopard/impala_docker_env.py
+++ b/tests/comparison/leopard/impala_docker_env.py
@@ -20,7 +20,9 @@
 from __future__ import division
 from fabric.api import sudo, run, settings
 from logging import getLogger
-from os.path import join as join_path
+from os.path import (
+    join as join_path,
+    normpath)
 from time import sleep
 from tests.comparison.leopard.controller import (
     SHOULD_BUILD_IMPALA,
@@ -31,13 +33,43 @@ import os
 
 IMPALA_HOME = '/home/dev/Impala'
 CORE_PATH = '/tmp/core_files'
-DEFAULT_BRANCH_NAME = 'origin/cdh5-trunk'
+DEFAULT_BRANCH_NAME = os.environ.get('DEFAULT_BRANCH_NAME', 'origin/master')
 DEFAULT_DOCKER_IMAGE_NAME = 'cloudera/impala-dev'
-DOCKER_USER_NAME = 'dev'
+DOCKER_USER_NAME = os.environ.get('DOCKER_USER_NAME', 'dev')
+
+# Needed for ensuring the testdata volume is properly owned. The UID/GID from 
the
+# container must be used, not symbolic name.
+DOCKER_IMPALA_USER_UID = int(os.environ.get(
+    'DOCKER_IMPALA_USER_UID', 1234))
+DOCKER_IMPALA_USER_GID = int(os.environ.get(
+    'DOCKER_IMPALA_USER_GID', 1000))
+
+HOST_TESTDATA_EXTERNAL_VOLUME_PATH = normpath(os.environ.get(
+    'HOST_TESTDATA_EXTERNAL_VOLUME_PATH',
+    os.path.sep + join_path('data', '1', 'dockervols', 'cluster')))
+
+DEFAULT_DOCKER_TESTDATA_VOLUME_PATH = os.path.sep + join_path(
+    'home', DOCKER_USER_NAME, 'Impala', 'testdata', 'cluster')
+
+# This needs to have a trailing os.path.sep for rsync so that the contents of 
the rsync
+# source will be put directly into this directory. man rsync to understand the
+# more idiosyncracies of trailling / (or not) in paths.
+DOCKER_TESTDATA_VOLUME_PATH = normpath(
+    os.environ.get(
+        'DOCKER_TESTDATA_VOLUME_PATH',
+        DEFAULT_DOCKER_TESTDATA_VOLUME_PATH)
+) + os.path.sep
+
+HOST_TO_DOCKER_SSH_KEY = os.environ.get(
+    'HOST_TO_DOCKER_SSH_KEY',
+    join_path(os.environ['HOME'], '.ssh', 'ro-rsync_rsa'))
+
 NUM_START_ATTEMPTS = 50
 NUM_FABRIC_ATTEMPTS = 50
+
 LOG = getLogger('ImpalaDockerEnv')
 
+
 def retry(func):
   '''Retry decorator.'''
 
@@ -57,6 +89,7 @@ def retry(func):
 
   return wrapper
 
+
 class ImpalaDockerEnv(object):
   '''Represents an Impala environemnt inside a Docker container. Used for 
starting
   Impala, getting stack traces after a crash and keeping track of the ports on 
which SSH,
@@ -75,14 +108,19 @@ class ImpalaDockerEnv(object):
         'DOCKER_IMAGE_NAME', DEFAULT_DOCKER_IMAGE_NAME)
 
   def stop_docker(self):
-    with settings(warn_only = True, host_string = self.host, user = 
self.host_username):
+    with settings(warn_only=True, host_string=self.host, 
user=self.host_username):
       retry(sudo)('docker stop {0}'.format(self.container_id), pty=True)
       retry(sudo)('docker rm {0}'.format(self.container_id), pty=True)
 
-  def start_new_container(self):
-    '''Starts a container with port forwarding for ssh, impala and postgres. 
'''
+  def start_new_container(self, volume_map=None):
+    """
+    Starts a container with port forwarding for ssh, impala and postgres.
+
+    The optional volume_map is a dictionary for making use of Docker external 
volumes.
+    The keys are paths on the host, and the values are paths on the container.
+    """
     for _ in range(NUM_START_ATTEMPTS):
-      with settings(warn_only = True, host_string = self.host, user = 
self.host_username):
+      with settings(warn_only=True, host_string=self.host, 
user=self.host_username):
         set_core_dump_location_command = \
             "echo '/tmp/core_files/core.%e.%p' | sudo tee 
/proc/sys/kernel/core_pattern"
         sudo(set_core_dump_location_command, pty=True)
@@ -94,46 +132,59 @@ class ImpalaDockerEnv(object):
         start_command = ''
         if SHOULD_PULL_DOCKER_IMAGE:
           start_command = 'docker pull {docker_image_name} && '.format(
-              docker_image_name = self.docker_image_name)
+              docker_image_name=self.docker_image_name)
+        volume_ops = ''
+        if volume_map is not None:
+          volume_ops = ' '.join(
+              ['-v {host_path}:{container_path}'.format(host_path=host_path,
+                                                        
container_path=container_path)
+               for host_path, container_path in volume_map.iteritems()])
         start_command += (
-            'docker run -d -t -p {postgres_port}:5432 -p {ssh_port}:22 '
+            'docker run -d -t {volume_ops} -p {postgres_port}:5432 -p 
{ssh_port}:22 '
             '-p {impala_port}:21050 {docker_image_name} 
/bin/docker-boot-daemon').format(
-                ssh_port = self.ssh_port,
-                impala_port = self.impala_port,
-                postgres_port = self.postgres_port,
-                docker_image_name = self.docker_image_name)
+                volume_ops=volume_ops,
+                ssh_port=self.ssh_port,
+                impala_port=self.impala_port,
+                postgres_port=self.postgres_port,
+                docker_image_name=self.docker_image_name)
 
         try:
           self.container_id = sudo(start_command, pty=True)
-        except:
-          LOG.exception('start_new_container')
+        except Exception as e:
+          LOG.exception('start_new_container:' + str(e))
       if self.container_id is not None:
         break
     else:
       LOG.error('Container failed to start after {0} 
attempts'.format(NUM_START_ATTEMPTS))
+    # Wait for the SSH service to start inside the docker instance.  Usually 
takes 1
+    # second. This is simple and reliable. An alternative implementation is to 
poll with
+    # timeout if SSH was started.
+    sleep(10)
 
   def get_git_hash(self):
     '''Returns Git hash if the current commit. '''
     with settings(
-        warn_only = True,
-        host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-        password = os.environ['DOCKER_PASSWORD']):
+        warn_only=True,
+        host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+        password=os.environ['DOCKER_PASSWORD']
+    ):
       git_hash = retry(run)('cd {IMPALA_HOME} && git rev-parse --short 
HEAD'.format(
-        IMPALA_HOME = IMPALA_HOME))
+          IMPALA_HOME=IMPALA_HOME))
       return git_hash
 
   def run_all(self):
     with settings(
-        warn_only = True,
-        host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-        password = os.environ['DOCKER_PASSWORD']):
+        warn_only=True,
+        host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+        password=os.environ['DOCKER_PASSWORD']
+    ):
       run_all_command = (
           'mkdir -p {CORE_PATH} && chmod 777 {CORE_PATH} && cd {IMPALA_HOME} '
           '&& source {IMPALA_HOME}/bin/impala-config.sh '
           '&& {IMPALA_HOME}/bin/create-test-configuration.sh '
           '&& {IMPALA_HOME}/testdata/bin/run-all.sh').format(
-              IMPALA_HOME = IMPALA_HOME,
-              CORE_PATH=CORE_PATH)
+              CORE_PATH=CORE_PATH,
+              IMPALA_HOME=IMPALA_HOME)
       retry(run)(run_all_command, pty=False)
 
   def build_impala(self):
@@ -146,33 +197,33 @@ class ImpalaDockerEnv(object):
           'docker-boot && cd {IMPALA_HOME} && {git_command} '
           '&& source {IMPALA_HOME}/bin/impala-config.sh '
           '&& {IMPALA_HOME}/buildall.sh -notests').format(
-              git_command = self.git_command,
-              IMPALA_HOME = IMPALA_HOME,
-              CORE_PATH = CORE_PATH)
+              git_command=self.git_command,
+              IMPALA_HOME=IMPALA_HOME)
     elif SHOULD_BUILD_IMPALA:
       build_command = (
           'docker-boot && cd {IMPALA_HOME} '
-          '&& git fetch --all && git checkout DEFAULT_BRANCH_NAME '
+          '&& git fetch --all && git checkout {DEFAULT_BRANCH_NAME} '
           '&& source {IMPALA_HOME}/bin/impala-config.sh '
           '&& {IMPALA_HOME}/buildall.sh -notests').format(
-              IMPALA_HOME = IMPALA_HOME,
-              DEFAULT_BRANCH_NAME = DEFAULT_BRANCH_NAME,
-              CORE_PATH = CORE_PATH)
+              IMPALA_HOME=IMPALA_HOME,
+              DEFAULT_BRANCH_NAME=DEFAULT_BRANCH_NAME)
 
     if build_command:
       with settings(
-          warn_only = True,
-          host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-          password = os.environ['DOCKER_PASSWORD']):
+          warn_only=True,
+          host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+          password=os.environ['DOCKER_PASSWORD']
+      ):
         result = retry(run)(build_command, pty=False)
         LOG.info('Build Complete, Result: {0}'.format(result))
 
   def load_data(self):
     if SHOULD_LOAD_DATA:
       with settings(
-          warn_only = True,
-          host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-          password = os.environ['DOCKER_PASSWORD']):
+          warn_only=True,
+          host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+          password=os.environ['DOCKER_PASSWORD']
+      ):
         self.start_impala()
         load_command = '''cd {IMPALA_HOME} \
             && source bin/impala-config.sh \
@@ -180,14 +231,16 @@ class ImpalaDockerEnv(object):
                 --use-postgresql --db-name=functional \
                 --migrate-table-names=alltypes,alltypestiny,alltypesagg 
migrate \
             && ./tests/comparison/data_generator.py --use-postgresql'''.format(
-                IMPALA_HOME=IMPALA_HOME)
+            IMPALA_HOME=IMPALA_HOME)
         result = retry(run)(load_command, pty=False)
+        return result
 
   def start_impala(self):
     with settings(
-        warn_only = True,
-        host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-        password = os.environ['DOCKER_PASSWORD']):
+        warn_only=True,
+        host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+        password=os.environ['DOCKER_PASSWORD']
+    ):
       impalad_args = [
           '-convert_legacy_hive_parquet_utc_timestamps=true',
       ]
@@ -202,46 +255,78 @@ class ImpalaDockerEnv(object):
   def is_impala_running(self):
     '''Check that exactly 3 impalads are running inside the docker instance.'''
     with settings(
-        warn_only = True,
-        host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-        password = os.environ['DOCKER_PASSWORD']):
+        warn_only=True,
+        host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+        password=os.environ['DOCKER_PASSWORD']
+    ):
       return retry(run)('ps aux | grep impalad').count('/service/impalad') == 3
 
   def get_stack(self):
     '''Finds the newest core file and extracts the stack trace from it using 
gdb. '''
     IMPALAD_PATH = '{IMPALA_HOME}/be/build/debug/service/impalad'.format(
-        IMPALA_HOME = IMPALA_HOME)
+        IMPALA_HOME=IMPALA_HOME)
     with settings(
-        warn_only = True,
-        host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-        password = os.environ['DOCKER_PASSWORD']):
+        warn_only=True,
+        host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+        password=os.environ['DOCKER_PASSWORD']
+    ):
       core_file_name = retry(run)('ls {0} -t1 | head -1'.format(CORE_PATH))
       LOG.info('Core File Name: {0}'.format(core_file_name))
       if 'core' not in core_file_name:
         return None
       core_full_path = join_path(CORE_PATH, core_file_name)
       stack_trace = retry(run)('gdb {0} {1} --batch --quiet 
--eval-command=bt'.format(
-        IMPALAD_PATH, core_full_path))
+          IMPALAD_PATH, core_full_path))
       self.delete_core_files()
       return stack_trace
 
   def delete_core_files(self):
     '''Delete all core files. This is usually done after the stack was 
extracted.'''
     with settings(
-        warn_only = True,
-        host_string = '{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
-        password = os.environ['DOCKER_PASSWORD']):
+        warn_only=True,
+        host_string='{0}@{1}:{2}'.format(DOCKER_USER_NAME, self.host, 
self.ssh_port),
+        password=os.environ['DOCKER_PASSWORD']
+    ):
       retry(run)('rm -f {0}/core.*'.format(CORE_PATH))
 
   def prepare(self):
     '''Create a new Impala Environment. Starts a docker container and builds 
Impala in it.
     '''
-    self.start_new_container()
+    # See KUDU-1419: If we expect to be running Kudu in the minicluster inside 
the
+    # Docker container, we have to protect against storage engines like AUFS 
and their
+    # incompatibility with Kudu. First we have to get test data off the 
container, store
+    # it somewhere, and then start another container using docker -v and mount 
the test
+    # data as a volume to bypass AUFS. See also the README for Leopard.
+    if os.environ.get('KUDU_IS_SUPPORTED') == 'true':
+      LOG.info('Warming testdata cluster external volume')
+      self.start_new_container()
+      with settings(
+          warn_only=True,
+          host_string=self.host,
+          user=self.host_username,
+      ):
+        sudo(
+            'mkdir -p {host_testdata_path} && '
+            'rsync -e "ssh -i {priv_key} -o StrictHostKeyChecking=no '
+            ''         '-o UserKnownHostsFile=/dev/null -p {ssh_port}" '
+            '--delete --archive --verbose --progress --chown={uid}:{gid} '
+            '{user}@127.0.0.1:{container_testdata_path} 
{host_testdata_path}'.format(
+                host_testdata_path=HOST_TESTDATA_EXTERNAL_VOLUME_PATH,
+                priv_key=HOST_TO_DOCKER_SSH_KEY,
+                ssh_port=self.ssh_port,
+                uid=DOCKER_IMPALA_USER_UID,
+                gid=DOCKER_IMPALA_USER_GID,
+                user=DOCKER_USER_NAME,
+                container_testdata_path=DOCKER_TESTDATA_VOLUME_PATH))
+      self.stop_docker()
+      volume_map = {
+          HOST_TESTDATA_EXTERNAL_VOLUME_PATH: DOCKER_TESTDATA_VOLUME_PATH,
+      }
+    else:
+      volume_map = None
+
+    self.start_new_container(volume_map=volume_map)
     LOG.info('Container Started')
-    # Wait for the SSH service to start inside the docker instance.  Usually 
takes 1
-    # second. This is simple and reliable. An alternative implementation is to 
poll with
-    # timeout if SSH was started.
-    sleep(10)
     self.build_impala()
     try:
       result = self.run_all()

Reply via email to