[02/11] impala git commit: IMPALA-6827: [DOCS] Updated the download link for the tutorial data

dhecht Thu, 17 May 2018 22:12:12 -0700

IMPALA-6827: [DOCS] Updated the download link for the tutorial data

Updated the link to download the Parquet airline files for tutorial.


Change-Id: I6823d1688169e0a6f09d5b552026bc18a3770828
Reviewed-on: http://gerrit.cloudera.org:8080/10393
Reviewed-by: Michael Brown <mi...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/04add98a
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/04add98a
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/04add98a

Branch: refs/heads/2.x
Commit: 04add98a341f3ae8e1e4e0613c82188eec5fc0d9
Parents: 99e379d
Author: Alex Rodoni <arod...@cloudera.com>
Authored: Mon May 14 13:30:42 2018 -0700
Committer: Impala Public Jenkins <impala-public-jenk...@gerrit.cloudera.org>
Committed: Thu May 17 22:03:02 2018 +0000

----------------------------------------------------------------------
 docs/topics/impala_tutorial.xml | 1110 ++++++++++++++--------------------
 1 file changed, 466 insertions(+), 644 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/04add98a/docs/topics/impala_tutorial.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_tutorial.xml b/docs/topics/impala_tutorial.xml
index 01e53e2..72665ef 100644
--- a/docs/topics/impala_tutorial.xml
+++ b/docs/topics/impala_tutorial.xml
@@ -1484,81 +1484,6 @@ Returned 20 row(s) in 0.38s
 Returned 10 row(s) in 0.39s</codeblock>
       </conbody>
     </concept>
-
-    <concept audience="hidden" id="tut_connect">
-
-      <title>Connecting to Impala through impala-shell</title>
-
-      <conbody>
-
-        <p>
-          This tutorial provides some tips for connecting to Impala through 
the <cmdname>impala-shell</cmdname>
-          interpreter, and gives usage tips for particular scenarios.
-        </p>
-
-        <ul>
-          <li>
-            <p>
-              By default, <cmdname>impala-shell</cmdname> with no arguments 
connects to the current host, port
-              21000.
-            </p>
-          </li>
-
-          <li>
-            <p>
-              <codeph>-i localhost</codeph> option gives you a shorter command 
prompt.
-            </p>
-          </li>
-
-          <li>
-            <p>
-              The <codeph>--quiet</codeph> option suppresses some 
informational messages, such as the
-              <codeph>Query:</codeph> line that echoes each command entered.
-            </p>
-          </li>
-
-          <li>
-            <codeph>CONNECT <varname>hostname</varname></codeph> lets you 
switch to another host (keeping the same
-            default port 21000).
-          </li>
-
-          <li>
-            <p>
-              You could run <cmdname>impala-shell</cmdname> from a completely 
different host to keep login accounts
-              off the actual Impala nodes.
-            </p>
-          </li>
-
-          <li>
-            <p>
-              For load balancing, you would connect to different coordinator 
nodes for concurrent queries.
-            </p>
-          </li>
-
-          <li>
-            <p>
-              For benchmarking, you would skip pretty printing with the 
<codeph>-B</codeph> option, or even skip
-              all screen output with the <codeph>-o</codeph> option or shell 
redirection.
-            </p>
-          </li>
-        </ul>
-      </conbody>
-    </concept>
-  </concept>
-
-  <concept audience="hidden" id="tut_mem_limit">
-
-    <title>Effects of Memory Limits on Impala Queries</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Memory"/>
-    </metadata>
-  </prolog>
-
-    <conbody>
-
-      <p></p>
-    </conbody>
   </concept>
 
   <concept id="tut_parquet_schemaless">
@@ -1591,319 +1516,311 @@ for purposes of this exercise, wait until after 
following the tutorial before ex
 a real-life situation where you cannot rely on assumptions and assertions 
about the ranges and representations of
 data values.
 </p>
+    </conbody>
+    <concept id="download_hdfs">
+      <title>Download the Data Files into HDFS</title>
 
-<p>
-We will download Parquet files containing this data from the Ibis blog.
-First, we download and unpack the data files.
-There are 8 files totalling 1.4 GB.
-Each file is less than 256 MB.
-</p>
-
-<codeblock>$ wget -O airlines_parquet.tar.gz 
https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0
-...
-Length: 1245204740 (1.2G) [application/octet-stream]
-Saving to: âairlines_parquet.tar.gzâ
+<conbody>
+<p> First, we download and unpack the data files. There are 8 files totalling
+        1.4 GB.</p>
 
-2015-08-12 17:14:24 (23.6 MB/s) - âairlines_parquet.tar.gzâ saved 
[1245204740/1245204740]
+<codeblock>$ wget -O airlines_parquet.tar.gz 
https://home.apache.org/~arodoni/airlines_parquet.tar.gz
+$ wget https://home.apache.org/~arodoni/airlines_parquet.tar.gz.sha512
+$ shasum -a 512 -c airlines_parquet.tar.gz.sha512
+airlines_parquet.tar.gz: OK
 
 $ tar xvzf airlines_parquet.tar.gz
-airlines_parquet/
-airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
-$ cd airlines_parquet/
-$ du -kch *.parq
-253M  93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
-65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
-156M  93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
-240M  93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
-253M  93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
-16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
-177M  93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
-213M  93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
-1.4G  total
-</codeblock>
 
-<p>
-Next, we put the Parquet data files in HDFS, all together in a single 
directory,
-with permissions on the directory and the files so that the 
<codeph>impala</codeph>
-user will be able to read them.
-</p>
-
-<note>
-After unpacking, we saw the largest Parquet file was 253 MB.
-When copying Parquet files into HDFS for Impala to use,
-for maximum query performance, make sure that each file resides in a single 
HDFS data block.
-Therefore, we pick a size larger than any single file and specify that as the 
block size, using the argument
-<codeph>-Ddfs.block.size=256m</codeph> on the <codeph>hdfs dfs -put</codeph> 
command.
-</note>
+$ cd airlines_parquet/
 
-<codeblock>$ hdfs dfs -mkdir -p 
hdfs://demo_host.example.com:8020/user/impala/staging/airlines
-$ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines
-$ hdfs dfs -ls /user/impala/staging
+$ du -kch *.parq
+253M   4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
+14M    4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq
+253M   4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq
+64M    4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq
+184M   4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq
+241M   4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq
+212M   4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq
+152M   4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq
+1.4G   total</codeblock>
+
+<p> Next, we put the Parquet data files in HDFS, all together in a single
+          directory, with permissions on the directory and the files so that 
the
+            <codeph>impala</codeph> user will be able to read them.</p>
+        <p>After unpacking, we saw the largest Parquet file was 253 MB. When
+          copying Parquet files into HDFS for Impala to use, for maximum query
+          performance, make sure that each file resides in a single HDFS data
+          block. Therefore, we pick a size larger than any single file and
+          specify that as the block size, using the argument
+            <codeph>-Ddfs.block.size=253m</codeph> on the <codeph>hdfs dfs
+            -put</codeph> command. </p>
+
+<codeblock>$ sudo -u hdfs hdfs dfs -mkdir -p /user/impala/staging/airlines
+$ sudo -u hdfs hdfs dfs -Ddfs.block.size=253m -put *.parq 
/user/impala/staging/airlines
+$ sudo -u hdfs hdfs dfs -ls /user/impala/staging
 Found 1 items
-drwxrwxrwx   - hdfs supergroup          0 2015-08-12 13:52 
/user/impala/staging/airlines
-$ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines
+
+$ sudo -u hdfs hdfs dfs -ls /user/impala/staging/airlines
 Found 8 items
--rw-r--r--   3 jrussell supergroup  265107489 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
--rw-r--r--   3 jrussell supergroup   67544715 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
--rw-r--r--   3 jrussell supergroup  162556490 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
--rw-r--r--   3 jrussell supergroup  251603518 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
--rw-r--r--   3 jrussell supergroup  265186603 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
--rw-r--r--   3 jrussell supergroup   16663754 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
--rw-r--r--   3 jrussell supergroup  185511677 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
--rw-r--r--   3 jrussell supergroup  222794621 2015-08-12 17:18 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
 </codeblock>
+</conbody>
+    </concept>
 
-<p>
-With the files in an accessible location in HDFS, we create a database table 
that uses the data in those files.
-The <codeph>CREATE EXTERNAL</codeph> syntax and the <codeph>LOCATION</codeph> 
attribute point Impala at the appropriate HDFS directory.
-The <codeph>LIKE PARQUET 
'<varname>path_to_any_parquet_file</varname>'</codeph> clause means we skip the 
list of column names and types;
-Impala automatically gets the column names and data types straight from the 
data files.
-(Currently, this technique only works for Parquet files.)
-We ignore the warning about lack of <codeph>READ_WRITE</codeph> access to the 
files in HDFS;
-the <codeph>impala</codeph> user can read the files, which will be sufficient 
for us to experiment with
-queries and perform some copy and transform operations into other tables.
-</p>
+    <concept id="create_tables">
+      <title>Create Database and Tables</title>
+    <conbody>
 
-<codeblock>$ impala-shell -i localhost
-Starting Impala Shell without Kerberos authentication
+<p> With the files in an accessible location in HDFS, you create a database
+          table that uses the data in those files:<ul>
+            <li>The <codeph>CREATE EXTERNAL</codeph> syntax and the
+                <codeph>LOCATION</codeph> attribute point Impala at the
+              appropriate HDFS directory.</li>
+            <li>The <codeph>LIKE PARQUET
+                  '<varname>path_to_any_parquet_file</varname>'</codeph> clause
+              means we skip the list of column names and types; Impala
+              automatically gets the column names and data types straight from
+              the data files. (Currently, this technique only works for Parquet
+              files.) </li>
+            <li>Ignore the warning about lack of <codeph>READ_WRITE</codeph>
+              access to the files in HDFS; the <codeph>impala</codeph> user can
+              read the files, which will be sufficient for us to experiment 
with
+              queries and perform some copy and transform operations into other
+              tables. </li>
+          </ul></p>
 
-Connected to localhost:21000
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/>
-Welcome to the Impala shell. Press TAB twice to see a list of available 
commands.
-...
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/>
-[localhost:21000] > create database airline_data;
-[localhost:21000] > use airline_data;
-[localhost:21000] > create external table airlines_external
-                  > like parquet 
'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq'
-                  > stored as parquet location 
'hdfs://demo_host.example.com:8020/user/impala/staging/airlines';
-WARNINGS: Impala does not have READ_WRITE access to path 
'hdfs://demo_host.example.com:8020/user/impala/staging'
+<codeblock>$ impala-shell
+> CREATE DATABASE airlines_data;
+  USE airlines_data;
+  CREATE EXTERNAL TABLE airlines_external
+  LIKE PARQUET 
'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
+  STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
+WARNINGS: Impala does not have READ_WRITE access to path 
'hdfs://myhost.com:8020/user/impala/staging'
 </codeblock>
+    </conbody>
+    </concept>
+    <concept id="examine_schema ">
+      <title>Examine Physical and Logical Schema</title>
+    <conbody>
 
-<p>
-With the table created, we examine its physical and logical characteristics to 
confirm that the data is really
-there and in a format and shape that we can work with.
-The <codeph>SHOW TABLE STATS</codeph> statement gives a very high-level 
summary of the table,
-showing how many files and how much total data it contains.
-Also, it confirms that the table is expecting all the associated data files to 
be in Parquet format.
-(The ability to work with all kinds of HDFS data files in different formats 
means that it is
-possible to have a mismatch between the format of the data files, and the 
format
-that the table expects the data files to be in.)
-The <codeph>SHOW FILES</codeph> statement confirms that the data in the table 
has the expected number,
-names, and sizes of the original Parquet files.
-The <codeph>DESCRIBE</codeph> statement (or its abbreviation 
<codeph>DESC</codeph>) confirms the names and types
-of the columns that Impala automatically created after reading that metadata 
from the Parquet file.
-The <codeph>DESCRIBE FORMATTED</codeph> statement prints out some extra detail 
along with the column definitions;
-the pieces we care about for this exercise are the containing database for the 
table,
-the location of the associated data files in HDFS, the fact that it's an 
external table so Impala will not
-delete the HDFS files when we finish the experiments and drop the table, and 
the fact that the
-table is set up to work exclusively with files in the Parquet format.
-</p>
-
-<codeblock>[localhost:21000] > show table stats airlines_external;
+<p> With the table created, we examine its physical and logical characteristics
+          to confirm that the data is really there and in a format and shape
+          that we can work with. <ul>
+            <li>The <codeph>SHOW TABLE STATS</codeph> statement gives a very
+              high-level summary of the table, showing how many files and how
+              much total data it contains. Also, it confirms that the table is
+              expecting all the associated data files to be in Parquet format.
+              (The ability to work with all kinds of HDFS data files in
+              different formats means that it is possible to have a mismatch
+              between the format of the data files, and the format that the
+              table expects the data files to be in.) </li>
+            <li>The <codeph>SHOW FILES</codeph> statement confirms that the 
data
+              in the table has the expected number, names, and sizes of the
+              original Parquet files.</li>
+            <li>The <codeph>DESCRIBE</codeph> statement (or its abbreviation
+                <codeph>DESC</codeph>) confirms the names and types of the
+              columns that Impala automatically created after reading that
+              metadata from the Parquet file. </li>
+            <li>The <codeph>DESCRIBE FORMATTED</codeph> statement prints out
+              some extra detail along with the column definitions. The pieces 
we
+              care about for this exercise are: <ul>
+                <li>The containing database for the table.</li>
+                <li>The location of the associated data files in HDFS.</li>
+                <li>The table is an external table so Impala will not delete 
the
+                  HDFS files when we finish the experiments and drop the
+                  table.</li>
+                <li>The table is set up to work exclusively with files in the
+                  Parquet format.</li>
+              </ul></li>
+          </ul></p>
+
+<codeblock>> SHOW TABLE STATS airlines_external;
 
+-------+--------+--------+--------------+-------------------+---------+-------------------+
 | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | 
Incremental stats |
 
+-------+--------+--------+--------------+-------------------+---------+-------------------+
 | -1    | 8      | 1.34GB | NOT CACHED   | NOT CACHED        | PARQUET | false 
            |
 
+-------+--------+--------+--------------+-------------------+---------+-------------------+
-[localhost:21000] > show files in airlines_external;
+
+> SHOW FILES IN airlines_external;
 
+----------------------------------------------------------------------------------------+----------+-----------+
 | path                                                                         
          | size     | partition |
 
+----------------------------------------------------------------------------------------+----------+-----------+
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
 | 252.83MB |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
 | 64.42MB  |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
  | 155.03MB |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
 | 239.95MB |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
 | 252.90MB |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
 | 15.89MB  |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
 | 176.92MB |           |
-| 
/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
 | 212.47MB |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
 | 252.99MB |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq
 | 13.43MB  |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq
  | 252.84MB |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq
  | 63.92MB  |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq
 | 183.64MB |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq
 | 240.04MB |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq
 | 211.35MB |           |
+| 
/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq
 | 151.46MB |           |
 
+----------------------------------------------------------------------------------------+----------+-----------+
-[localhost:21000] > describe airlines_external;
+
+> DESCRIBE airlines_external;
 
+---------------------+--------+---------------------------------------------------+
 | name                | type   | comment                                       
    |
 
+---------------------+--------+---------------------------------------------------+
-| year                | int    | inferred from: optional int32 year            
    |
-| month               | int    | inferred from: optional int32 month           
    |
-| day                 | int    | inferred from: optional int32 day             
    |
-| dayofweek           | int    | inferred from: optional int32 dayofweek       
    |
-| dep_time            | int    | inferred from: optional int32 dep_time        
    |
-| crs_dep_time        | int    | inferred from: optional int32 crs_dep_time    
    |
-| arr_time            | int    | inferred from: optional int32 arr_time        
    |
-| crs_arr_time        | int    | inferred from: optional int32 crs_arr_time    
    |
-| carrier             | string | inferred from: optional binary carrier        
    |
-| flight_num          | int    | inferred from: optional int32 flight_num      
    |
-| tail_num            | int    | inferred from: optional int32 tail_num        
    |
-| actual_elapsed_time | int    | inferred from: optional int32 
actual_elapsed_time |
-| crs_elapsed_time    | int    | inferred from: optional int32 
crs_elapsed_time    |
-| airtime             | int    | inferred from: optional int32 airtime         
    |
-| arrdelay            | int    | inferred from: optional int32 arrdelay        
    |
-| depdelay            | int    | inferred from: optional int32 depdelay        
    |
-| origin              | string | inferred from: optional binary origin         
    |
-| dest                | string | inferred from: optional binary dest           
    |
-| distance            | int    | inferred from: optional int32 distance        
    |
-| taxi_in             | int    | inferred from: optional int32 taxi_in         
    |
-| taxi_out            | int    | inferred from: optional int32 taxi_out        
    |
-| cancelled           | int    | inferred from: optional int32 cancelled       
    |
-| cancellation_code   | string | inferred from: optional binary 
cancellation_code  |
-| diverted            | int    | inferred from: optional int32 diverted        
    |
-| carrier_delay       | int    | inferred from: optional int32 carrier_delay   
    |
-| weather_delay       | int    | inferred from: optional int32 weather_delay   
    |
-| nas_delay           | int    | inferred from: optional int32 nas_delay       
    |
-| security_delay      | int    | inferred from: optional int32 security_delay  
    |
-| late_aircraft_delay | int    | inferred from: optional int32 
late_aircraft_delay |
+| year                | int    | Inferred from Parquet file.                   
    |
+| month               | int    | Inferred from Parquet file.                   
    |
+| day                 | int    | Inferred from Parquet file.                   
    |
+| dayofweek           | int    | Inferred from Parquet file.                   
    |
+| dep_time            | int    | Inferred from Parquet file.                   
     |
+| crs_dep_time        | int    | Inferred from Parquet file.                   
    |
+| arr_time            | int    | Inferred from Parquet file.                   
    |
+| crs_arr_time        | int    | Inferred from Parquet file.                   
    |
+| carrier             | string | Inferred from Parquet file.                   
    |
+| flight_num          | int    | Inferred from Parquet file.                   
    |
+| tail_num            | int    | Inferred from Parquet file.                   
    |
+| actual_elapsed_time | int    | Inferred from Parquet file.                   
    |
+| crs_elapsed_time    | int    | Inferred from Parquet file.                   
    |
+| airtime             | int    | Inferred from Parquet file.                   
    |
+| arrdelay            | int    | Inferred from Parquet file.                   
    |
+| depdelay            | int    | Inferred from Parquet file.                   
    |
+| origin              | string | Inferred from Parquet file.                   
    |
+| dest                | string | Inferred from Parquet file.                   
    |
+| distance            | int    | Inferred from Parquet file.                   
    |
+| taxi_in             | int    | Inferred from Parquet file.                   
    |
+| taxi_out            | int    | Inferred from Parquet file.                   
    |
+| cancelled           | int    | Inferred from Parquet file.                   
    |
+| cancellation_code   | string | Inferred from Parquet file.                   
    |
+| diverted            | int    | Inferred from Parquet file.                   
    |
+| carrier_delay       | int    | Inferred from Parquet file.                   
    |
+| weather_delay       | int    | Inferred from Parquet file.                   
    |
+| nas_delay           | int    | Inferred from Parquet file.                   
    |
+| security_delay      | int    | Inferred from Parquet file.                   
    |
+| late_aircraft_delay | int    | Inferred from Parquet file.                   
    |
 
+---------------------+--------+---------------------------------------------------+
-[localhost:21000] > desc formatted airlines_external;
+
+> DESCRIBE FORMATTED airlines_external;
 +------------------------------+-------------------------------
 | name                         | type
 +------------------------------+-------------------------------
 ...
 | # Detailed Table Information | NULL
-| Database:                    | airline_data
-| Owner:                       | jrussell
+| Database:                    | airlines_data
+| Owner:                       | impala
 ...
 | Location:                    | /user/impala/staging/airlines
 | Table Type:                  | EXTERNAL_TABLE
 ...
 | # Storage Information        | NULL
-| SerDe Library:               | parquet.hive.serde.ParquetHiveSerDe
-| InputFormat:                 | parquet.hive.DeprecatedParquetInputFormat
-| OutputFormat:                | parquet.hive.DeprecatedParquetOutputFormat
+| SerDe Library:               | 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
+| InputFormat:                 | 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputForma
+| OutputFormat:                | 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
 ...
 </codeblock>
+    </conbody></concept>
+    <concept id="examine_data">
+      <title>Analyze Data</title>
+    <conbody>
 
-<p>
-Now that we are confident that the connections are solid between the Impala 
table and the
-underlying Parquet files, we run some initial queries to understand the 
characteristics
-of the data: the overall number of rows, and the ranges and how many
-different values are in certain columns.
-For convenience in understanding the magnitude of the <codeph>COUNT(*)</codeph>
-result, we run another query dividing the number of rows by 1 million, 
demonstrating that there are 123 million rows in the table.
-</p>
-
-<!-- I think this is the very longest line that could possibly fit without 
wrapping in the PDF output; 87 characters wide.
-[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from 
airlines_external;
--->
+<p> Now that we are confident that the connections are solid between the Impala
+          table and the underlying Parquet files, we run some initial queries 
to
+          understand the characteristics of the data: the overall number of
+          rows, and the ranges and how many different values are in certain
+          columns. </p>
 
-<codeblock>[localhost:21000] > select count(*) from airlines_external;
+<codeblock>> SELECT COUNT(*) FROM airlines_external;
 +-----------+
 | count(*)  |
 +-----------+
 | 123534969 |
 +-----------+
-Fetched 1 row(s) in 1.32s
-[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from 
airlines_external;
-+------------------+
-| millions of rows |
-+------------------+
-| 123.534969       |
-+------------------+
-Fetched 1 row(s) in 1.24s
 </codeblock>
 
-<p> The <codeph>NDV()</codeph> function stands for <q>number of distinct
-          values</q>, which for performance reasons is an estimate when there
-        are lots of different values in the column, but is precise when the
-        cardinality is less than 16 K. Use <codeph>NDV()</codeph> calls for 
this
-        kind of exploration rather than <codeph>COUNT(DISTINCT
-            <varname>colname</varname>)</codeph>, because Impala can evaluate
-        multiple <codeph>NDV()</codeph> functions in a single query, but only a
-        single instance of <codeph>COUNT DISTINCT</codeph>. Here we see that
-        there are modest numbers of different airlines, flight numbers, and
-        origin and destination airports. Two things jump out from this query:
-        the number of <codeph>tail_num</codeph> values is much smaller than we
-        might have expected, and there are more destination airports than 
origin
-        airports. Let's dig further. What we find is that most
-          <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks
-        like this was an experimental column that wasn't filled in accurately.
-        We make a mental note that if we use this data as a starting point,
-        we'll ignore this column. We also find that certain airports are
-        represented in the <codeph>ORIGIN</codeph> column but not the
-          <codeph>DEST</codeph> column; now we know that we cannot rely on the
-        assumption that those sets of airport codes are identical. </p>
-
-<note>
-A slight digression for some performance tuning. Notice how the first
-<codeph>SELECT DISTINCT DEST</codeph> query takes almost 40 seconds.
-We expect all queries on such a small data set, less than 2 GB, to
-take a few seconds at most. The reason is because the expression
-<codeph>NOT IN (SELECT origin FROM airlines_external)</codeph>
-produces an intermediate result set of 123 million rows, then
-runs 123 million comparisons on each data node against the tiny set of 
destination airports.
-The way the <codeph>NOT IN</codeph> operator works internally means that
-this intermediate result set with 123 million rows might be transmitted
-across the network to each data node in the cluster.
-Applying another <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph>
-subquery means that the intermediate result set is only 340 items,
-resulting in much less network traffic and fewer comparison operations.
-The more efficient query with the added <codeph>DISTINCT</codeph> is 
approximately 7 times as fast.
-</note>
-
-<codeblock>[localhost:21000] > select ndv(carrier), ndv(flight_num), 
ndv(tail_num),
-                  >   ndv(origin), ndv(dest) from airlines_external;
+<p> The <codeph>NDV()</codeph> function returns a number of distinct values,
+          which, for performance reasons, is an estimate when there are lots of
+          different values in the column, but is precise when the cardinality 
is
+          less than 16 K. Use <codeph>NDV()</codeph> function for this kind of
+          exploration rather than <codeph>COUNT(DISTINCT
+              <varname>colname</varname>)</codeph>, because Impala can evaluate
+          multiple <codeph>NDV()</codeph> functions in a single query, but only
+          a single instance of <codeph>COUNT DISTINCT</codeph>. </p>
+
+<codeblock>> SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
+  NDV(origin), NDV(dest) FROM airlines_external;
 +--------------+-----------------+---------------+-------------+-----------+
 | ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
 +--------------+-----------------+---------------+-------------+-----------+
-| 29           | 9086            | 3             | 340         | 347       |
+| 29           | 8463            | 3             | 342         | 349       |
 +--------------+-----------------+---------------+-------------+-----------+
-[localhost:21000] > select tail_num, count(*) as howmany from airlines_external
-                  >   group by tail_num;
+
+> SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
+  GROUP BY tail_num;
 +----------+-----------+
 | tail_num | howmany   |
 +----------+-----------+
+| NULL     | 123122001 |
 | 715      | 1         |
 | 0        | 406405    |
 | 112      | 6562      |
-| NULL     | 123122001 |
 +----------+-----------+
-Fetched 1 row(s) in 5.18s
-[localhost:21000] > select distinct dest from airlines_external
-                  >   where dest not in (select origin from airlines_external);
+
+> SELECT DISTINCT dest FROM airlines_external
+  WHERE dest NOT IN (SELECT origin FROM airlines_external);
 +------+
 | dest |
 +------+
-| LBF  |
 | CBM  |
-| RCA  |
 | SKA  |
 | LAR  |
+| RCA  |
+| LBF  |
 +------+
-Fetched 5 row(s) in 39.64s
-[localhost:21000] > select distinct dest from airlines_external
-                  >   where dest not in (select distinct origin from 
airlines_external);
+
+> SELECT DISTINCT dest FROM airlines_external
+  WHERE dest NOT IN (SELECT DISTINCT origin FROM airlines_external);
 +------+
 | dest |
 +------+
-| LBF  |
-| RCA  |
 | CBM  |
 | SKA  |
 | LAR  |
+| RCA  |
+| LBF  |
 +------+
-Fetched 5 row(s) in 5.59s
-[localhost:21000] > select distinct origin from airlines_external
-                  >   where origin not in (select distinct dest from 
airlines_external);
-Fetched 0 row(s) in 5.37s
-</codeblock>
+
+> SELECT DISTINCT origin FROM airlines_external
+  WHERE origin NOT IN (SELECT DISTINCT dest FROM airlines_external);
+Fetched 0 row(s) in 2.63</codeblock>
+        <p>With the above queries, we see that there are modest numbers of
+          different airlines, flight numbers, and origin and destination
+          airports. Two things jump out from this query: the number of
+            <codeph>tail_num</codeph> values is much smaller than we might have
+          expected, and there are more destination airports than origin
+          airports. Let's dig further. What we find is that most
+            <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It 
looks
+          like this was an experimental column that wasn't filled in 
accurately.
+          We make a mental note that if we use this data as a starting point,
+          we'll ignore this column. We also find that certain airports are
+          represented in the <codeph>ORIGIN</codeph> column but not the
+            <codeph>DEST</codeph> column; now we know that we cannot rely on 
the
+          assumption that those sets of airport codes are identical. </p>
+        <note> The first <codeph>SELECT DISTINCT DEST</codeph> query takes
+          almost 40 seconds. We expect all queries on such a small data set,
+          less than 2 GB, to take a few seconds at most. The reason is because
+          the expression <codeph>NOT IN (SELECT origin FROM
+            airlines_external)</codeph> produces an intermediate result set of
+          123 million rows, then runs 123 million comparisons on each data node
+          against the tiny set of destination airports. The way the <codeph>NOT
+            IN</codeph> operator works internally means that this intermediate
+          result set with 123 million rows might be transmitted across the
+          network to each data node in the cluster. Applying another
+            <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph>
+          subquery means that the intermediate result set is only 340 items,
+          resulting in much less network traffic and fewer comparison
+          operations. The more efficient query with the added
+            <codeph>DISTINCT</codeph> is approximately 7 times as fast. </note>
 
 <p> Next, we try doing a simple calculation, with results broken down by year.
-        This reveals that some years have no data in the
-          <codeph>AIRTIME</codeph> column. That means we might be able to use
-        that column in queries involving certain date ranges, but we cannot
-        count on it to always be reliable. The question of whether a column
-        contains any <codeph>NULL</codeph> values, and if so what is their
-        number, proportion, and distribution, comes up again and again when
-        doing initial exploration of a data set. </p>
-
-<codeblock>[localhost:21000] > select year, sum(airtime) from airlines_external
-                  >   group by year order by year desc;
+          This reveals that some years have no data in the
+            <codeph>airtime</codeph> column. That means we might be able to use
+          that column in queries involving certain date ranges, but we cannot
+          count on it to always be reliable. The question of whether a column
+          contains any <codeph>NULL</codeph> values, and if so what is their
+          number, proportion, and distribution, comes up again and again when
+          doing initial exploration of a data set. </p>
+
+<codeblock>> SELECT year, SUM(airtime) FROM airlines_external
+  GROUP BY year ORDER BY year DESC;
 +------+--------------+
 | year | sum(airtime) |
 +------+--------------+
@@ -1932,34 +1849,37 @@ Fetched 0 row(s) in 5.37s
 +------+--------------+
 </codeblock>
 
-<p>
-With the notion of <codeph>NULL</codeph> values in mind, let's come back to 
the <codeph>TAILNUM</codeph>
-column that we discovered had a lot of <codeph>NULL</codeph>s.
-Let's quantify the <codeph>NULL</codeph> and non-<codeph>NULL</codeph> values 
in that column for better understanding.
-First, we just count the overall number of rows versus the 
non-<codeph>NULL</codeph> values in that column.
-That initial result gives the appearance of relatively few 
non-<codeph>NULL</codeph> values, but we can break
-it down more clearly in a single query.
-Once we have the <codeph>COUNT(*)</codeph> and the 
<codeph>COUNT(<varname>colname</varname>)</codeph> numbers,
-we can encode that initial query in a <codeph>WITH</codeph> clause, then run a 
followon query that performs
-multiple arithmetic operations on those values.
-Seeing that only one-third of one percent of all rows have 
non-<codeph>NULL</codeph> values for the
-<codeph>TAILNUM</codeph> column clearly illustrates that that column is not of 
much use.
-</p>
-
-<codeblock>[localhost:21000] > select count(*) as 'rows', count(tail_num) as 
'non-null tail numbers'
-                  >   from airlines_external;
+<p> With the notion of <codeph>NULL</codeph> values in mind, let's come back to
+          the <codeph>tail_num</codeph> column that we discovered had a lot of
+            <codeph>NULL</codeph>s. Let's quantify the <codeph>NULL</codeph> 
and
+            non-<codeph>NULL</codeph> values in that column for better
+          understanding. First, we just count the overall number of rows versus
+          the non-<codeph>NULL</codeph> values in that column. That initial
+          result gives the appearance of relatively few
+            non-<codeph>NULL</codeph> values, but we can break it down more
+          clearly in a single query. Once we have the <codeph>COUNT(*)</codeph>
+          and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers, 
we
+          can encode that initial query in a <codeph>WITH</codeph> clause, then
+          run a follow-on query that performs multiple arithmetic operations on
+          those values. Seeing that only one-third of one percent of all rows
+          have non-<codeph>NULL</codeph> values for the
+            <codeph>tail_num</codeph> column clearly illustrates that column is
+          not of much use. </p>
+
+<codeblock>> SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'non-null tail 
numbers'
+  FROM airlines_external;
 +-----------+-----------------------+
 | rows      | non-null tail numbers |
 +-----------+-----------------------+
 | 123534969 | 412968                |
 +-----------+-----------------------+
-Fetched 1 row(s) in 1.51s
-[localhost:21000] > with t1 as
-                  >   (select count(*) as 'rows', count(tail_num) as 'nonnull'
-                  >   from airlines_external)
-                  > select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls',
-                  >   (`nonnull` / `rows`) * 100 as 'percentage non-null'
-                  > from t1;
+
+> WITH t1 AS
+  (SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'nonnull'
+  FROM airlines_external)
+SELECT `rows`, `nonnull`, `rows` - `nonnull` AS 'nulls',
+  (`nonnull` / `rows`) * 100 AS 'percentage non-null'
+FROM t1;
 +-----------+---------+-----------+---------------------+
 | rows      | nonnull | nulls     | percentage non-null |
 +-----------+---------+-----------+---------------------+
@@ -1967,26 +1887,28 @@ Fetched 1 row(s) in 1.51s
 +-----------+---------+-----------+---------------------+
 </codeblock>
 
-<p>
-By examining other columns using these techniques, we can form a mental 
picture of the way data is distributed
-throughout the table, and which columns are most significant for query 
purposes. For this tutorial, we focus mostly on
-the fields likely to hold discrete values, rather than columns such as 
<codeph>ACTUAL_ELAPSED_TIME</codeph>
-whose names suggest they hold measurements. We would dig deeper into those 
columns once we had a clear picture
-of which questions were worthwhile to ask, and what kinds of trends we might 
look for.
-For the final piece of initial exploration, let's look at the 
<codeph>YEAR</codeph> column.
-A simple <codeph>GROUP BY</codeph> query shows that it has a well-defined 
range, a manageable number of
-distinct values, and relatively even distribution of rows across the different 
years.
-</p>
-
-<codeblock>[localhost:21000] > select min(year), max(year), ndv(year) from 
airlines_external;
+<p> By examining other columns using these techniques, we can form a mental
+          picture of the way data is distributed throughout the table, and 
which
+          columns are most significant for query purposes. For this tutorial, 
we
+          focus mostly on the fields likely to hold discrete values, rather 
than
+          columns such as <codeph>actual_elapsed_time</codeph> whose names
+          suggest they hold measurements. We would dig deeper into those 
columns
+          once we had a clear picture of which questions were worthwhile to 
ask,
+          and what kinds of trends we might look for. For the final piece of
+          initial exploration, let's look at the <codeph>year</codeph> column. 
A
+          simple <codeph>GROUP BY</codeph> query shows that it has a
+          well-defined range, a manageable number of distinct values, and
+          relatively even distribution of rows across the different years. </p>
+
+<codeblock>> SELECT MIN(year), MAX(year), NDV(year) FROM airlines_external;
 +-----------+-----------+-----------+
 | min(year) | max(year) | ndv(year) |
 +-----------+-----------+-----------+
 | 1987      | 2008      | 22        |
 +-----------+-----------+-----------+
-Fetched 1 row(s) in 2.03s
-[localhost:21000] > select year, count(*) howmany from airlines_external
-                  >   group by year order by year desc;
+
+> SELECT year, COUNT(*) howmany FROM airlines_external
+  GROUP BY year ORDER BY year DESC;
 +------+---------+
 | year | howmany |
 +------+---------+
@@ -2013,77 +1935,35 @@ Fetched 1 row(s) in 2.03s
 | 1988 | 5202096 |
 | 1987 | 1311826 |
 +------+---------+
-Fetched 22 row(s) in 2.13s
 </codeblock>
 
-<p>
-We could go quite far with the data in this initial raw format, just as we 
downloaded it from the web.
-If the data set proved to be useful and worth persisting in Impala for 
extensive queries,
-we might want to copy it to an internal table, letting Impala manage the data 
files and perhaps
-reorganizing a little for higher efficiency.
-In this next stage of the tutorial, we copy the original data into a 
partitioned table, still in Parquet format.
-Partitioning based on the <codeph>YEAR</codeph> column lets us run queries 
with clauses such as <codeph>WHERE year = 2001</codeph>
-or <codeph>WHERE year BETWEEN 1989 AND 1999</codeph>, which can dramatically 
cut down on I/O by
-ignoring all the data from years outside the desired range.
-Rather than reading all the data and then deciding which rows are in the 
matching years, Impala can
-zero in on only the data files from specific <codeph>YEAR</codeph> partitions.
-To do this, Impala physically reorganizes the data files, putting the rows 
from each year into
-data files in a separate HDFS directory for each <codeph>YEAR</codeph> value.
-Along the way, we'll also get rid of the <codeph>TAIL_NUM</codeph> column that 
proved to be almost entirely <codeph>NULL</codeph>.
-</p>
-
-<p>
-The first step is to create a new table with a layout very similar to the 
original <codeph>AIRLINES_EXTERNAL</codeph> table.
-We'll do that by reverse-engineering a <codeph>CREATE TABLE</codeph> statement 
for the first table,
-then tweaking it slightly to include a <codeph>PARTITION BY</codeph> clause 
for <codeph>YEAR</codeph>,
-and excluding the <codeph>TAIL_NUM</codeph> column.
-The <codeph>SHOW CREATE TABLE</codeph> statement gives us the starting point.
-</p>
-
-<codeblock>[localhost:21000] > show create table airlines_external;
-+-------------------------------------------------------------------------------------
-| result
-+-------------------------------------------------------------------------------------
-| CREATE EXTERNAL TABLE airline_data.airlines_external (
-|   year INT COMMENT 'inferred from: optional int32 year',
-|   month INT COMMENT 'inferred from: optional int32 month',
-|   day INT COMMENT 'inferred from: optional int32 day',
-|   dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
-|   dep_time INT COMMENT 'inferred from: optional int32 dep_time',
-|   crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
-|   arr_time INT COMMENT 'inferred from: optional int32 arr_time',
-|   crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
-|   carrier STRING COMMENT 'inferred from: optional binary carrier',
-|   flight_num INT COMMENT 'inferred from: optional int32 flight_num',
-|   tail_num INT COMMENT 'inferred from: optional int32 tail_num',
-|   actual_elapsed_time INT COMMENT 'inferred from: optional int32 
actual_elapsed_time',
-|   crs_elapsed_time INT COMMENT 'inferred from: optional int32 
crs_elapsed_time',
-|   airtime INT COMMENT 'inferred from: optional int32 airtime',
-|   arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
-|   depdelay INT COMMENT 'inferred from: optional int32 depdelay',
-|   origin STRING COMMENT 'inferred from: optional binary origin',
-|   dest STRING COMMENT 'inferred from: optional binary dest',
-|   distance INT COMMENT 'inferred from: optional int32 distance',
-|   taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
-|   taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
-|   cancelled INT COMMENT 'inferred from: optional int32 cancelled',
-|   cancellation_code STRING COMMENT 'inferred from: optional binary 
cancellation_code',
-|   diverted INT COMMENT 'inferred from: optional int32 diverted',
-|   carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
-|   weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
-|   nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
-|   security_delay INT COMMENT 'inferred from: optional int32 security_delay',
-|   late_aircraft_delay INT COMMENT 'inferred from: optional int32 
late_aircraft_delay'
-| )
-| STORED AS PARQUET
-| LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
-| TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
-|   'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
-|   'rawDataSize'='-1')
-+-------------------------------------------------------------------------------------
-Fetched 1 row(s) in 0.03s
-[localhost:21000] > quit;
-</codeblock>
+<p> We could go quite far with the data in this initial raw format, just as we
+          downloaded it from the web. If the data set proved to be useful and
+          worth persisting in Impala for extensive queries, we might want to
+          copy it to an internal table, letting Impala manage the data files 
and
+          perhaps reorganizing a little for higher efficiency. In this next
+          stage of the tutorial, we copy the original data into a partitioned
+          table, still in Parquet format. Partitioning based on the
+            <codeph>year</codeph> column lets us run queries with clauses such
+          as <codeph>WHERE year = 2001</codeph> or <codeph>WHERE year BETWEEN
+            1989 AND 1999</codeph>, which can dramatically cut down on I/O by
+          ignoring all the data from years outside the desired range. Rather
+          than reading all the data and then deciding which rows are in the
+          matching years, Impala can zero in on only the data files from
+          specific <codeph>year</codeph> partitions. To do this, Impala
+          physically reorganizes the data files, putting the rows from each 
year
+          into data files in a separate HDFS directory for each
+            <codeph>year</codeph> value. Along the way, we'll also get rid of
+          the <codeph>tail_num</codeph> column that proved to be almost 
entirely
+            <codeph>NULL</codeph>. </p>
+
+<p> The first step is to create a new table with a layout very similar to the
+          original <codeph>airlines_external</codeph> table. We'll do that by
+          reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the
+          first table, then tweaking it slightly to include a <codeph>PARTITION
+            BY</codeph> clause for <codeph>year</codeph>, and excluding the
+            <codeph>tail_num</codeph> column. The <codeph>SHOW CREATE
+            TABLE</codeph> statement gives us the starting point. </p>
 
 <p>
 Although we could edit that output into a new SQL statement, all the ASCII box 
characters
@@ -2092,17 +1972,10 @@ to start with, we restart the 
<cmdname>impala-shell</cmdname> command with the
 <codeph>-B</codeph> option, which turns off the box-drawing behavior.
 </p>
 
-<codeblock>[localhost:21000] > quit;
-Goodbye jrussell
-$ impala-shell -i localhost -B -d airline_data;
-Starting Impala Shell without Kerberos authentication
-Connected to localhost:21000
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/>
-Welcome to the Impala shell. Press TAB twice to see a list of available 
commands.
-...
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/>
-[localhost:21000] > show create table airlines_external;
-"CREATE EXTERNAL TABLE airline_data.airlines_external (
+<codeblock>$ impala-shell -i localhost -B -d airlines_data;
+
+> SHOW CREATE TABLE airlines_external;
+"CREATE EXTERNAL TABLE airlines_data.airlines_external (
   year INT COMMENT 'inferred from: optional int32 year',
   month INT COMMENT 'inferred from: optional int32 month',
   day INT COMMENT 'inferred from: optional int32 day',
@@ -2138,7 +2011,6 @@ LOCATION 
'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
 TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
   'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
   'rawDataSize'='-1')"
-Fetched 1 row(s) in 0.01s
 </codeblock>
 
 <p>
@@ -2146,102 +2018,97 @@ After copying and pasting the <codeph>CREATE 
TABLE</codeph> statement into a tex
 without the <codeph>-B</codeph> option, to switch back to regular
 output.
 </p>
-<p>
-Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from 
the <codeph>SHOW CREATE TABLE</codeph> output.
-We kept the <codeph>STORED AS PARQUET</codeph> clause because
-we want to rearrange the data somewhat but still keep it in the 
high-performance
-Parquet format. The <codeph>LOCATION</codeph> and 
<codeph>TBLPROPERTIES</codeph>
-clauses are not relevant for this new table, so we edit those out.
-Because we are going to partition the new table based on the 
<codeph>YEAR</codeph>
-column, we move that column name (and its type) into a new <codeph>PARTITIONED 
BY</codeph> clause.
-</p>
-
-<codeblock>[localhost:21000] > CREATE TABLE airline_data.airlines
-                  > (
-                  >   month INT,
-                  >   day INT,
-                  >   dayofweek INT,
-                  >   dep_time INT,
-                  >   crs_dep_time INT,
-                  >   arr_time INT,
-                  >   crs_arr_time INT,
-                  >   carrier STRING,
-                  >   flight_num INT,
-                  >   actual_elapsed_time INT,
-                  >   crs_elapsed_time INT,
-                  >   airtime INT,
-                  >   arrdelay INT,
-                  >   depdelay INT,
-                  >   origin STRING,
-                  >   dest STRING,
-                  >   distance INT,
-                  >   taxi_in INT,
-                  >   taxi_out INT,
-                  >   cancelled INT,
-                  >   cancellation_code STRING,
-                  >   diverted INT,
-                  >   carrier_delay INT,
-                  >   weather_delay INT,
-                  >   nas_delay INT,
-                  >   security_delay INT,
-                  >   late_aircraft_delay INT
-                  > )
-                  > STORED AS PARQUET
-                  > PARTITIONED BY (year INT);
-Fetched 0 row(s) in 0.10s
+<p> Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted 
from
+          the <codeph>SHOW CREATE TABLE</codeph> output. We kept the
+            <codeph>STORED AS PARQUET</codeph> clause because we want to
+          rearrange the data somewhat but still keep it in the high-performance
+          Parquet format. The <codeph>LOCATION</codeph> and
+            <codeph>TBLPROPERTIES</codeph> clauses are not relevant for this 
new
+          table, so we edit those out. Because we are going to partition the 
new
+          table based on the <codeph>year</codeph> column, we move that column
+          name (and its type) into a new <codeph>PARTITIONED BY</codeph> 
clause. </p>
+
+<codeblock>> CREATE TABLE airlines_data.airlines
+ (month INT,
+  day INT,
+  dayofweek INT,
+  dep_time INT,
+  crs_dep_time INT,
+  arr_time INT,
+  crs_arr_time INT,
+  carrier STRING,
+  flight_num INT,
+  actual_elapsed_time INT,
+  crs_elapsed_time INT,
+  airtime INT,
+  arrdelay INT,
+  depdelay INT,
+  origin STRING,
+  dest STRING,
+  distance INT,
+  taxi_in INT,
+  taxi_out INT,
+  cancelled INT,
+  cancellation_code STRING,
+  diverted INT,
+  carrier_delay INT,
+  weather_delay INT,
+  nas_delay INT,
+  security_delay INT,
+  late_aircraft_delay INT)
+PARTITIONED BY (year INT)
+STORED AS PARQUET
+;
 </codeblock>
 
-<p>
-Next, we copy all the rows from the original table into this new one with
-an <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE 
TABLE</codeph>
-statement to make an <codeph>INSERT</codeph> statement with the column names
-in the same order.) The only change is to add a 
<codeph>PARTITION(year)</codeph>
-clause, and move the <codeph>YEAR</codeph> column to the very end of the
-<codeph>SELECT</codeph> list of the <codeph>INSERT</codeph> statement.
-Specifying <codeph>PARTITION(year)</codeph>, rather than a fixed value
-such as <codeph>PARTITION(year=2000)</codeph>, means that Impala figures
-out the partition value for each row based on the value of the very last column
-in the <codeph>SELECT</codeph> list.
-This is the first SQL statement that legitimately takes any substantial time,
-because the rows from different years are shuffled around the cluster;
-the rows that go into each partition are collected on one node, before being
-written to one or more new data files.
-</p>
-
-<codeblock>[localhost:21000] > INSERT INTO airline_data.airlines
-                  > PARTITION (year)
-                  > SELECT
-                  >   month,
-                  >   day,
-                  >   dayofweek,
-                  >   dep_time,
-                  >   crs_dep_time,
-                  >   arr_time,
-                  >   crs_arr_time,
-                  >   carrier,
-                  >   flight_num,
-                  >   actual_elapsed_time,
-                  >   crs_elapsed_time,
-                  >   airtime,
-                  >   arrdelay,
-                  >   depdelay,
-                  >   origin,
-                  >   dest,
-                  >   distance,
-                  >   taxi_in,
-                  >   taxi_out,
-                  >   cancelled,
-                  >   cancellation_code,
-                  >   diverted,
-                  >   carrier_delay,
-                  >   weather_delay,
-                  >   nas_delay,
-                  >   security_delay,
-                  >   late_aircraft_delay,
-                  >   year
-                  > FROM airline_data.airlines_external;
-Inserted 123534969 row(s) in 202.70s
-</codeblock>
+<p> Next, we copy all the rows from the original table into this new one with 
an
+            <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE
+            TABLE</codeph> statement to make an <codeph>INSERT</codeph>
+          statement with the column names in the same order.) The only change 
is
+          to add a <codeph>PARTITION(year)</codeph> clause, and move the
+            <codeph>year</codeph> column to the very end of the
+            <codeph>SELECT</codeph> list of the <codeph>INSERT</codeph>
+          statement. Specifying <codeph>PARTITION(year)</codeph>, rather than a
+          fixed value such as <codeph>PARTITION(year=2000)</codeph>, means that
+          Impala figures out the partition value for each row based on the 
value
+          of the very last column in the <codeph>SELECT</codeph> list. This is
+          the first SQL statement that legitimately takes any substantial time,
+          because the rows from different years are shuffled around the 
cluster;
+          the rows that go into each partition are collected on one node, 
before
+          being written to one or more new data files. </p>
+
+<codeblock>> INSERT INTO airlines_data.airlines
+  PARTITION (year)
+  SELECT
+    month,
+    day,
+    dayofweek,
+    dep_time,
+    crs_dep_time,
+    arr_time,
+    crs_arr_time,
+    carrier,
+    flight_num,
+    actual_elapsed_time,
+    crs_elapsed_time,
+    airtime,
+    arrdelay,
+    depdelay,
+    origin,
+    dest,
+    distance,
+    taxi_in,
+    taxi_out,
+    cancelled,
+    cancellation_code,
+    diverted,
+    carrier_delay,
+    weather_delay,
+    nas_delay,
+    security_delay,
+    late_aircraft_delay,
+    year
+  FROM airlines_data.airlines_external;</codeblock>
 
 <p>
 Once partitioning or join queries come into play, it's important to have 
statistics
@@ -2253,54 +2120,57 @@ are in place for each partition, and also illustrates 
how many files and how muc
 is in each partition.
 </p>
 
-<codeblock>[localhost:21000] > compute incremental stats airlines;
+<codeblock>> COMPUTE INCREMENTAL STATS airlines;
 +-------------------------------------------+
 | summary                                   |
 +-------------------------------------------+
 | Updated 22 partition(s) and 27 column(s). |
 +-------------------------------------------+
-[localhost:21000] > show table stats airlines;
-+-------+-----------+--------+----------+--------------+------------+---------+-------------------+
-| year  | #Rows     | #Files | Size     | Bytes Cached | Cache Repl | Format  
| Incremental stats |
-+-------+-----------+--------+----------+--------------+------------+---------+-----
-| 1987  | 1311826   | 1      | 9.32MB   | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1988  | 5202096   | 1      | 37.04MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1989  | 5041200   | 1      | 36.25MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1990  | 5270893   | 1      | 38.39MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1991  | 5076925   | 1      | 37.23MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1992  | 5092157   | 1      | 36.85MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1993  | 5070501   | 1      | 37.16MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1994  | 5180048   | 1      | 38.31MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1995  | 5327435   | 1      | 53.14MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1996  | 5351983   | 1      | 53.64MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1997  | 5411843   | 1      | 54.41MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1998  | 5384721   | 1      | 54.01MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 1999  | 5527884   | 1      | 56.32MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2000  | 5683047   | 1      | 58.15MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2001  | 5967780   | 1      | 60.65MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2002  | 5271359   | 1      | 57.99MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2003  | 6488540   | 1      | 81.33MB  | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2004  | 7129270   | 1      | 103.19MB | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2005  | 7140596   | 1      | 102.61MB | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2006  | 7141922   | 1      | 106.03MB | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2007  | 7453215   | 1      | 112.15MB | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| 2008  | 7009728   | 1      | 105.76MB | NOT CACHED   | NOT CACHED | PARQUET 
| true
-| Total | 123534969 | 22     | 1.30GB   | 0B           |            |         |
-+-------+-----------+--------+----------+--------------+------------+---------+-----
+
+> SHOW TABLE STATS airlines;
++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
+| year  | #Rows     | #Files | Size     | Bytes Cached | Cache Replication | 
Format  | Incremental stats | Location                                          
                                                       |
++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
+| 1987  | 1311826   | 1      | 11.75MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1987 |
+| 1988  | 5202096   | 1      | 44.04MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1988 |
+| 1989  | 5041200   | 1      | 46.07MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1989 |
+| 1990  | 5270893   | 1      | 46.25MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1990 |
+| 1991  | 5076925   | 1      | 46.77MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1991 |
+| 1992  | 5092157   | 1      | 48.21MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1992 |
+| 1993  | 5070501   | 1      | 47.46MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1993 |
+| 1994  | 5180048   | 1      | 47.47MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1994 |
+| 1995  | 5327435   | 1      | 62.40MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1995 |
+| 1996  | 5351983   | 1      | 62.93MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1996 |
+| 1997  | 5411843   | 1      | 65.05MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1997 |
+| 1998  | 5384721   | 1      | 62.21MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1998 |
+| 1999  | 5527884   | 1      | 65.10MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1999 |
+| 2000  | 5683047   | 1      | 67.68MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2000 |
+| 2001  | 5967780   | 1      | 74.03MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2001 |
+| 2002  | 5271359   | 1      | 74.00MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2002 |
+| 2003  | 6488540   | 1      | 99.35MB  | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2003 |
+| 2004  | 7129270   | 1      | 123.29MB | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2004 |
+| 2005  | 7140596   | 1      | 120.72MB | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2005 |
+| 2006  | 7141922   | 1      | 121.88MB | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2006 |
+| 2007  | 7453215   | 1      | 130.87MB | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2007 |
+| 2008  | 7009728   | 1      | 123.14MB | NOT CACHED   | NOT CACHED        | 
PARQUET | true              | 
hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2008 |
+| Total | 123534969 | 22     | 1.55GB   | 0B           |                   |   
      |                   |                                                     
                                                     |
++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
 </codeblock>
 
-<p>
-At this point, we go through a quick thought process to sanity check the 
partitioning we did.
-All the partitions have exactly one file, which is on the low side. A query 
that includes a
-clause <codeph>WHERE year=2004</codeph> will only read a single data block; 
that data block
-will be read and processed by a single data node; therefore, for a query 
targeting a single year,
-all the other nodes in the cluster will sit idle while all the work happens on 
a single machine.
-It's even possible that by chance (depending on HDFS replication factor and 
the way data blocks are distributed
-across the cluster), that multiple year partitions selected by a filter such as
-<codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read and 
processed by the same data node.
-The more data files each partition has, the more parallelism you can get and 
the less probability of <q>hotspots</q> occurring
-on particular nodes, therefore a bigger performance boost by having a big 
cluster.
-</p>
+<p> At this point, we sanity check the partitioning we did. All the partitions
+          have exactly one file, which is on the low side. A query that 
includes
+          a clause <codeph>WHERE year=2004</codeph> will only read a single 
data
+          block; that data block will be read and processed by a single data
+          node; therefore, for a query targeting a single year, all the other
+          nodes in the cluster will sit idle while all the work happens on a
+          single machine. It's even possible that by chance (depending on HDFS
+          replication factor and the way data blocks are distributed across the
+          cluster), that multiple year partitions selected by a filter such as
+            <codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read
+          and processed by the same data node. The more data files each
+          partition has, the more parallelism you can get and the less
+          probability of <q>hotspots</q> occurring on particular nodes,
+          therefore a bigger performance boost by having a big cluster. </p>
 
 <p>
 However, the more data files, the less data goes in each one. The overhead of 
dividing the work in a
@@ -2312,15 +2182,19 @@ The way to see how well it works in practice is to run 
the same queries against
 flat table and the new partitioned table, and compare times.
 </p>
 
-<p>
-Spoiler: in this case, with my particular 4-node cluster with its specific 
distribution of data blocks and my
-particular exploratory queries, queries against the partitioned table do 
consistently run faster than the same queries
-against the unpartitioned table. But I could not be sure that would be the 
case without some real measurements.
-Here are some queries I ran to draw that conclusion, first against 
<codeph>AIRLINES_EXTERNAL</codeph> (no partitioning),
-then against <codeph>AIRLINES</codeph> (partitioned by year). The 
<codeph>AIRLINES</codeph> queries are consistently faster.
-Changing the volume of data, changing the size of the cluster, running queries 
that did or didn't refer to the partition key
-columns, or other factors could change the results to favor one table layout 
or the other.
-</p>
+<p> Spoiler: in this case, with my particular 4-node cluster with its specific
+          distribution of data blocks and my particular exploratory queries,
+          queries against the partitioned table do consistently run faster than
+          the same queries against the unpartitioned table. But I could not be
+          sure that would be the case without some real measurements. Here are
+          some queries I ran to draw that conclusion, first against
+            <codeph>airlines_external</codeph> (no partitioning), then against
+            <codeph>AIRLINES</codeph> (partitioned by year). The
+            <codeph>AIRLINES</codeph> queries are consistently faster. Changing
+          the volume of data, changing the size of the cluster, running queries
+          that did or didn't refer to the partition key columns, or other
+          factors could change the results to favor one table layout or the
+          other. </p>
 
 <note>
 If you find the volume of each partition is only in the low tens of megabytes, 
consider lowering the granularity
@@ -2330,64 +2204,36 @@ even hundreds of megabytes per Parquet file, and the 
number of Parquet files in
 higher than the number of data nodes.
 </note>
 
-<codeblock>[localhost:21000] > select sum(airtime) from airlines_external;
-+--------------+
-| sum(airtime) |
+<codeblock>> SELECT SUM(airtime) FROM airlines_external;
 +--------------+
 | 8662859484   |
 +--------------+
-Fetched 1 row(s) in 2.02s
-[localhost:21000] > select sum(airtime) from airlines;
-+--------------+
-| sum(airtime) |
+
+> SELECT SUM(airtime) FROM airlines;
 +--------------+
 | 8662859484   |
 +--------------+
-Fetched 1 row(s) in 1.21s
 
-[localhost:21000] > select sum(airtime) from airlines_external where year = 
2005;
-+--------------+
-| sum(airtime) |
-+--------------+
-| 708204026    |
-+--------------+
-Fetched 1 row(s) in 2.61s
-[localhost:21000] > select sum(airtime) from airlines where year = 2005;
-+--------------+
-| sum(airtime) |
+> SELECT SUM(airtime) FROM airlines_external WHERE year = 2005;
 +--------------+
 | 708204026    |
 +--------------+
-Fetched 1 row(s) in 1.19s
 
-[localhost:21000] > select sum(airtime) from airlines_external where dayofweek 
= 1;
+> SELECT SUM(airtime) FROM airlines WHERE year = 2005;
 +--------------+
-| sum(airtime) |
-+--------------+
-| 1264945051   |
-+--------------+
-Fetched 1 row(s) in 2.82s
-[localhost:21000] > select sum(airtime) from airlines where dayofweek = 1;
-+--------------+
-| sum(airtime) |
-+--------------+
-| 1264945051   |
+| 708204026    |
 +--------------+
-Fetched 1 row(s) in 1.61s
 </codeblock>
 
-<p>
-Now we can finally do some serious analysis with this data set that, remember,
-a few minutes ago all we had were some raw data files and we didn't even know
-what columns they contained.
-Let's see whether the <q>air time</q> of a flight tends to be different 
depending on the
-day of the week. We can see that the average is a little higher on day number 
6;
-perhaps Saturday is a busy flying day and planes have to circle for longer at 
the
-destination airport before landing.
-</p>
+<p> Now we can finally analyze this data set that from the raw data files and 
we
+          didn't know what columns they contained. Let's see whether the
+            <codeph>airtime</codeph> of a flight tends to be different 
depending
+          on the day of the week. We can see that the average is a little 
higher
+          on day number 6; perhaps Saturday is a busy flying day and planes 
have
+          to circle for longer at the destination airport before landing. </p>
 
-<codeblock>[localhost:21000] > select dayofweek, avg(airtime) from airlines
-                  > group by dayofweek order by dayofweek;
+<codeblock>> SELECT dayofweek, AVG(airtime) FROM airlines
+  GROUP BY dayofweek ORDER BY dayofweek;
 +-----------+-------------------+
 | dayofweek | avg(airtime)      |
 +-----------+-------------------+
@@ -2399,7 +2245,6 @@ destination airport before landing.
 | 6         | 105.3627448363705 |
 | 7         | 103.4144351202054 |
 +-----------+-------------------+
-Fetched 7 row(s) in 2.25s
 </codeblock>
 
 <p>
@@ -2410,8 +2255,8 @@ of <codeph>NULL</codeph> for this column in years 1987 to 
1994 shows that querie
 need to be restricted to a date range of 1995 and higher.
 </p>
 
-<codeblock>[localhost:21000] > select year, dayofweek, avg(airtime) from 
airlines
-                  > group by year, dayofweek order by year desc, dayofweek;
+<codeblock>> SELECT year, dayofweek, AVG(airtime) FROM airlines
+  GROUP BY year, dayofweek ORDER BY year DESC, dayofweek;
 +------+-----------+-------------------+
 | year | dayofweek | avg(airtime)      |
 +------+-----------+-------------------+
@@ -2456,28 +2301,5 @@ need to be restricted to a date range of 1995 and higher.
 
     </conbody>
   </concept>
-
-  <concept audience="hidden" id="tut_synthesizing">
-
-    <title>Synthesizing Data</title>
-
-    <conbody>
-
-<!-- Scenarios:
-       INSERT VALUES 1 row or a few rows, then duplicate the values with 
slight alterations.
-       - Use row_number() and rank() to assign IDs to values from narrow 
tables.
-       - Use sum() in analytic context, with sliding window, to derive other 
arbitrary numbers.
-       Grab database, table, column, etc. names from metastore and play with 
those.
-       Maybe take results of a 'find' and play with those.
-       - Invent fields like 'username' based on /home or whatever 
subdirectories.
-       - Perhaps also capture permissions, timestamps, etc. via 'ls -l'.
-       - Illustrate narrow table with 'extension' field, wide table with 
'jpg', 'Office', 'picture', 'XML', etc. fields
-         based on case() of extension.
-       Decompose some string values using regexp_extract.
-       Set up views to get the logical layout right before making new physical 
copies of data.
--->
-
-      <p></p>
-    </conbody>
   </concept>
 </concept>

[02/11] impala git commit: IMPALA-6827: [DOCS] Updated the download link for the tutorial data

Reply via email to