IMPALA-6827: [DOCS] Updated the download link for the tutorial data Updated the link to download the Parquet airline files for tutorial.
Change-Id: I6823d1688169e0a6f09d5b552026bc18a3770828 Reviewed-on: http://gerrit.cloudera.org:8080/10393 Reviewed-by: Michael Brown <mi...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/04add98a Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/04add98a Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/04add98a Branch: refs/heads/2.x Commit: 04add98a341f3ae8e1e4e0613c82188eec5fc0d9 Parents: 99e379d Author: Alex Rodoni <arod...@cloudera.com> Authored: Mon May 14 13:30:42 2018 -0700 Committer: Impala Public Jenkins <impala-public-jenk...@gerrit.cloudera.org> Committed: Thu May 17 22:03:02 2018 +0000 ---------------------------------------------------------------------- docs/topics/impala_tutorial.xml | 1110 ++++++++++++++-------------------- 1 file changed, 466 insertions(+), 644 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/04add98a/docs/topics/impala_tutorial.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_tutorial.xml b/docs/topics/impala_tutorial.xml index 01e53e2..72665ef 100644 --- a/docs/topics/impala_tutorial.xml +++ b/docs/topics/impala_tutorial.xml @@ -1484,81 +1484,6 @@ Returned 20 row(s) in 0.38s Returned 10 row(s) in 0.39s</codeblock> </conbody> </concept> - - <concept audience="hidden" id="tut_connect"> - - <title>Connecting to Impala through impala-shell</title> - - <conbody> - - <p> - This tutorial provides some tips for connecting to Impala through the <cmdname>impala-shell</cmdname> - interpreter, and gives usage tips for particular scenarios. - </p> - - <ul> - <li> - <p> - By default, <cmdname>impala-shell</cmdname> with no arguments connects to the current host, port - 21000. - </p> - </li> - - <li> - <p> - <codeph>-i localhost</codeph> option gives you a shorter command prompt. - </p> - </li> - - <li> - <p> - The <codeph>--quiet</codeph> option suppresses some informational messages, such as the - <codeph>Query:</codeph> line that echoes each command entered. - </p> - </li> - - <li> - <codeph>CONNECT <varname>hostname</varname></codeph> lets you switch to another host (keeping the same - default port 21000). - </li> - - <li> - <p> - You could run <cmdname>impala-shell</cmdname> from a completely different host to keep login accounts - off the actual Impala nodes. - </p> - </li> - - <li> - <p> - For load balancing, you would connect to different coordinator nodes for concurrent queries. - </p> - </li> - - <li> - <p> - For benchmarking, you would skip pretty printing with the <codeph>-B</codeph> option, or even skip - all screen output with the <codeph>-o</codeph> option or shell redirection. - </p> - </li> - </ul> - </conbody> - </concept> - </concept> - - <concept audience="hidden" id="tut_mem_limit"> - - <title>Effects of Memory Limits on Impala Queries</title> - <prolog> - <metadata> - <data name="Category" value="Memory"/> - </metadata> - </prolog> - - <conbody> - - <p></p> - </conbody> </concept> <concept id="tut_parquet_schemaless"> @@ -1591,319 +1516,311 @@ for purposes of this exercise, wait until after following the tutorial before ex a real-life situation where you cannot rely on assumptions and assertions about the ranges and representations of data values. </p> + </conbody> + <concept id="download_hdfs"> + <title>Download the Data Files into HDFS</title> -<p> -We will download Parquet files containing this data from the Ibis blog. -First, we download and unpack the data files. -There are 8 files totalling 1.4 GB. -Each file is less than 256 MB. -</p> - -<codeblock>$ wget -O airlines_parquet.tar.gz https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0 -... -Length: 1245204740 (1.2G) [application/octet-stream] -Saving to: âairlines_parquet.tar.gzâ +<conbody> +<p> First, we download and unpack the data files. There are 8 files totalling + 1.4 GB.</p> -2015-08-12 17:14:24 (23.6 MB/s) - âairlines_parquet.tar.gzâ saved [1245204740/1245204740] +<codeblock>$ wget -O airlines_parquet.tar.gz https://home.apache.org/~arodoni/airlines_parquet.tar.gz +$ wget https://home.apache.org/~arodoni/airlines_parquet.tar.gz.sha512 +$ shasum -a 512 -c airlines_parquet.tar.gz.sha512 +airlines_parquet.tar.gz: OK $ tar xvzf airlines_parquet.tar.gz -airlines_parquet/ -airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq -airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq -airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq -airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq -airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq -airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq -airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq -airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq -$ cd airlines_parquet/ -$ du -kch *.parq -253M 93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq -65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq -156M 93459d994898a9ba-77674173b331fa97_574780876_data.0.parq -240M 93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq -253M 93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq -16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq -177M 93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq -213M 93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq -1.4G total -</codeblock> -<p> -Next, we put the Parquet data files in HDFS, all together in a single directory, -with permissions on the directory and the files so that the <codeph>impala</codeph> -user will be able to read them. -</p> - -<note> -After unpacking, we saw the largest Parquet file was 253 MB. -When copying Parquet files into HDFS for Impala to use, -for maximum query performance, make sure that each file resides in a single HDFS data block. -Therefore, we pick a size larger than any single file and specify that as the block size, using the argument -<codeph>-Ddfs.block.size=256m</codeph> on the <codeph>hdfs dfs -put</codeph> command. -</note> +$ cd airlines_parquet/ -<codeblock>$ hdfs dfs -mkdir -p hdfs://demo_host.example.com:8020/user/impala/staging/airlines -$ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines -$ hdfs dfs -ls /user/impala/staging +$ du -kch *.parq +253M 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq +14M 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq +253M 4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq +64M 4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq +184M 4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq +241M 4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq +212M 4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq +152M 4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq +1.4G total</codeblock> + +<p> Next, we put the Parquet data files in HDFS, all together in a single + directory, with permissions on the directory and the files so that the + <codeph>impala</codeph> user will be able to read them.</p> + <p>After unpacking, we saw the largest Parquet file was 253 MB. When + copying Parquet files into HDFS for Impala to use, for maximum query + performance, make sure that each file resides in a single HDFS data + block. Therefore, we pick a size larger than any single file and + specify that as the block size, using the argument + <codeph>-Ddfs.block.size=253m</codeph> on the <codeph>hdfs dfs + -put</codeph> command. </p> + +<codeblock>$ sudo -u hdfs hdfs dfs -mkdir -p /user/impala/staging/airlines +$ sudo -u hdfs hdfs dfs -Ddfs.block.size=253m -put *.parq /user/impala/staging/airlines +$ sudo -u hdfs hdfs dfs -ls /user/impala/staging Found 1 items -drwxrwxrwx - hdfs supergroup 0 2015-08-12 13:52 /user/impala/staging/airlines -$ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines + +$ sudo -u hdfs hdfs dfs -ls /user/impala/staging/airlines Found 8 items --rw-r--r-- 3 jrussell supergroup 265107489 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq --rw-r--r-- 3 jrussell supergroup 67544715 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq --rw-r--r-- 3 jrussell supergroup 162556490 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq --rw-r--r-- 3 jrussell supergroup 251603518 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq --rw-r--r-- 3 jrussell supergroup 265186603 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq --rw-r--r-- 3 jrussell supergroup 16663754 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq --rw-r--r-- 3 jrussell supergroup 185511677 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq --rw-r--r-- 3 jrussell supergroup 222794621 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq </codeblock> +</conbody> + </concept> -<p> -With the files in an accessible location in HDFS, we create a database table that uses the data in those files. -The <codeph>CREATE EXTERNAL</codeph> syntax and the <codeph>LOCATION</codeph> attribute point Impala at the appropriate HDFS directory. -The <codeph>LIKE PARQUET '<varname>path_to_any_parquet_file</varname>'</codeph> clause means we skip the list of column names and types; -Impala automatically gets the column names and data types straight from the data files. -(Currently, this technique only works for Parquet files.) -We ignore the warning about lack of <codeph>READ_WRITE</codeph> access to the files in HDFS; -the <codeph>impala</codeph> user can read the files, which will be sufficient for us to experiment with -queries and perform some copy and transform operations into other tables. -</p> + <concept id="create_tables"> + <title>Create Database and Tables</title> + <conbody> -<codeblock>$ impala-shell -i localhost -Starting Impala Shell without Kerberos authentication +<p> With the files in an accessible location in HDFS, you create a database + table that uses the data in those files:<ul> + <li>The <codeph>CREATE EXTERNAL</codeph> syntax and the + <codeph>LOCATION</codeph> attribute point Impala at the + appropriate HDFS directory.</li> + <li>The <codeph>LIKE PARQUET + '<varname>path_to_any_parquet_file</varname>'</codeph> clause + means we skip the list of column names and types; Impala + automatically gets the column names and data types straight from + the data files. (Currently, this technique only works for Parquet + files.) </li> + <li>Ignore the warning about lack of <codeph>READ_WRITE</codeph> + access to the files in HDFS; the <codeph>impala</codeph> user can + read the files, which will be sufficient for us to experiment with + queries and perform some copy and transform operations into other + tables. </li> + </ul></p> -Connected to localhost:21000 -<ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/> -Welcome to the Impala shell. Press TAB twice to see a list of available commands. -... -<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> -[localhost:21000] > create database airline_data; -[localhost:21000] > use airline_data; -[localhost:21000] > create external table airlines_external - > like parquet 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq' - > stored as parquet location 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines'; -WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://demo_host.example.com:8020/user/impala/staging' +<codeblock>$ impala-shell +> CREATE DATABASE airlines_data; + USE airlines_data; + CREATE EXTERNAL TABLE airlines_external + LIKE PARQUET 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq' + STORED AS PARQUET LOCATION 'hdfs:staging/airlines'; +WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://myhost.com:8020/user/impala/staging' </codeblock> + </conbody> + </concept> + <concept id="examine_schema "> + <title>Examine Physical and Logical Schema</title> + <conbody> -<p> -With the table created, we examine its physical and logical characteristics to confirm that the data is really -there and in a format and shape that we can work with. -The <codeph>SHOW TABLE STATS</codeph> statement gives a very high-level summary of the table, -showing how many files and how much total data it contains. -Also, it confirms that the table is expecting all the associated data files to be in Parquet format. -(The ability to work with all kinds of HDFS data files in different formats means that it is -possible to have a mismatch between the format of the data files, and the format -that the table expects the data files to be in.) -The <codeph>SHOW FILES</codeph> statement confirms that the data in the table has the expected number, -names, and sizes of the original Parquet files. -The <codeph>DESCRIBE</codeph> statement (or its abbreviation <codeph>DESC</codeph>) confirms the names and types -of the columns that Impala automatically created after reading that metadata from the Parquet file. -The <codeph>DESCRIBE FORMATTED</codeph> statement prints out some extra detail along with the column definitions; -the pieces we care about for this exercise are the containing database for the table, -the location of the associated data files in HDFS, the fact that it's an external table so Impala will not -delete the HDFS files when we finish the experiments and drop the table, and the fact that the -table is set up to work exclusively with files in the Parquet format. -</p> - -<codeblock>[localhost:21000] > show table stats airlines_external; +<p> With the table created, we examine its physical and logical characteristics + to confirm that the data is really there and in a format and shape + that we can work with. <ul> + <li>The <codeph>SHOW TABLE STATS</codeph> statement gives a very + high-level summary of the table, showing how many files and how + much total data it contains. Also, it confirms that the table is + expecting all the associated data files to be in Parquet format. + (The ability to work with all kinds of HDFS data files in + different formats means that it is possible to have a mismatch + between the format of the data files, and the format that the + table expects the data files to be in.) </li> + <li>The <codeph>SHOW FILES</codeph> statement confirms that the data + in the table has the expected number, names, and sizes of the + original Parquet files.</li> + <li>The <codeph>DESCRIBE</codeph> statement (or its abbreviation + <codeph>DESC</codeph>) confirms the names and types of the + columns that Impala automatically created after reading that + metadata from the Parquet file. </li> + <li>The <codeph>DESCRIBE FORMATTED</codeph> statement prints out + some extra detail along with the column definitions. The pieces we + care about for this exercise are: <ul> + <li>The containing database for the table.</li> + <li>The location of the associated data files in HDFS.</li> + <li>The table is an external table so Impala will not delete the + HDFS files when we finish the experiments and drop the + table.</li> + <li>The table is set up to work exclusively with files in the + Parquet format.</li> + </ul></li> + </ul></p> + +<codeblock>> SHOW TABLE STATS airlines_external; +-------+--------+--------+--------------+-------------------+---------+-------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+--------+--------+--------------+-------------------+---------+-------------------+ | -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false | +-------+--------+--------+--------------+-------------------+---------+-------------------+ -[localhost:21000] > show files in airlines_external; + +> SHOW FILES IN airlines_external; +----------------------------------------------------------------------------------------+----------+-----------+ | path | size | partition | +----------------------------------------------------------------------------------------+----------+-----------+ -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq | 252.83MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq | 64.42MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq | 155.03MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq | 239.95MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq | 252.90MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq | 15.89MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq | 176.92MB | | -| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq | 212.47MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq | 252.84MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq | 63.92MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB | | +| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB | | +----------------------------------------------------------------------------------------+----------+-----------+ -[localhost:21000] > describe airlines_external; + +> DESCRIBE airlines_external; +---------------------+--------+---------------------------------------------------+ | name | type | comment | +---------------------+--------+---------------------------------------------------+ -| year | int | inferred from: optional int32 year | -| month | int | inferred from: optional int32 month | -| day | int | inferred from: optional int32 day | -| dayofweek | int | inferred from: optional int32 dayofweek | -| dep_time | int | inferred from: optional int32 dep_time | -| crs_dep_time | int | inferred from: optional int32 crs_dep_time | -| arr_time | int | inferred from: optional int32 arr_time | -| crs_arr_time | int | inferred from: optional int32 crs_arr_time | -| carrier | string | inferred from: optional binary carrier | -| flight_num | int | inferred from: optional int32 flight_num | -| tail_num | int | inferred from: optional int32 tail_num | -| actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time | -| crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time | -| airtime | int | inferred from: optional int32 airtime | -| arrdelay | int | inferred from: optional int32 arrdelay | -| depdelay | int | inferred from: optional int32 depdelay | -| origin | string | inferred from: optional binary origin | -| dest | string | inferred from: optional binary dest | -| distance | int | inferred from: optional int32 distance | -| taxi_in | int | inferred from: optional int32 taxi_in | -| taxi_out | int | inferred from: optional int32 taxi_out | -| cancelled | int | inferred from: optional int32 cancelled | -| cancellation_code | string | inferred from: optional binary cancellation_code | -| diverted | int | inferred from: optional int32 diverted | -| carrier_delay | int | inferred from: optional int32 carrier_delay | -| weather_delay | int | inferred from: optional int32 weather_delay | -| nas_delay | int | inferred from: optional int32 nas_delay | -| security_delay | int | inferred from: optional int32 security_delay | -| late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay | +| year | int | Inferred from Parquet file. | +| month | int | Inferred from Parquet file. | +| day | int | Inferred from Parquet file. | +| dayofweek | int | Inferred from Parquet file. | +| dep_time | int | Inferred from Parquet file. | +| crs_dep_time | int | Inferred from Parquet file. | +| arr_time | int | Inferred from Parquet file. | +| crs_arr_time | int | Inferred from Parquet file. | +| carrier | string | Inferred from Parquet file. | +| flight_num | int | Inferred from Parquet file. | +| tail_num | int | Inferred from Parquet file. | +| actual_elapsed_time | int | Inferred from Parquet file. | +| crs_elapsed_time | int | Inferred from Parquet file. | +| airtime | int | Inferred from Parquet file. | +| arrdelay | int | Inferred from Parquet file. | +| depdelay | int | Inferred from Parquet file. | +| origin | string | Inferred from Parquet file. | +| dest | string | Inferred from Parquet file. | +| distance | int | Inferred from Parquet file. | +| taxi_in | int | Inferred from Parquet file. | +| taxi_out | int | Inferred from Parquet file. | +| cancelled | int | Inferred from Parquet file. | +| cancellation_code | string | Inferred from Parquet file. | +| diverted | int | Inferred from Parquet file. | +| carrier_delay | int | Inferred from Parquet file. | +| weather_delay | int | Inferred from Parquet file. | +| nas_delay | int | Inferred from Parquet file. | +| security_delay | int | Inferred from Parquet file. | +| late_aircraft_delay | int | Inferred from Parquet file. | +---------------------+--------+---------------------------------------------------+ -[localhost:21000] > desc formatted airlines_external; + +> DESCRIBE FORMATTED airlines_external; +------------------------------+------------------------------- | name | type +------------------------------+------------------------------- ... | # Detailed Table Information | NULL -| Database: | airline_data -| Owner: | jrussell +| Database: | airlines_data +| Owner: | impala ... | Location: | /user/impala/staging/airlines | Table Type: | EXTERNAL_TABLE ... | # Storage Information | NULL -| SerDe Library: | parquet.hive.serde.ParquetHiveSerDe -| InputFormat: | parquet.hive.DeprecatedParquetInputFormat -| OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat +| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe +| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputForma +| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ... </codeblock> + </conbody></concept> + <concept id="examine_data"> + <title>Analyze Data</title> + <conbody> -<p> -Now that we are confident that the connections are solid between the Impala table and the -underlying Parquet files, we run some initial queries to understand the characteristics -of the data: the overall number of rows, and the ranges and how many -different values are in certain columns. -For convenience in understanding the magnitude of the <codeph>COUNT(*)</codeph> -result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows in the table. -</p> - -<!-- I think this is the very longest line that could possibly fit without wrapping in the PDF output; 87 characters wide. -[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external; ---> +<p> Now that we are confident that the connections are solid between the Impala + table and the underlying Parquet files, we run some initial queries to + understand the characteristics of the data: the overall number of + rows, and the ranges and how many different values are in certain + columns. </p> -<codeblock>[localhost:21000] > select count(*) from airlines_external; +<codeblock>> SELECT COUNT(*) FROM airlines_external; +-----------+ | count(*) | +-----------+ | 123534969 | +-----------+ -Fetched 1 row(s) in 1.32s -[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external; -+------------------+ -| millions of rows | -+------------------+ -| 123.534969 | -+------------------+ -Fetched 1 row(s) in 1.24s </codeblock> -<p> The <codeph>NDV()</codeph> function stands for <q>number of distinct - values</q>, which for performance reasons is an estimate when there - are lots of different values in the column, but is precise when the - cardinality is less than 16 K. Use <codeph>NDV()</codeph> calls for this - kind of exploration rather than <codeph>COUNT(DISTINCT - <varname>colname</varname>)</codeph>, because Impala can evaluate - multiple <codeph>NDV()</codeph> functions in a single query, but only a - single instance of <codeph>COUNT DISTINCT</codeph>. Here we see that - there are modest numbers of different airlines, flight numbers, and - origin and destination airports. Two things jump out from this query: - the number of <codeph>tail_num</codeph> values is much smaller than we - might have expected, and there are more destination airports than origin - airports. Let's dig further. What we find is that most - <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks - like this was an experimental column that wasn't filled in accurately. - We make a mental note that if we use this data as a starting point, - we'll ignore this column. We also find that certain airports are - represented in the <codeph>ORIGIN</codeph> column but not the - <codeph>DEST</codeph> column; now we know that we cannot rely on the - assumption that those sets of airport codes are identical. </p> - -<note> -A slight digression for some performance tuning. Notice how the first -<codeph>SELECT DISTINCT DEST</codeph> query takes almost 40 seconds. -We expect all queries on such a small data set, less than 2 GB, to -take a few seconds at most. The reason is because the expression -<codeph>NOT IN (SELECT origin FROM airlines_external)</codeph> -produces an intermediate result set of 123 million rows, then -runs 123 million comparisons on each data node against the tiny set of destination airports. -The way the <codeph>NOT IN</codeph> operator works internally means that -this intermediate result set with 123 million rows might be transmitted -across the network to each data node in the cluster. -Applying another <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph> -subquery means that the intermediate result set is only 340 items, -resulting in much less network traffic and fewer comparison operations. -The more efficient query with the added <codeph>DISTINCT</codeph> is approximately 7 times as fast. -</note> - -<codeblock>[localhost:21000] > select ndv(carrier), ndv(flight_num), ndv(tail_num), - > ndv(origin), ndv(dest) from airlines_external; +<p> The <codeph>NDV()</codeph> function returns a number of distinct values, + which, for performance reasons, is an estimate when there are lots of + different values in the column, but is precise when the cardinality is + less than 16 K. Use <codeph>NDV()</codeph> function for this kind of + exploration rather than <codeph>COUNT(DISTINCT + <varname>colname</varname>)</codeph>, because Impala can evaluate + multiple <codeph>NDV()</codeph> functions in a single query, but only + a single instance of <codeph>COUNT DISTINCT</codeph>. </p> + +<codeblock>> SElECT NDV(carrier), NDV(flight_num), NDV(tail_num), + NDV(origin), NDV(dest) FROM airlines_external; +--------------+-----------------+---------------+-------------+-----------+ | ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) | +--------------+-----------------+---------------+-------------+-----------+ -| 29 | 9086 | 3 | 340 | 347 | +| 29 | 8463 | 3 | 342 | 349 | +--------------+-----------------+---------------+-------------+-----------+ -[localhost:21000] > select tail_num, count(*) as howmany from airlines_external - > group by tail_num; + +> SELECT tail_num, COUNT(*) AS howmany FROM airlines_external + GROUP BY tail_num; +----------+-----------+ | tail_num | howmany | +----------+-----------+ +| NULL | 123122001 | | 715 | 1 | | 0 | 406405 | | 112 | 6562 | -| NULL | 123122001 | +----------+-----------+ -Fetched 1 row(s) in 5.18s -[localhost:21000] > select distinct dest from airlines_external - > where dest not in (select origin from airlines_external); + +> SELECT DISTINCT dest FROM airlines_external + WHERE dest NOT IN (SELECT origin FROM airlines_external); +------+ | dest | +------+ -| LBF | | CBM | -| RCA | | SKA | | LAR | +| RCA | +| LBF | +------+ -Fetched 5 row(s) in 39.64s -[localhost:21000] > select distinct dest from airlines_external - > where dest not in (select distinct origin from airlines_external); + +> SELECT DISTINCT dest FROM airlines_external + WHERE dest NOT IN (SELECT DISTINCT origin FROM airlines_external); +------+ | dest | +------+ -| LBF | -| RCA | | CBM | | SKA | | LAR | +| RCA | +| LBF | +------+ -Fetched 5 row(s) in 5.59s -[localhost:21000] > select distinct origin from airlines_external - > where origin not in (select distinct dest from airlines_external); -Fetched 0 row(s) in 5.37s -</codeblock> + +> SELECT DISTINCT origin FROM airlines_external + WHERE origin NOT IN (SELECT DISTINCT dest FROM airlines_external); +Fetched 0 row(s) in 2.63</codeblock> + <p>With the above queries, we see that there are modest numbers of + different airlines, flight numbers, and origin and destination + airports. Two things jump out from this query: the number of + <codeph>tail_num</codeph> values is much smaller than we might have + expected, and there are more destination airports than origin + airports. Let's dig further. What we find is that most + <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks + like this was an experimental column that wasn't filled in accurately. + We make a mental note that if we use this data as a starting point, + we'll ignore this column. We also find that certain airports are + represented in the <codeph>ORIGIN</codeph> column but not the + <codeph>DEST</codeph> column; now we know that we cannot rely on the + assumption that those sets of airport codes are identical. </p> + <note> The first <codeph>SELECT DISTINCT DEST</codeph> query takes + almost 40 seconds. We expect all queries on such a small data set, + less than 2 GB, to take a few seconds at most. The reason is because + the expression <codeph>NOT IN (SELECT origin FROM + airlines_external)</codeph> produces an intermediate result set of + 123 million rows, then runs 123 million comparisons on each data node + against the tiny set of destination airports. The way the <codeph>NOT + IN</codeph> operator works internally means that this intermediate + result set with 123 million rows might be transmitted across the + network to each data node in the cluster. Applying another + <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph> + subquery means that the intermediate result set is only 340 items, + resulting in much less network traffic and fewer comparison + operations. The more efficient query with the added + <codeph>DISTINCT</codeph> is approximately 7 times as fast. </note> <p> Next, we try doing a simple calculation, with results broken down by year. - This reveals that some years have no data in the - <codeph>AIRTIME</codeph> column. That means we might be able to use - that column in queries involving certain date ranges, but we cannot - count on it to always be reliable. The question of whether a column - contains any <codeph>NULL</codeph> values, and if so what is their - number, proportion, and distribution, comes up again and again when - doing initial exploration of a data set. </p> - -<codeblock>[localhost:21000] > select year, sum(airtime) from airlines_external - > group by year order by year desc; + This reveals that some years have no data in the + <codeph>airtime</codeph> column. That means we might be able to use + that column in queries involving certain date ranges, but we cannot + count on it to always be reliable. The question of whether a column + contains any <codeph>NULL</codeph> values, and if so what is their + number, proportion, and distribution, comes up again and again when + doing initial exploration of a data set. </p> + +<codeblock>> SELECT year, SUM(airtime) FROM airlines_external + GROUP BY year ORDER BY year DESC; +------+--------------+ | year | sum(airtime) | +------+--------------+ @@ -1932,34 +1849,37 @@ Fetched 0 row(s) in 5.37s +------+--------------+ </codeblock> -<p> -With the notion of <codeph>NULL</codeph> values in mind, let's come back to the <codeph>TAILNUM</codeph> -column that we discovered had a lot of <codeph>NULL</codeph>s. -Let's quantify the <codeph>NULL</codeph> and non-<codeph>NULL</codeph> values in that column for better understanding. -First, we just count the overall number of rows versus the non-<codeph>NULL</codeph> values in that column. -That initial result gives the appearance of relatively few non-<codeph>NULL</codeph> values, but we can break -it down more clearly in a single query. -Once we have the <codeph>COUNT(*)</codeph> and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers, -we can encode that initial query in a <codeph>WITH</codeph> clause, then run a followon query that performs -multiple arithmetic operations on those values. -Seeing that only one-third of one percent of all rows have non-<codeph>NULL</codeph> values for the -<codeph>TAILNUM</codeph> column clearly illustrates that that column is not of much use. -</p> - -<codeblock>[localhost:21000] > select count(*) as 'rows', count(tail_num) as 'non-null tail numbers' - > from airlines_external; +<p> With the notion of <codeph>NULL</codeph> values in mind, let's come back to + the <codeph>tail_num</codeph> column that we discovered had a lot of + <codeph>NULL</codeph>s. Let's quantify the <codeph>NULL</codeph> and + non-<codeph>NULL</codeph> values in that column for better + understanding. First, we just count the overall number of rows versus + the non-<codeph>NULL</codeph> values in that column. That initial + result gives the appearance of relatively few + non-<codeph>NULL</codeph> values, but we can break it down more + clearly in a single query. Once we have the <codeph>COUNT(*)</codeph> + and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers, we + can encode that initial query in a <codeph>WITH</codeph> clause, then + run a follow-on query that performs multiple arithmetic operations on + those values. Seeing that only one-third of one percent of all rows + have non-<codeph>NULL</codeph> values for the + <codeph>tail_num</codeph> column clearly illustrates that column is + not of much use. </p> + +<codeblock>> SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'non-null tail numbers' + FROM airlines_external; +-----------+-----------------------+ | rows | non-null tail numbers | +-----------+-----------------------+ | 123534969 | 412968 | +-----------+-----------------------+ -Fetched 1 row(s) in 1.51s -[localhost:21000] > with t1 as - > (select count(*) as 'rows', count(tail_num) as 'nonnull' - > from airlines_external) - > select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls', - > (`nonnull` / `rows`) * 100 as 'percentage non-null' - > from t1; + +> WITH t1 AS + (SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'nonnull' + FROM airlines_external) +SELECT `rows`, `nonnull`, `rows` - `nonnull` AS 'nulls', + (`nonnull` / `rows`) * 100 AS 'percentage non-null' +FROM t1; +-----------+---------+-----------+---------------------+ | rows | nonnull | nulls | percentage non-null | +-----------+---------+-----------+---------------------+ @@ -1967,26 +1887,28 @@ Fetched 1 row(s) in 1.51s +-----------+---------+-----------+---------------------+ </codeblock> -<p> -By examining other columns using these techniques, we can form a mental picture of the way data is distributed -throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly on -the fields likely to hold discrete values, rather than columns such as <codeph>ACTUAL_ELAPSED_TIME</codeph> -whose names suggest they hold measurements. We would dig deeper into those columns once we had a clear picture -of which questions were worthwhile to ask, and what kinds of trends we might look for. -For the final piece of initial exploration, let's look at the <codeph>YEAR</codeph> column. -A simple <codeph>GROUP BY</codeph> query shows that it has a well-defined range, a manageable number of -distinct values, and relatively even distribution of rows across the different years. -</p> - -<codeblock>[localhost:21000] > select min(year), max(year), ndv(year) from airlines_external; +<p> By examining other columns using these techniques, we can form a mental + picture of the way data is distributed throughout the table, and which + columns are most significant for query purposes. For this tutorial, we + focus mostly on the fields likely to hold discrete values, rather than + columns such as <codeph>actual_elapsed_time</codeph> whose names + suggest they hold measurements. We would dig deeper into those columns + once we had a clear picture of which questions were worthwhile to ask, + and what kinds of trends we might look for. For the final piece of + initial exploration, let's look at the <codeph>year</codeph> column. A + simple <codeph>GROUP BY</codeph> query shows that it has a + well-defined range, a manageable number of distinct values, and + relatively even distribution of rows across the different years. </p> + +<codeblock>> SELECT MIN(year), MAX(year), NDV(year) FROM airlines_external; +-----------+-----------+-----------+ | min(year) | max(year) | ndv(year) | +-----------+-----------+-----------+ | 1987 | 2008 | 22 | +-----------+-----------+-----------+ -Fetched 1 row(s) in 2.03s -[localhost:21000] > select year, count(*) howmany from airlines_external - > group by year order by year desc; + +> SELECT year, COUNT(*) howmany FROM airlines_external + GROUP BY year ORDER BY year DESC; +------+---------+ | year | howmany | +------+---------+ @@ -2013,77 +1935,35 @@ Fetched 1 row(s) in 2.03s | 1988 | 5202096 | | 1987 | 1311826 | +------+---------+ -Fetched 22 row(s) in 2.13s </codeblock> -<p> -We could go quite far with the data in this initial raw format, just as we downloaded it from the web. -If the data set proved to be useful and worth persisting in Impala for extensive queries, -we might want to copy it to an internal table, letting Impala manage the data files and perhaps -reorganizing a little for higher efficiency. -In this next stage of the tutorial, we copy the original data into a partitioned table, still in Parquet format. -Partitioning based on the <codeph>YEAR</codeph> column lets us run queries with clauses such as <codeph>WHERE year = 2001</codeph> -or <codeph>WHERE year BETWEEN 1989 AND 1999</codeph>, which can dramatically cut down on I/O by -ignoring all the data from years outside the desired range. -Rather than reading all the data and then deciding which rows are in the matching years, Impala can -zero in on only the data files from specific <codeph>YEAR</codeph> partitions. -To do this, Impala physically reorganizes the data files, putting the rows from each year into -data files in a separate HDFS directory for each <codeph>YEAR</codeph> value. -Along the way, we'll also get rid of the <codeph>TAIL_NUM</codeph> column that proved to be almost entirely <codeph>NULL</codeph>. -</p> - -<p> -The first step is to create a new table with a layout very similar to the original <codeph>AIRLINES_EXTERNAL</codeph> table. -We'll do that by reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the first table, -then tweaking it slightly to include a <codeph>PARTITION BY</codeph> clause for <codeph>YEAR</codeph>, -and excluding the <codeph>TAIL_NUM</codeph> column. -The <codeph>SHOW CREATE TABLE</codeph> statement gives us the starting point. -</p> - -<codeblock>[localhost:21000] > show create table airlines_external; -+------------------------------------------------------------------------------------- -| result -+------------------------------------------------------------------------------------- -| CREATE EXTERNAL TABLE airline_data.airlines_external ( -| year INT COMMENT 'inferred from: optional int32 year', -| month INT COMMENT 'inferred from: optional int32 month', -| day INT COMMENT 'inferred from: optional int32 day', -| dayofweek INT COMMENT 'inferred from: optional int32 dayofweek', -| dep_time INT COMMENT 'inferred from: optional int32 dep_time', -| crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time', -| arr_time INT COMMENT 'inferred from: optional int32 arr_time', -| crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time', -| carrier STRING COMMENT 'inferred from: optional binary carrier', -| flight_num INT COMMENT 'inferred from: optional int32 flight_num', -| tail_num INT COMMENT 'inferred from: optional int32 tail_num', -| actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time', -| crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time', -| airtime INT COMMENT 'inferred from: optional int32 airtime', -| arrdelay INT COMMENT 'inferred from: optional int32 arrdelay', -| depdelay INT COMMENT 'inferred from: optional int32 depdelay', -| origin STRING COMMENT 'inferred from: optional binary origin', -| dest STRING COMMENT 'inferred from: optional binary dest', -| distance INT COMMENT 'inferred from: optional int32 distance', -| taxi_in INT COMMENT 'inferred from: optional int32 taxi_in', -| taxi_out INT COMMENT 'inferred from: optional int32 taxi_out', -| cancelled INT COMMENT 'inferred from: optional int32 cancelled', -| cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code', -| diverted INT COMMENT 'inferred from: optional int32 diverted', -| carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay', -| weather_delay INT COMMENT 'inferred from: optional int32 weather_delay', -| nas_delay INT COMMENT 'inferred from: optional int32 nas_delay', -| security_delay INT COMMENT 'inferred from: optional int32 security_delay', -| late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay' -| ) -| STORED AS PARQUET -| LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines' -| TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false', -| 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0', -| 'rawDataSize'='-1') -+------------------------------------------------------------------------------------- -Fetched 1 row(s) in 0.03s -[localhost:21000] > quit; -</codeblock> +<p> We could go quite far with the data in this initial raw format, just as we + downloaded it from the web. If the data set proved to be useful and + worth persisting in Impala for extensive queries, we might want to + copy it to an internal table, letting Impala manage the data files and + perhaps reorganizing a little for higher efficiency. In this next + stage of the tutorial, we copy the original data into a partitioned + table, still in Parquet format. Partitioning based on the + <codeph>year</codeph> column lets us run queries with clauses such + as <codeph>WHERE year = 2001</codeph> or <codeph>WHERE year BETWEEN + 1989 AND 1999</codeph>, which can dramatically cut down on I/O by + ignoring all the data from years outside the desired range. Rather + than reading all the data and then deciding which rows are in the + matching years, Impala can zero in on only the data files from + specific <codeph>year</codeph> partitions. To do this, Impala + physically reorganizes the data files, putting the rows from each year + into data files in a separate HDFS directory for each + <codeph>year</codeph> value. Along the way, we'll also get rid of + the <codeph>tail_num</codeph> column that proved to be almost entirely + <codeph>NULL</codeph>. </p> + +<p> The first step is to create a new table with a layout very similar to the + original <codeph>airlines_external</codeph> table. We'll do that by + reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the + first table, then tweaking it slightly to include a <codeph>PARTITION + BY</codeph> clause for <codeph>year</codeph>, and excluding the + <codeph>tail_num</codeph> column. The <codeph>SHOW CREATE + TABLE</codeph> statement gives us the starting point. </p> <p> Although we could edit that output into a new SQL statement, all the ASCII box characters @@ -2092,17 +1972,10 @@ to start with, we restart the <cmdname>impala-shell</cmdname> command with the <codeph>-B</codeph> option, which turns off the box-drawing behavior. </p> -<codeblock>[localhost:21000] > quit; -Goodbye jrussell -$ impala-shell -i localhost -B -d airline_data; -Starting Impala Shell without Kerberos authentication -Connected to localhost:21000 -<ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/> -Welcome to the Impala shell. Press TAB twice to see a list of available commands. -... -<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> -[localhost:21000] > show create table airlines_external; -"CREATE EXTERNAL TABLE airline_data.airlines_external ( +<codeblock>$ impala-shell -i localhost -B -d airlines_data; + +> SHOW CREATE TABLE airlines_external; +"CREATE EXTERNAL TABLE airlines_data.airlines_external ( year INT COMMENT 'inferred from: optional int32 year', month INT COMMENT 'inferred from: optional int32 month', day INT COMMENT 'inferred from: optional int32 day', @@ -2138,7 +2011,6 @@ LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines' TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false', 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0', 'rawDataSize'='-1')" -Fetched 1 row(s) in 0.01s </codeblock> <p> @@ -2146,102 +2018,97 @@ After copying and pasting the <codeph>CREATE TABLE</codeph> statement into a tex without the <codeph>-B</codeph> option, to switch back to regular output. </p> -<p> -Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from the <codeph>SHOW CREATE TABLE</codeph> output. -We kept the <codeph>STORED AS PARQUET</codeph> clause because -we want to rearrange the data somewhat but still keep it in the high-performance -Parquet format. The <codeph>LOCATION</codeph> and <codeph>TBLPROPERTIES</codeph> -clauses are not relevant for this new table, so we edit those out. -Because we are going to partition the new table based on the <codeph>YEAR</codeph> -column, we move that column name (and its type) into a new <codeph>PARTITIONED BY</codeph> clause. -</p> - -<codeblock>[localhost:21000] > CREATE TABLE airline_data.airlines - > ( - > month INT, - > day INT, - > dayofweek INT, - > dep_time INT, - > crs_dep_time INT, - > arr_time INT, - > crs_arr_time INT, - > carrier STRING, - > flight_num INT, - > actual_elapsed_time INT, - > crs_elapsed_time INT, - > airtime INT, - > arrdelay INT, - > depdelay INT, - > origin STRING, - > dest STRING, - > distance INT, - > taxi_in INT, - > taxi_out INT, - > cancelled INT, - > cancellation_code STRING, - > diverted INT, - > carrier_delay INT, - > weather_delay INT, - > nas_delay INT, - > security_delay INT, - > late_aircraft_delay INT - > ) - > STORED AS PARQUET - > PARTITIONED BY (year INT); -Fetched 0 row(s) in 0.10s +<p> Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from + the <codeph>SHOW CREATE TABLE</codeph> output. We kept the + <codeph>STORED AS PARQUET</codeph> clause because we want to + rearrange the data somewhat but still keep it in the high-performance + Parquet format. The <codeph>LOCATION</codeph> and + <codeph>TBLPROPERTIES</codeph> clauses are not relevant for this new + table, so we edit those out. Because we are going to partition the new + table based on the <codeph>year</codeph> column, we move that column + name (and its type) into a new <codeph>PARTITIONED BY</codeph> clause. </p> + +<codeblock>> CREATE TABLE airlines_data.airlines + (month INT, + day INT, + dayofweek INT, + dep_time INT, + crs_dep_time INT, + arr_time INT, + crs_arr_time INT, + carrier STRING, + flight_num INT, + actual_elapsed_time INT, + crs_elapsed_time INT, + airtime INT, + arrdelay INT, + depdelay INT, + origin STRING, + dest STRING, + distance INT, + taxi_in INT, + taxi_out INT, + cancelled INT, + cancellation_code STRING, + diverted INT, + carrier_delay INT, + weather_delay INT, + nas_delay INT, + security_delay INT, + late_aircraft_delay INT) +PARTITIONED BY (year INT) +STORED AS PARQUET +; </codeblock> -<p> -Next, we copy all the rows from the original table into this new one with -an <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE TABLE</codeph> -statement to make an <codeph>INSERT</codeph> statement with the column names -in the same order.) The only change is to add a <codeph>PARTITION(year)</codeph> -clause, and move the <codeph>YEAR</codeph> column to the very end of the -<codeph>SELECT</codeph> list of the <codeph>INSERT</codeph> statement. -Specifying <codeph>PARTITION(year)</codeph>, rather than a fixed value -such as <codeph>PARTITION(year=2000)</codeph>, means that Impala figures -out the partition value for each row based on the value of the very last column -in the <codeph>SELECT</codeph> list. -This is the first SQL statement that legitimately takes any substantial time, -because the rows from different years are shuffled around the cluster; -the rows that go into each partition are collected on one node, before being -written to one or more new data files. -</p> - -<codeblock>[localhost:21000] > INSERT INTO airline_data.airlines - > PARTITION (year) - > SELECT - > month, - > day, - > dayofweek, - > dep_time, - > crs_dep_time, - > arr_time, - > crs_arr_time, - > carrier, - > flight_num, - > actual_elapsed_time, - > crs_elapsed_time, - > airtime, - > arrdelay, - > depdelay, - > origin, - > dest, - > distance, - > taxi_in, - > taxi_out, - > cancelled, - > cancellation_code, - > diverted, - > carrier_delay, - > weather_delay, - > nas_delay, - > security_delay, - > late_aircraft_delay, - > year - > FROM airline_data.airlines_external; -Inserted 123534969 row(s) in 202.70s -</codeblock> +<p> Next, we copy all the rows from the original table into this new one with an + <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE + TABLE</codeph> statement to make an <codeph>INSERT</codeph> + statement with the column names in the same order.) The only change is + to add a <codeph>PARTITION(year)</codeph> clause, and move the + <codeph>year</codeph> column to the very end of the + <codeph>SELECT</codeph> list of the <codeph>INSERT</codeph> + statement. Specifying <codeph>PARTITION(year)</codeph>, rather than a + fixed value such as <codeph>PARTITION(year=2000)</codeph>, means that + Impala figures out the partition value for each row based on the value + of the very last column in the <codeph>SELECT</codeph> list. This is + the first SQL statement that legitimately takes any substantial time, + because the rows from different years are shuffled around the cluster; + the rows that go into each partition are collected on one node, before + being written to one or more new data files. </p> + +<codeblock>> INSERT INTO airlines_data.airlines + PARTITION (year) + SELECT + month, + day, + dayofweek, + dep_time, + crs_dep_time, + arr_time, + crs_arr_time, + carrier, + flight_num, + actual_elapsed_time, + crs_elapsed_time, + airtime, + arrdelay, + depdelay, + origin, + dest, + distance, + taxi_in, + taxi_out, + cancelled, + cancellation_code, + diverted, + carrier_delay, + weather_delay, + nas_delay, + security_delay, + late_aircraft_delay, + year + FROM airlines_data.airlines_external;</codeblock> <p> Once partitioning or join queries come into play, it's important to have statistics @@ -2253,54 +2120,57 @@ are in place for each partition, and also illustrates how many files and how muc is in each partition. </p> -<codeblock>[localhost:21000] > compute incremental stats airlines; +<codeblock>> COMPUTE INCREMENTAL STATS airlines; +-------------------------------------------+ | summary | +-------------------------------------------+ | Updated 22 partition(s) and 27 column(s). | +-------------------------------------------+ -[localhost:21000] > show table stats airlines; -+-------+-----------+--------+----------+--------------+------------+---------+-------------------+ -| year | #Rows | #Files | Size | Bytes Cached | Cache Repl | Format | Incremental stats | -+-------+-----------+--------+----------+--------------+------------+---------+----- -| 1987 | 1311826 | 1 | 9.32MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1988 | 5202096 | 1 | 37.04MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1989 | 5041200 | 1 | 36.25MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1990 | 5270893 | 1 | 38.39MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1991 | 5076925 | 1 | 37.23MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1992 | 5092157 | 1 | 36.85MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1993 | 5070501 | 1 | 37.16MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1994 | 5180048 | 1 | 38.31MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1995 | 5327435 | 1 | 53.14MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1996 | 5351983 | 1 | 53.64MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1997 | 5411843 | 1 | 54.41MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1998 | 5384721 | 1 | 54.01MB | NOT CACHED | NOT CACHED | PARQUET | true -| 1999 | 5527884 | 1 | 56.32MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2000 | 5683047 | 1 | 58.15MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2001 | 5967780 | 1 | 60.65MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2002 | 5271359 | 1 | 57.99MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2003 | 6488540 | 1 | 81.33MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2004 | 7129270 | 1 | 103.19MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2005 | 7140596 | 1 | 102.61MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2006 | 7141922 | 1 | 106.03MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2007 | 7453215 | 1 | 112.15MB | NOT CACHED | NOT CACHED | PARQUET | true -| 2008 | 7009728 | 1 | 105.76MB | NOT CACHED | NOT CACHED | PARQUET | true -| Total | 123534969 | 22 | 1.30GB | 0B | | | -+-------+-----------+--------+----------+--------------+------------+---------+----- + +> SHOW TABLE STATS airlines; ++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+ +| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | ++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+ +| 1987 | 1311826 | 1 | 11.75MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1987 | +| 1988 | 5202096 | 1 | 44.04MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1988 | +| 1989 | 5041200 | 1 | 46.07MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1989 | +| 1990 | 5270893 | 1 | 46.25MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1990 | +| 1991 | 5076925 | 1 | 46.77MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1991 | +| 1992 | 5092157 | 1 | 48.21MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1992 | +| 1993 | 5070501 | 1 | 47.46MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1993 | +| 1994 | 5180048 | 1 | 47.47MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1994 | +| 1995 | 5327435 | 1 | 62.40MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1995 | +| 1996 | 5351983 | 1 | 62.93MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1996 | +| 1997 | 5411843 | 1 | 65.05MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1997 | +| 1998 | 5384721 | 1 | 62.21MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1998 | +| 1999 | 5527884 | 1 | 65.10MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1999 | +| 2000 | 5683047 | 1 | 67.68MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2000 | +| 2001 | 5967780 | 1 | 74.03MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2001 | +| 2002 | 5271359 | 1 | 74.00MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2002 | +| 2003 | 6488540 | 1 | 99.35MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2003 | +| 2004 | 7129270 | 1 | 123.29MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2004 | +| 2005 | 7140596 | 1 | 120.72MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2005 | +| 2006 | 7141922 | 1 | 121.88MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2006 | +| 2007 | 7453215 | 1 | 130.87MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2007 | +| 2008 | 7009728 | 1 | 123.14MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2008 | +| Total | 123534969 | 22 | 1.55GB | 0B | | | | | ++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+ </codeblock> -<p> -At this point, we go through a quick thought process to sanity check the partitioning we did. -All the partitions have exactly one file, which is on the low side. A query that includes a -clause <codeph>WHERE year=2004</codeph> will only read a single data block; that data block -will be read and processed by a single data node; therefore, for a query targeting a single year, -all the other nodes in the cluster will sit idle while all the work happens on a single machine. -It's even possible that by chance (depending on HDFS replication factor and the way data blocks are distributed -across the cluster), that multiple year partitions selected by a filter such as -<codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read and processed by the same data node. -The more data files each partition has, the more parallelism you can get and the less probability of <q>hotspots</q> occurring -on particular nodes, therefore a bigger performance boost by having a big cluster. -</p> +<p> At this point, we sanity check the partitioning we did. All the partitions + have exactly one file, which is on the low side. A query that includes + a clause <codeph>WHERE year=2004</codeph> will only read a single data + block; that data block will be read and processed by a single data + node; therefore, for a query targeting a single year, all the other + nodes in the cluster will sit idle while all the work happens on a + single machine. It's even possible that by chance (depending on HDFS + replication factor and the way data blocks are distributed across the + cluster), that multiple year partitions selected by a filter such as + <codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read + and processed by the same data node. The more data files each + partition has, the more parallelism you can get and the less + probability of <q>hotspots</q> occurring on particular nodes, + therefore a bigger performance boost by having a big cluster. </p> <p> However, the more data files, the less data goes in each one. The overhead of dividing the work in a @@ -2312,15 +2182,19 @@ The way to see how well it works in practice is to run the same queries against flat table and the new partitioned table, and compare times. </p> -<p> -Spoiler: in this case, with my particular 4-node cluster with its specific distribution of data blocks and my -particular exploratory queries, queries against the partitioned table do consistently run faster than the same queries -against the unpartitioned table. But I could not be sure that would be the case without some real measurements. -Here are some queries I ran to draw that conclusion, first against <codeph>AIRLINES_EXTERNAL</codeph> (no partitioning), -then against <codeph>AIRLINES</codeph> (partitioned by year). The <codeph>AIRLINES</codeph> queries are consistently faster. -Changing the volume of data, changing the size of the cluster, running queries that did or didn't refer to the partition key -columns, or other factors could change the results to favor one table layout or the other. -</p> +<p> Spoiler: in this case, with my particular 4-node cluster with its specific + distribution of data blocks and my particular exploratory queries, + queries against the partitioned table do consistently run faster than + the same queries against the unpartitioned table. But I could not be + sure that would be the case without some real measurements. Here are + some queries I ran to draw that conclusion, first against + <codeph>airlines_external</codeph> (no partitioning), then against + <codeph>AIRLINES</codeph> (partitioned by year). The + <codeph>AIRLINES</codeph> queries are consistently faster. Changing + the volume of data, changing the size of the cluster, running queries + that did or didn't refer to the partition key columns, or other + factors could change the results to favor one table layout or the + other. </p> <note> If you find the volume of each partition is only in the low tens of megabytes, consider lowering the granularity @@ -2330,64 +2204,36 @@ even hundreds of megabytes per Parquet file, and the number of Parquet files in higher than the number of data nodes. </note> -<codeblock>[localhost:21000] > select sum(airtime) from airlines_external; -+--------------+ -| sum(airtime) | +<codeblock>> SELECT SUM(airtime) FROM airlines_external; +--------------+ | 8662859484 | +--------------+ -Fetched 1 row(s) in 2.02s -[localhost:21000] > select sum(airtime) from airlines; -+--------------+ -| sum(airtime) | + +> SELECT SUM(airtime) FROM airlines; +--------------+ | 8662859484 | +--------------+ -Fetched 1 row(s) in 1.21s -[localhost:21000] > select sum(airtime) from airlines_external where year = 2005; -+--------------+ -| sum(airtime) | -+--------------+ -| 708204026 | -+--------------+ -Fetched 1 row(s) in 2.61s -[localhost:21000] > select sum(airtime) from airlines where year = 2005; -+--------------+ -| sum(airtime) | +> SELECT SUM(airtime) FROM airlines_external WHERE year = 2005; +--------------+ | 708204026 | +--------------+ -Fetched 1 row(s) in 1.19s -[localhost:21000] > select sum(airtime) from airlines_external where dayofweek = 1; +> SELECT SUM(airtime) FROM airlines WHERE year = 2005; +--------------+ -| sum(airtime) | -+--------------+ -| 1264945051 | -+--------------+ -Fetched 1 row(s) in 2.82s -[localhost:21000] > select sum(airtime) from airlines where dayofweek = 1; -+--------------+ -| sum(airtime) | -+--------------+ -| 1264945051 | +| 708204026 | +--------------+ -Fetched 1 row(s) in 1.61s </codeblock> -<p> -Now we can finally do some serious analysis with this data set that, remember, -a few minutes ago all we had were some raw data files and we didn't even know -what columns they contained. -Let's see whether the <q>air time</q> of a flight tends to be different depending on the -day of the week. We can see that the average is a little higher on day number 6; -perhaps Saturday is a busy flying day and planes have to circle for longer at the -destination airport before landing. -</p> +<p> Now we can finally analyze this data set that from the raw data files and we + didn't know what columns they contained. Let's see whether the + <codeph>airtime</codeph> of a flight tends to be different depending + on the day of the week. We can see that the average is a little higher + on day number 6; perhaps Saturday is a busy flying day and planes have + to circle for longer at the destination airport before landing. </p> -<codeblock>[localhost:21000] > select dayofweek, avg(airtime) from airlines - > group by dayofweek order by dayofweek; +<codeblock>> SELECT dayofweek, AVG(airtime) FROM airlines + GROUP BY dayofweek ORDER BY dayofweek; +-----------+-------------------+ | dayofweek | avg(airtime) | +-----------+-------------------+ @@ -2399,7 +2245,6 @@ destination airport before landing. | 6 | 105.3627448363705 | | 7 | 103.4144351202054 | +-----------+-------------------+ -Fetched 7 row(s) in 2.25s </codeblock> <p> @@ -2410,8 +2255,8 @@ of <codeph>NULL</codeph> for this column in years 1987 to 1994 shows that querie need to be restricted to a date range of 1995 and higher. </p> -<codeblock>[localhost:21000] > select year, dayofweek, avg(airtime) from airlines - > group by year, dayofweek order by year desc, dayofweek; +<codeblock>> SELECT year, dayofweek, AVG(airtime) FROM airlines + GROUP BY year, dayofweek ORDER BY year DESC, dayofweek; +------+-----------+-------------------+ | year | dayofweek | avg(airtime) | +------+-----------+-------------------+ @@ -2456,28 +2301,5 @@ need to be restricted to a date range of 1995 and higher. </conbody> </concept> - - <concept audience="hidden" id="tut_synthesizing"> - - <title>Synthesizing Data</title> - - <conbody> - -<!-- Scenarios: - INSERT VALUES 1 row or a few rows, then duplicate the values with slight alterations. - - Use row_number() and rank() to assign IDs to values from narrow tables. - - Use sum() in analytic context, with sliding window, to derive other arbitrary numbers. - Grab database, table, column, etc. names from metastore and play with those. - Maybe take results of a 'find' and play with those. - - Invent fields like 'username' based on /home or whatever subdirectories. - - Perhaps also capture permissions, timestamps, etc. via 'ls -l'. - - Illustrate narrow table with 'extension' field, wide table with 'jpg', 'Office', 'picture', 'XML', etc. fields - based on case() of extension. - Decompose some string values using regexp_extract. - Set up views to get the logical layout right before making new physical copies of data. ---> - - <p></p> - </conbody> </concept> </concept>