http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/sql-ref/cmd-summary/003-select.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/cmd-summary/003-select.md b/_docs/sql-ref/cmd-summary/003-select.md new file mode 100644 index 0000000..00004a4 --- /dev/null +++ b/_docs/sql-ref/cmd-summary/003-select.md @@ -0,0 +1,85 @@ +--- +title: "SELECT Statements" +parent: "SQL Commands Summary" +--- +Drill supports the following ANSI standard clauses in the SELECT statement: + + * WITH clause + * SELECT list + * FROM clause + * WHERE clause + * GROUP BY clause + * HAVING clause + * ORDER BY clause (with an optional LIMIT clause) + +You can use the same SELECT syntax in the following commands: + + * CREATE TABLE AS (CTAS) + * CREATE VIEW + +INSERT INTO SELECT is not yet supported. + +## Column Aliases + +You can use named column aliases in the SELECT list to provide meaningful +names for regular columns and computed columns, such as the results of +aggregate functions. See the section on running queries for examples. + +You cannot reference column aliases in the following clauses: + + * WHERE + * GROUP BY + * HAVING + +Because Drill works with schema-less data sources, you cannot use positional +aliases (1, 2, etc.) to refer to SELECT list columns, except in the ORDER BY +clause. + +## UNION ALL Set Operator + +Drill supports the UNION ALL set operator to combine two result sets. The +distinct UNION operator is not yet supported. + +The EXCEPT, EXCEPT ALL, INTERSECT, and INTERSECT ALL operators are not yet +supported. + +## Joins + +Drill supports ANSI standard joins in the FROM and WHERE clauses: + + * Inner joins + * Left, full, and right outer joins + +The following types of join syntax are supported: + +Join type| Syntax +---|--- +Join condition in WHERE clause|FROM table1, table 2 WHERE table1.col1=table2.col1 +USING join in FROM clause|FROM table1 JOIN table2 USING(col1, ...) +ON join in FROM clause|FROM table1 JOIN table2 ON table1.col1=table2.col1 +NATURAL JOIN in FROM clause|FROM table 1 NATURAL JOIN table 2 + +Cross-joins are not yet supported. You must specify a join condition when more +than one table is listed in the FROM clause. + +Non-equijoins are supported if the join also contains an equality condition on +the same two tables as part of a conjunction: + + table1.col1 = table2.col1 AND table1.c2 < table2.c2 + +This restriction applies to both inner and outer joins. + +## Subqueries + +You can use the following subquery operators in Drill queries. These operators +all return Boolean results. + + * ALL + * ANY + * EXISTS + * IN + * SOME + +In general, correlated subqueries are supported. EXISTS and NOT EXISTS +subqueries that do not contain a correlation join are not yet supported. +
http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/sql-ref/cmd-summary/004-show-files.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/cmd-summary/004-show-files.md b/_docs/sql-ref/cmd-summary/004-show-files.md new file mode 100644 index 0000000..1fcf395 --- /dev/null +++ b/_docs/sql-ref/cmd-summary/004-show-files.md @@ -0,0 +1,65 @@ +--- +title: "SHOW FILES Command" +parent: "SQL Commands Summary" +--- +The SHOW FILES command provides a quick report of the file systems that are +visible to Drill for query purposes. This command is unique to Apache Drill. + +## Syntax + +The SHOW FILES command supports the following syntax. + + SHOW FILES [ FROM filesystem.directory_name | IN filesystem.directory_name ]; + +The FROM or IN clause is required if you do not specify a default file system +first. You can do this with the USE command. FROM and IN are synonyms. + +The directory name is optional. (If the directory name is a Drill reserved +word, you must use back ticks around the name.) + +The command returns standard Linux `stat` information for each file or +directory, such as permissions, owner, and group values. This information is +not specific to Drill. + +## Examples + +The following example returns information about directories and files in the +local (`dfs`) file system. + + 0: jdbc:drill:> use dfs; + + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'dfs' | + +------------+------------+ + 1 row selected (0.318 seconds) + + 0: jdbc:drill:> show files; + +------------+-------------+------------+------------+------------+------------+-------------+------------+------------------+ + | name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime | + +------------+-------------+------------+------------+------------+------------+-------------+------------+------------------+ + | user | true | false | 1 | mapr | mapr | rwxr-xr-x | 2014-07-30 21:37:06.0 | 2014-07-31 22:15:53.193 | + | backup.tgz | false | true | 36272 | root | root | rw-r--r-- | 2014-07-31 22:09:13.0 | 2014-07-31 22:09:13.211 | + | JSON | true | false | 1 | root | root | rwxr-xr-x | 2014-07-31 15:22:42.0 | 2014-08-04 15:43:07.083 | + | scripts | true | false | 3 | root | root | rwxr-xr-x | 2014-07-31 22:10:51.0 | 2014-08-04 18:23:09.236 | + | temp | true | false | 2 | root | root | rwxr-xr-x | 2014-08-01 20:07:37.0 | 2014-08-01 20:09:42.595 | + | hbase | true | false | 10 | mapr | mapr | rwxr-xr-x | 2014-07-30 21:36:08.0 | 2014-08-04 18:31:13.778 | + | tables | true | false | 0 | root | root | rwxrwxrwx | 2014-07-31 22:14:35.0 | 2014-08-04 15:42:43.415 | + | CSV | true | false | 4 | root | root | rwxrwxrwx | 2014-07-31 17:34:53.0 | 2014-08-04 + ... + +The following example shows the files in a specific directory in the `dfs` +file system: + + 0: jdbc:drill:> show files in dfs.CSV; + + +------------+-------------+------------+------------+------------+------------+-------------+------------+------------------+ + | name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime | + +------------+-------------+------------+------------+------------+------------+-------------+------------+------------------+ + | customers.csv | false | true | 62011 | root | root | rw-r--r-- | 2014-08-04 18:30:39.0 | 2014-08-04 18:30:39.314 | + | products.csv.small | false | true | 34972 | root | root | rw-r--r-- | 2014-07-31 23:58:42.0 | 2014-07-31 23:59:16.849 | + | products.csv | false | true | 34972 | root | root | rw-r--r-- | 2014-08-01 06:39:34.0 | 2014-08-04 15:58:09.325 | + | products.csv.bad | false | true | 62307 | root | root | rw-r--r-- | 2014-08-04 15:58:02.0 | 2014-08-04 15:58:02.612 | + +------------+-------------+------------+------------+------------+------------+-------------+------------+------------------+ + 4 rows selected (0.165 seconds) \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/sql-ref/data-types/001-date.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/data-types/001-date.md b/_docs/sql-ref/data-types/001-date.md new file mode 100644 index 0000000..6340e35 --- /dev/null +++ b/_docs/sql-ref/data-types/001-date.md @@ -0,0 +1,148 @@ +--- +title: "Supported Date/Time Data Type Formats" +parent: "Data Types" +--- +You must use supported `date` and `time` formats when you `SELECT` date and +time literals or when you `CAST()` from `VARCHAR `to `date` and `time` data +types. Apache Drill currently supports specific formats for the following +`date` and `time` data types: + + * Date + * Timestamp + * Time + * Interval + * Interval Year + * Interval Day + * Literal + +The following query provides an example of how to `SELECT` a few of the +supported date and time formats as literals: + + select date '2008-2-23', timestamp '2008-1-23 14:24:23', time '10:20:30' from dfs.`/tmp/input.json`; + +The following query provides an example where `VARCHAR` data in a file is +`CAST()` to supported `date `and `time` formats: + + select cast(col_A as date), cast(col_B as timestamp), cast(col_C as time) from dfs.`/tmp/dates.json`; + +`Date`, t`imestamp`, and` time` data types store values in `UTC`. Currently, +Apache Drill does not support `timestamp` with time zone. + +## Date + +Drill supports the `date` data type in the following format: + + YYYY-MM-DD (year-month-date) + +The following table provides some examples for the `date` data type: + + | Use | Example | + | --- | ------- | + |Literal| `select date â2008-2-23â from dfs.`/tmp/input.json`;`| + |`JSON` input | `{"date_col" : "2008-2-23"} + | `CAST` from `VARCHAR`| `` select CAST(date_col as date) as CAST_DATE from dfs.`/tmp/input.json`; ``| + +## Timestamp + +Drill supports the `timestamp` data type in the following format: + + yyyy-MM-dd HH:mm:ss.SSS (year-month-date hour:minute:sec.milliseconds) + +The following table provides some examples for the `timestamp` data type: + +<table> + <tbody> + <tr> + <th>Use</th> + <th>CAST Example</th> + </tr> + <tr> + <td valign="top">Literal</td> + <td valign="top"><code><span style="color: rgb(0,0,0);">select timestamp â2008-2-23 10:20:30.345â, timestamp â2008-2-23 10:20:30â from dfs.`/tmp/input.json`;</span></code> + </td></tr> + <tr> + <td colspan="1" valign="top"><code>JSON</code> Input</td> + <td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">{âtimestamp_colâ: â2008-2-23 15:20:30.345â}<br /></span><span style="color: rgb(0,0,0);">{âtimestamp_colâ: â2008-2-23 10:20:30â}</span></code><span style="color: rgb(0,0,0);">The fractional millisecond component is optional.</span></td> + </tr> + <tr> + <td colspan="1" valign="top"><code>CAST</code> from <code>VARCHAR</code></td> + <td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">select cast(timestamp_col as timestamp) from dfs.`/tmp/input.json`; </span></code></td> + </tr> + </tbody> + </table> + +## Time + +Drill supports the `time` data type in the following format: + + HH:mm:ss.SSS (hour:minute:sec.milliseconds) + +The following table provides some examples for the `time` data type: + +<table><tbody><tr> + <th>Use</th> + <th>Example</th> + </tr> + <tr> + <td valign="top">Literal</td> + <td valign="top"><code><span style="color: rgb(0,0,0);">select time â15:20:30â, time â10:20:30.123â from dfs.`/tmp/input.json`;</span></code></td> + </tr> + <tr> + <td colspan="1" valign="top"><code>JSON</code> Input</td> + <td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">{âtime_colâ : â10:20:30.999â}<br /></span><span style="color: rgb(0,0,0);">{âtime_colâ: â10:20:30â}</span></code></td> + </tr> + <tr> + <td colspan="1" valign="top"><code>CAST</code> from <code>VARCHAR</code></td> + <td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">select cast(time_col as time) from dfs.`/tmp/input.json`;</span></code></td> +</tr></tbody> +</table> + +## Interval + +Drill supports the `interval year` and `interval day` data types. + +### Interval Year + +The `interval year` data type stores time duration in years and months. Drill +supports the `interval` data type in the following format: + + P [qty] Y [qty] M + +The following table provides examples for `interval year` data type: + +<table ><tbody><tr> +<th>Use</th> +<th>Example</th></tr> + <tr> + <td valign="top">Literals</td> + <td valign="top"><code><span style="color: rgb(0,0,0);">select interval â1-2â year to month from dfs.`/tmp/input.json`;<br /></span><span style="color: rgb(0,0,0);">select interval â1â year from dfs.`/tmp/input.json`;<br /></span><span style="color: rgb(0,0,0);">select interval '13â month from dfs.`/tmp/input.json`;</span></code></td></tr><tr> + <td colspan="1" valign="top"><code>JSON</code> Input</td> + <td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">{âcolâ : âP1Y2Mâ}<br /></span><span style="color: rgb(0,0,0);">{âcolâ : âP-1Y2Mâ}<br /></span><span style="color: rgb(0,0,0);">{âcolâ : âP-1Y-2Mâ}<br /></span><span style="color: rgb(0,0,0);">{âcolâ: âP10Mâ}<br /></span><span style="color: rgb(0,0,0);">{âcolâ: âP5Yâ}</span></code></td> + </tr> + <tr> + <td colspan="1" valign="top"><code>CAST</code> from <code>VARCHAR</code></td> + <td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">select cast(col as interval year) from dfs.`/tmp/input.json`;</span></code></td> + </tr> + </tbody></table> + +### Interval Day + +The `interval day` data type stores time duration in days, hours, minutes, and +seconds. You do not need to specify all fields in a given interval. Drill +supports the `interval day` data type in the following format: + + P [qty] D T [qty] H [qty] M [qty] S + +The following table provides examples for `interval day` data type: + +<table ><tbody><tr><th >Use</th><th >Example</th></tr><tr><td valign="top">Literal</td><td valign="top"><code><span style="color: rgb(0,0,0);">select interval '1 10:20:30.123' day to second from dfs.`/tmp/input.json`;<br /></span><span style="color: rgb(0,0,0);">select interval '1 10' day to hour from dfs.`/tmp/input.json`;<br /></span><span style="color: rgb(0,0,0);">select interval '10' day from dfs.`/tmp/input.json`;<br /></span><span style="color: rgb(0,0,0);">select interval '10' hour from dfs.`/tmp/input.json`;</span></code><code><span style="color: rgb(0,0,0);">select interval '10.999' second from dfs.`/tmp/input.json`;</span></code></td></tr><tr><td colspan="1" valign="top"><code>JSON</code> Input</td><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">{"col" : "P1DT10H20M30S"}<br /></span><span style="color: rgb(0,0,0);">{"col" : "P1DT10H20M30.123S"}<br /></span><span style="color: rgb(0,0,0);">{"col" : &q uot;P1D"}<br /></span><span style="color: rgb(0,0,0);">{"col" : "PT10H"}<br /></span><span style="color: rgb(0,0,0);">{"col" : "PT10.10S"}<br /></span><span style="color: rgb(0,0,0);">{"col" : "PT20S"}<br /></span><span style="color: rgb(0,0,0);">{"col" : "PT10H10S"}</span></code></td></tr><tr><td colspan="1" valign="top"><code>CAST</code> from <code>VARCHAR</code></td><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">select cast(col as interval day) from dfs.`/tmp/input.json`;</span></code></td></tr></tbody></table> + +## Literal + +The following table provides a list of `date/time` literals that Drill +supports with examples of each: + +<table ><tbody><tr><th >Format</th><th colspan="1" >Interpretation</th><th >Example</th></tr><tr><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">interval '1 10:20:30.123' day to second</span></code></td><td colspan="1" valign="top"><code>1 day, 10 hours, 20 minutes, 30 seconds, and 123 thousandths of a second</code></td><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">select interval '1 10:20:30.123' day to second from dfs.`/tmp/input.json`;</span></code></td></tr><tr><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">interval '1 10' day to hour</span></code></td><td colspan="1" valign="top"><code>1 day 10 hours</code></td><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">select interval '1 10' day to hour from dfs.`/tmp/input.json`;</span></code></td></tr><tr><td colspan="1" valign="top"><code><span style="color: rgb(0,0,0);">interval '10' day</span></code></td><td colspan="1" valign="top"><code>10 days</code ></td><td colspan="1" valign="top"><code><span style="color: >rgb(0,0,0);">select interval '10' day from >dfs.`/tmp/input.json`;</span></code></td></tr><tr><td colspan="1" >valign="top"><code><span style="color: rgb(0,0,0);">interval '10' >hour</span></code></td><td colspan="1" valign="top"><code>10 >hours</code></td><td colspan="1" valign="top"><code><span style="color: >rgb(0,0,0);">select interval '10' hour from >dfs.`/tmp/input.json`;</span></code></td></tr><tr><td colspan="1" >valign="top"><code><span style="color: rgb(0,0,0);">interval '10.999' >second</span></code></td><td colspan="1" valign="top"><code>10.999 >seconds</code></td><td colspan="1" valign="top"><code><span style="color: >rgb(0,0,0);">select interval '10.999' second from dfs.`/tmp/input.json`; ></span></code></td></tr></tbody></table> + + + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/sql-ref/nested/001-flatten.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/nested/001-flatten.md b/_docs/sql-ref/nested/001-flatten.md new file mode 100644 index 0000000..2769000 --- /dev/null +++ b/_docs/sql-ref/nested/001-flatten.md @@ -0,0 +1,89 @@ +--- +title: "FLATTEN Function" +parent: "Nested Data Functions" +--- +The FLATTEN function is useful for flexible exploration of repeated data. +FLATTEN separates the elements in a repeated field into individual records. To +maintain the association between each flattened value and the other fields in +the record, all of the other columns are copied into each new record. A very +simple example would turn this data (one record): + + { + "x" : 5, + "y" : "a string", + "z" : [ 1,2,3] + } + +into three distinct records: + + select flatten(z) from table; + | x | y | z | + +-------------+----------------+-----------+ + | 5 | "a string" | 1 | + | 5 | "a string" | 2 | + | 5 | "a string" | 3 | + +The function takes a single argument, which must be an array (the `z` column +in this example). + + + +For a more interesting example, consider the JSON data in the publicly +available [Yelp](https://www.yelp.com/dataset_challenge/dataset) data set. The +first query below returns three columns from the +`yelp_academic_dataset_business.json` file: `name`, `hours`, and `categories`. +The query is restricted to distinct rows where the name is `z``pizza`. The +query returns only one row that meets those criteria; however, note that this +row contains an array of four categories: + + 0: jdbc:drill:zk=local> select distinct name, hours, categories + from dfs.yelp.`yelp_academic_dataset_business.json` + where name ='zpizza'; + +------------+------------+------------+ + | name | hours | categories | + +------------+------------+------------+ + | zpizza | {"Tuesday":{"close":"22:00","open":"10:00"},"Friday":{"close":"23:00","open":"10:00"},"Monday":{"close":"22:00","open":"10:00"},"Wednesday":{"close":"22:00","open":"10:00"},"Thursday":{"close":"22:00","open":"10:00"},"Sunday":{"close":"22:00","open":"10:00"},"Saturday":{"close":"23:00","open":"10:00"}} | ["Gluten-Free","Pizza","Vegan","Restaurants"] | + +The FLATTEN function can operate on this single row and return multiple rows, +one for each category: + + 0: jdbc:drill:zk=local> select distinct name, flatten(categories) as categories + from dfs.yelp.`yelp_academic_dataset_business.json` + where name ='zpizza' order by 2; + +------------+-------------+ + | name | categories | + +------------+-------------+ + | zpizza | Gluten-Free | + | zpizza | Pizza | + | zpizza | Restaurants | + | zpizza | Vegan | + +------------+-------------+ + 4 rows selected (2.797 seconds) + +Having used the FLATTEN function to break down arrays into distinct rows, you +can run queries that do deeper analysis on the flattened result set. For +example, you can use FLATTEN in a subquery, then apply WHERE clause +constraints or aggregate functions to the results in the outer query. + +The following query uses the same data file as the previous query to flatten +the categories array, then run a COUNT function on the flattened result: + + select celltbl.catl, count(celltbl.catl) catcount + from (select flatten(categories) catl + from dfs.yelp.`yelp_academic_dataset_business.json`) celltbl + group by celltbl.catl + order by count(celltbl.catl) desc limit 5; + + +---------------+------------+ + | catl | catcount | + +---------------+------------+ + | Restaurants | 14303 | + | Shopping | 6428 | + | Food | 5209 | + | Beauty & Spas | 3421 | + | Nightlife | 2870 | + +---------------|------------+ + +A common use case for FLATTEN is its use in conjunction with the +[KVGEN](/drill/docs/flatten-function) function. + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/sql-ref/nested/002-kvgen.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/nested/002-kvgen.md b/_docs/sql-ref/nested/002-kvgen.md new file mode 100644 index 0000000..f619864 --- /dev/null +++ b/_docs/sql-ref/nested/002-kvgen.md @@ -0,0 +1,150 @@ +--- +title: "KVGEN Function" +parent: "Nested Data Functions" +--- +KVGEN stands for _key-value generation_. This function is useful when complex +data files contain arbitrary maps that consist of relatively "unknown" column +names. Instead of having to specify columns in the map to access the data, you +can use KVGEN to return a list of the keys that exist in the map. KVGEN turns +a map with a wide set of columns into an array of key-value pairs. + +In turn, you can write analytic queries that return a subset of the generated +keys or constrain the keys in some way. For example, you can use the +[FLATTEN](/drill/docs/flatten-function) function to break the +array down into multiple distinct rows and further query those rows. + + + +For example, assume that a JSON file contains this data: + + {"a": "valA", "b": "valB"} + {"c": "valC", "d": "valD"} + + +KVGEN would operate on this data to generate: + + [{"key": "a", "value": "valA"}, {"key": "b", "value": "valB"}] + [{"key": "c", "value": "valC"}, {"key": "d", "value": "valD"}] + +Applying the [FLATTEN](/drill/docs/flatten-function) function to +this data would return: + + {"key": "a", "value": "valA"} + {"key": "b", "value": "valB"} + {"key": "c", "value": "valC"} + {"key": "d", "value": "valD"} + +Assume that a JSON file called `kvgendata.json` includes multiple records that +look like this one: + + { + "rownum": 1, + "bigintegercol": { + "int_1": 1, + "int_2": 2, + "int_3": 3 + }, + "varcharcol": { + "varchar_1": "abc", + "varchar_2": "def", + "varchar_3": "xyz" + }, + "boolcol": { + "boolean_1": true, + "boolean_2": false, + "boolean_3": true + }, + "float8col": { + "f8_1": 1.1, + "f8_2": 2.2 + }, + "complex": [ + { + "col1": 3 + }, + { + "col2": 2, + "col3": 1 + }, + { + "col1": 7 + } + ] + } + + { + "rownum": 3, + "bigintegercol": { + "int_1": 1, + "int_3": 3 + }, + "varcharcol": { + "varchar_1": "abcde", + "varchar_2": null, + "varchar_3": "xyz", + "varchar_4": "xyz2" + }, + "boolcol": { + "boolean_1": true, + "boolean_2": false + }, + "float8col": { + "f8_1": 1.1, + "f8_3": 6.6 + }, + "complex": [ + { + "col1": 2, + "col3": 1 + } + ] + } + ... + + +A SELECT * query against this specific record returns the following row: + + 0: jdbc:drill:zk=local> select * from dfs.yelp.`kvgendata.json` where rownum=1; + + +------------+---------------+------------+------------+------------+------------+ + | rownum | bigintegercol | varcharcol | boolcol | float8col | complex | + +------------+---------------+------------+------------+------------+------------+ + | 1 | {"int_1":1,"int_2":2,"int_3":3} | {"varchar_1":"abc","varchar_2":"def","varchar_3":"xyz"} | {"boolean_1":true,"boolean_2":false,"boolean_3":true} | {"f8_1":1.1,"f8_2":2.2} | [{"col1":3},{"col2":2,"col3":1},{"col1":7}] | + +------------+---------------+------------+------------+------------+------------+ + 1 row selected (0.122 seconds) + +You can use the KVGEN function to turn the maps in this data into key-value +pairs. For example: + + 0: jdbc:drill:zk=local> select kvgen(varcharcol) from dfs.yelp.`kvgendata.json`; + +------------+ + | EXPR$0 | + +------------+ + | [{"key":"varchar_1","value":"abc"},{"key":"varchar_2","value":"def"},{"key":"varchar_3","value":"xyz"}] | + | [{"key":"varchar_1","value":"abcd"}] | + | [{"key":"varchar_1","value":"abcde"},{"key":"varchar_3","value":"xyz"},{"key":"varchar_4","value":"xyz2"}] | + | [{"key":"varchar_1","value":"abc"},{"key":"varchar_2","value":"def"}] | + +------------+ + 4 rows selected (0.091 seconds) + +Now you can apply the FLATTEN function to break out the key-value pairs into +distinct rows: + + 0: jdbc:drill:zk=local> select flatten(kvgen(varcharcol)) from dfs.yelp.`kvgendata.json`; + +------------+ + | EXPR$0 | + +------------+ + | {"key":"varchar_1","value":"abc"} | + | {"key":"varchar_2","value":"def"} | + | {"key":"varchar_3","value":"xyz"} | + | {"key":"varchar_1","value":"abcd"} | + | {"key":"varchar_1","value":"abcde"} | + | {"key":"varchar_3","value":"xyz"} | + | {"key":"varchar_4","value":"xyz2"} | + | {"key":"varchar_1","value":"abc"} | + | {"key":"varchar_2","value":"def"} | + +------------+ + 9 rows selected (0.151 seconds) + +See the description of [FLATTEN](/drill/docs/flatten-function) +for an example of a query against the flattened data. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/sql-ref/nested/003-repeated-cnt.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/nested/003-repeated-cnt.md b/_docs/sql-ref/nested/003-repeated-cnt.md new file mode 100644 index 0000000..2b332b3 --- /dev/null +++ b/_docs/sql-ref/nested/003-repeated-cnt.md @@ -0,0 +1,33 @@ +--- +title: "REPEATED_COUNT Function" +parent: "Nested Data Functions" +--- +This function counts the values in an array. The following example returns the +counts for the `categories` array in the `yelp_academic_dataset_business.json` +file. The counts are restricted to rows that contain the string `pizza`. + + SELECT name, REPEATED_COUNT(categories) + FROM dfs.yelp.`yelp_academic_dataset_business.json` + WHERE name LIKE '%pizza%'; + + +---------------+------------+ + | name | EXPR$1 | + +---------------+------------+ + | Villapizza | 2 | + | zpizza | 4 | + | zpizza | 4 | + | Luckys pizza | 2 | + | Zpizza | 2 | + | S2pizzabar | 4 | + | Dominos pizza | 5 | + +---------------+------------+ + + 7 rows selected (2.03 seconds) + +The function requires a single argument, which must be an array. Note that +this function is not a standard SQL aggregate function and does not require +the count to be grouped by other columns in the select list (such as `name` in +this example). + +For another example of this function, see the following lesson in the Apache +Drill Tutorial for Hadoop: [Lesson 3: Run Queries on Complex Data Types](/drill/docs/lession-3-run-queries-on-complex-data-types/). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/001-install-sandbox.md ---------------------------------------------------------------------- diff --git a/_docs/tutorial/001-install-sandbox.md b/_docs/tutorial/001-install-sandbox.md new file mode 100644 index 0000000..26360ff --- /dev/null +++ b/_docs/tutorial/001-install-sandbox.md @@ -0,0 +1,33 @@ +--- +title: "Installing the Apache Drill Sandbox" +parent: "Apache Drill Tutorial" +--- +## Prerequisites + +The MapR Sandbox with Apache Drill runs on VMware Player and VirtualBox, free +desktop applications that you can use to run a virtual machine on a Windows, +Mac, or Linux PC. Before you install the MapR Sandbox with Apache Drill, +verify that the host system meets the following prerequisites: + + * VMware Player or VirtualBox is installed. + * At least 20 GB free hard disk space, at least 4 physical cores, and 8 GB of RAM is available. Performance increases with more RAM and free hard disk space. + * Uses one of the following 64-bit x86 architectures: + * A 1.3 GHz or faster AMD CPU with segment-limit support in long mode + * A 1.3 GHz or faster Intel CPU with VT-x support + * If you have an Intel CPU with VT-x support, verify that VT-x support is enabled in the host system BIOS. The BIOS settings that must be enabled for VT-x support vary depending on the system vendor. See the VMware knowledge base article at <http://kb.vmware.com/kb/1003944> for information about how to determine if VT-x support is enabled. + +### VM Player Downloads + +For Linux, Mac, or Windows, download the free [VMware Player](https://my.vmwar +e.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0) or +[VirtualBox](https://www.virtualbox.org/wiki/Downloads). Optionally, you can +purchase [VMware Fusion](http://www.vmware.com/products/fusion/) for Mac. + +### VM Player Installation + +The following list provides links to the virtual machine installation +instructions: + + * To install the VMware Player, see the [VMware documentation](http://www.vmware.com/support/pubs/player_pubs.html). Use of VMware Player is subject to the VMware Player end user license terms. VMware does not provide support for VMware Player. For self-help resources, see the [VMware Player FAQ](http://www.vmware.com/products/player/faqs.html). + * To install VirtualBox, see the [Oracle VM VirtualBox User Manual](http://dlc.sun.com.edgesuite.net/virtualbox/4.3.4/UserManual.pdf). By downloading VirtualBox, you agree to the terms and conditions of the respective license. + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/002-get2kno-sb.md ---------------------------------------------------------------------- diff --git a/_docs/tutorial/002-get2kno-sb.md b/_docs/tutorial/002-get2kno-sb.md new file mode 100644 index 0000000..9b11b9d --- /dev/null +++ b/_docs/tutorial/002-get2kno-sb.md @@ -0,0 +1,232 @@ +--- +title: "Getting to Know the Drill Sandbox" +parent: "Apache Drill Tutorial" +--- +This section describes the configuration of the Apache Drill system that you +have installed and introduces the overall use case for the tutorial. + +# Storage Plugins Overview + +The Hadoop cluster within the sandbox is set up with MapR-FS, MapR-DB, and +Hive, which all serve as data sources for Drill in this tutorial. Before you +can run queries against these data sources, Drill requires each one to be +configured as a storage plugin. A storage plugin defines the abstraction on +the data sources for Drill to talk to and provides interfaces to read/write +and get metadata from the data source. Each storage plugin also exposes +optimization rules for Drill to leverage for efficient query execution. + +Take a look at the pre-configured storage plugins by opening the Drill Web UI. + +Feel free to skip this section and jump directly to the queries: [Lesson 1: +Learn About the Data +Set](/drill/docs/lession-1-learn-about-the-data-set) + + * Launch a web browser and go to: `http://<IP address of the sandbox>:8047` + * Go to the Storage tab + * Open the configured storage plugins one at a time by clicking Update + * You will see the following plugins configured. + +## maprdb + +A storage plugin configuration for MapR-DB in the sandbox. Drill uses a single +storage plugin for connecting to HBase as well as MapR-DB, which is an +enterprise grade in-Hadoop NoSQL database. In addition to the following brief example, see the [Registering HBase](/drill/docs/registering-hbase) for more +information on how to configure Drill to query HBase. + + { + "type" : "hbase", + "enabled" : true, + "config" : { + "hbase.table.namespace.mappings" : "*:/tables" + } + } + +## dfs + +This is a storage plugin configuration for the MapR file system (MapR-FS) in +the sandbox. The connection attribute indicates the type of distributed file +system: in this case, MapR-FS. Drill can work with any distributed system, +including HDFS, S3, and so on. + +The configuration also includes a set of workspaces; each one represents a +location in MapR-FS: + + * root: access to the root file system location + * clicks: access to nested JSON log data + * logs: access to flat (non-nested) JSON log data in the logs directory and its subdirectories + * views: a workspace for creating views + +A workspace in Drill is a location where users can easily access a specific +set of data and collaborate with each other by sharing artifacts. Users can +create as many workspaces as they need within Drill. + +Each workspace can also be configured as âwritableâ or not, which indicates +whether users can write data to this location and defines the storage format +in which the data will be written (parquet, csv, json). These attributes +become relevant when you explore SQL commands, especially CREATE TABLE +AS (CTAS) and CREATE VIEW. + +Drill can query files and directories directly and can detect the file formats +based on the file extension or the first few bits of data within the file. +However, additional information around formats is required for Drill, such as +delimiters for text files, which are specified in the âformatsâ section below. + + { + "type": "file", + "enabled": true, + "connection": "maprfs:///", + "workspaces": { + "root": { + "location": "/mapr/demo.mapr.com/data", + "writable": false, + "storageformat": null + }, + "clicks": { + "location": "/mapr/demo.mapr.com/data/nested", + "writable": true, + "storageformat": "parquet" + }, + "logs": { + "location": "/mapr/demo.mapr.com/data/flat", + "writable": true, + "storageformat": "parquet" + }, + "views": { + "location": "/mapr/demo.mapr.com/data/views", + "writable": true, + "storageformat": "parquet" + }, + "formats": { + "psv": { + "type": "text", + "extensions": [ + "tbl" + ], + "delimiter": "|" + }, + "csv": { + "type": "text", + "extensions": [ + "csv" + ], + "delimiter": "," + }, + "tsv": { + "type": "text", + "extensions": [ + "tsv" + ], + "delimiter": "\t" + }, + "parquet": { + "type": "parquet" + }, + "json": { + "type": "json" + } + }} + +## hive + +A storage plugin configuration for a Hive data warehouse within the sandbox. +Drill connects to the Hive metastore by using the configured metastore thrift +URI. Metadata for Hive tables is automatically available for users to query. + + { + "type": "hive", + "enabled": true, + "configProps": { + "hive.metastore.uris": "thrift://localhost:9083", + "hive.metastore.sasl.enabled": "false" + } + } + +# Client Application Interfaces + +Drill also provides additional application interfaces for the client tools to +connect and access from Drill. The interfaces include the following. + +### ODBC/JDBC drivers + +Drill provides ODBC/JDBC drivers to connect from BI tools such as Tableau, +MicroStrategy, SQUirrel, and Jaspersoft; refer to [Using ODBC to Access Apache +Drill from BI Tools](/drill/docs/odbc-jdbc-interfaces/using-odbc-to- access-apache-drill-from-bi-tools) and [Using JDBC to Access Apache Drill](/drill/docs/odbc-jdbc-interfaces#using-jdbc-to-access-apache-drill-from-squirrel) to learn +more. + +### SQLLine + +SQLLine is a JDBC application that comes packaged with Drill. In order to +start working with it, you can use the command line on the demo cluster to log +in as root, then enter `sqlline`. Use `mapr` as the login password. For +example: + + $ ssh root@localhost -p 2222 + Password: + Last login: Mon Sep 15 13:46:08 2014 from 10.250.0.28 + Welcome to your Mapr Demo virtual machine. + [root@maprdemo ~]# sqlline + sqlline version 1.1.6 + 0: jdbc:drill:> + +### Drill Web UI + +The Drill Web UI is a simple user interface for configuring and manage Apache +Drill. This UI can be launched from any of the nodes in the Drill cluster. The +configuration for Drill includes setting up storage plugins that represent the +data sources on which Drill performs queries. The sandbox comes with storage +plugins configured for the Hive, HBase, MapR file system, and local file +system. + +Users and developers can get the necessary information for tuning and +performing diagnostics on queries, such as the list of queries executed in a +session and detailed query plan profiles for each. + +Detailed configuration and management of Drill is out of scope for this +tutorial. + +The Web interface for Apache Drill also provides a query UI where users can +submit queries to Drill and observe results. Here is a screen shot of the Web +UI for Apache Drill: + +![drill query flow]({{ site.baseurl }}/docs/img/DrillWebUI.png) + +### REST API + +Drill provides a simple REST API for the users to query data as well as manage +the system. The Web UI leverages the REST API to talk to Drill. + +This tutorial introduces sample queries that you can run by using SQLLine. +Note that you can run the queries just as easily by launching the Drill Web +UI. No additional installation or configuration is required. + +# Use Case Overview + +As you run through the queries in this tutorial, put yourself in the shoes of +an analyst with basic SQL skills. Let us imagine that the analyst works for an +emerging online retail business that accepts purchases from its customers +through both an established web-based interface and a new mobile application. + +The analyst is data-driven and operates mostly on the business side with +little or no interaction with the IT department. Recently the central IT team +has implemented a Hadoop-based infrastructure to reduce the cost of the legacy +database system, and most of the DWH/ETL workload is now handled by +Hadoop/Hive. The master customer profile information and product catalog are +managed in MapR-DB, which is a NoSQL database. The IT team has also started +acquiring clickstream data that comes from web and mobile applications. This +data is stored in Hadoop as JSON files. + +The analyst has a number of data sources that he could explore, but exploring +them in isolation is not the way to go. There are some potentially very +interesting analytical connections between these data sources. For example, it +would be good to be able to analyze customer records in the clickstream data +and tie them to the master customer data in MapR DB. + +The analyst decides to explore various data sources and he chooses to do that +by using Apache Drill. Think about the flexibility and analytic capability of +Apache Drill as you work through the tutorial. + +# What's Next + +Start running queries by going to [Lesson 1: Learn About the Data +Set](/drill/docs/lession-1-learn-about-the-data-set). + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/003-lesson1.md ---------------------------------------------------------------------- diff --git a/_docs/tutorial/003-lesson1.md b/_docs/tutorial/003-lesson1.md new file mode 100644 index 0000000..119d67f --- /dev/null +++ b/_docs/tutorial/003-lesson1.md @@ -0,0 +1,396 @@ +--- +title: "Lession 1: Learn about the Data Set" +parent: "Apache Drill Tutorial" +--- +## Goal + +This lesson is simply about discovering what data is available, in what +format, using simple SQL SELECT statements. Drill is capable of analyzing data +without prior knowledge or definition of its schema. This means that you can +start querying data immediately (and even as it changes), regardless of its +format. + +The data set for the tutorial consists of: + + * Transactional data: stored as a Hive table + * Product catalog and master customer data: stored as MapR-DB tables + * Clickstream and logs data: stored in the MapR file system as JSON files + +## Queries in This Lesson + +This lesson consists of select * queries on each data source. + +## Before You Begin + +### Start sqlline + +If sqlline is not already started, use a Terminal or Command window to log +into the demo VM as root, then enter `sqlline`: + + $ ssh root@10.250.0.6 + Password: + Last login: Mon Sep 15 13:46:08 2014 from 10.250.0.28 + Welcome to your Mapr Demo virtual machine. + [root@maprdemo ~]# sqlline + sqlline version 1.1.6 + 0: jdbc:drill:> + +You can run queries from this prompt to complete the tutorial. To exit from +`sqlline`, type: + + 0: jdbc:drill:> !quit + +Note that though this tutorial demonstrates the queries using SQLLine, you can +also execute queries using the Drill Web UI. + +### List the available workspaces and databases: + + 0: jdbc:drill:> show databases; + +-------------+ + | SCHEMA_NAME | + +-------------+ + | hive.default | + | dfs.default | + | dfs.logs | + | dfs.root | + | dfs.views | + | dfs.clicks | + | dfs.data | + | dfs.tmp | + | sys | + | maprdb | + | cp.default | + | INFORMATION_SCHEMA | + +-------------+ + 12 rows selected + +Note that this command exposes all the metadata available from the storage +plugins configured with Drill as a set of schemas. This includes the Hive and +MapR-DB databases as well as the workspaces configured in the file system. As +you run queries in the tutorial, you will switch among these schemas by +submitting the USE command. This behavior resembles the ability to use +different database schemas (namespaces) in a relational database system. + +## Query Hive Tables + +The orders table is a six-column Hive table defined in the Hive metastore. +This is a Hive external table pointing to the data stored in flat files on the +MapR file system. The orders table contains 122,000 rows. + +### Set the schema to hive: + + 0: jdbc:drill:> use hive; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'hive' | + +------------+------------+ + +You will run the USE command throughout this tutorial. The USE command sets +the schema for the current session. + +### Describe the table: + +You can use the DESCRIBE command to show the columns and data types for a Hive +table: + + 0: jdbc:drill:> describe orders; + +-------------+------------+-------------+ + | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | + +-------------+------------+-------------+ + | order_id | BIGINT | YES | + | month | VARCHAR | YES | + | cust_id | BIGINT | YES | + | state | VARCHAR | YES | + | prod_id | BIGINT | YES | + | order_total | INTEGER | YES | + +-------------+------------+-------------+ + +The DESCRIBE command returns complete schema information for Hive tables based +on the metadata available in the Hive metastore. + +### Select 5 rows from the orders table: + + 0: jdbc:drill:> select * from orders limit 5; + +------------+------------+------------+------------+------------+-------------+ + | order_id | month | cust_id | state | prod_id | order_total | + +------------+------------+------------+------------+------------+-------------+ + | 67212 | June | 10001 | ca | 909 | 13 | + | 70302 | June | 10004 | ga | 420 | 11 | + | 69090 | June | 10011 | fl | 44 | 76 | + | 68834 | June | 10012 | ar | 0 | 81 | + | 71220 | June | 10018 | az | 411 | 24 | + +------------+------------+------------+------------+------------+-------------+ + +Because orders is a Hive table, you can query the data in the same way that +you would query the columns in a relational database table. Note the use of +the standard LIMIT clause, which limits the result set to the specified number +of rows. You can use LIMIT with or without an ORDER BY clause. + +Drill provides seamless integration with Hive by allowing queries on Hive +tables defined in the metastore with no extra configuration. Note that Hive is +not a prerequisite for Drill, but simply serves as a storage plugin or data +source for Drill. Drill also lets users query all Hive file formats (including +custom serdes). Additionally, any UDFs defined in Hive can be leveraged as +part of Drill queries. + +Because Drill has its own low-latency SQL query execution engine, you can +query Hive tables with high performance and support for interactive and ad-hoc +data exploration. + +## Query MapR-DB and HBase Tables + +The customers and products tables are MapR-DB tables. MapR-DB is an enterprise +in-Hadoop NoSQL database. It exposes the HBase API to support application +development. Every MapR-DB table has a row_key, in addition to one or more +column families. Each column family contains one or more specific columns. The +row_key value is a primary key that uniquely identifies each row. + +Drill allows direct queries on MapR-DB and HBase tables. Unlike other SQL on +Hadoop options, Drill requires no overlay schema definitions in Hive to work +with this data. Think about a MapR-DB or HBase table with thousands of +columns, such as a time-series database, and the pain of having to manage +duplicate schemas for it in Hive! + +### Products Table + +The products table has two column families. + +<table ><colgroup><col /><col /></colgroup><tbody><tr><td ><span style="color: rgb(0,0,0);">Column Family</span></td><td ><span style="color: rgb(0,0,0);">Columns</span></td></tr><tr><td ><span style="color: rgb(0,0,0);">details</span></td><td ><span style="color: rgb(0,0,0);">name</br></span><span style="color: rgb(0,0,0);">category</span></td></tr><tr><td ><span style="color: rgb(0,0,0);">pricing</span></td><td ><span style="color: rgb(0,0,0);">price</span></td></tr></tbody></table> +The products table contains 965 rows. + +### Customers Table + +The Customers table has three column families. + +<table ><colgroup><col /><col /></colgroup><tbody><tr><td ><span style="color: rgb(0,0,0);">Column Family</span></td><td ><span style="color: rgb(0,0,0);">Columns</span></td></tr><tr><td ><span style="color: rgb(0,0,0);">address</span></td><td ><span style="color: rgb(0,0,0);">state</span></td></tr><tr><td ><span style="color: rgb(0,0,0);">loyalty</span></td><td ><span style="color: rgb(0,0,0);">agg_rev</br></span><span style="color: rgb(0,0,0);">membership</span></td></tr><tr><td ><span style="color: rgb(0,0,0);">personal</span></td><td ><span style="color: rgb(0,0,0);">age</br></span><span style="color: rgb(0,0,0);">gender</span></td></tr></tbody></table> + +The customers table contains 993 rows. + +### Set the workspace to maprdb: + + 0: jdbc:drill:> use maprdb; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'maprdb' | + +------------+------------+ + +### Describe the tables: + + 0: jdbc:drill:> describe customers; + +-------------+------------+-------------+ + | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | + +-------------+------------+-------------+ + | row_key | ANY | NO | + | address | (VARCHAR(1), ANY) MAP | NO | + | loyalty | (VARCHAR(1), ANY) MAP | NO | + | personal | (VARCHAR(1), ANY) MAP | NO | + +-------------+------------+-------------+ + + 0: jdbc:drill:> describe products; + +-------------+------------+-------------+ + | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | + +-------------+------------+-------------+ + | row_key | ANY | NO | + | details | (VARCHAR(1), ANY) MAP | NO | + | pricing | (VARCHAR(1), ANY) MAP | NO | + +-------------+------------+-------------+ + +Unlike the Hive example, the DESCRIBE command does not return the full schema +up to the column level. Wide-column NoSQL databases such as MapR-DB and HBase +can be schema-less by design; every row has its own set of column name-value +pairs in a given column family, and the column value can be of any data type, +as determined by the application inserting the data. + +A âMAPâ complex type in Drill represents this variable column name-value +structure, and âANYâ represents the fact that the column value can be of any +data type. Observe the row_key, which is also simply bytes and has the type +ANY. + +### Select 5 rows from the products table: + + 0: jdbc:drill:> select * from products limit 5; + +------------+------------+------------+ + | row_key | details | pricing | + +------------+------------+------------+ + | [B@a1a3e25 | {"category":"bGFwdG9w","name":"IlNvbnkgbm90ZWJvb2si"} | {"price":"OTU5"} | + | [B@103a43af | {"category":"RW52ZWxvcGVz","name":"IzEwLTQgMS84IHggOSAxLzIgUHJlbWl1bSBEaWFnb25hbCBTZWFtIEVudmVsb3Blcw=="} | {"price":"MT | + | [B@61319e7b | {"category":"U3RvcmFnZSAmIE9yZ2FuaXphdGlvbg==","name":"MjQgQ2FwYWNpdHkgTWF4aSBEYXRhIEJpbmRlciBSYWNrc1BlYXJs"} | {"price" | + | [B@9bcf17 | {"category":"TGFiZWxz","name":"QXZlcnkgNDk4"} | {"price":"Mw=="} | + | [B@7538ef50 | {"category":"TGFiZWxz","name":"QXZlcnkgNDk="} | {"price":"Mw=="} | + +Given that Drill requires no up front schema definitions indicating data +types, the query returns the raw byte arrays for column values, just as they +are stored in MapR-DB (or HBase). Observe that the column families (details +and pricing) have the map data type and appear as JSON strings. + +In Lesson 2, you will use CAST functions to return typed data for each column. + +### Select 5 rows from the customers table: + + + +0: jdbc:drill:> select * from customers limit 5; + +------------+------------+------------+------------+ + | row_key | address | loyalty | personal | + +------------+------------+------------+------------+ + | [B@284bae62 | {"state":"Imt5Ig=="} | {"agg_rev":"IjEwMDEtMzAwMCI=","membership":"ImJhc2ljIg=="} | {"age":"IjI2LTM1Ig==","gender":"Ik1B | + | [B@7ffa4523 | {"state":"ImNhIg=="} | {"agg_rev":"IjAtMTAwIg==","membership":"ImdvbGQi"} | {"age":"IjI2LTM1Ig==","gender":"IkZFTUFMRSI= | + | [B@7d13e79 | {"state":"Im9rIg=="} | {"agg_rev":"IjUwMS0xMDAwIg==","membership":"InNpbHZlciI="} | {"age":"IjI2LTM1Ig==","gender":"IkZFT | + | [B@3a5c7df1 | {"state":"ImtzIg=="} | {"agg_rev":"IjMwMDEtMTAwMDAwIg==","membership":"ImdvbGQi"} | {"age":"IjUxLTEwMCI=","gender":"IkZF | + | [B@e507726 | {"state":"Im5qIg=="} | {"agg_rev":"IjAtMTAwIg==","membership":"ImJhc2ljIg=="} | {"age":"IjIxLTI1Ig==","gender":"Ik1BTEUi" | + +------------+------------+------------+------------+ + +Again the table returns byte data that needs to be cast to readable data +types. + +## Query the File System + +Along with querying a data source with full schemas (such as Hive) and partial +schemas (such as MapR-DB and HBase), Drill offers the unique capability to +perform SQL queries directly on file system. The file system could be a local +file system, or a distributed file system such as MapR-FS, HDFS, or S3. + +In the context of Drill, a file or a directory is considered as synonymous to +a relational database âtable.â Therefore, you can perform SQL operations +directly on files and directories without the need for up-front schema +definitions or schema management for any model changes. The schema is +discovered on the fly based on the query. Drill supports queries on a variety +of file formats including text, CSV, Parquet, and JSON in the 0.5 release. + +In this example, the clickstream data coming from the mobile/web applications +is in JSON format. The JSON files have the following structure: + + {"trans_id":31920,"date":"2014-04-26","time":"12:17:12","user_info":{"cust_id":22526,"device":"IOS5","state":"il"},"trans_info":{"prod_id":[174,2],"purch_flag":"false"}} + {"trans_id":31026,"date":"2014-04-20","time":"13:50:29","user_info":{"cust_id":16368,"device":"AOS4.2","state":"nc"},"trans_info":{"prod_id":[],"purch_flag":"false"}} + {"trans_id":33848,"date":"2014-04-10","time":"04:44:42","user_info":{"cust_id":21449,"device":"IOS6","state":"oh"},"trans_info":{"prod_id":[582],"purch_flag":"false"}} + + +The clicks.json and clicks.campaign.json files contain metadata as part of the +data itself (referred to as âself-describingâ data). Also note that the data +elements are complex, or nested. The initial queries below do not show how to +unpack the nested data, but they show that easy access to the data requires no +setup beyond the definition of a workspace. + +### Query nested clickstream data + +#### Set the workspace to dfs.clicks: + + 0: jdbc:drill:> use dfs.clicks; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'dfs.clicks' | + +------------+------------+ + +In this case, setting the workspace is a mechanism for making queries easier +to write. When you specify a file system workspace, you can shorten references +to files in the FROM clause of your queries. Instead of having to provide the +complete path to a file, you can provide the path relative to a directory +location specified in the workspace. For example: + + "location": "/mapr/demo.mapr.com/data/nested" + +Any file or directory that you want to query in this path can be referenced +relative to this path. The clicks directory referred to in the following query +is directly below the nested directory. + +#### Select 2 rows from the clicks.json file: + + 0: jdbc:drill:> select * from `clicks/clicks.json` limit 2; + +------------+------------+------------+------------+------------+ + | trans_id | date | time | user_info | trans_info | + +------------+------------+------------+------------+------------+ + | 31920 | 2014-04-26 | 12:17:12 | {"cust_id":22526,"device":"IOS5","state":"il"} | {"prod_id":[174,2],"purch_flag":"false"} | + | 31026 | 2014-04-20 | 13:50:29 | {"cust_id":16368,"device":"AOS4.2","state":"nc"} | {"prod_id":[],"purch_flag":"false"} | + +------------+------------+------------+------------+------------+ + 2 rows selected + +Note that the FROM clause reference points to a specific file. Drill expands +the traditional concept of a âtable referenceâ in a standard SQL FROM clause +to refer to a file in a local or distributed file system. + +The only special requirement is the use of back ticks to enclose the file +path. This is necessary whenever the file path contains Drill reserved words +or characters. + +#### Select 2 rows from the campaign.json file: + + 0: jdbc:drill:> select * from `clicks/clicks.campaign.json` limit 2; + +------------+------------+------------+------------+------------+------------+ + | trans_id | date | time | user_info | ad_info | trans_info | + +------------+------------+------------+------------+------------+------------+ + | 35232 | 2014-05-10 | 00:13:03 | {"cust_id":18520,"device":"AOS4.3","state":"tx"} | {"camp_id":"null"} | {"prod_id":[7,7],"purch_flag":"true"} | + | 31995 | 2014-05-22 | 16:06:38 | {"cust_id":17182,"device":"IOS6","state":"fl"} | {"camp_id":"null"} | {"prod_id":[],"purch_flag":"false"} | + +------------+------------+------------+------------+------------+------------+ + 2 rows selected + +Notice that with a select * query, any complex data types such as maps and +arrays return as JSON strings. You will see how to unpack this data using +various SQL functions and operators in the next lesson. + +## Query Logs Data + +Unlike the previous example where we performed queries against clicks data in +one file, logs data is stored as partitioned directories on the file system. +The logs directory has three subdirectories: + + * 2012 + * 2013 + * 2014 + +Each of these year directories fans out to a set of numbered month +directories, and each month directory contains a JSON file with log records +for that month. The total number of records in all log files is 48000. + +The files in the logs directory and its subdirectories are JSON files. There +are many of these files, but you can use Drill to query them all as a single +data source, or to query a subset of the files. + +#### Set the workspace to dfs.logs: + + 0: jdbc:drill:> use dfs.logs; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'dfs.logs' | + +------------+------------+ + +#### Select 2 rows from the logs directory: + + 0: jdbc:drill:> select * from logs limit 2; + +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------+ + | dir0 | dir1 | trans_id | date | time | cust_id | device | state | camp_id | keywords | prod_id | purch_fl | + +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------+ + | 2014 | 8 | 24181 | 08/02/2014 | 09:23:52 | 0 | IOS5 | il | 2 | wait | 128 | false | + | 2014 | 8 | 24195 | 08/02/2014 | 07:58:19 | 243 | IOS5 | mo | 6 | hmm | 107 | false | + +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------+ + +Note that this is flat JSON data. The dfs.clicks workspace location property +points to a directory that contains the logs directory, making the FROM clause +reference for this query very simple. You do not have to refer to the complete +directory path on the file system. + +The column names dir0 and dir1 are special Drill variables that identify +subdirectories below the logs directory. In Lesson 3, you will do more complex +queries that leverage these dynamic variables. + +#### Find the total number of rows in the logs directory (all files): + + 0: jdbc:drill:> select count(*) from logs; + +------------+ + | EXPR$0 | + +------------+ + | 48000 | + +------------+ + +This query traverses all of the files in the logs directory and its +subdirectories to return the total number of rows in those files. + +# What's Next + +Go to [Lesson 2: Run Queries with ANSI +SQL](/drill/docs/lession-2-run-queries-with-ansi-sql). + + + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/004-lesson2.md ---------------------------------------------------------------------- diff --git a/_docs/tutorial/004-lesson2.md b/_docs/tutorial/004-lesson2.md new file mode 100644 index 0000000..73c4329 --- /dev/null +++ b/_docs/tutorial/004-lesson2.md @@ -0,0 +1,388 @@ +--- +title: "Lession 2: Run Queries with ANSI SQL" +parent: "Apache Drill Tutorial" +--- +## Goal + +This lesson shows how to do some standard SQL analysis in Apache Drill: for +example, summarizing data by using simple aggregate functions and connecting +data sources by using joins. Note that Apache Drill provides ANSI SQL support, +not a âSQL-likeâ interface. + +## Queries in This Lesson + +Now that you know what the data sources look like in their raw form, using +select * queries, try running some simple but more useful queries on each data +source. These queries demonstrate how Drill supports ANSI SQL constructs and +also how you can combine data from different data sources in a single SELECT +statement. + + * Show an aggregate query on a single file or table. Use GROUP BY, WHERE, HAVING, and ORDER BY clauses. + * Perform joins between Hive, MapR-DB, and file system data sources. + * Use table and column aliases. + * Create a Drill view. + +## Aggregation + + +### Set the schema to hive: + + 0: jdbc:drill:> use hive; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'hive' | + +------------+------------+ + 1 row selected + +### Return sales totals by month: + + 0: jdbc:drill:> select `month`, sum(order_total) + from orders group by `month` order by 2 desc; + +------------+------------+ + | month | EXPR$1 | + +------------+------------+ + | June | 950481 | + | May | 947796 | + | March | 836809 | + | April | 807291 | + | July | 757395 | + | October | 676236 | + | August | 572269 | + | February | 532901 | + | September | 373100 | + | January | 346536 | + +------------+------------+ + +Drill supports SQL aggregate functions such as SUM, MAX, AVG, and MIN. +Standard SQL clauses work in the same way in Drill queries as in relational +database queries. + +Note that back ticks are required for the âmonthâ column only because âmonthâ +is a reserved word in SQL. + +### Return the top 20 sales totals by month and state: + + 0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state + order by 3 desc limit 20; + +------------+------------+------------+ + | month | state | sales | + +------------+------------+------------+ + | May | ca | 119586 | + | June | ca | 116322 | + | April | ca | 101363 | + | March | ca | 99540 | + | July | ca | 90285 | + | October | ca | 80090 | + | June | tx | 78363 | + | May | tx | 77247 | + | March | tx | 73815 | + | August | ca | 71255 | + | April | tx | 68385 | + | July | tx | 63858 | + | February | ca | 63527 | + | June | fl | 62199 | + | June | ny | 62052 | + | May | fl | 61651 | + | May | ny | 59369 | + | October | tx | 55076 | + | March | fl | 54867 | + | March | ny | 52101 | + +------------+------------+------------+ + 20 rows selected + +Note the alias for the result of the SUM function. Drill supports column +aliases and table aliases. + +## HAVING Clause + +This query uses the HAVING clause to constrain an aggregate result. + +### Set the workspace to dfs.clicks + + 0: jdbc:drill:> use dfs.clicks; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'dfs.clicks' | + +------------+------------+ + 1 row selected + +### Return total number of clicks for devices that indicate high click-throughs: + + 0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t + group by t.user_info.device + having count(*) > 1000; + +------------+------------+ + | EXPR$0 | EXPR$1 | + +------------+------------+ + | IOS5 | 11814 | + | AOS4.2 | 5986 | + | IOS6 | 4464 | + | IOS7 | 3135 | + | AOS4.4 | 1562 | + | AOS4.3 | 3039 | + +------------+------------+ + +The aggregate is a count of the records for each different mobile device in +the clickstream data. Only the activity for the devices that registered more +than 1000 transactions qualify for the result set. + +## UNION Operator + +Use the same workspace as before (dfs.clicks). + +### Combine clicks activity from before and after the marketing campaign + + 0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t + union all + select u.trans_id, u.user_info.cust_id from `clicks/clicks.json` u limit 5; + +-------------+------------+ + | transaction | customer | + +-------------+------------+ + | 35232 | 18520 | + | 31995 | 17182 | + | 35760 | 18228 | + | 37090 | 17015 | + | 37838 | 18737 | + +-------------+------------+ + +This UNION ALL query returns rows that exist in two files (and includes any +duplicate rows from those files): `clicks.campaign.json` and `clicks.json`. + +## Subqueries + +### Set the workspace to hive: + + 0: jdbc:drill:> use hive; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'hive' | + +------------+------------+ + +### Compare order totals across states: + + 0: jdbc:drill:> select o1.cust_id, sum(o1.order_total) as ny_sales, + (select sum(o2.order_total) from hive.orders o2 + where o1.cust_id=o2.cust_id and state='ca') as ca_sales + from hive.orders o1 where o1.state='ny' group by o1.cust_id + order by cust_id limit 20; + +------------+------------+------------+ + | cust_id | ny_sales | ca_sales | + +------------+------------+------------+ + | 1001 | 72 | 47 | + | 1002 | 108 | 198 | + | 1003 | 83 | null | + | 1004 | 86 | 210 | + | 1005 | 168 | 153 | + | 1006 | 29 | 326 | + | 1008 | 105 | 168 | + | 1009 | 443 | 127 | + | 1010 | 75 | 18 | + | 1012 | 110 | null | + | 1013 | 19 | null | + | 1014 | 106 | 162 | + | 1015 | 220 | 153 | + | 1016 | 85 | 159 | + | 1017 | 82 | 56 | + | 1019 | 37 | 196 | + | 1020 | 193 | 165 | + | 1022 | 124 | null | + | 1023 | 166 | 149 | + | 1024 | 233 | null | + +------------+------------+------------+ + +This example demonstrates Drill support for correlated subqueries. This query +uses a subquery in the select list and correlates the result of the subquery +with the outer query, using the cust_id column reference. The subquery returns +the sum of order totals for California, and the outer query returns the +equivalent sum, for the same cust_id, for New York. + +The result set is sorted by the cust_id and presents the sales totals side by +side for easy comparison. Null values indicate customer IDs that did not +register any sales in that state. + +## CAST Function + +### Use the maprdb workspace: + + 0: jdbc:drill:> use maprdb; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'maprdb' | + +------------+------------+ + 1 row selected + +### Return customer data with appropriate data types + + 0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, + cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age, + cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, + cast(t.loyalty.membership as varchar(20)) as membership + from customers t limit 5; + +------------+------------+------------+------------+------------ +------------+------------+ + | cust_id | name | gender | age | state | agg_rev | membership | + +------------+------------+------------+------------+------------+------------+------------+ + | 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" | + | 10005 | "Brittany Park" | "MALE" | "26-35" | "in" | 230.00 | "silver" | + | 10006 | "Rose Lokey" | "MALE" | "26-35" | "ca" | 250.00 | "silver" | + | 10007 | "James Fowler" | "FEMALE" | "51-100" | "me" | 263.00 | "silver" | + | 10010 | "Guillermo Koehler" | "OTHER" | "51-100" | "mn" | 202.00 | "silver" | + +------------+------------+------------+------------+------------+------------+------------+ + 5 rows selected + +Note the following features of this query: + + * The CAST function is required for every column in the table. This function returns the MapR-DB/HBase binary data as readable integers and strings. Alternatively, you can use CONVERT_TO/CONVERT_FROM functions to decode the columns. CONVERT_TO and CONVERT_FROM are more efficient than CAST in most cases. + * The row_key column functions as the primary key of the table (a customer ID in this case). + * The table alias t is required; otherwise the column family names would be parsed as table names and the query would return an error. + +### Remove the quotes from the strings: + +You can use the regexp_replace function to remove the quotes around the +strings in the query results. For example, to return a state name va instead +of âvaâ: + + 0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','') + from customers t limit 1; + +------------+------------+ + | EXPR$0 | EXPR$1 | + +------------+------------+ + | 10001 | va | + +------------+------------+ + 1 row selected + +## CREATE VIEW Command + + 0: jdbc:drill:> use dfs.views; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | Default schema changed to 'dfs.views' | + +------------+------------+ + +### Use a mutable workspace: + +A mutable (or writable) workspace is a workspace that is enabled for âwriteâ +operations. This attribute is part of the storage plugin configuration. You +can create Drill views and tables in mutable workspaces. + +### Create a view on a MapR-DB table + + 0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id, + cast(t.personal.name as varchar(20)) as name, + cast(t.personal.gender as varchar(10)) as gender, + cast(t.personal.age as varchar(10)) as age, + cast(t.address.state as varchar(4)) as state, + cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, + cast(t.loyalty.membership as varchar(20)) as membership + from maprdb.customers t; + +------------+------------+ + | ok | summary | + +------------+------------+ + | true | View 'custview' replaced successfully in 'dfs.views' schema | + +------------+------------+ + 1 row selected + +Drill provides CREATE OR REPLACE VIEW syntax similar to relational databases +to create views. Use the OR REPLACE option to make it easier to update the +view later without having to remove it first. Note that the FROM clause in +this example must refer to maprdb.customers. The MapR-DB tables are not +directly visible to the dfs.views workspace. + +Unlike a traditional database where views typically are DBA/developer-driven +operations, file system-based views in Drill are very lightweight. A view is +simply a special file with a specific extension (.drill). You can store views +even in your local file system or point to a specific workspace. You can +specify any query against any Drill data source in the body of the CREATE VIEW +statement. + +Drill provides a decentralized metadata model. Drill is able to query metadata +defined in data sources such as Hive, HBase, and the file system. Drill also +supports the creation of metadata in the file system. + +### Query data from the view: + + 0: jdbc:drill:> select * from custview limit 1; + +------------+------------+------------+------------+------------+------------+------------+ + | cust_id | name | gender | age | state | agg_rev | membership | + +------------+------------+------------+------------+------------+------------+------------+ + | 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" | + +------------+------------+------------+------------+------------+------------+------------+ + +Once the users get an idea on what data is available by exploring it directly +from file system , views can be used as a way to take the data in downstream +tools like Tableau, Microstrategy etc for downstream analysis and +visualization. For these tools, a view appears simply as a âtableâ with +selectable âcolumnsâ in it. + +## Query Across Data Sources + +Continue using `dfs.views` for this query. + +### Join the customers view and the orders table: + + 0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview + where orders.cust_id=custview.cust_id + group by membership order by 2; + +------------+------------+ + | membership | sales | + +------------+------------+ + | "basic" | 380665 | + | "silver" | 708438 | + | "gold" | 2787682 | + +------------+------------+ + 3 rows selected + +In this query, we are reading data from a MapR-DB table (represented by +custview) and combining it with the order information in Hive. When doing +cross data source queries such as this, you need to use fully qualified +table/view names. For example, the orders table is prefixed by âhive,â which +is the storage plugin name registered with Drill. We are not using any prefix +for âcustviewâ because we explicitly switched the dfs.views workspace where +custview is stored. + +Note: If the results of any of your queries appear to be truncated because the +rows are wide, set the maximum width of the display to 10000: + +Do not use a semicolon for this SET command. + +### Join the customers, orders, and clickstream data: + + 0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview, + dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c + where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id + group by custview.membership order by 2; + +------------+------------+ + | membership | sales | + +------------+------------+ + | "basic" | 372866 | + | "silver" | 728424 | + | "gold" | 7050198 | + +------------+------------+ + 3 rows selected + +This three-way join selects from three different data sources in one query: + + * hive.orders table + * custview (a view of the HBase customers table) + * clicks.json file + +The join column for both sets of join conditions is the cust_id column. The +views workspace is used for this query so that custview can be accessed. The +hive.orders table is also visible to the query. + +However, note that the JSON file is not directly visible from the views +workspace, so the query specifies the full path to the file: + + dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` + + +## What's Next + +Go to [Lesson 3: Run Queries on Complex Data Types](/drill/docs/lession-3-run-queries-on-complex-data-types). + + +