This is an automated email from the ASF dual-hosted git repository. jackylk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push: new c15d55c [DOC] CarbonExtensions doc c15d55c is described below commit c15d55c0aa0630de1a9a4d399a21d61e1a05647f Author: QiangCai <qiang...@qq.com> AuthorDate: Mon Jan 20 16:25:05 2020 +0800 [DOC] CarbonExtensions doc Why is this PR needed? explain how to use CarbonExtensions in spark What changes were proposed in this PR? Document is updated to introduce CarbonExtensions Does this PR introduce any user interface change? No Is any new testcase added? No This closes #3585 --- docs/quick-start-guide.md | 85 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md index dedba36..f9f467c 100644 --- a/docs/quick-start-guide.md +++ b/docs/quick-start-guide.md @@ -39,6 +39,7 @@ This tutorial provides a quick introduction to using CarbonData. To follow along CarbonData can be integrated with Spark,Presto and Hive execution engines. The below documentation guides on Installing and Configuring with these execution engines. #### Spark +[Installing and Configuring CarbonData to run locally with Spark SQL CLI (version: 2.3+)](#installing-and-configuring-carbondata-to-run-locally-with-spark-sql) [Installing and Configuring CarbonData to run locally with Spark Shell](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell) @@ -65,12 +66,64 @@ CarbonData can be integrated with Spark,Presto and Hive execution engines. The b #### Alluxio [CarbonData supports read and write with Alluxio](./alluxio-guide.md) +## Installing and Configuring CarbonData to run locally with Spark SQL CLI (version: 2.3+) + +In Spark SQL CLI, it use CarbonExtensions to customize the SparkSession with CarbonData's parser, analyzer, optimizer and physical planning strategy rules in Spark. +To enable CarbonExtensions, we need to add the following configuration. + +|Key|Value| +|---|---| +|spark.sql.extensions|org.apache.spark.sql.CarbonExtensions| + +Start Spark SQL CLI by running the following command in the Spark directory: + +``` +./bin/spark-sql --conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions --jars <carbondata assembly jar path> +``` +###### Creating a Table + +``` +CREATE TABLE IF NOT EXISTS test_table ( + id string, + name string, + city string, + age Int) +STORED AS carbondata; +``` +**NOTE**: CarbonExtensions only support "STORED AS carbondata" and "USING carbondata" + +###### Loading Data to a Table + +``` +LOAD DATA INPATH '/path/to/sample.csv' INTO TABLE test_table; +``` + +``` +insert into table test_table select '1', 'name1', 'city1', 1; +``` + +**NOTE**: Please provide the real file path of `sample.csv` for the above script. +If you get "tablestatus.lock" issue, please refer to [FAQ](faq.md) + +###### Query Data from a Table + +``` +SELECT * FROM test_table; +``` + +``` +SELECT city, avg(age), sum(age) +FROM test_table +GROUP BY city; +``` + ## Installing and Configuring CarbonData to run locally with Spark Shell Apache Spark Shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit [Apache Spark Documentation](http://spark.apache.org/docs/latest/) for more details on Spark shell. #### Basics +###### Option 1: Using CarbonSession Start Spark shell by running the following command in the Spark directory: ``` @@ -99,6 +152,27 @@ val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession( `SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("<carbon_store_path>", "<local metastore path>")`. - Data storage location can be specified by `<carbon_store_path>`, like `/carbon/data/store`, `hdfs://localhost:9000/carbon/data/store` or `s3a://carbon/data/store`. +###### Option 2: Using SparkSession with CarbonExtensions + +Start Spark shell by running the following command in the Spark directory: + +``` +./bin/spark-shell --conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions --jars <carbondata assembly jar path> +``` +**NOTE** + - In this flow, we can use the built-in SparkSession `spark` instead of `carbon`. + We also can create a new SparkSession instead of the built-in SparkSession `spark` if need. + It need to add "org.apache.spark.sql.CarbonExtensions" into spark configuration "spark.sql.extensions". + ``` + SparkSession newSpark = SparkSession + .builder() + .config(sc.getConf) + .enableHiveSupport + .config("spark.sql.extensions","org.apache.spark.sql.CarbonExtensions") + .getOrCreate() + ``` + - Data storage location can be specified by "spark.sql.warehouse.dir". + #### Executing Queries ###### Creating a Table @@ -114,6 +188,17 @@ carbon.sql( | STORED AS carbondata """.stripMargin) ``` +**NOTE**: +The following table list all supported syntax: + +|create table |SparkSession with CarbonExtensions | CarbonSession| +|---|---|---| +| STORED AS carbondata|yes|yes| +| USING carbondata|yes|yes| +| STORED BY 'carbondata'|no|yes| +| STORED BY 'org.apache.carbondata.format'|no|yes| + +We suggest to use CarbonExtensions instead of CarbonSession. ###### Loading Data to a Table