subject:"\[kylin\] 01\/01\: KYLIN\-3383 add document for Spark JDBC"

[kylin] 01/01: KYLIN-3383 add document for Spark JDBC

2018-05-17 Thread shaofengshi

This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git

commit c4b31e3f418d700fa5bce9de1b37c9a9297617e3
Author: shaofengshi 
AuthorDate: Thu May 17 18:55:15 2018 +0800

KYLIN-3383 add document for Spark JDBC
---
 website/_data/docs23.yml  |  1 +
 website/_docs23/index.md  |  5 ++-
 website/_docs23/tutorial/spark.md | 90 +++
 3 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/website/_data/docs23.yml b/website/_data/docs23.yml
index 66b3121..97a0a6e 100644
--- a/website/_data/docs23.yml
+++ b/website/_data/docs23.yml
@@ -56,6 +56,7 @@
   - tutorial/microstrategy
   - tutorial/squirrel
   - tutorial/flink
+  - tutorial/spark
   - tutorial/hue
   - tutorial/Qlik
 
diff --git a/website/_docs23/index.md b/website/_docs23/index.md
index 0ad0af7..6d99ee1 100644
--- a/website/_docs23/index.md
+++ b/website/_docs23/index.md
@@ -56,8 +56,9 @@ Connectivity and APIs
 8. [Connect from MicroStrategy](tutorial/microstrategy.html)
 9. [Connect from SQuirreL](tutorial/squirrel.html)
 10. [Connect from Apache Flink](tutorial/flink.html)
-11. [Connect from Hue](tutorial/hue.html)
-12. [Connect from Qlik Sense](tutorial/Qlik.html)
+11. [Connect from Apache Spark](tutorial/spark.html)
+12. [Connect from Hue](tutorial/hue.html)
+13. [Connect from Qlik Sense](tutorial/Qlik.html)
 
 
 Operations
diff --git a/website/_docs23/tutorial/spark.md 
b/website/_docs23/tutorial/spark.md
new file mode 100644
index 000..53ff765
--- /dev/null
+++ b/website/_docs23/tutorial/spark.md
@@ -0,0 +1,90 @@
+---
+layout: docs23
+title:  Apache Spark
+categories: tutorial
+permalink: /docs23/tutorial/spark.html
+---
+
+
+### Introduction
+
+Apache Kylin provides JDBC driver to query the Cube data, and Apache Spark 
supports JDBC data source. With it, you can connect with Kylin from your Spark 
application and then do the analysis over a very huge data set in an 
interactive way.
+
+Please keep in mind, Kylin is an OLAP system, which already aggregated the raw 
data by the given dimensions. If you simply load the source table like a normal 
database, you may not gain the benefit of Cubes, and it may crash your 
application.
+
+The right way is to start from a summarized view (e.g., a query with "group 
by"), loading it as a data frame, and then do the transformation and other 
actions.
+
+This document describes how to use Kylin as a data source in Apache Spark. You 
need to install Kylin, build a Cube before run it. And remember to put Kylin's 
JDBC driver (in the 'lib' folder of Kylin binary package) onto Spark's class 
path. 
+
+### The wrong way
+
+The below Python application tries to directly load Kylin's table as a data 
frame, and then to get the total row count with "df.count()", but the result is 
incorrect.
+
+{% highlight Groff markup %}
+
+conf = SparkConf() 
+conf.setMaster('yarn')
+conf.setAppName('Kylin jdbc example')
+
+sc = SparkContext(conf=conf)
+sqlContext = SQLContext(self.sc)
+
+url='jdbc:kylin://sandbox:7070/default'
+df = self.sqlContext.read.format('jdbc').options(
+url=url, user='ADMIN', password='KYLIN',
+driver='org.apache.kylin.jdbc.Driver',
+dbtable='kylin_sales').load()
+
+print df.count()
+
+
+{% endhighlight %}
+
+The output is:
+{% highlight Groff markup %}
+132
+
+{% endhighlight %}
+
+
+The result "132" is not the total count of the origin table. Because Spark 
didn't send a "select count(*)" query to Kylin as you thought, but send a 
"select * " and then try to count within Spark; This would be inefficient and, 
as Kylin doesn't have the raw data, the "select * " query will be answered with 
the base Cuboid (summarized by all dimensions). The "132" is the row number of 
the base Cuboid, not original data. 
+
+
+### The right way
+
+The right behavior is to push down possible aggregations to Kylin, so that the 
Cube can be leveraged and the performance would be much better. Below is the 
correct code:
+
+{% highlight Groff markup %}
+
+conf = SparkConf() 
+conf.setMaster('yarn')
+conf.setAppName('Kylin jdbc example')
+
+sc = SparkContext(conf=conf)
+sqlContext = SQLContext(sc)
+  
+url='jdbc:kylin://sandbox:7070/default'
+tab_name = '(select count(*) as total from kylin_sales) the_alias'
+
+df = sqlContext.read.format('jdbc').options(
+url=url, user='ADMIN', password='KYLIN',
+driver='org.apache.kylin.jdbc.Driver',
+dbtable=tab_name).load()
+
+df.show()
+
+{% endhighlight %}
+
+Here is the output, the result is correct as Spark push down the aggregation 
to Kylin:
+
+{% highlight Groff markup %}
++-+
+|TOTAL|
++-+
+| 2000|
++-+
+
+{% endhighlight %}
+
+Thanks for the input and sample code from Shuxin Yang 
(shuxinyang@gmail.com).
+

-- 
To stop receiving notification emails like this one, please contact
shaofeng...@apache.org.

[kylin] 01/01: KYLIN-3383 add document for Spark JDBC

2018-05-17 Thread shaofengshi

This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git

commit 45b2132d536cd6febcd9b6165655b86c98bf7336
Author: shaofengshi 
AuthorDate: Thu May 17 18:55:15 2018 +0800

KYLIN-3383 add document for Spark JDBC
---
 website/_data/docs23.yml  |  1 +
 website/_docs23/index.md  |  5 ++-
 website/_docs23/tutorial/spark.md | 89 +++
 3 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/website/_data/docs23.yml b/website/_data/docs23.yml
index 66b3121..97a0a6e 100644
--- a/website/_data/docs23.yml
+++ b/website/_data/docs23.yml
@@ -56,6 +56,7 @@
   - tutorial/microstrategy
   - tutorial/squirrel
   - tutorial/flink
+  - tutorial/spark
   - tutorial/hue
   - tutorial/Qlik
 
diff --git a/website/_docs23/index.md b/website/_docs23/index.md
index 0ad0af7..6d99ee1 100644
--- a/website/_docs23/index.md
+++ b/website/_docs23/index.md
@@ -56,8 +56,9 @@ Connectivity and APIs
 8. [Connect from MicroStrategy](tutorial/microstrategy.html)
 9. [Connect from SQuirreL](tutorial/squirrel.html)
 10. [Connect from Apache Flink](tutorial/flink.html)
-11. [Connect from Hue](tutorial/hue.html)
-12. [Connect from Qlik Sense](tutorial/Qlik.html)
+11. [Connect from Apache Spark](tutorial/spark.html)
+12. [Connect from Hue](tutorial/hue.html)
+13. [Connect from Qlik Sense](tutorial/Qlik.html)
 
 
 Operations
diff --git a/website/_docs23/tutorial/spark.md 
b/website/_docs23/tutorial/spark.md
new file mode 100644
index 000..2246843
--- /dev/null
+++ b/website/_docs23/tutorial/spark.md
@@ -0,0 +1,89 @@
+---
+layout: docs23
+title:  Apache Spark
+categories: tutorial
+permalink: /docs23/tutorial/spark.html
+---
+
+
+### Introduction
+
+Apache Kylin provides JDBC driver to query the Cube data, and Apache Spark 
supports JDBC data source. With it, you can connect with Kylin from your Spark 
application and then do the analysis over a very huge data set in an 
interactive way.
+
+Please keep in mind, Kylin is an OLAP system, which already aggregated the raw 
data by the given dimensions. If you simply load the source table like a normal 
database, you may not gain the benefit of Cubes, and it may crash your 
application.
+
+The right way is to start from a summarized result (e.g., a query with "group 
by"), loading it as a data frame, and then do the transformation and other 
actions.
+
+This document describes how to use Kylin as a data source in Apache Spark. You 
need to install Kylin, build a Cube before run it. And remember to put Kylin's 
JDBC driver (in the 'lib' folder of Kylin binary package) onto Spark's class 
path. 
+
+### The wrong way
+
+The below Python application tries to directly load Kylin's table as a data 
frame, and then to get the total row count with "df.count()", but the result is 
incorrect.
+
+{% highlight Groff markup %}
+
+conf = SparkConf() 
+conf.setMaster('yarn')
+conf.setAppName('Kylin jdbc example')
+
+self.sc = SparkContext(conf=conf)
+self.sqlContext = SQLContext(self.sc)
+
+self.df = self.sqlContext.read.format('jdbc').options(
+url='jdbc:kylin://sandbox:7070/default',
+user='ADMIN', password='KYLIN',
+dbtable='kylin_sales', driver='org.apache.kylin.jdbc.Driver').load()
+
+print self.df.count()
+
+
+{% endhighlight %}
+
+The output is:
+{% highlight Groff markup %}
+132
+
+{% endhighlight %}
+
+
+The result "132" is not the total count of the origin table. Because Spark 
didn't send a "select count(*)" query to Kylin as you thought, but send a 
"select * " and then try to count within Spark; This would be inefficient and, 
as Kylin doesn't have the raw data, the "select * " query will be answered with 
the base Cuboid (summarized by all dimensions). The "132" is the row number of 
the base Cuboid, not original data. 
+
+
+### The right way
+
+The right behavior is to push down possible aggregations to Kylin, so that the 
Cube can be leveraged and the performance would be much better. Below is the 
correct code:
+
+{% highlight Groff markup %}
+
+conf = SparkConf() 
+conf.setMaster('yarn')
+conf.setAppName('Kylin jdbc example')
+
+sc = SparkContext(conf=conf)
+sql_ctx = SQLContext(sc)
+  
+url='jdbc:kylin://sandbox:7070/default'
+tab_name = '(select count(*) as total from kylin_sales) the_alias'
+
+df = sql_ctx.read.format('jdbc').options(
+url=url, user='ADMIN', password='KYLIN',
+driver='org.apache.kylin.jdbc.Driver',
+dbtable=tab_name).load()
+
+df.show()
+
+{% endhighlight %}
+
+Here is the output, the result is correct as Spark push down the aggregation 
to Kylin:
+
+{% highlight Groff markup %}
++-+
+|TOTAL|
++-+
+| 2000|
++-+
+
+{% endhighlight %}
+
+Thanks for the input and sample code from Shuxin Yang 
(shuxinyang@gmail.com).
+

-- 
To stop receiving notification emails like this one, please contact
shaofeng...@apache.org.

[kylin] 01/01: KYLIN-3383 add document for Spark JDBC

[kylin] 01/01: KYLIN-3383 add document for Spark JDBC

2 matches

Site Navigation

Mail list logo

Footer information