This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git


The following commit(s) were added to refs/heads/document by this push:
     new 1553900  KYLIN-3383 add document for Spark JDBC
1553900 is described below

commit 155390017f298e2201f8c94812b0b9bca676bffe
Author: shaofengshi <shaofeng...@apache.org>
AuthorDate: Thu May 17 18:55:15 2018 +0800

    KYLIN-3383 add document for Spark JDBC
---
 website/_data/docs23.yml          |   1 +
 website/_docs23/index.md          |   5 +-
 website/_docs23/tutorial/spark.md | 191 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 195 insertions(+), 2 deletions(-)

diff --git a/website/_data/docs23.yml b/website/_data/docs23.yml
index 66b3121..97a0a6e 100644
--- a/website/_data/docs23.yml
+++ b/website/_data/docs23.yml
@@ -56,6 +56,7 @@
   - tutorial/microstrategy
   - tutorial/squirrel
   - tutorial/flink
+  - tutorial/spark
   - tutorial/hue
   - tutorial/Qlik
 
diff --git a/website/_docs23/index.md b/website/_docs23/index.md
index 0ad0af7..6d99ee1 100644
--- a/website/_docs23/index.md
+++ b/website/_docs23/index.md
@@ -56,8 +56,9 @@ Connectivity and APIs
 8. [Connect from MicroStrategy](tutorial/microstrategy.html)
 9. [Connect from SQuirreL](tutorial/squirrel.html)
 10. [Connect from Apache Flink](tutorial/flink.html)
-11. [Connect from Hue](tutorial/hue.html)
-12. [Connect from Qlik Sense](tutorial/Qlik.html)
+11. [Connect from Apache Spark](tutorial/spark.html)
+12. [Connect from Hue](tutorial/hue.html)
+13. [Connect from Qlik Sense](tutorial/Qlik.html)
 
 
 Operations
diff --git a/website/_docs23/tutorial/spark.md 
b/website/_docs23/tutorial/spark.md
new file mode 100644
index 0000000..6d599ca
--- /dev/null
+++ b/website/_docs23/tutorial/spark.md
@@ -0,0 +1,191 @@
+---
+layout: docs23
+title:  Apache Spark
+categories: tutorial
+permalink: /docs23/tutorial/spark.html
+---
+
+
+### Introduction
+
+Kylin provides JDBC driver to query the Cube data. Spark can query SQL 
databases using JDBC driver. With this, you can query Kylin's Cube from Spark 
and then do the analysis.
+
+But, Kylin is an OLAP system, it is not a real database: Kylin only has 
aggregated data, no raw data. If you simply load the source table into Spark as 
a data frame, some operations like "count" might be wrong if you expect to 
count the raw data. 
+
+Besides, the Cube data can be very huge which is different with normal 
database. 
+
+This document describes how to use Kylin as a data source in Apache Spark. You 
need install Kylin and build a Cube as the prerequisite. 
+
+### The wrong application
+
+The below Python application tries to load Kylin's table as a data frame, and 
then expect to get the total row count with "df.count()", but the result is 
incorrect.
+
+{% highlight Groff markup %}
+#!/usr/bin/env python
+
+import os
+import sys
+import traceback
+import time
+import subprocess
+import json
+import re
+
+os.environ["SPARK_HOME"] = "/usr/local/spark/"
+sys.path.append(os.environ["SPARK_HOME"]+"/python")
+
+from pyspark import SparkConf, SparkContext
+from pyspark.sql import SQLContext
+
+from pyspark.sql.functions import *
+from pyspark.sql.types import *
+
+jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
+
+class Kap(object):
+    def __init__(self):
+        print 'initializing Spark context ...'
+        sys.stdout.flush()
+
+        conf = SparkConf() 
+        conf.setMaster('yarn')
+        conf.setAppName('kap test')
+
+        wdir = os.path.dirname(os.path.realpath(__file__))
+        jars_with_path = ','.join([wdir + '/' + x for x in jars])
+
+        conf.set("spark.jars", jars_with_path)
+        conf.set("spark.yarn.archive", 
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
+        conf.set("spark.driver.extraClassPath", 
jars_with_path.replace(",",":"))
+
+        self.sc = SparkContext(conf=conf)
+        self.sqlContext = SQLContext(self.sc)
+        print 'Spark context is initialized'
+
+        self.df = self.sqlContext.read.format('jdbc').options(
+            url='jdbc:kylin://sandbox:7070/default',
+            user='ADMIN', password='KYLIN',
+            dbtable='test_kylin_fact', 
driver='org.apache.kylin.jdbc.Driver').load()
+
+        self.df.registerTempTable("loltab")
+        print self.df.count()
+
+    def sql(self, cmd, result_tab_name='tmptable'):
+        df = self.sqlContext.sql(cmd) 
+        if df is not None:
+            df.registerTempTable(result_tab_name)
+        return df
+
+    def stop(self):
+        self.sc.stop()
+
+kap = Kap()
+try:
+    df = kap.sql(r"select count(*) from loltab")
+    df.show(truncate=False)
+except:
+    pass
+finally:
+    kap.stop()
+
+
+{% endhighlight %}
+
+The output is:
+{% highlight Groff markup %}
+Spark context is initialized
+132
++--------+
+|count(1)|
++--------+
+|132     |
++--------+
+
+{% endhighlight %}
+
+
+The result "132" here is not the total count of the origin table. The reason 
is that, Spark sends "select * from " query to Kylin, Kylin doesn't have the 
raw data, but will answer the query with aggregated data in the base Cuboid. 
The "132" is the row number of the base Cuboid, not source data. 
+
+
+### The right code
+
+The right behavior is to push down the aggregation to Kylin, so that the Cube 
can be leveraged. Below is the correct code:
+
+{% highlight Groff markup %}
+#!/usr/bin/env python
+
+import os
+import sys
+import json
+
+os.environ["SPARK_HOME"] = "/usr/local/spark/"
+sys.path.append(os.environ["SPARK_HOME"]+"/python")
+
+from pyspark import SparkConf, SparkContext
+from pyspark.sql import SQLContext
+
+from pyspark.sql.functions import *
+from pyspark.sql.types import *
+
+jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
+
+
+def demo():
+    # step 1: init
+    print 'initializing ...',
+    conf = SparkConf() 
+    conf.setMaster('yarn')
+    conf.setAppName('jdbc example')
+
+    wdir = os.path.dirname(os.path.realpath(__file__))
+    jars_with_path = ','.join([wdir + '/' + x for x in jars])
+
+    conf.set("spark.jars", jars_with_path)
+    conf.set("spark.yarn.archive", 
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
+        
+    conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
+
+    sc = SparkContext(conf=conf)
+    sql_ctx = SQLContext(sc)
+    print 'done'
+
+    url='jdbc:kylin://sandbox:7070/default'
+    tab_name = '(select count(*) as total from test_kylin_fact) the_alias'
+
+    # step 2: initiate the sql
+    df = sql_ctx.read.format('jdbc').options(
+            url=url, user='ADMIN', password='KYLIN',
+            driver='org.apache.kylin.jdbc.Driver',
+            dbtable=tab_name).load()
+
+    # many ways to obtain the results
+    df.show()
+
+    print "df.count()", df.count()  # must be 1, as there is only one row
+
+    for record in df.toJSON().collect():
+        # this loop has only one iteration
+        # reach record is a string; need to be decoded to JSON
+        print 'the total column: ', json.loads(record)['TOTAL']
+
+    sc.stop()
+
+demo()
+
+{% endhighlight %}
+
+Here is the output, which is expected:
+
+{% highlight Groff markup %}
+initializing ... done
++-----+
+|TOTAL|
++-----+
+| 2000|
++-----+
+
+df.count() 1
+the total column:  2000
+{% endhighlight %}
+
+Thanks for the input and sample code from Shuxin Yang 
(shuxinyang....@gmail.com).

-- 
To stop receiving notification emails like this one, please contact
shaofeng...@apache.org.

Reply via email to