Repository: incubator-toree-website Updated Branches: refs/heads/OverhaulSite 9b329ef1f -> e7ff553dd
Added 'How it works' content Project: http://git-wip-us.apache.org/repos/asf/incubator-toree-website/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-toree-website/commit/e7ff553d Tree: http://git-wip-us.apache.org/repos/asf/incubator-toree-website/tree/e7ff553d Diff: http://git-wip-us.apache.org/repos/asf/incubator-toree-website/diff/e7ff553d Branch: refs/heads/OverhaulSite Commit: e7ff553ddc9cabf263db3d18eb4ce241e9937233 Parents: 9b329ef Author: Gino Bustelo <lbust...@us.ibm.com> Authored: Mon Jun 13 15:21:11 2016 -0500 Committer: Gino Bustelo <lbust...@us.ibm.com> Committed: Mon Jun 13 15:23:31 2016 -0500 ---------------------------------------------------------------------- assets/images/batch_mode.png | Bin 0 -> 61060 bytes assets/images/interactive_mode.png | Bin 0 -> 65268 bytes assets/images/toree_spark_gateway.png | Bin 0 -> 74504 bytes assets/images/toree_with_notebook.png | Bin 0 -> 52906 bytes documentation/user/how-it-works.md | 60 +++++++++++++++++++++++++++-- 5 files changed, 57 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-toree-website/blob/e7ff553d/assets/images/batch_mode.png ---------------------------------------------------------------------- diff --git a/assets/images/batch_mode.png b/assets/images/batch_mode.png new file mode 100644 index 0000000..18082f3 Binary files /dev/null and b/assets/images/batch_mode.png differ http://git-wip-us.apache.org/repos/asf/incubator-toree-website/blob/e7ff553d/assets/images/interactive_mode.png ---------------------------------------------------------------------- diff --git a/assets/images/interactive_mode.png b/assets/images/interactive_mode.png new file mode 100644 index 0000000..55abbc2 Binary files /dev/null and b/assets/images/interactive_mode.png differ http://git-wip-us.apache.org/repos/asf/incubator-toree-website/blob/e7ff553d/assets/images/toree_spark_gateway.png ---------------------------------------------------------------------- diff --git a/assets/images/toree_spark_gateway.png b/assets/images/toree_spark_gateway.png new file mode 100644 index 0000000..a18daa0 Binary files /dev/null and b/assets/images/toree_spark_gateway.png differ http://git-wip-us.apache.org/repos/asf/incubator-toree-website/blob/e7ff553d/assets/images/toree_with_notebook.png ---------------------------------------------------------------------- diff --git a/assets/images/toree_with_notebook.png b/assets/images/toree_with_notebook.png new file mode 100644 index 0000000..873142c Binary files /dev/null and b/assets/images/toree_with_notebook.png differ http://git-wip-us.apache.org/repos/asf/incubator-toree-website/blob/e7ff553d/documentation/user/how-it-works.md ---------------------------------------------------------------------- diff --git a/documentation/user/how-it-works.md b/documentation/user/how-it-works.md index 86c6472..3aa999a 100644 --- a/documentation/user/how-it-works.md +++ b/documentation/user/how-it-works.md @@ -9,7 +9,61 @@ tagline: Apache Project ! {% include JB/setup %} -- Architecture in relation to Jupyter and Spark -- Links to Jupyter kernel spec -- Links to keynotes and presentations +# How it works + +Toree provides an interactive programming interface to a Spark Cluster. It's API takes in `code` in a variety of +languages and executes it. The `code` can perform Spark tasks using the provided Spark Context. + +To further understand how Toree works, it is worth exploring the role that it plays in several usage scenarios. + +### As a Kernel to Jupyter Notebooks + +Toree's primary role is as a [Jupyter](http://jupyter.org/) Kernel. It was originally created to add full Spark API +support to a Jupyter Notebook using the Scala language. It since has grown to also support Python an R. The diagram +below shows Toree in relation to a running Jupyter Notebook. + +![Toree with Jupyter Notebook](/assets/images/toree_with_notebook.png) + +When the user creates a new Notebook and selects Toree, the Notebook server launches a new Toree process that is +configured to connect to a Spark cluster. Once in the Notebook, the user can interact with Spark by writing code that +uses the managed Spark Context instance. + +The Notebook server and Toree communicate using the [Jupyter Kernel Protocol](https://ipython.org/ipython-doc/3/development/messaging.html). +This is a [0MQ](http://zeromq.org/) based protocol that is language agnostic and allows for bidirectional communication +between the client and the kernel (i.e. Toree). This protocol is the __ONLY__ network interface for communicating with a +Toree process. + +When using Toree within a Jupyter Notebook, these technical details can be ignored, but they are very relevant when +building custom clients. Several options are discussed in the next section. + +### As an Interactive Gateway to Spark + +One way of using Spark is what is commonly referred to as 'Batch' mode. Very similar to other Big Data systems, such as +Hadoop, this mode has the user create a program that is submitted to the cluster. This program runs tasks in the +cluster and ultimately writes data to some persistent store (i.e. HDFS or No-SQL store). Spark provided `Batch` mode +support through [Spark Submit](http://spark.apache.org/docs/latest/submitting-applications.html). + +![Toree Gateway to Spark](/assets/images/batch_mode.png) + +This mode of using Spark, although valid, suffers from lots of friction. For example, packaging and submitting of jobs, as +well as the reading and writing from storage, tend to introduce unwanted latencies. Spark alleviates some of the +frictions by relying on memory to hold data along with the concept of a SparkContext as a way to tie jobs together. What +is missing from Spark is a way for applications to interact with a long living SparkContext. + +![Toree Gateway to Spark](/assets/images/interactive_mode.png) + +Toree provides this through a communication channel between an application and a SparkContext that allows access to the +entire Spark API. Through this channel, the application interacts with Spark by exchanging code and data. + +The Jupyter Notebook is a good example of an application that relies on the presence of these interactive channels and +uses Toree to access Spark. Other Spark enabled applications can be built that directly connect to Toree through the +`0MQ` protocol, but there are also other ways. + +![Toree Gateway to Spark](/assets/images/toree_spark_gateway.png) + +As shown above, the [Jupyter Kernel Gateway](https://github.com/jupyter/kernel_gateway) can be used to expose a Web +Socket based protocol to Toree. This makes Toree easier to integrate. In combination with the +[jupyter-js-services](https://github.com/jupyter/jupyter-js-services) library, other web applications can access Spark +interactively. The [Jupyter Dashboard Server](https://github.com/jupyter-incubator/dashboards_server) is an example of +a web application that uses Toree as the backend to dynamic dashboards.