Hi Folks,

Recently, there's several tickets [1][2][3] about sharing data in zeppelin.
Zeppelin's goal is to be an unified data analyst platform which could
integrate most of the big data tools and help user to switch between tools
and share data between tools easily. So sharing data is a very critical and
killer feature of Zeppelin IMHO.

I raise this ticket to discuss about the scenario of sharing data and how
to do that. Although zeppelin already provides tools and api to share data,
I don't think it is mature and stable enough. After seeing these tickets, I
think it might be a good time to talk about it in community and gather more
feedback, so that we could provide a more stable and mature approach for
it.

Currently, there're 3 approaches to share data between interpreters and
interpreter processes.
1. Sharing data across interpreter in the same interpreter process. Like
sharing data via the same SparkContext in %spark, %spark.pyspark and
%spark.r.
2. Sharing data between frontend and backend via angularObject
3. Sharing data across interpreter processes via Zeppelin's ResourcePool

For this thread, I would like to talk about the approach 3 (Sharing data
via Zeppelin's ResourcePool)

Here's my current thinking of sharing data.
1. What kind of data would be shared ?
   IMHO, users would share 2 kinds of data: primitive data (string, number)
and table data.

2. How to write shared data ?
    User may want to share data via 2 approches
    a. Use ZeppelinContext (e.g. z.put).
    b. Share the paragraph result via paragraph properties. e.g. user may
want to read data from oracle database via jdbc interpreter and then do
plotting in python interpreter. In such scenario. he can save the jdbc
result in ResourcePool via paragraph property and then read it it via
z.get. Here's one simple example (Not implemented yet)

        %jdbc(saveAsTable=people)
         select * from oracle_table

         %python
         z.getTable("people).toPandas()

3. How to read shared data ?
    User can also have 2 approaches to read the shared data.
    a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
    b. Via variable substitution [1]

Here's one sample note which illustrate the scenario of sharing data.
https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24

This is just my current thinking of sharing data in zeppelin, it definitely
doesn't cover all the scenarios, so I raise this thread to discuss about in
community, welcome any feedback and comments.


[1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
[2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
[3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617

Reply via email to