subject:"Re\: \[DISCUSS\] Share Data in Zeppelin"

Re: [DISCUSS] Share Data in Zeppelin

2018-07-17 Thread Belousov Maksim Eduardovich

Ability to work with many data source is one the reason we chose Apache 
Zeppelin.

For branch-0.7 our ops-team wrote a lot of python functions for import and 
export data from diffent source (Greenlum, Hive, Oracle) using Python DataFrame 
as middleware.
Our users can upload flat files to Zeppelin via Samba, then upload to DBs and 
run queries.

Availability of ResourcePool in 0.8 is big milestone.
I hope ResourcePool will allow to smoothly intergate all sources in company.
It would be great if not only spark and python interpreter could get data from 
ResourcePool.

2b case is nice.
Now I see that the transmit of table data is sufficient.



Regards,
Maxim Belousov


От: Jeff Zhang 
Отправлено: 13 июля 2018 г. 6:00
Кому: us...@zeppelin.apache.org
Копия: dev
Тема: Re: [DISCUSS] Share Data in Zeppelin

Thanks Sanjay, I have fixed the example note.

*Folks, to be noticed,* the example note is just a fake note, it won't work
for now.



Jongyoul Lee 于2018年7月13日周五 上午10:54写道：

> BTW, we need to consider the case where the result is large in a design
> time. In my experience, If we implement this feature, users could use it
> with large data.
>
> On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta <
> sanjay.dasgu...@gmail.com> wrote:
>
>> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?
>>
>> There are a few typos in the example note shared:
>>
>> 1) The line val peopleDF = spark.read.format("zeppelin").load() should
>> mention the table name (possibly as argument to load?)
>> 2) The python line val peopleDF = z.getTable("people").toPandas() should
>> not have the val
>>
>>
>> The z.getTable() method could be a very good tool to judge
>> which use-cases are important in the community. It is easy to implement for
>> the in-memory data case, and could be very useful for many situations where
>> a small amount of data is being transferred across interpreters (like the
>> jdbc -> matplotlib case mentioned).
>>
>> Thanks,
>> Sanjay
>>
>> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee  wrote:
>>
>>> Yes, it's similar to 2.b.
>>>
>>> Basically, my concern is to handle all kinds of data. But in your case,
>>> it looks like focusing on table data. It's also useful but it would be
>>> better to handle all of the data including table or plain text as well.
>>> WDYT?
>>>
>>> About storage, we could discuss it later.
>>>
>>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>>>
>>>>
>>>> I think your use case is the same of 2.b.  Personally I don't recommend
>>>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>>>> 1.  noteId, paragraphId is meaningless, which is not readable
>>>> 2. The note will break if we clone it as the noteId is changed.
>>>> That's why I suggest to use paragraph property to save paragraph's
>>>> result
>>>>
>>>> Regarding the intermediate storage, I also though about it and agree
>>>> that in the long term we should provide such layer to support large data,
>>>> currently we put the shared data in memory which is not a scalable
>>>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>>>> format I think apache arrow [2] is another good option for zeppelin to
>>>> share table data across interpreter processes and different languages. But
>>>> these are all implementation details, I think we can talk about them in
>>>> another thread. In this thread, I think we should focus on the user facing
>>>> api.
>>>>
>>>>
>>>> [1] http://www.alluxio.org/
>>>> [2] https://arrow.apache.org/
>>>>
>>>>
>>>>
>>>> Jongyoul Lee 于2018年7月13日周五 上午10:11写道：
>>>>
>>>>> I have a bit different idea to share data.
>>>>>
>>>>> In my case,
>>>>>
>>>>> It would be very useful to get a paragraph's result as an input of
>>>>> other paragraphs.
>>>>>
>>>>> e.g.
>>>>>
>>>>> -- Paragrph 1
>>>>> %jdbc
>>>>> select * from some_table;
>>>>>
>>>>> -- Paragraph 2
>>>>> %spark
>>>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>>>>> spark.read(table).select
>>>>>
>>>>> If paragraph 1's result is too big to show on FE, it would be save

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jeff Zhang

Thanks Sanjay, I have fixed the example note.

*Folks, to be noticed,* the example note is just a fake note, it won't work
for now.



Jongyoul Lee 于2018年7月13日周五 上午10:54写道：

> BTW, we need to consider the case where the result is large in a design
> time. In my experience, If we implement this feature, users could use it
> with large data.
>
> On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta <
> sanjay.dasgu...@gmail.com> wrote:
>
>> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?
>>
>> There are a few typos in the example note shared:
>>
>> 1) The line val peopleDF = spark.read.format("zeppelin").load() should
>> mention the table name (possibly as argument to load?)
>> 2) The python line val peopleDF = z.getTable("people").toPandas() should
>> not have the val
>>
>>
>> The z.getTable() method could be a very good tool to judge
>> which use-cases are important in the community. It is easy to implement for
>> the in-memory data case, and could be very useful for many situations where
>> a small amount of data is being transferred across interpreters (like the
>> jdbc -> matplotlib case mentioned).
>>
>> Thanks,
>> Sanjay
>>
>> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee  wrote:
>>
>>> Yes, it's similar to 2.b.
>>>
>>> Basically, my concern is to handle all kinds of data. But in your case,
>>> it looks like focusing on table data. It's also useful but it would be
>>> better to handle all of the data including table or plain text as well.
>>> WDYT?
>>>
>>> About storage, we could discuss it later.
>>>
>>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>>>

 I think your use case is the same of 2.b.  Personally I don't recommend
 to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
 1.  noteId, paragraphId is meaningless, which is not readable
 2. The note will break if we clone it as the noteId is changed.
 That's why I suggest to use paragraph property to save paragraph's
 result

 Regarding the intermediate storage, I also though about it and agree
 that in the long term we should provide such layer to support large data,
 currently we put the shared data in memory which is not a scalable
 solution.  One candidate in my mind is alluxio [1], and regarding the data
 format I think apache arrow [2] is another good option for zeppelin to
 share table data across interpreter processes and different languages. But
 these are all implementation details, I think we can talk about them in
 another thread. In this thread, I think we should focus on the user facing
 api.


 [1] http://www.alluxio.org/
 [2] https://arrow.apache.org/



 Jongyoul Lee 于2018年7月13日周五 上午10:11写道：

> I have a bit different idea to share data.
>
> In my case,
>
> It would be very useful to get a paragraph's result as an input of
> other paragraphs.
>
> e.g.
>
> -- Paragrph 1
> %jdbc
> select * from some_table;
>
> -- Paragraph 2
> %spark
> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
> spark.read(table).select
>
> If paragraph 1's result is too big to show on FE, it would be saved in
> Zeppelin Server with proper way and pass to SparkInterpreter when 
> Paragraph
> 2 is executed.
>
> Basically, I think we need to intermediate storage to store
> paragraph's results to share them. We can introduce another layer or 
> extend
> NotebootRepo. In some cases, we might change notebook repos as well.
>
> JL
>
>
>
> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:
>
>> Hi Folks,
>>
>> Recently, there's several tickets [1][2][3] about sharing data in
>> zeppelin.
>> Zeppelin's goal is to be an unified data analyst platform which could
>> integrate most of the big data tools and help user to switch between
>> tools
>> and share data between tools easily. So sharing data is a very
>> critical and
>> killer feature of Zeppelin IMHO.
>>
>> I raise this ticket to discuss about the scenario of sharing data and
>> how
>> to do that. Although zeppelin already provides tools and api to share
>> data,
>> I don't think it is mature and stable enough. After seeing these
>> tickets, I
>> think it might be a good time to talk about it in community and
>> gather more
>> feedback, so that we could provide a more stable and mature approach
>> for
>> it.
>>
>> Currently, there're 3 approaches to share data between interpreters
>> and
>> interpreter processes.
>> 1. Sharing data across interpreter in the same interpreter process.
>> Like
>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>> %spark.r.
>> 2. Sharing data between frontend and backend via angularObject
>> 3. Sharing data across interpreter processes via

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jongyoul Lee

BTW, we need to consider the case where the result is large in a design
time. In my experience, If we implement this feature, users could use it
with large data.

On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta  wrote:

> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?
>
> There are a few typos in the example note shared:
>
> 1) The line val peopleDF = spark.read.format("zeppelin").load() should
> mention the table name (possibly as argument to load?)
> 2) The python line val peopleDF = z.getTable("people").toPandas() should
> not have the val
>
>
> The z.getTable() method could be a very good tool to judge
> which use-cases are important in the community. It is easy to implement for
> the in-memory data case, and could be very useful for many situations where
> a small amount of data is being transferred across interpreters (like the
> jdbc -> matplotlib case mentioned).
>
> Thanks,
> Sanjay
>
> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee  wrote:
>
>> Yes, it's similar to 2.b.
>>
>> Basically, my concern is to handle all kinds of data. But in your case,
>> it looks like focusing on table data. It's also useful but it would be
>> better to handle all of the data including table or plain text as well.
>> WDYT?
>>
>> About storage, we could discuss it later.
>>
>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>>
>>>
>>> I think your use case is the same of 2.b.  Personally I don't recommend
>>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>>> 1.  noteId, paragraphId is meaningless, which is not readable
>>> 2. The note will break if we clone it as the noteId is changed.
>>> That's why I suggest to use paragraph property to save paragraph's result
>>>
>>> Regarding the intermediate storage, I also though about it and agree
>>> that in the long term we should provide such layer to support large data,
>>> currently we put the shared data in memory which is not a scalable
>>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>>> format I think apache arrow [2] is another good option for zeppelin to
>>> share table data across interpreter processes and different languages. But
>>> these are all implementation details, I think we can talk about them in
>>> another thread. In this thread, I think we should focus on the user facing
>>> api.
>>>
>>>
>>> [1] http://www.alluxio.org/
>>> [2] https://arrow.apache.org/
>>>
>>>
>>>
>>> Jongyoul Lee 于2018年7月13日周五 上午10:11写道：
>>>
 I have a bit different idea to share data.

 In my case,

 It would be very useful to get a paragraph's result as an input of
 other paragraphs.

 e.g.

 -- Paragrph 1
 %jdbc
 select * from some_table;

 -- Paragraph 2
 %spark
 val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
 spark.read(table).select

 If paragraph 1's result is too big to show on FE, it would be saved in
 Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
 2 is executed.

 Basically, I think we need to intermediate storage to store paragraph's
 results to share them. We can introduce another layer or extend
 NotebootRepo. In some cases, we might change notebook repos as well.

 JL



 On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:

> Hi Folks,
>
> Recently, there's several tickets [1][2][3] about sharing data in
> zeppelin.
> Zeppelin's goal is to be an unified data analyst platform which could
> integrate most of the big data tools and help user to switch between
> tools
> and share data between tools easily. So sharing data is a very
> critical and
> killer feature of Zeppelin IMHO.
>
> I raise this ticket to discuss about the scenario of sharing data and
> how
> to do that. Although zeppelin already provides tools and api to share
> data,
> I don't think it is mature and stable enough. After seeing these
> tickets, I
> think it might be a good time to talk about it in community and gather
> more
> feedback, so that we could provide a more stable and mature approach
> for
> it.
>
> Currently, there're 3 approaches to share data between interpreters and
> interpreter processes.
> 1. Sharing data across interpreter in the same interpreter process.
> Like
> sharing data via the same SparkContext in %spark, %spark.pyspark and
> %spark.r.
> 2. Sharing data between frontend and backend via angularObject
> 3. Sharing data across interpreter processes via Zeppelin's
> ResourcePool
>
> For this thread, I would like to talk about the approach 3 (Sharing
> data
> via Zeppelin's ResourcePool)
>
> Here's my current thinking of sharing data.
> 1. What kind of data would be shared ?
>IMHO, users would share 2 kinds of data: primitive data (string,
> number)
> and

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jongyoul Lee

That would be great.

BTW, does ZEPL's example work for now?

On Fri, Jul 13, 2018 at 11:43 AM, Jeff Zhang  wrote:

>
> Sure, we can support plain text as well.
>
> Jongyoul Lee 于2018年7月13日周五 上午10:37写道：
>
>> Yes, it's similar to 2.b.
>>
>> Basically, my concern is to handle all kinds of data. But in your case,
>> it looks like focusing on table data. It's also useful but it would be
>> better to handle all of the data including table or plain text as well.
>> WDYT?
>>
>> About storage, we could discuss it later.
>>
>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>>
>>>
>>> I think your use case is the same of 2.b.  Personally I don't recommend
>>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>>> 1.  noteId, paragraphId is meaningless, which is not readable
>>> 2. The note will break if we clone it as the noteId is changed.
>>> That's why I suggest to use paragraph property to save paragraph's result
>>>
>>> Regarding the intermediate storage, I also though about it and agree
>>> that in the long term we should provide such layer to support large data,
>>> currently we put the shared data in memory which is not a scalable
>>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>>> format I think apache arrow [2] is another good option for zeppelin to
>>> share table data across interpreter processes and different languages. But
>>> these are all implementation details, I think we can talk about them in
>>> another thread. In this thread, I think we should focus on the user facing
>>> api.
>>>
>>>
>>> [1] http://www.alluxio.org/
>>> [2] https://arrow.apache.org/
>>>
>>>
>>>
>>> Jongyoul Lee 于2018年7月13日周五 上午10:11写道：
>>>
 I have a bit different idea to share data.

 In my case,

 It would be very useful to get a paragraph's result as an input of
 other paragraphs.

 e.g.

 -- Paragrph 1
 %jdbc
 select * from some_table;

 -- Paragraph 2
 %spark
 val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
 spark.read(table).select

 If paragraph 1's result is too big to show on FE, it would be saved in
 Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
 2 is executed.

 Basically, I think we need to intermediate storage to store paragraph's
 results to share them. We can introduce another layer or extend
 NotebootRepo. In some cases, we might change notebook repos as well.

 JL



 On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:

> Hi Folks,
>
> Recently, there's several tickets [1][2][3] about sharing data in
> zeppelin.
> Zeppelin's goal is to be an unified data analyst platform which could
> integrate most of the big data tools and help user to switch between
> tools
> and share data between tools easily. So sharing data is a very
> critical and
> killer feature of Zeppelin IMHO.
>
> I raise this ticket to discuss about the scenario of sharing data and
> how
> to do that. Although zeppelin already provides tools and api to share
> data,
> I don't think it is mature and stable enough. After seeing these
> tickets, I
> think it might be a good time to talk about it in community and gather
> more
> feedback, so that we could provide a more stable and mature approach
> for
> it.
>
> Currently, there're 3 approaches to share data between interpreters and
> interpreter processes.
> 1. Sharing data across interpreter in the same interpreter process.
> Like
> sharing data via the same SparkContext in %spark, %spark.pyspark and
> %spark.r.
> 2. Sharing data between frontend and backend via angularObject
> 3. Sharing data across interpreter processes via Zeppelin's
> ResourcePool
>
> For this thread, I would like to talk about the approach 3 (Sharing
> data
> via Zeppelin's ResourcePool)
>
> Here's my current thinking of sharing data.
> 1. What kind of data would be shared ?
>IMHO, users would share 2 kinds of data: primitive data (string,
> number)
> and table data.
>
> 2. How to write shared data ?
> User may want to share data via 2 approches
> a. Use ZeppelinContext (e.g. z.put).
> b. Share the paragraph result via paragraph properties. e.g. user
> may
> want to read data from oracle database via jdbc interpreter and then do
> plotting in python interpreter. In such scenario. he can save the jdbc
> result in ResourcePool via paragraph property and then read it it via
> z.get. Here's one simple example (Not implemented yet)
>
> %jdbc(saveAsTable=people)
>  select * from oracle_table
>
>  %python
>  z.getTable("people).toPandas()
>
> 3. How to read shared data ?
> User can also have 2 approaches to read

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Sanjay Dasgupta

I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?

There are a few typos in the example note shared:

1) The line val peopleDF = spark.read.format("zeppelin").load() should
mention the table name (possibly as argument to load?)
2) The python line val peopleDF = z.getTable("people").toPandas() should
not have the val

The z.getTable() method could be a very good tool to judge
which use-cases are important in the community. It is easy to implement for
the in-memory data case, and could be very useful for many situations where
a small amount of data is being transferred across interpreters (like the
jdbc -> matplotlib case mentioned).

Thanks,
Sanjay

On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee  wrote:

> Yes, it's similar to 2.b.
>
> Basically, my concern is to handle all kinds of data. But in your case, it
> looks like focusing on table data. It's also useful but it would be better
> to handle all of the data including table or plain text as well. WDYT?
>
> About storage, we could discuss it later.
>
> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>
>>
>> I think your use case is the same of 2.b.  Personally I don't recommend
>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>> 1.  noteId, paragraphId is meaningless, which is not readable
>> 2. The note will break if we clone it as the noteId is changed.
>> That's why I suggest to use paragraph property to save paragraph's result
>>
>> Regarding the intermediate storage, I also though about it and agree that
>> in the long term we should provide such layer to support large data,
>> currently we put the shared data in memory which is not a scalable
>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>> format I think apache arrow [2] is another good option for zeppelin to
>> share table data across interpreter processes and different languages. But
>> these are all implementation details, I think we can talk about them in
>> another thread. In this thread, I think we should focus on the user facing
>> api.
>>
>>
>> [1] http://www.alluxio.org/
>> [2] https://arrow.apache.org/
>>
>>
>>
>> Jongyoul Lee 于2018年7月13日周五 上午10:11写道：
>>
>>> I have a bit different idea to share data.
>>>
>>> In my case,
>>>
>>> It would be very useful to get a paragraph's result as an input of other
>>> paragraphs.
>>>
>>> e.g.
>>>
>>> -- Paragrph 1
>>> %jdbc
>>> select * from some_table;
>>>
>>> -- Paragraph 2
>>> %spark
>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>>> spark.read(table).select
>>>
>>> If paragraph 1's result is too big to show on FE, it would be saved in
>>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
>>> 2 is executed.
>>>
>>> Basically, I think we need to intermediate storage to store paragraph's
>>> results to share them. We can introduce another layer or extend
>>> NotebootRepo. In some cases, we might change notebook repos as well.
>>>
>>> JL
>>>
>>>
>>>
>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:
>>>
 Hi Folks,

 Recently, there's several tickets [1][2][3] about sharing data in
 zeppelin.
 Zeppelin's goal is to be an unified data analyst platform which could
 integrate most of the big data tools and help user to switch between
 tools
 and share data between tools easily. So sharing data is a very critical
 and
 killer feature of Zeppelin IMHO.

 I raise this ticket to discuss about the scenario of sharing data and
 how
 to do that. Although zeppelin already provides tools and api to share
 data,
 I don't think it is mature and stable enough. After seeing these
 tickets, I
 think it might be a good time to talk about it in community and gather
 more
 feedback, so that we could provide a more stable and mature approach for
 it.

 Currently, there're 3 approaches to share data between interpreters and
 interpreter processes.
 1. Sharing data across interpreter in the same interpreter process. Like
 sharing data via the same SparkContext in %spark, %spark.pyspark and
 %spark.r.
 2. Sharing data between frontend and backend via angularObject
 3. Sharing data across interpreter processes via Zeppelin's ResourcePool

 For this thread, I would like to talk about the approach 3 (Sharing data
 via Zeppelin's ResourcePool)

 Here's my current thinking of sharing data.
 1. What kind of data would be shared ?
IMHO, users would share 2 kinds of data: primitive data (string,
 number)
 and table data.

 2. How to write shared data ?
 User may want to share data via 2 approches
 a. Use ZeppelinContext (e.g. z.put).
 b. Share the paragraph result via paragraph properties. e.g. user
 may
 want to read data from oracle database via jdbc interpreter and then do
 plotting in python interpreter. In such scenario. he can save the jdbc

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jeff Zhang

Sure, we can support plain text as well.

Jongyoul Lee 于2018年7月13日周五 上午10:37写道：

> Yes, it's similar to 2.b.
>
> Basically, my concern is to handle all kinds of data. But in your case, it
> looks like focusing on table data. It's also useful but it would be better
> to handle all of the data including table or plain text as well. WDYT?
>
> About storage, we could discuss it later.
>
> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>
>>
>> I think your use case is the same of 2.b.  Personally I don't recommend
>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>> 1.  noteId, paragraphId is meaningless, which is not readable
>> 2. The note will break if we clone it as the noteId is changed.
>> That's why I suggest to use paragraph property to save paragraph's result
>>
>> Regarding the intermediate storage, I also though about it and agree that
>> in the long term we should provide such layer to support large data,
>> currently we put the shared data in memory which is not a scalable
>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>> format I think apache arrow [2] is another good option for zeppelin to
>> share table data across interpreter processes and different languages. But
>> these are all implementation details, I think we can talk about them in
>> another thread. In this thread, I think we should focus on the user facing
>> api.
>>
>>
>> [1] http://www.alluxio.org/
>> [2] https://arrow.apache.org/
>>
>>
>>
>> Jongyoul Lee 于2018年7月13日周五 上午10:11写道：
>>
>>> I have a bit different idea to share data.
>>>
>>> In my case,
>>>
>>> It would be very useful to get a paragraph's result as an input of other
>>> paragraphs.
>>>
>>> e.g.
>>>
>>> -- Paragrph 1
>>> %jdbc
>>> select * from some_table;
>>>
>>> -- Paragraph 2
>>> %spark
>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>>> spark.read(table).select
>>>
>>> If paragraph 1's result is too big to show on FE, it would be saved in
>>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
>>> 2 is executed.
>>>
>>> Basically, I think we need to intermediate storage to store paragraph's
>>> results to share them. We can introduce another layer or extend
>>> NotebootRepo. In some cases, we might change notebook repos as well.
>>>
>>> JL
>>>
>>>
>>>
>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:
>>>
 Hi Folks,

 Recently, there's several tickets [1][2][3] about sharing data in
 zeppelin.
 Zeppelin's goal is to be an unified data analyst platform which could
 integrate most of the big data tools and help user to switch between
 tools
 and share data between tools easily. So sharing data is a very critical
 and
 killer feature of Zeppelin IMHO.

 I raise this ticket to discuss about the scenario of sharing data and
 how
 to do that. Although zeppelin already provides tools and api to share
 data,
 I don't think it is mature and stable enough. After seeing these
 tickets, I
 think it might be a good time to talk about it in community and gather
 more
 feedback, so that we could provide a more stable and mature approach for
 it.

 Currently, there're 3 approaches to share data between interpreters and
 interpreter processes.
 1. Sharing data across interpreter in the same interpreter process. Like
 sharing data via the same SparkContext in %spark, %spark.pyspark and
 %spark.r.
 2. Sharing data between frontend and backend via angularObject
 3. Sharing data across interpreter processes via Zeppelin's ResourcePool

 For this thread, I would like to talk about the approach 3 (Sharing data
 via Zeppelin's ResourcePool)

 Here's my current thinking of sharing data.
 1. What kind of data would be shared ?
IMHO, users would share 2 kinds of data: primitive data (string,
 number)
 and table data.

 2. How to write shared data ?
 User may want to share data via 2 approches
 a. Use ZeppelinContext (e.g. z.put).
 b. Share the paragraph result via paragraph properties. e.g. user
 may
 want to read data from oracle database via jdbc interpreter and then do
 plotting in python interpreter. In such scenario. he can save the jdbc
 result in ResourcePool via paragraph property and then read it it via
 z.get. Here's one simple example (Not implemented yet)

 %jdbc(saveAsTable=people)
  select * from oracle_table

  %python
  z.getTable("people).toPandas()

 3. How to read shared data ?
 User can also have 2 approaches to read the shared data.
 a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
 b. Via variable substitution [1]

 Here's one sample note which illustrate the scenario of sharing data.

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jongyoul Lee

Yes, it's similar to 2.b.

Basically, my concern is to handle all kinds of data. But in your case, it
looks like focusing on table data. It's also useful but it would be better
to handle all of the data including table or plain text as well. WDYT?

About storage, we could discuss it later.

On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:

>
> I think your use case is the same of 2.b.  Personally I don't recommend to
> use z.get(noteId, paragraphId) to get the shared data for 2 reasons
> 1.  noteId, paragraphId is meaningless, which is not readable
> 2. The note will break if we clone it as the noteId is changed.
> That's why I suggest to use paragraph property to save paragraph's result
>
> Regarding the intermediate storage, I also though about it and agree that
> in the long term we should provide such layer to support large data,
> currently we put the shared data in memory which is not a scalable
> solution.  One candidate in my mind is alluxio [1], and regarding the data
> format I think apache arrow [2] is another good option for zeppelin to
> share table data across interpreter processes and different languages. But
> these are all implementation details, I think we can talk about them in
> another thread. In this thread, I think we should focus on the user facing
> api.
>
>
> [1] http://www.alluxio.org/
> [2] https://arrow.apache.org/
>
>
>
> Jongyoul Lee 于2018年7月13日周五 上午10:11写道：
>
>> I have a bit different idea to share data.
>>
>> In my case,
>>
>> It would be very useful to get a paragraph's result as an input of other
>> paragraphs.
>>
>> e.g.
>>
>> -- Paragrph 1
>> %jdbc
>> select * from some_table;
>>
>> -- Paragraph 2
>> %spark
>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>> spark.read(table).select
>>
>> If paragraph 1's result is too big to show on FE, it would be saved in
>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
>> 2 is executed.
>>
>> Basically, I think we need to intermediate storage to store paragraph's
>> results to share them. We can introduce another layer or extend
>> NotebootRepo. In some cases, we might change notebook repos as well.
>>
>> JL
>>
>>
>>
>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:
>>
>>> Hi Folks,
>>>
>>> Recently, there's several tickets [1][2][3] about sharing data in
>>> zeppelin.
>>> Zeppelin's goal is to be an unified data analyst platform which could
>>> integrate most of the big data tools and help user to switch between
>>> tools
>>> and share data between tools easily. So sharing data is a very critical
>>> and
>>> killer feature of Zeppelin IMHO.
>>>
>>> I raise this ticket to discuss about the scenario of sharing data and how
>>> to do that. Although zeppelin already provides tools and api to share
>>> data,
>>> I don't think it is mature and stable enough. After seeing these
>>> tickets, I
>>> think it might be a good time to talk about it in community and gather
>>> more
>>> feedback, so that we could provide a more stable and mature approach for
>>> it.
>>>
>>> Currently, there're 3 approaches to share data between interpreters and
>>> interpreter processes.
>>> 1. Sharing data across interpreter in the same interpreter process. Like
>>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>>> %spark.r.
>>> 2. Sharing data between frontend and backend via angularObject
>>> 3. Sharing data across interpreter processes via Zeppelin's ResourcePool
>>>
>>> For this thread, I would like to talk about the approach 3 (Sharing data
>>> via Zeppelin's ResourcePool)
>>>
>>> Here's my current thinking of sharing data.
>>> 1. What kind of data would be shared ?
>>>IMHO, users would share 2 kinds of data: primitive data (string,
>>> number)
>>> and table data.
>>>
>>> 2. How to write shared data ?
>>> User may want to share data via 2 approches
>>> a. Use ZeppelinContext (e.g. z.put).
>>> b. Share the paragraph result via paragraph properties. e.g. user may
>>> want to read data from oracle database via jdbc interpreter and then do
>>> plotting in python interpreter. In such scenario. he can save the jdbc
>>> result in ResourcePool via paragraph property and then read it it via
>>> z.get. Here's one simple example (Not implemented yet)
>>>
>>> %jdbc(saveAsTable=people)
>>>  select * from oracle_table
>>>
>>>  %python
>>>  z.getTable("people).toPandas()
>>>
>>> 3. How to read shared data ?
>>> User can also have 2 approaches to read the shared data.
>>> a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
>>> b. Via variable substitution [1]
>>>
>>> Here's one sample note which illustrate the scenario of sharing data.
>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3Ym
>>> FhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>>>
>>> This is just my current thinking of sharing data in zeppelin, it
>>> definitely
>>> doesn't cover all the scenarios, so I raise this thread to discuss about
>>> in
>>>

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jeff Zhang

I think your use case is the same of 2.b.  Personally I don't recommend to
use z.get(noteId, paragraphId) to get the shared data for 2 reasons
1.  noteId, paragraphId is meaningless, which is not readable
2. The note will break if we clone it as the noteId is changed.
That's why I suggest to use paragraph property to save paragraph's result

Regarding the intermediate storage, I also though about it and agree that
in the long term we should provide such layer to support large data,
currently we put the shared data in memory which is not a scalable
solution.  One candidate in my mind is alluxio [1], and regarding the data
format I think apache arrow [2] is another good option for zeppelin to
share table data across interpreter processes and different languages. But
these are all implementation details, I think we can talk about them in
another thread. In this thread, I think we should focus on the user facing
api.


[1] http://www.alluxio.org/
[2] https://arrow.apache.org/



Jongyoul Lee 于2018年7月13日周五 上午10:11写道：

> I have a bit different idea to share data.
>
> In my case,
>
> It would be very useful to get a paragraph's result as an input of other
> paragraphs.
>
> e.g.
>
> -- Paragrph 1
> %jdbc
> select * from some_table;
>
> -- Paragraph 2
> %spark
> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
> spark.read(table).select
>
> If paragraph 1's result is too big to show on FE, it would be saved in
> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
> 2 is executed.
>
> Basically, I think we need to intermediate storage to store paragraph's
> results to share them. We can introduce another layer or extend
> NotebootRepo. In some cases, we might change notebook repos as well.
>
> JL
>
>
>
> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:
>
>> Hi Folks,
>>
>> Recently, there's several tickets [1][2][3] about sharing data in
>> zeppelin.
>> Zeppelin's goal is to be an unified data analyst platform which could
>> integrate most of the big data tools and help user to switch between tools
>> and share data between tools easily. So sharing data is a very critical
>> and
>> killer feature of Zeppelin IMHO.
>>
>> I raise this ticket to discuss about the scenario of sharing data and how
>> to do that. Although zeppelin already provides tools and api to share
>> data,
>> I don't think it is mature and stable enough. After seeing these tickets,
>> I
>> think it might be a good time to talk about it in community and gather
>> more
>> feedback, so that we could provide a more stable and mature approach for
>> it.
>>
>> Currently, there're 3 approaches to share data between interpreters and
>> interpreter processes.
>> 1. Sharing data across interpreter in the same interpreter process. Like
>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>> %spark.r.
>> 2. Sharing data between frontend and backend via angularObject
>> 3. Sharing data across interpreter processes via Zeppelin's ResourcePool
>>
>> For this thread, I would like to talk about the approach 3 (Sharing data
>> via Zeppelin's ResourcePool)
>>
>> Here's my current thinking of sharing data.
>> 1. What kind of data would be shared ?
>>IMHO, users would share 2 kinds of data: primitive data (string,
>> number)
>> and table data.
>>
>> 2. How to write shared data ?
>> User may want to share data via 2 approches
>> a. Use ZeppelinContext (e.g. z.put).
>> b. Share the paragraph result via paragraph properties. e.g. user may
>> want to read data from oracle database via jdbc interpreter and then do
>> plotting in python interpreter. In such scenario. he can save the jdbc
>> result in ResourcePool via paragraph property and then read it it via
>> z.get. Here's one simple example (Not implemented yet)
>>
>> %jdbc(saveAsTable=people)
>>  select * from oracle_table
>>
>>  %python
>>  z.getTable("people).toPandas()
>>
>> 3. How to read shared data ?
>> User can also have 2 approaches to read the shared data.
>> a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
>> b. Via variable substitution [1]
>>
>> Here's one sample note which illustrate the scenario of sharing data.
>>
>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>>
>> This is just my current thinking of sharing data in zeppelin, it
>> definitely
>> doesn't cover all the scenarios, so I raise this thread to discuss about
>> in
>> community, welcome any feedback and comments.
>>
>>
>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617
>>
>
>
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>

Re: [DISCUSS] Share Data in Zeppelin

2018-07-12 Thread Jongyoul Lee

I have a bit different idea to share data.

In my case,

It would be very useful to get a paragraph's result as an input of other
paragraphs.

e.g.

-- Paragrph 1
%jdbc
select * from some_table;

-- Paragraph 2
%spark
val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
spark.read(table).select

If paragraph 1's result is too big to show on FE, it would be saved in
Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
2 is executed.

Basically, I think we need to intermediate storage to store paragraph's
results to share them. We can introduce another layer or extend
NotebootRepo. In some cases, we might change notebook repos as well.

JL



On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:

> Hi Folks,
>
> Recently, there's several tickets [1][2][3] about sharing data in zeppelin.
> Zeppelin's goal is to be an unified data analyst platform which could
> integrate most of the big data tools and help user to switch between tools
> and share data between tools easily. So sharing data is a very critical and
> killer feature of Zeppelin IMHO.
>
> I raise this ticket to discuss about the scenario of sharing data and how
> to do that. Although zeppelin already provides tools and api to share data,
> I don't think it is mature and stable enough. After seeing these tickets, I
> think it might be a good time to talk about it in community and gather more
> feedback, so that we could provide a more stable and mature approach for
> it.
>
> Currently, there're 3 approaches to share data between interpreters and
> interpreter processes.
> 1. Sharing data across interpreter in the same interpreter process. Like
> sharing data via the same SparkContext in %spark, %spark.pyspark and
> %spark.r.
> 2. Sharing data between frontend and backend via angularObject
> 3. Sharing data across interpreter processes via Zeppelin's ResourcePool
>
> For this thread, I would like to talk about the approach 3 (Sharing data
> via Zeppelin's ResourcePool)
>
> Here's my current thinking of sharing data.
> 1. What kind of data would be shared ?
>IMHO, users would share 2 kinds of data: primitive data (string, number)
> and table data.
>
> 2. How to write shared data ?
> User may want to share data via 2 approches
> a. Use ZeppelinContext (e.g. z.put).
> b. Share the paragraph result via paragraph properties. e.g. user may
> want to read data from oracle database via jdbc interpreter and then do
> plotting in python interpreter. In such scenario. he can save the jdbc
> result in ResourcePool via paragraph property and then read it it via
> z.get. Here's one simple example (Not implemented yet)
>
> %jdbc(saveAsTable=people)
>  select * from oracle_table
>
>  %python
>  z.getTable("people).toPandas()
>
> 3. How to read shared data ?
> User can also have 2 approaches to read the shared data.
> a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
> b. Via variable substitution [1]
>
> Here's one sample note which illustrate the scenario of sharing data.
> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3Ym
> FhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>
> This is just my current thinking of sharing data in zeppelin, it definitely
> doesn't cover all the scenarios, so I raise this thread to discuss about in
> community, welcome any feedback and comments.
>
>
> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617
>



-- 
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

Re: [DISCUSS] Share Data in Zeppelin

9 matches

Site Navigation

Mail list logo

Footer information