Re: Zeppelin distributed architecture design

2018-07-17 Thread liuxun
hi,Ruslan Dautkhanov

Thank you very much for your question. according to your advice, I added 3 
schematics to illustrate.
1. Distributed Zeppelin Deployment architecture diagram.
2. Distributed zeppelin Server fault tolerance diagram.
3. Distributed zeppelin Server & intp process fault tolerance diagram.


The email attachment exceeded the size limit, so I reorganized the document and 
updated it with Google Docs.
https://docs.google.com/document/d/1a8QLSyR3M5AhlG1GIYuDTj6bwazeuVDKCRRBm-Qa3Bw/edit?usp=sharing
 



> 在 2018年7月18日,下午1:03,liuxun  写道:
> 
> hi,Ruslan Dautkhanov
> 
> Thank you very much for your question. according to your advice, I added 3 
> schematics to illustrate.
> 1. Zeppelin Cluster architecture diagram.
> 2. Distributed zeppelin Server fault tolerance diagram.
> 3. Distributed zeppelin Server & intp process fault tolerance diagram.
> 
> Later, I will merge the schematic into the system design document.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> 在 2018年7月18日,上午1:16,Ruslan Dautkhanov > > 写道:
>> 
>> Nice.
>> 
>> Thanks for sharing.
>> 
>> Can you explain how are users routed into a particular zeppelin server
>> instance? I've seen nginx on top of them, but I don't think the document
>> covers details? If one zeppelin server goes down or unhealthy, is nginx
>> supposed to detect (if so, how?) that and reroute users to a survived
>> instance?
>> 
>> Thanks,
>> Ruslan Dautkhanov
>> 
>> 
>> On Tue, Jul 17, 2018 at 2:46 AM liuxun > > wrote:
>> 
>>> hi:
>>> 
>>> Our company installed and deployed a lot of zeppelin for data analysis.
>>> The single server version of zeppelin could not meet our application
>>> scenarios, so we transformed zeppelin into a clustered service that
>>> supports distributed deployment, Have a unified entrance, high
>>> availability, and High server resource usage.  the email attachment is the
>>> entire design document, I am very happy to feedback our modified code back
>>> to the community.
>>> 
>>> 
>>> this is the JIRA I submitted in the community,
>>> 
>>> https://issues.apache.org/jira/browse/ZEPPELIN-3471 
>>> 
>>> 
>>> 
>>> Since the design document size exceeds the mail attachment size limit, the
>>> document link address has to be sent.
>>> 
>>> https://issues.apache.org/jira/secure/attachment/12931896/Zeppelin%20distributed%20architecture%20design.pdf
>>> 
>>> https://issues.apache.org/jira/secure/attachment/12931895/zepplin%20Cluster%20Sequence%20Diagram.png
>>> 
>>> 
>>> liuxun
>>> 
> 



Re: [DISCUSS] Disable supports on Spark 1.6.3

2018-07-17 Thread Jeff Zhang
I created ZEPPELIN-3635 for dropping support before spark 1.6, if you have
any concerns, please comment on that jira.



Clemens Valiente 于2018年7月17日周二 下午4:05写道:

> As far as I know, the Cloudera distribution of hadoop still comes with
> Spark 1.6 out of the box, so I believe there are still quite a few people
> stuck on it.
>
> On Tue, 2018-07-17 at 10:40 +0900, Jongyoul Lee wrote:
>
> I think the current release is good enough to use Spark 1.6.x. For the
> future release, it would be better to focus on 2.x only.
>
> And older versions than 1.6, fully agreed. we should do it, personally.
>
> On Tue, Jul 17, 2018 at 10:16 AM, Jeff Zhang  wrote:
>
>
> This might be a little risky. But it depends on how many people still use
> spark 1.6
> But at least I would suggest to disable support any spark before 1.6.
> There's many legacy code in zeppelin to support very old versions of spark
> (e.g. 1.5, 1.4)
> Actually we don't have travis job for any spark before 1.6, so we don't
> know whether these legacy code still works. Maintain these legacy code is a
> extra effort for community. so I would suggest to disable support on spark
> before 1.6 at least.
>
>
> Jongyoul Lee 于2018年7月17日周二 上午9:10写道:
>
> Hi,
>
> Today, I found that Apache Spark 1.6.3 distribution was removed from
> Apache CDN. We can get the link for 1.6.3 from Apache Spark, but it's only
> available to be download from Apache archive. I'm not sure how many ppl
> still use Spark 1.6.3 in Apache Zeppelin, but in my opinion, it means Spark
> 1.6.3 is not active anymore.
>
> From now, AFAIK, about supporting versions of Apache Spark, we have
> followed for Spark's policy.
>
> I suggest that we also remove Spark 1.6.3 from the officially supported
> version for the next Apache Zeppelin major release - 0.9.0 or 1.0.0. If we
> could focus on support on Spark 2.x only, we could make SparkInterpreter
> more concretely.
>
> WDYT?
>
> Best regards,
> JL
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>
>
>
>
>


Re: Zeppelin distributed architecture design

2018-07-17 Thread Ruslan Dautkhanov
Nice.

Thanks for sharing.

Can you explain how are users routed into a particular zeppelin server
instance? I've seen nginx on top of them, but I don't think the document
covers details? If one zeppelin server goes down or unhealthy, is nginx
supposed to detect (if so, how?) that and reroute users to a survived
instance?

Thanks,
Ruslan Dautkhanov


On Tue, Jul 17, 2018 at 2:46 AM liuxun  wrote:

> hi:
>
> Our company installed and deployed a lot of zeppelin for data analysis.
> The single server version of zeppelin could not meet our application
> scenarios, so we transformed zeppelin into a clustered service that
> supports distributed deployment, Have a unified entrance, high
> availability, and High server resource usage.  the email attachment is the
> entire design document, I am very happy to feedback our modified code back
> to the community.
>
>
> this is the JIRA I submitted in the community,
>
> https://issues.apache.org/jira/browse/ZEPPELIN-3471
>
>
> Since the design document size exceeds the mail attachment size limit, the
> document link address has to be sent.
>
> https://issues.apache.org/jira/secure/attachment/12931896/Zeppelin%20distributed%20architecture%20design.pdf
>
> https://issues.apache.org/jira/secure/attachment/12931895/zepplin%20Cluster%20Sequence%20Diagram.png
>
>
> liuxun
>


Re: ZeppelinContext Not Found in yarn-cluster Mode

2018-07-17 Thread Jongyoul Lee
I have the same issue. We might need to investigate it deeply. Could you
please file it up?

Regards,
JL

On Tue, Jul 17, 2018 at 7:27 PM, Chris Penny  wrote:

> Hi all,
>
> Thanks for the 0.8.0 release!
>
> We’re keen to take advantage of the yarn-cluster support to take the
> pressure off our Zeppelin host. However, I am having some trouble with it.
> The first problem was in following the documentation here:
> https://zeppelin.apache.org/docs/0.8.0/interpreter/spark.html
>
> This suggests that we need to modify the master configuration from
> “yarn-client” to “yarn-cluster”. However, doing so results in the following
> error:
>
> Warning: Master yarn-cluster is deprecated since 2.0. Please use master
> “yarn” with specified deploy mode instead.
> Error: Client deploy mode is not compatible with master “yarn-cluster”
> Run with --help for usage help or --verbose for debug output
> 
>
> I got past this error with the following settings:
> master = yarn
> spark.submit.deployMode = cluster
>
> I’m somewhat unclear if I’m straying from the correct (documented)
> configuration or if the documentation needs an update. Anyway;
>
> These settings appear to work for everything except the ZeppelinContext,
> which is missing.
> Code:
> %spark
> z
>
> Output:
> :24: error: not found: value z
>
> Using yarn-client mode I can identify that z is meant to be an instance of
> org.apache.zeppelin.spark.SparkZeppelinContext
> Code:
> %spark
> z
>
> Output:
> res4: org.apache.zeppelin.spark.SparkZeppelinContext =
> org.apache.zeppelin.spark.SparkZeppelinContext@5b9282e1
>
> However, this class is absent in cluster-mode:
> Code:
> %spark
> org.apache.zeppelin.spark.SparkZeppelinContext
>
> Output:
> :24: error: object zeppelin is not a member of package org.apache
>org.apache.zeppelin.spark.SparkZeppelinContext
>   ^
>
> Snooping around the Zeppelin installation I was able to locate this class
> in ${ZEPPELIN_INSTALL_DIR}/interpreter/spark/spark-interpreter-0.8.0.jar.
> I then uploaded this jar to HDFS and added it to spark.jars &
> spark.driver.extraClassPath. Relevant entries in driver log:
>
> …
> Added JAR hdfs:/spark-interpreter-0.8.0.jar at
> hdfs:/tmp/zeppelin/spark-interpreter-0.8.0.jar with timestamp
> 1531732774379
> …
> CLASSPATH -> …:hdfs:/tmp/zeppelin/spark-interpreter-0.8.0.jar …
> …
> command:
> …
> file:$PWD/spark-interpreter-0.8.0.jar \
> etc.
>
> However, I still can’t use the ZeppelinContext or
> org.apache.zeppelin.spark.SparkZeppelinContext class. At this point I’ve
> run out of ideas and am willing to ask for help.
>
> Does anyone have thoughts on how I could use the ZeppelinContext in yarn
> cluster mode?
>
> Regards, Chris.
>
>


-- 
이종열, Jongyoul Lee, 李宗烈
http://madeng.net


Re: [DISCUSS] Share Data in Zeppelin

2018-07-17 Thread Belousov Maksim Eduardovich
Ability to work with many data source is one the reason we chose Apache 
Zeppelin.

For branch-0.7 our ops-team wrote a lot of python functions for import and 
export data from diffent source (Greenlum, Hive, Oracle) using Python DataFrame 
as middleware.
Our users can upload flat files to Zeppelin via Samba, then upload to DBs and 
run queries.

Availability of ResourcePool in 0.8 is big milestone.
I hope ResourcePool will allow to smoothly intergate all sources in company.
It would be great if not only spark and python interpreter could get data from 
ResourcePool.

2b case is nice.
Now I see that the transmit of table data is sufficient.



Regards,
Maxim Belousov


От: Jeff Zhang 
Отправлено: 13 июля 2018 г. 6:00
Кому: users@zeppelin.apache.org
Копия: dev
Тема: Re: [DISCUSS] Share Data in Zeppelin

Thanks Sanjay, I have fixed the example note.

*Folks, to be noticed,* the example note is just a fake note, it won't work
for now.



Jongyoul Lee 于2018年7月13日周五 上午10:54写道:

> BTW, we need to consider the case where the result is large in a design
> time. In my experience, If we implement this feature, users could use it
> with large data.
>
> On Fri, Jul 13, 2018 at 11:51 AM, Sanjay Dasgupta <
> sanjay.dasgu...@gmail.com> wrote:
>
>> I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?
>>
>> There are a few typos in the example note shared:
>>
>> 1) The line val peopleDF = spark.read.format("zeppelin").load() should
>> mention the table name (possibly as argument to load?)
>> 2) The python line val peopleDF = z.getTable("people").toPandas() should
>> not have the val
>>
>>
>> The z.getTable() method could be a very good tool to judge
>> which use-cases are important in the community. It is easy to implement for
>> the in-memory data case, and could be very useful for many situations where
>> a small amount of data is being transferred across interpreters (like the
>> jdbc -> matplotlib case mentioned).
>>
>> Thanks,
>> Sanjay
>>
>> On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee  wrote:
>>
>>> Yes, it's similar to 2.b.
>>>
>>> Basically, my concern is to handle all kinds of data. But in your case,
>>> it looks like focusing on table data. It's also useful but it would be
>>> better to handle all of the data including table or plain text as well.
>>> WDYT?
>>>
>>> About storage, we could discuss it later.
>>>
>>> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang  wrote:
>>>

 I think your use case is the same of 2.b.  Personally I don't recommend
 to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
 1.  noteId, paragraphId is meaningless, which is not readable
 2. The note will break if we clone it as the noteId is changed.
 That's why I suggest to use paragraph property to save paragraph's
 result

 Regarding the intermediate storage, I also though about it and agree
 that in the long term we should provide such layer to support large data,
 currently we put the shared data in memory which is not a scalable
 solution.  One candidate in my mind is alluxio [1], and regarding the data
 format I think apache arrow [2] is another good option for zeppelin to
 share table data across interpreter processes and different languages. But
 these are all implementation details, I think we can talk about them in
 another thread. In this thread, I think we should focus on the user facing
 api.


 [1] http://www.alluxio.org/
 [2] https://arrow.apache.org/



 Jongyoul Lee 于2018年7月13日周五 上午10:11写道:

> I have a bit different idea to share data.
>
> In my case,
>
> It would be very useful to get a paragraph's result as an input of
> other paragraphs.
>
> e.g.
>
> -- Paragrph 1
> %jdbc
> select * from some_table;
>
> -- Paragraph 2
> %spark
> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
> spark.read(table).select
>
> If paragraph 1's result is too big to show on FE, it would be saved in
> Zeppelin Server with proper way and pass to SparkInterpreter when 
> Paragraph
> 2 is executed.
>
> Basically, I think we need to intermediate storage to store
> paragraph's results to share them. We can introduce another layer or 
> extend
> NotebootRepo. In some cases, we might change notebook repos as well.
>
> JL
>
>
>
> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang  wrote:
>
>> Hi Folks,
>>
>> Recently, there's several tickets [1][2][3] about sharing data in
>> zeppelin.
>> Zeppelin's goal is to be an unified data analyst platform which could
>> integrate most of the big data tools and help user to switch between
>> tools
>> and share data between tools easily. So sharing data is a very
>> critical and
>> killer feature of Zeppelin IMHO.
>>
>> I raise this ticket to 

ZeppelinContext Not Found in yarn-cluster Mode

2018-07-17 Thread Chris Penny
Hi all,

Thanks for the 0.8.0 release!

We’re keen to take advantage of the yarn-cluster support to take the
pressure off our Zeppelin host. However, I am having some trouble with it.
The first problem was in following the documentation here:
https://zeppelin.apache.org/docs/0.8.0/interpreter/spark.html

This suggests that we need to modify the master configuration from
“yarn-client” to “yarn-cluster”. However, doing so results in the following
error:

Warning: Master yarn-cluster is deprecated since 2.0. Please use master
“yarn” with specified deploy mode instead.
Error: Client deploy mode is not compatible with master “yarn-cluster”
Run with --help for usage help or --verbose for debug output


I got past this error with the following settings:
master = yarn
spark.submit.deployMode = cluster

I’m somewhat unclear if I’m straying from the correct (documented)
configuration or if the documentation needs an update. Anyway;

These settings appear to work for everything except the ZeppelinContext,
which is missing.
Code:
%spark
z

Output:
:24: error: not found: value z

Using yarn-client mode I can identify that z is meant to be an instance of
org.apache.zeppelin.spark.SparkZeppelinContext
Code:
%spark
z

Output:
res4: org.apache.zeppelin.spark.SparkZeppelinContext =
org.apache.zeppelin.spark.SparkZeppelinContext@5b9282e1

However, this class is absent in cluster-mode:
Code:
%spark
org.apache.zeppelin.spark.SparkZeppelinContext

Output:
:24: error: object zeppelin is not a member of package org.apache
   org.apache.zeppelin.spark.SparkZeppelinContext
  ^

Snooping around the Zeppelin installation I was able to locate this class
in ${ZEPPELIN_INSTALL_DIR}/interpreter/spark/spark-interpreter-0.8.0.jar. I
then uploaded this jar to HDFS and added it to spark.jars &
spark.driver.extraClassPath. Relevant entries in driver log:

…
Added JAR hdfs:/spark-interpreter-0.8.0.jar at
hdfs:/tmp/zeppelin/spark-interpreter-0.8.0.jar
with timestamp 1531732774379
…
CLASSPATH -> …:hdfs:/tmp/zeppelin/spark-interpreter-0.8.0.jar …
…
command:
…
file:$PWD/spark-interpreter-0.8.0.jar \
etc.

However, I still can’t use the ZeppelinContext or
org.apache.zeppelin.spark.SparkZeppelinContext
class. At this point I’ve run out of ideas and am willing to ask for help.

Does anyone have thoughts on how I could use the ZeppelinContext in yarn
cluster mode?

Regards, Chris.


Zeppelin distributed architecture design

2018-07-17 Thread liuxun
hi:

Our company installed and deployed a lot of zeppelin for data analysis. The 
single server version of zeppelin could not meet our application scenarios, so 
we transformed zeppelin into a clustered service that supports distributed 
deployment, Have a unified entrance, high availability, and High server 
resource usage.  the email attachment is the entire design document, I am very 
happy to feedback our modified code back to the community.


this is the JIRA I submitted in the community,

https://issues.apache.org/jira/browse/ZEPPELIN-3471 



Since the design document size exceeds the mail attachment size limit, the 
document link address has to be sent.
https://issues.apache.org/jira/secure/attachment/12931896/Zeppelin%20distributed%20architecture%20design.pdf
 

https://issues.apache.org/jira/secure/attachment/12931895/zepplin%20Cluster%20Sequence%20Diagram.png
 



liuxun