Re: Multiple spark interpreters in the same Zeppelin instance

DuyHai Doan Fri, 29 Apr 2016 11:16:16 -0700

I would agree with John Omernik point about portability of JSON notes
because of the strong dependency with configured interpreters.


Which gives me an idea: what's about "exporting" interpreters config into
note.json file ?

Let's say your note has 10 paragraphs but they are just using 3 different
interpreters. Upon export, we will just fetch the current config for those
3 interpreters and save them in the note.json

On import of the note.json, there is more work to do:

- if there are already 3 interpreters as the one saved in the note.json,
check the current config
   - if the config match, import the note
   - else, ask the user with a dialog if they want to 1) use current
interpreter conf, 2) override current interpreter conf with the ones in the
note.json or 3) try to merge configuration

- if for each 3 interpreters in the note.json, there is no matching
interpreter instance, propose the user to create one for him by using the
config saved in the note

And for backward compatibility with old note.json format, on import if we
don't find any info related to interpreter we just skip the whole config
checking step above

What do you think ? It's a little bit complex but I guess it will help
greatly portability. I'm not saying it's necessarily easy and indeed is
require a lot of code change but I'm just throwing some ideas to feed the
discussion



On Fri, Apr 29, 2016 at 5:41 PM, John Omernik <j...@omernik.com> wrote:

> Moon -
>
> I would be curious on your thoughts on my email from April 12th.
>
> John
>
>
>
> On Tue, Apr 12, 2016 at 7:11 AM, John Omernik <j...@omernik.com> wrote:
>
>> I would actually argue that if the user doesn't have access to the same
>> or a similar interpreter.json file, than notebook file portability is a
>> moot point. For example, if I setup %spark or %jdbc in my environment,
>> create a notebook, that notebook is not any more or less portable than if I
>> had %myspark or %drill (a jdbc  interpreter).  Mainly, because if someone
>> tries to open that notebook, and they don't have my setup of %spark or of
>> %jdbc, they can't run the notebook.  If we could allow the user to create
>> an alias for an instance of an interpreter, and that alias information was
>> stored in interpreter.json, then the portability of the notebook would
>> essentially that the same.
>>
>> Said another way:
>>
>> Static interpreter invocation: (%jdbc, %pysaprk, %psql):
>> - This notebook is 100% dependent on the interpreter.json in order to
>> run. %jdbc may point to Drill, %pyspark may point to an authenticated YARN
>> instance (specific to the user/org), %psql may point to an authenticated
>> Postgres instance unique to the org/user.  Without interpreter.json, this
>> notebook is not portable.
>>
>> Aliases for interpreter invocation stored in interpreter.json (%drill ->
>> jdbc with settings, %datesciencespark -> pyspark for the data science
>> group, %entdw -> postgres server, enterprise datawarehouse)
>>
>> - Thus notebook is still 100% dependent on the interpreter.json file in
>> order to run. There is no more or less dependance on the interpreter.json
>> (if these aliases are stored there) then there is if Zeppelin is using
>> static interpreter invocation, thus portability is not a benefit of the
>> static method, and the aliased method can provide a good deal of analyst
>> agility/definition in a multi data set/source environment.
>>
>>
>> My thought is we should people to create new interpreters of known types,
>> and on creation of these interpreters allow the invocation to be stored in
>> the interpreter.json. Also, if a new interpreter is registered, it would
>> follow the same interpreter group methodology. Thus if I setup a new %spark
>> to be %entspark, then the sub interpreters (pyspark, sparksql etc) can be
>> there and have access to the master entspark, and also can be renamed.  so
>> that sub interpreter can be renamed, and the access it has to interpreter
>> group is based on the parent child relationship, not just by name...
>>
>> Thoughts?
>>
>>
>>
>>
>>
>>
>> On Fri, Feb 5, 2016 at 2:15 PM, Zhong Wang <wangzhong....@gmail.com>
>> wrote:
>>
>>> Thanks moon - it is good to know the ideas behind the design. It makes a
>>> lot more sense to use system-defined identifiers in order to make the
>>> notebook portable.
>>>
>>> Currently, I can name the interpreter in the WebUI, but actually the
>>> name doesn't help to distinguish between my spark interpreters, which is
>>> quite confusing to me. I am not sure whether this is a better way:
>>> --
>>> 1. the UI generates the default identifier for the first spark
>>> interpreter, which is %spark
>>> 2. when the user creates another spark interpreter, the UI asks the
>>> users to provide a user-defined identifier
>>>
>>> Zhong
>>>
>>> On Fri, Feb 5, 2016 at 12:02 AM, moon soo Lee <m...@apache.org> wrote:
>>>
>>>> In the initial stage of development, there were discussion about
>>>> %xxx, where xxx should be user defined interpreter identifier, or
>>>> should be a static interpreter identifier.
>>>>
>>>> And decided to go later one. Because we wanted keep notebook file
>>>> portable. i.e. Let run imported note.json file from other Zeppelin instance
>>>> without (or minimum) modification.
>>>>
>>>> if we use user defined identifier, running imported notebook will not
>>>> be very simple. This is why %xxx is not using user defined interpreter
>>>> identifier at the moment.
>>>>
>>>> If you have any other thoughts, ideas, please feel free to share.
>>>>
>>>> Thanks,
>>>> moon
>>>>
>>>> On Fri, Feb 5, 2016 at 3:58 PM Zhong Wang <wangzhong....@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks, Moon! I got it worked. The reason why it didn't work is that I
>>>>> tried to use both of the spark interpreters inside one notebook. I think I
>>>>> can create different notebooks for each interpreters, but it would be 
>>>>> great
>>>>> if we could use "%xxx", where xxx is the user defined interpreter
>>>>> identifier, to identify different interpreters for different paragraphs.
>>>>>
>>>>> Besides, because currently both of the interpreters are using "spark"
>>>>> as the identifier, they share the same log file. I am not sure whether
>>>>> there are other cases they interfere with each other.
>>>>>
>>>>> Thanks,
>>>>> Zhong
>>>>>
>>>>> On Thu, Feb 4, 2016 at 9:04 PM, moon soo Lee <m...@apache.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Once you create another spark interpreter in Interpreter menu of GUI,
>>>>>> then each notebook should able to select and use it (setting icon on
>>>>>> top right corner of each notebook).
>>>>>>
>>>>>> If it does not work, could you find error message on the log file?
>>>>>>
>>>>>> Thanks,
>>>>>> moon
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 11:54 AM Zhong Wang <wangzhong....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi zeppelin pilots,
>>>>>>>
>>>>>>> I am trying to run multiple spark interpreters in the same Zeppelin
>>>>>>> instance. This is very helpful if the data comes from multiple spark
>>>>>>> clusters.
>>>>>>>
>>>>>>> Another useful use case is that, run one instance in cluster mode,
>>>>>>> and another in local mode. This will significantly boost the 
>>>>>>> performance of
>>>>>>> small data analysis.
>>>>>>>
>>>>>>> Is there anyway to run multiple spark interpreters? I tried to
>>>>>>> create another spark interpreter with a different identifier, which is
>>>>>>> allowed in UI. But it doesn't work (shall I file a ticket?)
>>>>>>>
>>>>>>> I am now trying running multiple sparkContext in the same spark
>>>>>>> interpreter.
>>>>>>>
>>>>>>> Zhong
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>

Re: Multiple spark interpreters in the same Zeppelin instance

Reply via email to