Re: why zeppelin SparkInterpreter use FIFOScheduler

Rohit Agarwal Mon, 16 Nov 2015 15:05:16 -0800

Hey Pranav,

Did you make any progress on this?


--
Rohit

On Sunday, August 16, 2015, moon soo Lee <m...@apache.org> wrote:

> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
> information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is
> deleting)?
> As far as i know, interpreter not able to get information of notebook
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com
> <javascript:_e(%7B%7D,'cvml','goi....@gmail.com');>> wrote:
>
>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>> multi-tenancy easily"
>>
>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','doanduy...@gmail.com');>> wrote:
>>
>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>> that it can handle multi-tenancy easily. The technical solution proposed by 
>>> Pranav
>>> is great but it only applies to Spark. Right now, each interpreter has to
>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>> multi-tenancy contract/info (like UserContext, similar to
>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>
>>>
>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','djo...@gmail.com');>> wrote:
>>>
>>>> I think while the idea of running multiple notes simultaneously is
>>>> great. It is really dancing around the lack of true multi user support in
>>>> Zeppelin. While the proposed solution would work if the applications
>>>> resources are those of the whole cluster, if the app is limited (say they
>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>> note can hog all the resources and the scheduler will have to throttle all
>>>> other executions leaving you exactly where you are now.
>>>> While I think the solution is a good one, maybe this question makes us
>>>> think in adding true multiuser support.
>>>> Where we isolate resources (cluster and the notebooks themselves), have
>>>> separate login/identity and (I don't know if it's possible) share the same
>>>> context.
>>>>
>>>> Thanks,
>>>> Joel
>>>>
>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','mindpri...@gmail.com');>> wrote:
>>>> >
>>>> > If the problem is that multiple users have to wait for each other
>>>> while
>>>> > using Zeppelin, the solution already exists: they can create a new
>>>> > interpreter by going to the interpreter page and attach it to their
>>>> > notebook - then they don't have to wait for others to submit their
>>>> job.
>>>> >
>>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>>> other
>>>> > notes is a confusing default. We can get around that in two ways:
>>>> >
>>>> >   1. Create a new interpreter for each note and attach that
>>>> interpreter to
>>>> >   that note. This approach would require the least amount of code
>>>> changes but
>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>> different
>>>> >   notes.
>>>> >   2. If we want to share the Spark Context between different notes,
>>>> we can
>>>> >   submit jobs from different notes into different fairscheduler pools
>>>> (
>>>> >
>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>> ).
>>>> >   This can be done by submitting jobs from different notes in
>>>> different
>>>> >   threads. This will make sure that jobs from one note are run
>>>> sequentially
>>>> >   but jobs from different notes will be able to run in parallel.
>>>> >
>>>> > Neither of these options require any change in the Spark code.
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Rohit Agarwal
>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>> >
>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>> praag...@gmail.com <javascript:_e(%7B%7D,'cvml','praag...@gmail.com');>
>>>> >
>>>> > wrote:
>>>> >
>>>> >> If someone can share about the idea of sharing single SparkContext
>>>> through
>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>> >> Here is a proposal:
>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>> virtual
>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>> from
>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>> the same
>>>> >> virtual directory.
>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>> server in
>>>> >> Spark Context using classserverUri method
>>>> >> 3. Scala generated code has a notion of packages. The default
>>>> package name
>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>> >> isolated from other instances of SparkIMain
>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>> execution
>>>> >> at a time per notebook.
>>>> >>
>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>> there any
>>>> >> Jira already for the same that I can uptake? Also I need to
>>>> understand:
>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>> >>
>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>> >>
>>>> >> Regards,
>>>> >> -Pranav.
>>>> >>
>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>> >>>
>>>> >>> Hi piyush,
>>>> >>>
>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>> >>> sharing the SparkContext sounds great.
>>>> >>>
>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>> could
>>>> >>> generates the same class name, and spark executor confuses
>>>> classname since
>>>> >>> they're reading classes from single SparkContext.
>>>> >>>
>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> moon
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>> >>> piyush.muk...@flipkart.com
>>>> <javascript:_e(%7B%7D,'cvml','piyush.muk...@flipkart.com');> <mailto:
>>>> piyush.muk...@flipkart.com
>>>> <javascript:_e(%7B%7D,'cvml','piyush.muk...@flipkart.com');>>> wrote:
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>> working
>>>> >>> with spark.
>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>> parallel
>>>> >>> scheduler ?
>>>> >>>    thanks
>>>> >>>
>>>> >>>    -piyush
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>
>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>> Spark's
>>>> >>>    remote interpreter - this will allow multiple users to run their
>>>> spark
>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>> paragraph is
>>>> >>>    executed at a time.
>>>> >>>
>>>> >>>    Regards,
>>>> >>>    -Pranav.
>>>> >>>
>>>> >>>
>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Thanks for asking question.
>>>> >>>>
>>>> >>>> The reason is simply because of it is running code statements. The
>>>> >>>> statements can have order and dependency. Imagine i have two
>>>> >>> paragraphs
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> val a = 1
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> print(a)
>>>> >>>>
>>>> >>>> If they're not running one by one, that means they possibly runs in
>>>> >>>> random order and the output will be always different. Either '1' or
>>>> >>>> 'val a can not found'.
>>>> >>>>
>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> moon
>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>> >>>> <linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>  <mailto:
>>>> linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>
>>>> >>> <mailto:linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>  <mailto:
>>>> linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>>>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>    any one who have the same question with me? or this is not a
>>>> >>> question?
>>>> >>>>
>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>
>>>> >>> <mailto:linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>
>>>> >>>>    <mailto:linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>  <mailto:
>>>> >>> linxizeng0...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>>>:
>>>> >>>>
>>>> >>>>        hi, Moon:
>>>> >>>>           I notice that the getScheduler function in the
>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>> >>>>        the same time, because they have to wait for each other.
>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>> >>>>        My question is, what kind of consideration do you based on
>>>> >>> to
>>>> >>>>        make such a decision?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>> >>>
>>>> >>>    This email and any files transmitted with it are confidential and
>>>> >>>    intended solely for the use of the individual or entity to whom
>>>> >>>    they are addressed. If you have received this email in error
>>>> >>>    please notify the system manager. This message contains
>>>> >>>    confidential information and is intended only for the individual
>>>> >>>    named. If you are not the named addressee you should not
>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>> >>>    the intended recipient you are notified that disclosing, copying,
>>>> >>>    distributing or taking any action in reliance on the contents of
>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>> >>>    this email, the company cannot accept responsibility for any loss
>>>> >>>    or damage arising from the use of this email or attachments
>>>> >>
>>>>
>>>
>>>

-- 
Sent from a mobile device. Excuse my thumbs.

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to