Re: why zeppelin SparkInterpreter use FIFOScheduler

moon soo Lee Wed, 26 Aug 2015 07:58:23 -0700

Hi Pranav,

Thanks for sharing the plan.
I think passing InterpreterContext to completion()  make sense.
Although it changes interpreter api, changing now looks better than later.


Thanks.
moon

On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <praag...@gmail.com>
wrote:

> Hi Moon,
>
> > I think releasing SparkIMain and related objects
> By packaging I meant to ask what is the process to "release SparkIMain
> and related objects"? for Zeppelin's code uptake?
>
> I have one more question:
> Most the changes to allow SparkInterpreter support ParallelScheduler are
> implemented but I'm struggling with the completion feature. Since I have
> SparkIMain interpreter for each notebook, completion functionality is
> not working as expected cause the completion method doesn't have
> InterpreterContext. I need to be able to pull notebook specific
> SparkIMain interpreter to return correct completion results, and for
> that I need to know the notbook-id at the time of completion call.
>
> I'm planning to change the Interpreter.java abstract method completion
> to pass InterpreterContext along with buffer and cursor location. This
> will require refactoring all the Interpreter's. It's a change in the
> contract, so thought will run with you before embarking on it...
>
> Please let me know your thoughts.
>
> Regards,
> -Pranav.
>
> On 18/08/15 8:04 am, moon soo Lee wrote:
> > Could you explain little bit more about package changes you mean?
> >
> > Thanks,
> > moon
> >
> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praag...@gmail.com
> > <mailto:praag...@gmail.com>> wrote:
> >
> >     Any thoughts on how to package changes related to Spark?
> >
> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <m...@apache.org
> >     <mailto:m...@apache.org>> wrote:
> >
> >         I think releasing SparkIMain and related objects after
> >         configurable inactivity would be good for now.
> >
> >         About scheduler, I can help implementing such scheduler.
> >
> >         Thanks,
> >         moon
> >
> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
> >         <praag...@gmail.com <mailto:praag...@gmail.com>> wrote:
> >
> >             Hi Moon,
> >
> >             Yes, the notebookid comes from InterpreterContext. At the
> >             moment destroying SparkIMain on deletion of notebook is
> >             not handled. I think SparkIMain is a lightweight object,
> >             do you see a concern having these objects in a map? One
> >             possible option could be to destroy notebook related
> >             objects when the inactivity on a notebook is greater than
> >             say 8 hours.
> >
> >
> >>             >> 4. Build a queue inside interpreter to allow only one
> >>             paragraph execution
> >>             >> at a time per notebook.
> >>
> >>             One downside of this approach is, GUI will display
> >>             RUNNING instead of PENDING for jobs inside of queue in
> >>             interpreter.
> >             Yes that's an good point. Having a scheduler at Zeppelin
> >             server to build a scheduler that is parallel across
> >             notebook's and FIFO across paragraph's will be nice. Is
> >             there any plan for having such a scheduler?
> >
> >             Regards,
> >             -Pranav.
> >
> >
> >             On 17/08/15 5:38 am, moon soo Lee wrote:
> >>             Pranav, proposal looks awesome!
> >>
> >>             I have a question and feedback,
> >>
> >>             You said you tested 1,2 and 3. To create SparkIMain per
> >>             notebook, you need information of notebook id. Did you
> >>             get it from InterpreterContext?
> >>             Then how did you handle destroying of SparkIMain (when
> >>             notebook is deleting)?
> >>             As far as i know, interpreter not able to get information
> >>             of notebook deletion.
> >>
> >>             >> 4. Build a queue inside interpreter to allow only one
> >>             paragraph execution
> >>             >> at a time per notebook.
> >>
> >>             One downside of this approach is, GUI will display
> >>             RUNNING instead of PENDING for jobs inside of queue in
> >>             interpreter.
> >>
> >>             Best,
> >>             moon
> >>
> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
> >>             <goi....@gmail.com <mailto:goi....@gmail.com>> wrote:
> >>
> >>                 +1 for "to re-factor the Zeppelin architecture so
> >>                 that it can handle multi-tenancy easily"
> >>
> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
> >>                 <doanduy...@gmail.com <mailto:doanduy...@gmail.com>>
> >>                 wrote:
> >>
> >>                     Agree with Joel, we may think to re-factor the
> >>                     Zeppelin architecture so that it can handle
> >>                     multi-tenancy easily. The technical solution
> >>                     proposed by Pranav is great but it only applies
> >>                     to Spark. Right now, each interpreter has to
> >>                     manage multi-tenancy its own way. Ultimately
> >>                     Zeppelin can propose a multi-tenancy
> >>                     contract/info (like UserContext, similar to
> >>                     InterpreterContext) so that each interpreter can
> >>                     choose to use or not.
> >>
> >>
> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
> >>                     <djo...@gmail.com <mailto:djo...@gmail.com>> wrote:
> >>
> >>                         I think while the idea of running multiple
> >>                         notes simultaneously is great. It is really
> >>                         dancing around the lack of true multi user
> >>                         support in Zeppelin. While the proposed
> >>                         solution would work if the applications
> >>                         resources are those of the whole cluster, if
> >>                         the app is limited (say they are 8 cores of
> >>                         16, with some distribution in memory) then
> >>                         potentially your note can hog all the
> >>                         resources and the scheduler will have to
> >>                         throttle all other executions leaving you
> >>                         exactly where you are now.
> >>                         While I think the solution is a good one,
> >>                         maybe this question makes us think in adding
> >>                         true multiuser support.
> >>                         Where we isolate resources (cluster and the
> >>                         notebooks themselves), have separate
> >>                         login/identity and (I don't know if it's
> >>                         possible) share the same context.
> >>
> >>                         Thanks,
> >>                         Joel
> >>
> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
> >>                         <mindpri...@gmail.com
> >>                         <mailto:mindpri...@gmail.com>> wrote:
> >>                         >
> >>                         > If the problem is that multiple users have
> >>                         to wait for each other while
> >>                         > using Zeppelin, the solution already
> >>                         exists: they can create a new
> >>                         > interpreter by going to the interpreter
> >>                         page and attach it to their
> >>                         > notebook - then they don't have to wait for
> >>                         others to submit their job.
> >>                         >
> >>                         > But I agree, having paragraphs from one
> >>                         note wait for paragraphs from other
> >>                         > notes is a confusing default. We can get
> >>                         around that in two ways:
> >>                         >
> >>                         >   1. Create a new interpreter for each note
> >>                         and attach that interpreter to
> >>                         >   that note. This approach would require the
> least amount
> >>                         of code changes but
> >>                         >   is resource heavy and doesn't let you
> >>                         share Spark Context between different
> >>                         >   notes.
> >>                         >   2. If we want to share the Spark Context
> >>                         between different notes, we can
> >>                         >   submit jobs from different notes into
> >>                         different fairscheduler pools (
> >>                         >
> >>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> ).
> >>                         >   This can be done by submitting jobs from
> >>                         different notes in different
> >>                         >   threads. This will make sure that jobs
> >>                         from one note are run sequentially
> >>                         >   but jobs from different notes will be
> >>                         able to run in parallel.
> >>                         >
> >>                         > Neither of these options require any change
> >>                         in the Spark code.
> >>                         >
> >>                         > --
> >>                         > Thanks & Regards
> >>                         > Rohit Agarwal
> >>                         > https://www.linkedin.com/in/rohitagarwal003
> >>                         >
> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
> >>                         Kumar Agarwal <praag...@gmail.com
> >>                         <mailto:praag...@gmail.com>>
> >>                         > wrote:
> >>                         >
> >>                         >> If someone can share about the idea of
> >>                         sharing single SparkContext through
> >>                         >>> multiple SparkILoop safely, it'll be
> >>                         really helpful.
> >>                         >> Here is a proposal:
> >>                         >> 1. In Spark code, change SparkIMain.scala
> >>                         to allow setting the virtual
> >>                         >> directory. While creating new instances of
> >>                         SparkIMain per notebook from
> >>                         >> zeppelin spark interpreter set all the
> >>                         instances of SparkIMain to the same
> >>                         >> virtual directory.
> >>                         >> 2. Start HTTP server on that virtual
> >>                         directory and set this HTTP server in
> >>                         >> Spark Context using classserverUri method
> >>                         >> 3. Scala generated code has a notion of
> >>                         packages. The default package name
> >>                         >> is "line$<linenumber>". Package name can
> >>                         be controlled using System
> >>                         >> Property scala.repl.name.line. Setting
> >>                         this property to "notebook id"
> >>                         >> ensures that code generated by individual
> >>                         instances of SparkIMain is
> >>                         >> isolated from other instances of SparkIMain
> >>                         >> 4. Build a queue inside interpreter to
> >>                         allow only one paragraph execution
> >>                         >> at a time per notebook.
> >>                         >>
> >>                         >> I have tested 1, 2, and 3 and it seems to
> >>                         provide isolation across
> >>                         >> classnames. I'll work towards submitting a
> >>                         formal patch soon - Is there any
> >>                         >> Jira already for the same that I can
> >>                         uptake? Also I need to understand:
> >>                         >> 1. How does Zeppelin uptake Spark fixes?
> >>                         OR do I need to first work
> >>                         >> towards getting Spark changes merged in
> >>                         Apache Spark github?
> >>                         >>
> >>                         >> Any suggestions on comments on the
> >>                         proposal are highly welcome.
> >>                         >>
> >>                         >> Regards,
> >>                         >> -Pranav.
> >>                         >>
> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>                         >>>
> >>                         >>> Hi piyush,
> >>                         >>>
> >>                         >>> Separate instance of SparkILoop
> >>                         SparkIMain for each notebook while
> >>                         >>> sharing the SparkContext sounds great.
> >>                         >>>
> >>                         >>> Actually, i tried to do it, found problem
> >>                         that multiple SparkILoop could
> >>                         >>> generates the same class name, and spark
> >>                         executor confuses classname since
> >>                         >>> they're reading classes from single
> >>                         SparkContext.
> >>                         >>>
> >>                         >>> If someone can share about the idea of
> >>                         sharing single SparkContext
> >>                         >>> through multiple SparkILoop safely, it'll
> >>                         be really helpful.
> >>                         >>>
> >>                         >>> Thanks,
> >>                         >>> moon
> >>                         >>>
> >>                         >>>
> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
> >>                         Mukati (Data Platform) <
> >>                         >>> piyush.muk...@flipkart.com
> >>                         <mailto:piyush.muk...@flipkart.com>
> >>                         <mailto:piyush.muk...@flipkart.com
> >>                         <mailto:piyush.muk...@flipkart.com>>> wrote:
> >>                         >>>
> >>                         >>>    Hi Moon,
> >>                         >>>    Any suggestion on it, have to wait lot
> >>                         when multiple people  working
> >>                         >>> with spark.
> >>                         >>>    Can we create separate instance of
> >>                          SparkILoop SparkIMain and
> >>                         >>> printstrems  for each notebook while
> >>                         sharing theSparkContext
> >>                         >>> ZeppelinContext  SQLContext and
> >>                         DependencyResolver and then use parallel
> >>                         >>> scheduler ?
> >>                         >>> thanks
> >>                         >>>
> >>                         >>> -piyush
> >>                         >>>
> >>                         >>>    Hi Moon,
> >>                         >>>
> >>                         >>>    How about tracking dedicated
> >>                         SparkContext for a notebook in Spark's
> >>                         >>> remote interpreter - this will allow
> >>                         multiple users to run their spark
> >>                         >>> paragraphs in parallel. Also, within a
> >>                         notebook only one paragraph is
> >>                         >>> executed at a time.
> >>                         >>>
> >>                         >>> Regards,
> >>                         >>> -Pranav.
> >>                         >>>
> >>                         >>>
> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>                         >>>> Hi,
> >>                         >>>>
> >>                         >>>> Thanks for asking question.
> >>                         >>>>
> >>                         >>>> The reason is simply because of it is
> >>                         running code statements. The
> >>                         >>>> statements can have order and
> >>                         dependency. Imagine i have two
> >>                         >>> paragraphs
> >>                         >>>>
> >>                         >>>> %spark
> >>                         >>>> val a = 1
> >>                         >>>>
> >>                         >>>> %spark
> >>                         >>>> print(a)
> >>                         >>>>
> >>                         >>>> If they're not running one by one, that
> >>                         means they possibly runs in
> >>                         >>>> random order and the output will be
> >>                         always different. Either '1' or
> >>                         >>>> 'val a can not found'.
> >>                         >>>>
> >>                         >>>> This is the reason why. But if there are
> >>                         nice idea to handle this
> >>                         >>>> problem i agree using parallel scheduler
> >>                         would help a lot.
> >>                         >>>>
> >>                         >>>> Thanks,
> >>                         >>>> moon
> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
> >>                         linxi zeng
> >>                         >>>> <linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>
> >>                         <mailto:linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>>
> >>                         >>> <mailto:linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>
> >>                         <mailto:linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>>>>
> >>                         >>> wrote:
> >>                         >>>>
> >>                         >>>> any one who have the same question with
> >>                         me? or this is not a
> >>                         >>> question?
> >>                         >>>>
> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
> >>                         <linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>
> >>                         >>> <mailto:linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>>
> >>                         >>>> <mailto:linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com> <mailto:
> >>                         >>> linxizeng0...@gmail.com
> >>                         <mailto:linxizeng0...@gmail.com>>>>:
> >>                         >>>>
> >>                         >>>>     hi, Moon:
> >>                         >>>>        I notice that the getScheduler
> >>                         function in the
> >>                         >>>> SparkInterpreter.java return a
> >>                         FIFOScheduler which makes the
> >>                         >>>>     spark interpreter run spark job one
> >>                         by one. It's not a good
> >>                         >>>>     experience when couple of users do
> >>                         some work on zeppelin at
> >>                         >>>>     the same time, because they have to
> >>                         wait for each other.
> >>                         >>>>     And at the same time,
> >>                         SparkSqlInterpreter can chose what
> >>                         >>>>     scheduler to use by
> >>                         "zeppelin.spark.concurrentSQL".
> >>                         >>>>     My question is, what kind of
> >>                         consideration do you based on
> >>                         >>> to
> >>                         >>>>     make such a decision?
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>
>  
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>                         >>>
> >>                         >>>    This email and any files transmitted
> >>                         with it are confidential and
> >>                         >>> intended solely for the use of the
> >>                         individual or entity to whom
> >>                         >>>    they are addressed. If you have
> >>                         received this email in error
> >>                         >>> please notify the system manager. This
> >>                         message contains
> >>                         >>> confidential information and is intended
> >>                         only for the individual
> >>                         >>> named. If you are not the named addressee
> >>                         you should not
> >>                         >>> disseminate, distribute or copy this
> >>                         e-mail. Please notify the
> >>                         >>> sender immediately by e-mail if you have
> >>                         received this e-mail by
> >>                         >>> mistake and delete this e-mail from your
> >>                         system. If you are not
> >>                         >>>    the intended recipient you are
> >>                         notified that disclosing, copying,
> >>                         >>> distributing or taking any action in
> >>                         reliance on the contents of
> >>                         >>>    this information is strictly
> >>                         prohibited. Although Flipkart has
> >>                         >>> taken reasonable precautions to ensure no
> >>                         viruses are present in
> >>                         >>>    this email, the company cannot accept
> >>                         responsibility for any loss
> >>                         >>>    or damage arising from the use of this
> >>                         email or attachments
> >>                         >>
> >>
> >>
> >
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to