Re: why zeppelin SparkInterpreter use FIFOScheduler

Dimp Bhat Thu, 14 Jan 2016 12:20:17 -0800

Thanks Piyush. Do we have any ETA for this to be sent for review?

Dimple


On Wed, Jan 13, 2016 at 6:23 PM, Piyush Mukati (Data Platform) <
piyush.muk...@flipkart.com> wrote:

> Hi,
>  The code is available here
>
> https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark
>
>
> some testing part is left.
>
> On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <dimp201...@gmail.com> wrote:
>
> > Hi Pranav,
> > When do you plan to send out the code for running notebooks in parallel ?
> >
> > Dimple
> >
> > On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <
> praag...@gmail.com>
> > wrote:
> >
> >> Hi Rohit,
> >>
> >> We implemented the proposal and are able to run Zeppelin as a hosted
> >> service inside my organization. Our internal forked version has
> pluggable
> >> authentication and type ahead.
> >>
> >> I need to get the work ported to the latest and chop out the auth
> changes
> >> portion. We'll be submitting it soon.
> >>
> >> We'll target to get this out for review by 11/26.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>
> >>
> >> On 17/11/15 4:34 am, Rohit Agarwal wrote:
> >>
> >> Hey Pranav,
> >>
> >> Did you make any progress on this?
> >>
> >> --
> >> Rohit
> >>
> >> On Sunday, August 16, 2015, moon soo Lee <m...@apache.org> wrote:
> >>
> >>> Pranav, proposal looks awesome!
> >>>
> >>> I have a question and feedback,
> >>>
> >>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
> >>> need information of notebook id. Did you get it from
> InterpreterContext?
> >>> Then how did you handle destroying of SparkIMain (when notebook is
> >>> deleting)?
> >>> As far as i know, interpreter not able to get information of notebook
> >>> deletion.
> >>>
> >>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>> execution
> >>> >> at a time per notebook.
> >>>
> >>> One downside of this approach is, GUI will display RUNNING instead of
> >>> PENDING for jobs inside of queue in interpreter.
> >>>
> >>> Best,
> >>> moon
> >>>
> >>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com> wrote:
> >>>
> >>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
> >>>> multi-tenancy easily"
> >>>>
> >>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
> >>>>> so that it can handle multi-tenancy easily. The technical solution
> proposed
> >>>>> by Pranav is great but it only applies to Spark. Right now, each
> >>>>> interpreter has to manage multi-tenancy its own way. Ultimately
> Zeppelin
> >>>>> can propose a multi-tenancy contract/info (like UserContext, similar
> to
> >>>>> InterpreterContext) so that each interpreter can choose to use or
> not.
> >>>>>
> >>>>>
> >>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think while the idea of running multiple notes simultaneously is
> >>>>>> great. It is really dancing around the lack of true multi user
> support in
> >>>>>> Zeppelin. While the proposed solution would work if the applications
> >>>>>> resources are those of the whole cluster, if the app is limited
> (say they
> >>>>>> are 8 cores of 16, with some distribution in memory) then
> potentially your
> >>>>>> note can hog all the resources and the scheduler will have to
> throttle all
> >>>>>> other executions leaving you exactly where you are now.
> >>>>>> While I think the solution is a good one, maybe this question makes
> >>>>>> us think in adding true multiuser support.
> >>>>>> Where we isolate resources (cluster and the notebooks themselves),
> >>>>>> have separate login/identity and (I don't know if it's possible)
> share the
> >>>>>> same context.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Joel
> >>>>>>
> >>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com>
> >>>>>> wrote:
> >>>>>> >
> >>>>>> > If the problem is that multiple users have to wait for each other
> >>>>>> while
> >>>>>> > using Zeppelin, the solution already exists: they can create a new
> >>>>>> > interpreter by going to the interpreter page and attach it to
> their
> >>>>>> > notebook - then they don't have to wait for others to submit their
> >>>>>> job.
> >>>>>> >
> >>>>>> > But I agree, having paragraphs from one note wait for paragraphs
> >>>>>> from other
> >>>>>> > notes is a confusing default. We can get around that in two ways:
> >>>>>> >
> >>>>>> >   1. Create a new interpreter for each note and attach that
> >>>>>> interpreter to
> >>>>>> >   that note. This approach would require the least amount of code
> >>>>>> changes but
> >>>>>> >   is resource heavy and doesn't let you share Spark Context
> between
> >>>>>> different
> >>>>>> >   notes.
> >>>>>> >   2. If we want to share the Spark Context between different
> notes,
> >>>>>> we can
> >>>>>> >   submit jobs from different notes into different fairscheduler
> >>>>>> pools (
> >>>>>> >
> >>>>>>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> >>>>>> ).
> >>>>>> >   This can be done by submitting jobs from different notes in
> >>>>>> different
> >>>>>> >   threads. This will make sure that jobs from one note are run
> >>>>>> sequentially
> >>>>>> >   but jobs from different notes will be able to run in parallel.
> >>>>>> >
> >>>>>> > Neither of these options require any change in the Spark code.
> >>>>>> >
> >>>>>> > --
> >>>>>> > Thanks & Regards
> >>>>>> > Rohit Agarwal
> >>>>>> > https://www.linkedin.com/in/rohitagarwal003
> >>>>>> >
> >>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> >>>>>> praag...@gmail.com>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> >> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> through
> >>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >> Here is a proposal:
> >>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
> >>>>>> virtual
> >>>>>> >> directory. While creating new instances of SparkIMain per
> notebook
> >>>>>> from
> >>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
> >>>>>> the same
> >>>>>> >> virtual directory.
> >>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
> >>>>>> server in
> >>>>>> >> Spark Context using classserverUri method
> >>>>>> >> 3. Scala generated code has a notion of packages. The default
> >>>>>> package name
> >>>>>> >> is "line$<linenumber>". Package name can be controlled using
> System
> >>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
> >>>>>> id"
> >>>>>> >> ensures that code generated by individual instances of SparkIMain
> >>>>>> is
> >>>>>> >> isolated from other instances of SparkIMain
> >>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>>>>> execution
> >>>>>> >> at a time per notebook.
> >>>>>> >>
> >>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation
> across
> >>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
> >>>>>> there any
> >>>>>> >> Jira already for the same that I can uptake? Also I need to
> >>>>>> understand:
> >>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first
> work
> >>>>>> >> towards getting Spark changes merged in Apache Spark github?
> >>>>>> >>
> >>>>>> >> Any suggestions on comments on the proposal are highly welcome.
> >>>>>> >>
> >>>>>> >> Regards,
> >>>>>> >> -Pranav.
> >>>>>> >>
> >>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>>>> >>>
> >>>>>> >>> Hi piyush,
> >>>>>> >>>
> >>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook
> while
> >>>>>> >>> sharing the SparkContext sounds great.
> >>>>>> >>>
> >>>>>> >>> Actually, i tried to do it, found problem that multiple
> >>>>>> SparkILoop could
> >>>>>> >>> generates the same class name, and spark executor confuses
> >>>>>> classname since
> >>>>>> >>> they're reading classes from single SparkContext.
> >>>>>> >>>
> >>>>>> >>> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> moon
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>>>>> >>> piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>>
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
> >>>>>> working
> >>>>>> >>> with spark.
> >>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain
> and
> >>>>>> >>> printstrems  for each notebook while sharing theSparkContext
> >>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> >>>>>> parallel
> >>>>>> >>> scheduler ?
> >>>>>> >>>    thanks
> >>>>>> >>>
> >>>>>> >>>    -piyush
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>
> >>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
> >>>>>> Spark's
> >>>>>> >>>    remote interpreter - this will allow multiple users to run
> >>>>>> their spark
> >>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
> >>>>>> paragraph is
> >>>>>> >>>    executed at a time.
> >>>>>> >>>
> >>>>>> >>>    Regards,
> >>>>>> >>>    -Pranav.
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>>>> >>>> Hi,
> >>>>>> >>>>
> >>>>>> >>>> Thanks for asking question.
> >>>>>> >>>>
> >>>>>> >>>> The reason is simply because of it is running code statements.
> >>>>>> The
> >>>>>> >>>> statements can have order and dependency. Imagine i have two
> >>>>>> >>> paragraphs
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> val a = 1
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> print(a)
> >>>>>> >>>>
> >>>>>> >>>> If they're not running one by one, that means they possibly
> runs
> >>>>>> in
> >>>>>> >>>> random order and the output will be always different. Either
> '1'
> >>>>>> or
> >>>>>> >>>> 'val a can not found'.
> >>>>>> >>>>
> >>>>>> >>>> This is the reason why. But if there are nice idea to handle
> this
> >>>>>> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> moon
> >>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>>>> >>>> <linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>
> >>>>>> >>> <mailto:linxizeng0...@gmail.com  <mailto:
> linxizeng0...@gmail.com
> >>>>>> >>>
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>>    any one who have the same question with me? or this is not a
> >>>>>> >>> question?
> >>>>>> >>>>
> >>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
> >>>>>> linxizeng0...@gmail.com
> >>>>>> >>> <mailto:linxizeng0...@gmail.com>
> >>>>>> >>>>    <mailto:linxizeng0...@gmail.com  <mailto:
> >>>>>> >>> linxizeng0...@gmail.com>>>:
> >>>>>> >>>>
> >>>>>> >>>>        hi, Moon:
> >>>>>> >>>>           I notice that the getScheduler function in the
> >>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
> >>>>>> the
> >>>>>> >>>>        spark interpreter run spark job one by one. It's not a
> >>>>>> good
> >>>>>> >>>>        experience when couple of users do some work on zeppelin
> >>>>>> at
> >>>>>> >>>>        the same time, because they have to wait for each other.
> >>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>>> >>>>        My question is, what kind of consideration do you based
> on
> >>>>>> >>> to
> >>>>>> >>>>        make such a decision?
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>>>> >>>
> >>>>>> >>>    This email and any files transmitted with it are confidential
> >>>>>> and
> >>>>>> >>>    intended solely for the use of the individual or entity to
> whom
> >>>>>> >>>    they are addressed. If you have received this email in error
> >>>>>> >>>    please notify the system manager. This message contains
> >>>>>> >>>    confidential information and is intended only for the
> >>>>>> individual
> >>>>>> >>>    named. If you are not the named addressee you should not
> >>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify
> the
> >>>>>> >>>    sender immediately by e-mail if you have received this e-mail
> >>>>>> by
> >>>>>> >>>    mistake and delete this e-mail from your system. If you are
> not
> >>>>>> >>>    the intended recipient you are notified that disclosing,
> >>>>>> copying,
> >>>>>> >>>    distributing or taking any action in reliance on the contents
> >>>>>> of
> >>>>>> >>>    this information is strictly prohibited. Although Flipkart
> has
> >>>>>> >>>    taken reasonable precautions to ensure no viruses are present
> >>>>>> in
> >>>>>> >>>    this email, the company cannot accept responsibility for any
> >>>>>> loss
> >>>>>> >>>    or damage arising from the use of this email or attachments
> >>>>>> >>
> >>>>>>
> >>>>>
> >>>>>
> >>
> >> --
> >> Sent from a mobile device. Excuse my thumbs.
> >>
> >>
> >>
> >
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to