Re: why zeppelin SparkInterpreter use FIFOScheduler

Pranav Kumar Agarwal Tue, 17 Nov 2015 03:26:06 -0800

Hi Rohit,

We implemented the proposal and are able to run Zeppelin as a hostedservice inside my organization. Our internal forked version haspluggable authentication and type ahead.

I need to get the work ported to the latest and chop out the authchanges portion. We'll be submitting it soon.


We'll target to get this out for review by 11/26.

Regards,
-Pranav.


On 17/11/15 4:34 am, Rohit Agarwal wrote:

Hey Pranav,

Did you make any progress on this?

--
Rohit

On Sunday, August 16, 2015, moon soo Lee <m...@apache.org<mailto:m...@apache.org>> wrote:


    Pranav, proposal looks awesome!

    I have a question and feedback,

    You said you tested 1,2 and 3. To create SparkIMain per notebook,
    you need information of notebook id. Did you get it from
    InterpreterContext?
    Then how did you handle destroying of SparkIMain (when notebook is
    deleting)?
    As far as i know, interpreter not able to get information of
    notebook deletion.

    >> 4. Build a queue inside interpreter to allow only one paragraph
    execution
    >> at a time per notebook.

    One downside of this approach is, GUI will display RUNNING instead
    of PENDING for jobs inside of queue in interpreter.

    Best,
    moon

    On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com
    <javascript:_e(%7B%7D,'cvml','goi....@gmail.com');>> wrote:

        +1 for "to re-factor the Zeppelin architecture so that it can
        handle multi-tenancy easily"

        On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
        <doanduy...@gmail.com
        <javascript:_e(%7B%7D,'cvml','doanduy...@gmail.com');>> wrote:

            Agree with Joel, we may think to re-factor the Zeppelin
            architecture so that it can handle multi-tenancy easily.
            The technical solution proposed by Pranav is great but it
            only applies to Spark. Right now, each interpreter has to
            manage multi-tenancy its own way. Ultimately Zeppelin can
            propose a multi-tenancy contract/info (like UserContext,
            similar to InterpreterContext) so that each interpreter
            can choose to use or not.


            On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
            <djo...@gmail.com
            <javascript:_e(%7B%7D,'cvml','djo...@gmail.com');>> wrote:

                I think while the idea of running multiple notes
                simultaneously is great. It is really dancing around
                the lack of true multi user support in Zeppelin. While
                the proposed solution would work if the applications
                resources are those of the whole cluster, if the app
                is limited (say they are 8 cores of 16, with some
                distribution in memory) then potentially your note can
                hog all the resources and the scheduler will have to
                throttle all other executions leaving you exactly
                where you are now.
                While I think the solution is a good one, maybe this
                question makes us think in adding true multiuser support.
                Where we isolate resources (cluster and the notebooks
                themselves), have separate login/identity and (I don't
                know if it's possible) share the same context.

                Thanks,
                Joel

                > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
                <mindpri...@gmail.com
                <javascript:_e(%7B%7D,'cvml','mindpri...@gmail.com');>> wrote:
                >
                > If the problem is that multiple users have to wait
                for each other while
                > using Zeppelin, the solution already exists: they
                can create a new
                > interpreter by going to the interpreter page and
                attach it to their
                > notebook - then they don't have to wait for others
                to submit their job.
                >
                > But I agree, having paragraphs from one note wait
                for paragraphs from other
                > notes is a confusing default. We can get around that
                in two ways:
                >
                >   1. Create a new interpreter for each note and
                attach that interpreter to
                >   that note. This approach would require the least amount of 
code changes but
                >   is resource heavy and doesn't let you share Spark
                Context between different
                >   notes.
                >   2. If we want to share the Spark Context between
                different notes, we can
                >   submit jobs from different notes into different
                fairscheduler pools (
                >
                
https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
                >   This can be done by submitting jobs from different
                notes in different
                >   threads. This will make sure that jobs from one
                note are run sequentially
                >   but jobs from different notes will be able to run
                in parallel.
                >
                > Neither of these options require any change in the
                Spark code.
                >
                > --
                > Thanks & Regards
                > Rohit Agarwal
                > https://www.linkedin.com/in/rohitagarwal003
                >
                > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar
                Agarwal <praag...@gmail.com
                <javascript:_e(%7B%7D,'cvml','praag...@gmail.com');>>
                > wrote:
                >
                >> If someone can share about the idea of sharing
                single SparkContext through
                >>> multiple SparkILoop safely, it'll be really helpful.
                >> Here is a proposal:
                >> 1. In Spark code, change SparkIMain.scala to allow
                setting the virtual
                >> directory. While creating new instances of
                SparkIMain per notebook from
                >> zeppelin spark interpreter set all the instances of
                SparkIMain to the same
                >> virtual directory.
                >> 2. Start HTTP server on that virtual directory and
                set this HTTP server in
                >> Spark Context using classserverUri method
                >> 3. Scala generated code has a notion of packages.
                The default package name
                >> is "line$<linenumber>". Package name can be
                controlled using System
                >> Property scala.repl.name.line. Setting this
                property to "notebook id"
                >> ensures that code generated by individual instances
                of SparkIMain is
                >> isolated from other instances of SparkIMain
                >> 4. Build a queue inside interpreter to allow only
                one paragraph execution
                >> at a time per notebook.
                >>
                >> I have tested 1, 2, and 3 and it seems to provide
                isolation across
                >> classnames. I'll work towards submitting a formal
                patch soon - Is there any
                >> Jira already for the same that I can uptake? Also I
                need to understand:
                >> 1. How does Zeppelin uptake Spark fixes? OR do I
                need to first work
                >> towards getting Spark changes merged in Apache
                Spark github?
                >>
                >> Any suggestions on comments on the proposal are
                highly welcome.
                >>
                >> Regards,
                >> -Pranav.
                >>
                >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
                >>>
                >>> Hi piyush,
                >>>
                >>> Separate instance of SparkILoop SparkIMain for
                each notebook while
                >>> sharing the SparkContext sounds great.
                >>>
                >>> Actually, i tried to do it, found problem that
                multiple SparkILoop could
                >>> generates the same class name, and spark executor
                confuses classname since
                >>> they're reading classes from single SparkContext.
                >>>
                >>> If someone can share about the idea of sharing
                single SparkContext
                >>> through multiple SparkILoop safely, it'll be
                really helpful.
                >>>
                >>> Thanks,
                >>> moon
                >>>
                >>>
                >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati
                (Data Platform) <
                >>> piyush.muk...@flipkart.com
                <javascript:_e(%7B%7D,'cvml','piyush.muk...@flipkart.com');>
                <mailto:piyush.muk...@flipkart.com
                <javascript:_e(%7B%7D,'cvml','piyush.muk...@flipkart.com');>>>
                wrote:
                >>>
                >>>    Hi Moon,
                >>>    Any suggestion on it, have to wait lot when
                multiple people working
                >>> with spark.

>>> Can we create separate instance ofSparkILoop SparkIMain and

                >>> printstrems  for each notebook while sharing
                theSparkContext
                >>> ZeppelinContext  SQLContext and DependencyResolver
                and then use parallel
                >>> scheduler ?
                >>>    thanks
                >>>
                >>>    -piyush
                >>>
                >>>    Hi Moon,
                >>>
                >>>    How about tracking dedicated SparkContext for a
                notebook in Spark's
                >>>    remote interpreter - this will allow multiple
                users to run their spark
                >>>    paragraphs in parallel. Also, within a notebook
                only one paragraph is
                >>>    executed at a time.
                >>>
                >>>    Regards,
                >>>    -Pranav.
                >>>
                >>>
                >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
                >>>> Hi,
                >>>>
                >>>> Thanks for asking question.
                >>>>
                >>>> The reason is simply because of it is running
                code statements. The
                >>>> statements can have order and dependency. Imagine
                i have two
                >>> paragraphs
                >>>>
                >>>> %spark
                >>>> val a = 1
                >>>>
                >>>> %spark
                >>>> print(a)
                >>>>
                >>>> If they're not running one by one, that means
                they possibly runs in
                >>>> random order and the output will be always
                different. Either '1' or
                >>>> 'val a can not found'.
                >>>>
                >>>> This is the reason why. But if there are nice
                idea to handle this
                >>>> problem i agree using parallel scheduler would
                help a lot.
                >>>>
                >>>> Thanks,
                >>>> moon
                >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
                >>>> <linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>
                <mailto:linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>
                >>> <mailto:linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>
                <mailto:linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>>>
                >>> wrote:
                >>>>
                >>>>    any one who have the same question with me? or
                this is not a
                >>> question?
                >>>>
                >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng
                <linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>
                >>> <mailto:linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>
                >>>>    <mailto:linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>
                <mailto:
                >>> linxizeng0...@gmail.com
                <javascript:_e(%7B%7D,'cvml','linxizeng0...@gmail.com');>>>>:
                >>>>
                >>>>        hi, Moon:
                >>>>           I notice that the getScheduler function
                in the
                >>>> SparkInterpreter.java return a FIFOScheduler
                which makes the
                >>>>        spark interpreter run spark job one by
                one. It's not a good
                >>>>        experience when couple of users do some
                work on zeppelin at
                >>>>        the same time, because they have to wait
                for each other.
                >>>>        And at the same time, SparkSqlInterpreter
                can chose what
                >>>>        scheduler to use by
                "zeppelin.spark.concurrentSQL".
                >>>>        My question is, what kind of consideration
                do you based on
                >>> to
                >>>>        make such a decision?
                >>>
                >>>
                >>>
                >>>
                >>>
                
------------------------------------------------------------------------------------------------------------------------------------------
                >>>
                >>>    This email and any files transmitted with it
                are confidential and
                >>>    intended solely for the use of the individual
                or entity to whom
                >>>    they are addressed. If you have received this
                email in error
                >>>    please notify the system manager. This message
                contains
                >>>    confidential information and is intended only
                for the individual
                >>>    named. If you are not the named addressee you
                should not
                >>>    disseminate, distribute or copy this e-mail.
                Please notify the
                >>>    sender immediately by e-mail if you have
                received this e-mail by
                >>>    mistake and delete this e-mail from your
                system. If you are not
                >>>    the intended recipient you are notified that
                disclosing, copying,
                >>>    distributing or taking any action in reliance
                on the contents of
                >>>    this information is strictly prohibited.
                Although Flipkart has
                >>>    taken reasonable precautions to ensure no
                viruses are present in
                >>>    this email, the company cannot accept
                responsibility for any loss
                >>>    or damage arising from the use of this email or
                attachments
                >>




--
Sent from a mobile device. Excuse my thumbs.

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to