Re: why zeppelin SparkInterpreter use FIFOScheduler

moon soo Lee Mon, 17 Aug 2015 19:36:07 -0700

Could you explain little bit more about package changes you mean?

Thanks,
moon


On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praag...@gmail.com> wrote:

> Any thoughts on how to package changes related to Spark?
> On 17-Aug-2015 7:58 pm, "moon soo Lee" <m...@apache.org> wrote:
>
>> I think releasing SparkIMain and related objects after configurable
>> inactivity would be good for now.
>>
>> About scheduler, I can help implementing such scheduler.
>>
>> Thanks,
>> moon
>>
>> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <praag...@gmail.com>
>> wrote:
>>
>>> Hi Moon,
>>>
>>> Yes, the notebookid comes from InterpreterContext. At the moment
>>> destroying SparkIMain on deletion of notebook is not handled. I think
>>> SparkIMain is a lightweight object, do you see a concern having these
>>> objects in a map? One possible option could be to destroy notebook related
>>> objects when the inactivity on a notebook is greater than say 8 hours.
>>>
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Yes that's an good point. Having a scheduler at Zeppelin server to build
>>> a scheduler that is parallel across notebook's and FIFO across paragraph's
>>> will be nice. Is there any plan for having such a scheduler?
>>>
>>> Regards,
>>> -Pranav.
>>>
>>>
>>> On 17/08/15 5:38 am, moon soo Lee wrote:
>>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution 
>>>>> proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially 
>>>>>> your
>>>>>> note can hog all the resources and the scheduler will have to throttle 
>>>>>> all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share 
>>>>>> the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praag...@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>
>>>>>> >>> <mailto:linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0...@gmail.com
>>>>>> >>> <mailto:linxizeng0...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0...@gmail.com  <mailto:
>>>>>> >>> linxizeng0...@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to