Re: why zeppelin SparkInterpreter use FIFOScheduler

Piyush Mukati (Data Platform) Wed, 13 Jan 2016 18:24:57 -0800

Hi,
 The code is available here
https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark



some testing part is left.

On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <dimp201...@gmail.com> wrote:

> Hi Pranav,
> When do you plan to send out the code for running notebooks in parallel ?
>
> Dimple
>
> On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <praag...@gmail.com>
> wrote:
>
>> Hi Rohit,
>>
>> We implemented the proposal and are able to run Zeppelin as a hosted
>> service inside my organization. Our internal forked version has pluggable
>> authentication and type ahead.
>>
>> I need to get the work ported to the latest and chop out the auth changes
>> portion. We'll be submitting it soon.
>>
>> We'll target to get this out for review by 11/26.
>>
>> Regards,
>> -Pranav.
>>
>>
>>
>> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>>
>> Hey Pranav,
>>
>> Did you make any progress on this?
>>
>> --
>> Rohit
>>
>> On Sunday, August 16, 2015, moon soo Lee <m...@apache.org> wrote:
>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution 
>>>>> proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially 
>>>>>> your
>>>>>> note can hog all the resources and the scheduler will have to throttle 
>>>>>> all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share 
>>>>>> the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praag...@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>
>>>>>> >>> <mailto:linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0...@gmail.com
>>>>>> >>> <mailto:linxizeng0...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0...@gmail.com  <mailto:
>>>>>> >>> linxizeng0...@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>
>> --
>> Sent from a mobile device. Excuse my thumbs.
>>
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to