Could you explain little bit more about package changes you mean? Thanks, moon
On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praag...@gmail.com> wrote: > Any thoughts on how to package changes related to Spark? > On 17-Aug-2015 7:58 pm, "moon soo Lee" <m...@apache.org> wrote: > >> I think releasing SparkIMain and related objects after configurable >> inactivity would be good for now. >> >> About scheduler, I can help implementing such scheduler. >> >> Thanks, >> moon >> >> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <praag...@gmail.com> >> wrote: >> >>> Hi Moon, >>> >>> Yes, the notebookid comes from InterpreterContext. At the moment >>> destroying SparkIMain on deletion of notebook is not handled. I think >>> SparkIMain is a lightweight object, do you see a concern having these >>> objects in a map? One possible option could be to destroy notebook related >>> objects when the inactivity on a notebook is greater than say 8 hours. >>> >>> >>> >> 4. Build a queue inside interpreter to allow only one paragraph >>> execution >>> >> at a time per notebook. >>> >>> One downside of this approach is, GUI will display RUNNING instead of >>> PENDING for jobs inside of queue in interpreter. >>> >>> Yes that's an good point. Having a scheduler at Zeppelin server to build >>> a scheduler that is parallel across notebook's and FIFO across paragraph's >>> will be nice. Is there any plan for having such a scheduler? >>> >>> Regards, >>> -Pranav. >>> >>> >>> On 17/08/15 5:38 am, moon soo Lee wrote: >>> >>> Pranav, proposal looks awesome! >>> >>> I have a question and feedback, >>> >>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you >>> need information of notebook id. Did you get it from InterpreterContext? >>> Then how did you handle destroying of SparkIMain (when notebook is >>> deleting)? >>> As far as i know, interpreter not able to get information of notebook >>> deletion. >>> >>> >> 4. Build a queue inside interpreter to allow only one paragraph >>> execution >>> >> at a time per notebook. >>> >>> One downside of this approach is, GUI will display RUNNING instead of >>> PENDING for jobs inside of queue in interpreter. >>> >>> Best, >>> moon >>> >>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com> wrote: >>> >>>> +1 for "to re-factor the Zeppelin architecture so that it can handle >>>> multi-tenancy easily" >>>> >>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com> >>>> wrote: >>>> >>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture >>>>> so that it can handle multi-tenancy easily. The technical solution >>>>> proposed >>>>> by Pranav is great but it only applies to Spark. Right now, each >>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin >>>>> can propose a multi-tenancy contract/info (like UserContext, similar to >>>>> InterpreterContext) so that each interpreter can choose to use or not. >>>>> >>>>> >>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com> >>>>> wrote: >>>>> >>>>>> I think while the idea of running multiple notes simultaneously is >>>>>> great. It is really dancing around the lack of true multi user support in >>>>>> Zeppelin. While the proposed solution would work if the applications >>>>>> resources are those of the whole cluster, if the app is limited (say they >>>>>> are 8 cores of 16, with some distribution in memory) then potentially >>>>>> your >>>>>> note can hog all the resources and the scheduler will have to throttle >>>>>> all >>>>>> other executions leaving you exactly where you are now. >>>>>> While I think the solution is a good one, maybe this question makes >>>>>> us think in adding true multiuser support. >>>>>> Where we isolate resources (cluster and the notebooks themselves), >>>>>> have separate login/identity and (I don't know if it's possible) share >>>>>> the >>>>>> same context. >>>>>> >>>>>> Thanks, >>>>>> Joel >>>>>> >>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > If the problem is that multiple users have to wait for each other >>>>>> while >>>>>> > using Zeppelin, the solution already exists: they can create a new >>>>>> > interpreter by going to the interpreter page and attach it to their >>>>>> > notebook - then they don't have to wait for others to submit their >>>>>> job. >>>>>> > >>>>>> > But I agree, having paragraphs from one note wait for paragraphs >>>>>> from other >>>>>> > notes is a confusing default. We can get around that in two ways: >>>>>> > >>>>>> > 1. Create a new interpreter for each note and attach that >>>>>> interpreter to >>>>>> > that note. This approach would require the least amount of code >>>>>> changes but >>>>>> > is resource heavy and doesn't let you share Spark Context between >>>>>> different >>>>>> > notes. >>>>>> > 2. If we want to share the Spark Context between different notes, >>>>>> we can >>>>>> > submit jobs from different notes into different fairscheduler >>>>>> pools ( >>>>>> > >>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application >>>>>> ). >>>>>> > This can be done by submitting jobs from different notes in >>>>>> different >>>>>> > threads. This will make sure that jobs from one note are run >>>>>> sequentially >>>>>> > but jobs from different notes will be able to run in parallel. >>>>>> > >>>>>> > Neither of these options require any change in the Spark code. >>>>>> > >>>>>> > -- >>>>>> > Thanks & Regards >>>>>> > Rohit Agarwal >>>>>> > https://www.linkedin.com/in/rohitagarwal003 >>>>>> > >>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal < >>>>>> praag...@gmail.com> >>>>>> > wrote: >>>>>> > >>>>>> >> If someone can share about the idea of sharing single SparkContext >>>>>> through >>>>>> >>> multiple SparkILoop safely, it'll be really helpful. >>>>>> >> Here is a proposal: >>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the >>>>>> virtual >>>>>> >> directory. While creating new instances of SparkIMain per notebook >>>>>> from >>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to >>>>>> the same >>>>>> >> virtual directory. >>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP >>>>>> server in >>>>>> >> Spark Context using classserverUri method >>>>>> >> 3. Scala generated code has a notion of packages. The default >>>>>> package name >>>>>> >> is "line$<linenumber>". Package name can be controlled using System >>>>>> >> Property scala.repl.name.line. Setting this property to "notebook >>>>>> id" >>>>>> >> ensures that code generated by individual instances of SparkIMain >>>>>> is >>>>>> >> isolated from other instances of SparkIMain >>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph >>>>>> execution >>>>>> >> at a time per notebook. >>>>>> >> >>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across >>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is >>>>>> there any >>>>>> >> Jira already for the same that I can uptake? Also I need to >>>>>> understand: >>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work >>>>>> >> towards getting Spark changes merged in Apache Spark github? >>>>>> >> >>>>>> >> Any suggestions on comments on the proposal are highly welcome. >>>>>> >> >>>>>> >> Regards, >>>>>> >> -Pranav. >>>>>> >> >>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote: >>>>>> >>> >>>>>> >>> Hi piyush, >>>>>> >>> >>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while >>>>>> >>> sharing the SparkContext sounds great. >>>>>> >>> >>>>>> >>> Actually, i tried to do it, found problem that multiple >>>>>> SparkILoop could >>>>>> >>> generates the same class name, and spark executor confuses >>>>>> classname since >>>>>> >>> they're reading classes from single SparkContext. >>>>>> >>> >>>>>> >>> If someone can share about the idea of sharing single SparkContext >>>>>> >>> through multiple SparkILoop safely, it'll be really helpful. >>>>>> >>> >>>>>> >>> Thanks, >>>>>> >>> moon >>>>>> >>> >>>>>> >>> >>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) < >>>>>> >>> piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>> >>>>>> wrote: >>>>>> >>> >>>>>> >>> Hi Moon, >>>>>> >>> Any suggestion on it, have to wait lot when multiple people >>>>>> working >>>>>> >>> with spark. >>>>>> >>> Can we create separate instance of SparkILoop SparkIMain and >>>>>> >>> printstrems for each notebook while sharing theSparkContext >>>>>> >>> ZeppelinContext SQLContext and DependencyResolver and then use >>>>>> parallel >>>>>> >>> scheduler ? >>>>>> >>> thanks >>>>>> >>> >>>>>> >>> -piyush >>>>>> >>> >>>>>> >>> Hi Moon, >>>>>> >>> >>>>>> >>> How about tracking dedicated SparkContext for a notebook in >>>>>> Spark's >>>>>> >>> remote interpreter - this will allow multiple users to run >>>>>> their spark >>>>>> >>> paragraphs in parallel. Also, within a notebook only one >>>>>> paragraph is >>>>>> >>> executed at a time. >>>>>> >>> >>>>>> >>> Regards, >>>>>> >>> -Pranav. >>>>>> >>> >>>>>> >>> >>>>>> >>>> On 15/07/15 7:15 pm, moon soo Lee wrote: >>>>>> >>>> Hi, >>>>>> >>>> >>>>>> >>>> Thanks for asking question. >>>>>> >>>> >>>>>> >>>> The reason is simply because of it is running code statements. >>>>>> The >>>>>> >>>> statements can have order and dependency. Imagine i have two >>>>>> >>> paragraphs >>>>>> >>>> >>>>>> >>>> %spark >>>>>> >>>> val a = 1 >>>>>> >>>> >>>>>> >>>> %spark >>>>>> >>>> print(a) >>>>>> >>>> >>>>>> >>>> If they're not running one by one, that means they possibly runs >>>>>> in >>>>>> >>>> random order and the output will be always different. Either '1' >>>>>> or >>>>>> >>>> 'val a can not found'. >>>>>> >>>> >>>>>> >>>> This is the reason why. But if there are nice idea to handle this >>>>>> >>>> problem i agree using parallel scheduler would help a lot. >>>>>> >>>> >>>>>> >>>> Thanks, >>>>>> >>>> moon >>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng >>>>>> >>>> <linxizeng0...@gmail.com <mailto:linxizeng0...@gmail.com> >>>>>> >>> <mailto:linxizeng0...@gmail.com <mailto:linxizeng0...@gmail.com >>>>>> >>> >>>>>> >>> wrote: >>>>>> >>>> >>>>>> >>>> any one who have the same question with me? or this is not a >>>>>> >>> question? >>>>>> >>>> >>>>>> >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng < >>>>>> linxizeng0...@gmail.com >>>>>> >>> <mailto:linxizeng0...@gmail.com> >>>>>> >>>> <mailto:linxizeng0...@gmail.com <mailto: >>>>>> >>> linxizeng0...@gmail.com>>>: >>>>>> >>>> >>>>>> >>>> hi, Moon: >>>>>> >>>> I notice that the getScheduler function in the >>>>>> >>>> SparkInterpreter.java return a FIFOScheduler which makes >>>>>> the >>>>>> >>>> spark interpreter run spark job one by one. It's not a >>>>>> good >>>>>> >>>> experience when couple of users do some work on zeppelin >>>>>> at >>>>>> >>>> the same time, because they have to wait for each other. >>>>>> >>>> And at the same time, SparkSqlInterpreter can chose what >>>>>> >>>> scheduler to use by "zeppelin.spark.concurrentSQL". >>>>>> >>>> My question is, what kind of consideration do you based on >>>>>> >>> to >>>>>> >>>> make such a decision? >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> ------------------------------------------------------------------------------------------------------------------------------------------ >>>>>> >>> >>>>>> >>> This email and any files transmitted with it are confidential >>>>>> and >>>>>> >>> intended solely for the use of the individual or entity to whom >>>>>> >>> they are addressed. If you have received this email in error >>>>>> >>> please notify the system manager. This message contains >>>>>> >>> confidential information and is intended only for the >>>>>> individual >>>>>> >>> named. If you are not the named addressee you should not >>>>>> >>> disseminate, distribute or copy this e-mail. Please notify the >>>>>> >>> sender immediately by e-mail if you have received this e-mail >>>>>> by >>>>>> >>> mistake and delete this e-mail from your system. If you are not >>>>>> >>> the intended recipient you are notified that disclosing, >>>>>> copying, >>>>>> >>> distributing or taking any action in reliance on the contents >>>>>> of >>>>>> >>> this information is strictly prohibited. Although Flipkart has >>>>>> >>> taken reasonable precautions to ensure no viruses are present >>>>>> in >>>>>> >>> this email, the company cannot accept responsibility for any >>>>>> loss >>>>>> >>> or damage arising from the use of this email or attachments >>>>>> >> >>>>>> >>>>> >>>>> >>>