Re: [DISCUSS] Update Roadmap

Eran Witkon Tue, 01 Mar 2016 00:13:28 -0800

One point to clarify, I don't want to suggest Oozie in specific, I want to
think about which features we develop and which ones we integrate external,
preferred Apache, technology? We don't think about building our own storage
services so why build our own scheduler?
Eran
On Tue, 1 Mar 2016 at 09:49 moon soo Lee <m...@apache.org> wrote:


> @Vinayak, @Eran, @Benjamin, @Guilherme, @Sourav, @Rick
> Now I can see a lot of demands around enterprise level job scheduling.
> Either external or built-in, I completely agree having enterprise level job
> scheduling support on the roadmap.
> ZEPPELIN-137 <https://issues.apache.org/jira/browse/ZEPPELIN-137>,
> ZEPPELIN-531 <https://issues.apache.org/jira/browse/ZEPPELIN-531> are
> related issues i can find in our JIRA.
>
> @Vinayak
> Regarding importing notebook from github, Zeppelin has pluggable notebook
> storage layer (see related package
> <https://github.com/apache/incubator-zeppelin/tree/master/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/repo>).
> So, github notebook sync can be implemented easily.
>
> @Shabeel
> Right, we need better manage management to prevent such OOM.
> And i think table is one of the most frequently used way of displaying
> data. So definitely, we'll need more features like filter, sort, etc.
> After this roadmap discussion, discussion for the next release will
> follow. Then we'll get idea when those features will be available.
>
> @Prasad
> Thanks for mentioning HA and DR. They're really important subject for
> enterprise use. Definitely Zeppelin will need to address them.
> And displaying meta information of notebook on top level page is good idea.
>
> It's really great to hear many opinions and ideas.
> And thanks @Rick for sharing valuable view to Zeppelin project.
>
> Thanks,
> moon
>
>
> On Mon, Feb 29, 2016 at 11:14 PM Rick Moritz <rah...@gmail.com> wrote:
>
>> Hi,
>>
>> For one, I know that there is rudimentary scheduling built into Zeppelin
>> already (at least I fixed a bug in the test for a scheduling feature a few
>> months ago).
>> But another point is, that Zeppelin should also focus on quality,
>> reproduceability and portability.
>> Although this doesn't offer exciting new features, it would make
>> development much easier.
>>
>> Cross-platform testability, Tests that pass when run sequentially,
>> compatibility with Firefox, and many more open issues that make it so much
>> harder to enhance Zeppelin and add features should be addressed soon,
>> preferably before more features are added. Already Zeppelin is suffering -
>> in my opinion - from quite a lot of feature creep, and we should avoid
>> putting in the kitchen sink, at the cost of quality and maintainability.
>> Instead modularity (ZEPPELIN-533 in particular) should be targeted.
>>
>> Oozie, in my opinion, is a dead end - it may de-facto still be in use on
>> many clusters, but it's not getting the love it needs, and I wouldn't bet
>> on it, when it comes to integrating scheduling. Instead, any external tool
>> should be able to use the REST-API to trigger executions, if you want
>> external scheduling.
>>
>> So, in conclusion, if we take Moon's list as a list of descending
>> priorities, I fully agree, under the condition that code quality is
>> included as a subset of enterprise-readyness. Auth* is paramount (Kerberos
>> SPNEGO SSO support is what we really want) with user and group rights
>> assignment on the notebook level. We probably also need Knox-integration
>> (ODP-Members looking at integrating Zeppelin should consider contributing
>> this), and integration of something like Spree (
>> https://github.com/hammerlab/spree) to be able to profile jobs.
>>
>> I'm hopeful that soon I can resume contributing some quality-oriented
>> code, to drive this "necessary evil" forward ;)
>>
>> On Mon, Feb 29, 2016 at 8:27 PM, Sourav Mazumder <
>> sourav.mazumde...@gmail.com> wrote:
>>
>>> I do agree with Vinayak. It need not be coupled with Oozie.
>>>
>>> Rather one should be able to call it from any scheduler typically used
>>> in enterprise level. May be support for BPML.
>>>
>>> I believe the existing ability to call/execute a Zeppelin Notebook or a
>>> specific paragraph within a notebook using REST API should take care of
>>> this requirement to some extent.
>>>
>>> Regards,
>>> Sourav
>>>
>>> On Mon, Feb 29, 2016 at 11:23 AM, Vinayak Agrawal <
>>> vinayakagrawa...@gmail.com> wrote:
>>>
>>>> @Eran Witkon,
>>>> Thanks for the suggestion Eran. I concur with your thought.
>>>> If Zepplin can be integrated with oozie, that would be wonderful. Users
>>>> will also be able to leverage their Oozie skills.
>>>> This would be promising for now.
>>>> However, in the future Hadoop might not necessarily be installed in
>>>> Spark Cluster and Oozie (since its installs with Hadoop Distribution) might
>>>> not be available.
>>>> So perhaps we should give a thought about this feature for the future.
>>>> Should it depend on oozie or should Zeppelin have its owns scheduling?
>>>>
>>>> As Benjamin has iterated, Databrick notebook has this as a core
>>>> notebook feature.
>>>>
>>>>
>>>> Also, would anybody give any suggestions regarding "sync with github"
>>>> feature?
>>>> -Exporting notebook to Github
>>>> -Importing notebook from Github
>>>>
>>>> Thanks
>>>> Vinayak
>>>>
>>>>
>>>> On Mon, Feb 29, 2016 at 4:17 AM, Eran Witkon <eranwit...@gmail.com>
>>>> wrote:
>>>>
>>>>> @Vinayak Agrawal I would suggest adding the ability to connect
>>>>> zeppelin to existing scheduling tools\workflow tools such as
>>>>> https://oozie.apache.org/. this requires betters hooks and status
>>>>> reporting but doesn't make zeppeling and ETL\scheduler tool by itself/
>>>>>
>>>>>
>>>>> On Mon, Feb 29, 2016 at 10:21 AM Vinayak Agrawal <
>>>>> vinayakagrawa...@gmail.com> wrote:
>>>>>
>>>>>> Moon,
>>>>>> The new roadmap looks very promising. I am very happy to see security
>>>>>> in the list.
>>>>>> I have some suggestions regarding Enterprise Ready features:
>>>>>>
>>>>>> 1. Job Scheduler - Can this be improved?
>>>>>> Currently the scheduler can be used with Cron expression or a pre-set
>>>>>> time. But in an enterprise solution, a notebook might be one piece of the
>>>>>> workflow. Can we look towards the functionality of scheduling notebook's
>>>>>> based on other notebooks finishing their job successfully?
>>>>>> This requirement would arise in any ETL workflow, where all the
>>>>>> downstream users wait for the ETL notebook to finish successfully. Only
>>>>>> after that, other business oriented notebooks can be executed.
>>>>>>
>>>>>> 2. Importing a notebook - Is there a current requirement or future
>>>>>> plan to implement a feature that allows import-notebook-from-github? This
>>>>>> would allow users to share notebooks seamlessly.
>>>>>>
>>>>>> Thanks
>>>>>> Vinayak
>>>>>>
>>>>>> On Sun, Feb 28, 2016 at 11:22 PM, moon soo Lee <m...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Zhong Wang,
>>>>>>> Right, Folder support would be quite useful. Thanks for the opinion.
>>>>>>>
>>>>>> Hope i can finish the work pr-190
>>>>>>> <https://github.com/apache/incubator-zeppelin/pull/190>.
>>>>>>>
>>>>>>
>>>>>>> Sourav,
>>>>>>> Regarding concurrent running, Zeppelin doesn't have limitation of
>>>>>>> run paragraph/query concurrently. Interpreter can implement it's own
>>>>>>> scheduling policy. For example, SparkSQL interpreter and 
>>>>>>> ShellInterpreter
>>>>>>> can already run paragraph/query concurrently.
>>>>>>>
>>>>>>> SparkInterpreter is implemented with FIFO scheduler considering
>>>>>>> nature of scala compiler. That's why user can not run multiple paragraph
>>>>>>> concurrently when they work with SparkInterpreter.
>>>>>>> But as Zhong Wang mentioned, pr-703 enables each notebook will have
>>>>>>> separate scala compiler so paragraphs run concurrently, while they're in
>>>>>>> different notebooks.
>>>>>>> Thanks for the feedback!
>>>>>>>
>>>>>>> Best,
>>>>>>> moon
>>>>>>>
>>>>>> On Sat, Feb 27, 2016 at 8:59 PM Zhong Wang <wangzhong....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>> Sourav: I think this newly merged PR can help you
>>>>>>>> https://github.com/apache/incubator-zeppelin/pull/703#issuecomment-185582537
>>>>>>>>
>>>>>>>> On Sat, Feb 27, 2016 at 1:46 PM, Sourav Mazumder <
>>>>>>>> sourav.mazumde...@gmail.com> wrote:
>>>>>>>>
>>>>>>> Hi Moon,
>>>>>>>>>
>>>>>>>>> This looks great.
>>>>>>>>>
>>>>>>>>> My only suggestion would be to include a PR/feature - Support for
>>>>>>>>> Running Concurrent paragraphs/queries in Zeppelin.
>>>>>>>>>
>>>>>>>>> Right now if more than one user tries to run paragraphs in
>>>>>>>>> multiple notebooks concurrently through a single Zeppelin instance 
>>>>>>>>> (and
>>>>>>>>> single interpreter instance) the performance is very slow. It is 
>>>>>>>>> obvious
>>>>>>>>> that the queue gets built up within the zeppelin process and 
>>>>>>>>> interpreter
>>>>>>>>> process in that scenario as the time taken to move the status from 
>>>>>>>>> start to
>>>>>>>>> pending and pending to running is very high compared to the actual 
>>>>>>>>> running
>>>>>>>>> time of a paragraph.
>>>>>>>>>
>>>>>>>>> Without this the multi tenancy support would be meaningless as no
>>>>>>>>> one can practically use it in a situation where multiple users are 
>>>>>>>>> trying
>>>>>>>>> to connect to the same instance of Zeppelin (and the related 
>>>>>>>>> interpreter).
>>>>>>>>> A possible solution would be to spawn separate instance of the same
>>>>>>>>> interpreter at every notebook/user level.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Sourav
>>>>>>>>>
>>>>>>>> On Sat, Feb 27, 2016 at 12:48 PM, moon soo Lee <m...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>> Hi Zeppelin users and developers,
>>>>>>>>>>
>>>>>>>>>> The roadmap we have published at
>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/ZEPPELIN/Zeppelin+Roadmap
>>>>>>>>>> is almost 9 month old, and it doesn't reflect where the community
>>>>>>>>>> goes anymore. It's time to update.
>>>>>>>>>>
>>>>>>>>>> Based on mailing list, jira issues, pullrequests, feedbacks from
>>>>>>>>>> users, conferences and meetings, I could summarize the major 
>>>>>>>>>> interest of
>>>>>>>>>> users and developers in 7 categories. Enterprise ready, Usability
>>>>>>>>>> improvement, Pluggability, Documentation, Backend integration, 
>>>>>>>>>> Notebook
>>>>>>>>>> storage, and Visualization.
>>>>>>>>>>
>>>>>>>>>> And i could list related subjects under each categories.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>    - Enterprise ready
>>>>>>>>>>       - Authentication
>>>>>>>>>>          - Shiro authentication ZEPPELIN-548
>>>>>>>>>>          <https://issues.apache.org/jira/browse/ZEPPELIN-548>
>>>>>>>>>>       - Authorization
>>>>>>>>>>          - Notebook authorization PR-681
>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/681>
>>>>>>>>>>       - Security
>>>>>>>>>>       - Multi-tenancy
>>>>>>>>>>       - Stability
>>>>>>>>>>    - Usability Improvement
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - UX improvement
>>>>>>>>>>       - Better Table data support
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Download data as csv, etc PR-725
>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/725>
>>>>>>>>>>          , PR-714
>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/714>
>>>>>>>>>>          , PR-6
>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/6>,
>>>>>>>>>>          PR-89
>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/89>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Featureful table data display (pagenation, etc)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Pluggability ZEPPELIN-533
>>>>>>>>>>    <https://issues.apache.org/jira/browse/ZEPPELIN-533>
>>>>>>>>>>       - Pluggable visualization
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Dynamic Interpreter, notebook, visualization loading
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Repository and registry for pluggable components
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Improve documentation
>>>>>>>>>>       - Improve contents and readability
>>>>>>>>>>       - more tutorials, examples
>>>>>>>>>>    - Interpreter
>>>>>>>>>>       - Generic JDBC Interpreter
>>>>>>>>>>       - (spark)R Interpreter
>>>>>>>>>>       - Cluster manager for interpreter (Proposal
>>>>>>>>>>       
>>>>>>>>>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Cluster+Manager+Proposal>
>>>>>>>>>>       )
>>>>>>>>>>       - more interpreters
>>>>>>>>>>    - Notebook storage
>>>>>>>>>>       - Versioning ZEPPELIN-540
>>>>>>>>>>       <http://issues.apache.org/jira/browse/ZEPPELIN-540>
>>>>>>>>>>       - more notebook storages
>>>>>>>>>>    - Visualization
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - More visualizations PR-152
>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/152>,
>>>>>>>>>>       PR-728
>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/728>,
>>>>>>>>>>       PR-336
>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/336>,
>>>>>>>>>>       PR-321
>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/321>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Customize graph (show/hide label, color, etc)
>>>>>>>>>>
>>>>>>>>>> It will help anyone quickly get overall interest of project and
>>>>>>>>>> the direction. And based on this roadmap, we can discuss and 
>>>>>>>>>> re-define the
>>>>>>>>>> next release 0.6.0 scope and it's schedule.
>>>>>>>>>>
>>>>>>>>>> What do you think? Any feedback would be appreciated.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> moon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Vinayak Agrawal
>>>>>>
>>>>>>
>>>>>> "To Strive, To Seek, To Find and Not to Yield!"
>>>>>> ~Lord Alfred Tennyson
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Vinayak Agrawal
>>>> Big Data Analytics
>>>> IBM
>>>>
>>>> "To Strive, To Seek, To Find and Not to Yield!"
>>>> ~Lord Alfred Tennyson
>>>>
>>>
>>>
>>

Re: [DISCUSS] Update Roadmap

Reply via email to