Re: [DISCUSS] Update Roadmap

Tamas Szuromi Tue, 01 Mar 2016 00:47:29 -0800

Hey,

Really promising roadmap.


I'd only push more visualization options. I agree built in visualization is
needed with limited charting options but I think we also need somehow
'inject' external js visualizations also.


For scheduling Zeppelin notebooks  we use https://github.com/airbnb/airflow
<https://github.com/airbnb/airflow> through the job rest api. It's an
enterprise ready and very robust solution right now.


*Tamas*

On 1 March 2016 at 09:12, Eran Witkon <eranwit...@gmail.com> wrote:

> One point to clarify, I don't want to suggest Oozie in specific, I want to
> think about which features we develop and which ones we integrate external,
> preferred Apache, technology? We don't think about building our own storage
> services so why build our own scheduler?
> Eran
> On Tue, 1 Mar 2016 at 09:49 moon soo Lee <m...@apache.org> wrote:
>
>> @Vinayak, @Eran, @Benjamin, @Guilherme, @Sourav, @Rick
>> Now I can see a lot of demands around enterprise level job scheduling.
>> Either external or built-in, I completely agree having enterprise level job
>> scheduling support on the roadmap.
>> ZEPPELIN-137 <https://issues.apache.org/jira/browse/ZEPPELIN-137>,
>> ZEPPELIN-531 <https://issues.apache.org/jira/browse/ZEPPELIN-531> are
>> related issues i can find in our JIRA.
>>
>> @Vinayak
>> Regarding importing notebook from github, Zeppelin has pluggable notebook
>> storage layer (see related package
>> <https://github.com/apache/incubator-zeppelin/tree/master/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/repo>).
>> So, github notebook sync can be implemented easily.
>>
>> @Shabeel
>> Right, we need better manage management to prevent such OOM.
>> And i think table is one of the most frequently used way of displaying
>> data. So definitely, we'll need more features like filter, sort, etc.
>> After this roadmap discussion, discussion for the next release will
>> follow. Then we'll get idea when those features will be available.
>>
>> @Prasad
>> Thanks for mentioning HA and DR. They're really important subject for
>> enterprise use. Definitely Zeppelin will need to address them.
>> And displaying meta information of notebook on top level page is good
>> idea.
>>
>> It's really great to hear many opinions and ideas.
>> And thanks @Rick for sharing valuable view to Zeppelin project.
>>
>> Thanks,
>> moon
>>
>>
>> On Mon, Feb 29, 2016 at 11:14 PM Rick Moritz <rah...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> For one, I know that there is rudimentary scheduling built into Zeppelin
>>> already (at least I fixed a bug in the test for a scheduling feature a few
>>> months ago).
>>> But another point is, that Zeppelin should also focus on quality,
>>> reproduceability and portability.
>>> Although this doesn't offer exciting new features, it would make
>>> development much easier.
>>>
>>> Cross-platform testability, Tests that pass when run sequentially,
>>> compatibility with Firefox, and many more open issues that make it so much
>>> harder to enhance Zeppelin and add features should be addressed soon,
>>> preferably before more features are added. Already Zeppelin is suffering -
>>> in my opinion - from quite a lot of feature creep, and we should avoid
>>> putting in the kitchen sink, at the cost of quality and maintainability.
>>> Instead modularity (ZEPPELIN-533 in particular) should be targeted.
>>>
>>> Oozie, in my opinion, is a dead end - it may de-facto still be in use on
>>> many clusters, but it's not getting the love it needs, and I wouldn't bet
>>> on it, when it comes to integrating scheduling. Instead, any external tool
>>> should be able to use the REST-API to trigger executions, if you want
>>> external scheduling.
>>>
>>> So, in conclusion, if we take Moon's list as a list of descending
>>> priorities, I fully agree, under the condition that code quality is
>>> included as a subset of enterprise-readyness. Auth* is paramount (Kerberos
>>> SPNEGO SSO support is what we really want) with user and group rights
>>> assignment on the notebook level. We probably also need Knox-integration
>>> (ODP-Members looking at integrating Zeppelin should consider contributing
>>> this), and integration of something like Spree (
>>> https://github.com/hammerlab/spree) to be able to profile jobs.
>>>
>>> I'm hopeful that soon I can resume contributing some quality-oriented
>>> code, to drive this "necessary evil" forward ;)
>>>
>>> On Mon, Feb 29, 2016 at 8:27 PM, Sourav Mazumder <
>>> sourav.mazumde...@gmail.com> wrote:
>>>
>>>> I do agree with Vinayak. It need not be coupled with Oozie.
>>>>
>>>> Rather one should be able to call it from any scheduler typically used
>>>> in enterprise level. May be support for BPML.
>>>>
>>>> I believe the existing ability to call/execute a Zeppelin Notebook or a
>>>> specific paragraph within a notebook using REST API should take care of
>>>> this requirement to some extent.
>>>>
>>>> Regards,
>>>> Sourav
>>>>
>>>> On Mon, Feb 29, 2016 at 11:23 AM, Vinayak Agrawal <
>>>> vinayakagrawa...@gmail.com> wrote:
>>>>
>>>>> @Eran Witkon,
>>>>> Thanks for the suggestion Eran. I concur with your thought.
>>>>> If Zepplin can be integrated with oozie, that would be wonderful.
>>>>> Users will also be able to leverage their Oozie skills.
>>>>> This would be promising for now.
>>>>> However, in the future Hadoop might not necessarily be installed in
>>>>> Spark Cluster and Oozie (since its installs with Hadoop Distribution) 
>>>>> might
>>>>> not be available.
>>>>> So perhaps we should give a thought about this feature for the future.
>>>>> Should it depend on oozie or should Zeppelin have its owns scheduling?
>>>>>
>>>>> As Benjamin has iterated, Databrick notebook has this as a core
>>>>> notebook feature.
>>>>>
>>>>>
>>>>> Also, would anybody give any suggestions regarding "sync with github"
>>>>> feature?
>>>>> -Exporting notebook to Github
>>>>> -Importing notebook from Github
>>>>>
>>>>> Thanks
>>>>> Vinayak
>>>>>
>>>>>
>>>>> On Mon, Feb 29, 2016 at 4:17 AM, Eran Witkon <eranwit...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> @Vinayak Agrawal I would suggest adding the ability to connect
>>>>>> zeppelin to existing scheduling tools\workflow tools such as
>>>>>> https://oozie.apache.org/. this requires betters hooks and status
>>>>>> reporting but doesn't make zeppeling and ETL\scheduler tool by itself/
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 29, 2016 at 10:21 AM Vinayak Agrawal <
>>>>>> vinayakagrawa...@gmail.com> wrote:
>>>>>>
>>>>>>> Moon,
>>>>>>> The new roadmap looks very promising. I am very happy to see
>>>>>>> security in the list.
>>>>>>> I have some suggestions regarding Enterprise Ready features:
>>>>>>>
>>>>>>> 1. Job Scheduler - Can this be improved?
>>>>>>> Currently the scheduler can be used with Cron expression or a
>>>>>>> pre-set time. But in an enterprise solution, a notebook might be one 
>>>>>>> piece
>>>>>>> of the workflow. Can we look towards the functionality of scheduling
>>>>>>> notebook's based on other notebooks finishing their job successfully?
>>>>>>> This requirement would arise in any ETL workflow, where all the
>>>>>>> downstream users wait for the ETL notebook to finish successfully. Only
>>>>>>> after that, other business oriented notebooks can be executed.
>>>>>>>
>>>>>>> 2. Importing a notebook - Is there a current requirement or future
>>>>>>> plan to implement a feature that allows import-notebook-from-github? 
>>>>>>> This
>>>>>>> would allow users to share notebooks seamlessly.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Vinayak
>>>>>>>
>>>>>>> On Sun, Feb 28, 2016 at 11:22 PM, moon soo Lee <m...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Zhong Wang,
>>>>>>>> Right, Folder support would be quite useful. Thanks for the
>>>>>>>> opinion.
>>>>>>>>
>>>>>>> Hope i can finish the work pr-190
>>>>>>>> <https://github.com/apache/incubator-zeppelin/pull/190>.
>>>>>>>>
>>>>>>>
>>>>>>>> Sourav,
>>>>>>>> Regarding concurrent running, Zeppelin doesn't have limitation of
>>>>>>>> run paragraph/query concurrently. Interpreter can implement it's own
>>>>>>>> scheduling policy. For example, SparkSQL interpreter and 
>>>>>>>> ShellInterpreter
>>>>>>>> can already run paragraph/query concurrently.
>>>>>>>>
>>>>>>>> SparkInterpreter is implemented with FIFO scheduler considering
>>>>>>>> nature of scala compiler. That's why user can not run multiple 
>>>>>>>> paragraph
>>>>>>>> concurrently when they work with SparkInterpreter.
>>>>>>>> But as Zhong Wang mentioned, pr-703 enables each notebook will have
>>>>>>>> separate scala compiler so paragraphs run concurrently, while they're 
>>>>>>>> in
>>>>>>>> different notebooks.
>>>>>>>> Thanks for the feedback!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> moon
>>>>>>>>
>>>>>>> On Sat, Feb 27, 2016 at 8:59 PM Zhong Wang <wangzhong....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>> Sourav: I think this newly merged PR can help you
>>>>>>>>> https://github.com/apache/incubator-zeppelin/pull/703#issuecomment-185582537
>>>>>>>>>
>>>>>>>>> On Sat, Feb 27, 2016 at 1:46 PM, Sourav Mazumder <
>>>>>>>>> sourav.mazumde...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>> Hi Moon,
>>>>>>>>>>
>>>>>>>>>> This looks great.
>>>>>>>>>>
>>>>>>>>>> My only suggestion would be to include a PR/feature - Support for
>>>>>>>>>> Running Concurrent paragraphs/queries in Zeppelin.
>>>>>>>>>>
>>>>>>>>>> Right now if more than one user tries to run paragraphs in
>>>>>>>>>> multiple notebooks concurrently through a single Zeppelin instance 
>>>>>>>>>> (and
>>>>>>>>>> single interpreter instance) the performance is very slow. It is 
>>>>>>>>>> obvious
>>>>>>>>>> that the queue gets built up within the zeppelin process and 
>>>>>>>>>> interpreter
>>>>>>>>>> process in that scenario as the time taken to move the status from 
>>>>>>>>>> start to
>>>>>>>>>> pending and pending to running is very high compared to the actual 
>>>>>>>>>> running
>>>>>>>>>> time of a paragraph.
>>>>>>>>>>
>>>>>>>>>> Without this the multi tenancy support would be meaningless as no
>>>>>>>>>> one can practically use it in a situation where multiple users are 
>>>>>>>>>> trying
>>>>>>>>>> to connect to the same instance of Zeppelin (and the related 
>>>>>>>>>> interpreter).
>>>>>>>>>> A possible solution would be to spawn separate instance of the same
>>>>>>>>>> interpreter at every notebook/user level.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Sourav
>>>>>>>>>>
>>>>>>>>> On Sat, Feb 27, 2016 at 12:48 PM, moon soo Lee <m...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>> Hi Zeppelin users and developers,
>>>>>>>>>>>
>>>>>>>>>>> The roadmap we have published at
>>>>>>>>>>>
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/ZEPPELIN/Zeppelin+Roadmap
>>>>>>>>>>> is almost 9 month old, and it doesn't reflect where the
>>>>>>>>>>> community goes anymore. It's time to update.
>>>>>>>>>>>
>>>>>>>>>>> Based on mailing list, jira issues, pullrequests, feedbacks from
>>>>>>>>>>> users, conferences and meetings, I could summarize the major 
>>>>>>>>>>> interest of
>>>>>>>>>>> users and developers in 7 categories. Enterprise ready, Usability
>>>>>>>>>>> improvement, Pluggability, Documentation, Backend integration, 
>>>>>>>>>>> Notebook
>>>>>>>>>>> storage, and Visualization.
>>>>>>>>>>>
>>>>>>>>>>> And i could list related subjects under each categories.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>    - Enterprise ready
>>>>>>>>>>>       - Authentication
>>>>>>>>>>>          - Shiro authentication ZEPPELIN-548
>>>>>>>>>>>          <https://issues.apache.org/jira/browse/ZEPPELIN-548>
>>>>>>>>>>>       - Authorization
>>>>>>>>>>>          - Notebook authorization PR-681
>>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/681>
>>>>>>>>>>>       - Security
>>>>>>>>>>>       - Multi-tenancy
>>>>>>>>>>>       - Stability
>>>>>>>>>>>    - Usability Improvement
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - UX improvement
>>>>>>>>>>>       - Better Table data support
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Download data as csv, etc PR-725
>>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/725>
>>>>>>>>>>>          , PR-714
>>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/714>
>>>>>>>>>>>          , PR-6
>>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/6>,
>>>>>>>>>>>          PR-89
>>>>>>>>>>>          <https://github.com/apache/incubator-zeppelin/pull/89>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Featureful table data display (pagenation, etc)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Pluggability ZEPPELIN-533
>>>>>>>>>>>    <https://issues.apache.org/jira/browse/ZEPPELIN-533>
>>>>>>>>>>>       - Pluggable visualization
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Dynamic Interpreter, notebook, visualization loading
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Repository and registry for pluggable components
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Improve documentation
>>>>>>>>>>>       - Improve contents and readability
>>>>>>>>>>>       - more tutorials, examples
>>>>>>>>>>>    - Interpreter
>>>>>>>>>>>       - Generic JDBC Interpreter
>>>>>>>>>>>       - (spark)R Interpreter
>>>>>>>>>>>       - Cluster manager for interpreter (Proposal
>>>>>>>>>>>       
>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Cluster+Manager+Proposal>
>>>>>>>>>>>       )
>>>>>>>>>>>       - more interpreters
>>>>>>>>>>>    - Notebook storage
>>>>>>>>>>>       - Versioning ZEPPELIN-540
>>>>>>>>>>>       <http://issues.apache.org/jira/browse/ZEPPELIN-540>
>>>>>>>>>>>       - more notebook storages
>>>>>>>>>>>    - Visualization
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - More visualizations PR-152
>>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/152>,
>>>>>>>>>>>       PR-728
>>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/728>,
>>>>>>>>>>>       PR-336
>>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/336>,
>>>>>>>>>>>       PR-321
>>>>>>>>>>>       <https://github.com/apache/incubator-zeppelin/pull/321>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Customize graph (show/hide label, color, etc)
>>>>>>>>>>>
>>>>>>>>>>> It will help anyone quickly get overall interest of project and
>>>>>>>>>>> the direction. And based on this roadmap, we can discuss and 
>>>>>>>>>>> re-define the
>>>>>>>>>>> next release 0.6.0 scope and it's schedule.
>>>>>>>>>>>
>>>>>>>>>>> What do you think? Any feedback would be appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> moon
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vinayak Agrawal
>>>>>>>
>>>>>>>
>>>>>>> "To Strive, To Seek, To Find and Not to Yield!"
>>>>>>> ~Lord Alfred Tennyson
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Vinayak Agrawal
>>>>> Big Data Analytics
>>>>> IBM
>>>>>
>>>>> "To Strive, To Seek, To Find and Not to Yield!"
>>>>> ~Lord Alfred Tennyson
>>>>>
>>>>
>>>>
>>>

Re: [DISCUSS] Update Roadmap

Reply via email to