Re: [DISCUSS] Pulling "Contrib" Modules into Apache

David Miller Wed, 12 Mar 2014 22:26:25 -0700

what about both ?
connectors for spout/bolt/states that connect to other tech, storm-kafka, 
storm-cassandra, etc
extras for other things like storm-starter, storm-deploy, storm-puppet




On 13 Mar 2014, at 3:57 pm, Nathan Marz <[email protected]> wrote:

> I don't like either name tbh. Storm itself is already broken into modules 
> (storm-core, storm-netty, etc) and things like storm-starter and storm-kafka 
> are something different. I don't like "connectors" because something like 
> storm-starter is not a connector. Maybe we call them "extras"?
> 
> I would say just to support 0.8.x of Kafka.
> 
> 
> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <[email protected]> wrote:
> Incorporation of storm starter is underway.
> 
> I'd like to turn the attention to kafka, with the goal being to pull in kafka 
> support that is maintained and will be known to be compatible with the 
> current version of storm and specific version(s) of kafka.
> 
> I have the following questions for the community:
> 
> 1. What do we want to call additions like this? I'm leaning toward "modules" 
> or "connectors".
> 
> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 
> 0.8.x? From a release management perspective, the latter is preferable 
> because the 0.7.x line artifacts are not in maven central. This makes 
> building a real pain, and maintaining support for two versions won't be fun. 
> Also, most of the people I have worked with are looking at 0.8.x for a 
> variety of reasons, but I'm open to either way.
> 
> - Taylor
> 
> 
> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" 
> > <[email protected]> wrote:
> >
> > Thanks for starting this discussion, Taylor.
> >
> > As a user of Storm (and a small-scale contributor to storm-starter) as
> > well as a user of Kafka, here are my $.02.
> >
> > [Storm and Kafka]
> > First, I agree with Nathan that storm-kafka should be considered to be
> > brought in.  While various "integrate Storm with X" options exist,
> > basically everyone I have been talking to is using Kafka in
> > combination with Storm.  I'm sure this is not a representative sample
> > of Storm users, and of course one may or may not agree that Kafka is
> > important enough of a technology in Storm's ecosystem.  Still, I do
> > see the need to make sure Storm and Kafka do work together without
> > having to go through forks of forks on GitHub and spending days to
> > figure out how to get data from Kafka (0.8) into Storm.
> >    Speaking of Kafka spout implementations, please don't forget
> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> > We've been quite happy with the former, so I'd suggest to at least
> > consider both options here (maybe the two projects can even join forces?).
> >
> > [Storm examples, storm-starter]
> > Second, IMHO every open source project should have a "1-click starting
> > experience" for new users.  That's very much related to the project
> > principles of tools like LogStash [1] who say: "Community: If a newbie
> > has a bad time, it's a bug."  For this reason I personally would like
> > to see the equivalent of storm-starter being brought into the "core"
> > Storm project -- think of an examples/ sub-module.  If the level of
> > effort is deemed too high to e.g. maintain what's already in
> > storm-starter, then (say) reduce the scope and remove some of the
> > examples.  In any case I'd personally would like to see bundled
> > examples that are known to work with the latest version of Storm.
> > storm-starter is often used to show new users how to get started with
> > Storm (I used that approach in my Storm blog posts, for instance, and
> > others like Mesosphere.io are even using storm-starter for their
> > commercial offerings [2]).
> >
> > [Have Storm up and running faster than you can brew an espresso]
> > Third, for the same reason (get people up and running in a few
> > minutes), I do like that other people in this thread have been
> > bringing up projects like storm-deploy.  For the same reason I have
> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> > few days ago, and I'll soon open source another Vagrant/Puppet based
> > tool that provides you with 1-click local and remote deployments of
> > Storm and Kafka clusters.  That's way better IMHO than having to
> > follow long articles or blog posts to deploy your first cluster.  And
> > there are a number of other people that have been rolling their own
> > variants.  Now don't get me wrong -- I don't mention this to pitch any
> > of those tools.  My intention is to say that it would be greatly
> > helpful to have /something/ like this for Storm, for the same reason
> > that it's nice to have LocalCluster for unit testing.  I have been
> > demo'ing both Storm and Kafka by launching clusters with a simple
> > command line, which always gets people excited.  If they can then rely
> > on existing examples (see above) to also /run/ an analysis on "their"
> > cluster then they have a beautiful start.
> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> > VM cluster setup, too [4] so that people can run the Aurora tutorial
> > on their machines in a few minutes.
> >
> > [Storm and YARN]
> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> > would be nice.  It ties into being able to run LocalCluster as well as
> > to run Storm in local or remote VMs -- but now alongside your existing
> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> > will surely be similarly attractive.
> >
> >
> > On a related note bringing the Storm docs up to speed with the quality
> > of the Storm code would also be great.  I have seen that since Storm
> > moved to Incubator several new sections have been added such as the
> > FAQ [5] (btw: nice!).
> >
> > Similarly, there should be better examples and docs for users how to
> > write unit tests for Storm.  Right now people seem to be cobbling
> > together their test code by figuring out how the 1-year old code in
> > [6] actually works, and copy-pasting other people's test code from GitHub.
> >
> > --
> >
> > As I said above, these are my personal $.02.  I admit that my comments
> > go a bit beyond the original question of bringing in contrib modules
> > -- it think implicitly the discussion about the contrib modules also
> > means "what do you need to provide a better and more well-rounded
> > experience", i.e. the question whether to have batteries included or
> > not. (As you may suspect I'm leaning towards included at least the
> > most important batteries, though what's really "important" for on the
> > project-level is of course up to debate.)
> >
> > On my side I'd be happy to help with those areas where I am able to
> > contribute, whether that's code and examples (like storm-starter) or
> > tutorials/docs (I already wrote e.g. [7] and [8]).
> >
> > Again, thanks Taylor for starting this discussion.  No matter the
> > actual outcome I'm sure the state of the project will be improved.
> >
> > Best,
> > Michael
> >
> >
> >
> > [1] https://github.com/elasticsearch/logstash
> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> > [3] https://github.com/miguno/puppet-storm
> > [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
> > [6]
> > https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> > [7]
> > https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> > [8]
> > http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> >
> >
> >
> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> >> Thanks for the feedback Bobby.
> >>
> >> To clarify, I’m mainly talking about spout/bolt/trident state
> >> implementations that integrate storm with *Technology X*, where
> >> *Technology X* is not a fundamental part of storm.
> >>
> >> Examples would be technologies that are part of or related to the
> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
> >> Kafka, HDFS, HBase, Cassandra, etc.
> >>
> >> The idea behind having one or more Storm committers act as a
> >> “sponsor” is to make sure new additions are done carefully and with
> >> good reason. To add a new module, it would require committer/PPMC
> >> consensus, and assignment of one or more sponsors. Part of a
> >> sponsor’s job would be to ensure that a module is maintained, which
> >> would require enough familiarity with the code so support it long
> >> term. If a new module was proposed, but no committers were willing
> >> to act as a sponsor, it would not be added.
> >>
> >> It would be the Committers’/PPMC’s responsibly to make sure things
> >> didn’t get out of hand, and to do something about it if it does.
> >>
> >> Here’s an old Hadoop JIRA thread [1] discussing the addition of
> >> Hive as a contrib module, similar to what happened with HBase as
> >> Bobby pointed out. Some interesting points are brought up. The
> >> difference here is that both HBase and Hive were pretty big
> >> codebases relative to Hadoop. With spout/bolt/state implementations
> >> I doubt we’d see anything along that scale.
> >>
> >> - Taylor
> >>
> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> >>
> >>
> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >>> I can see a lot of value in having a distribution of storm that
> >>> comes with batteries included, everything is tested together and
> >>> you know it works.  But I don’t see much long term developer
> >>> benefit in building them all together.  If there is strong
> >>> coupling between storm and these external projects so that they
> >>> break when storm changes then we need to understand the coupling
> >>> and decide if we want to reduce that coupling by stabilizing
> >>> APIs, improving version numbering and release process, etc.; or
> >>> if the functionality is something that should be offered as a
> >>> base service in storm.
> >>>
> >>> I can see politically the value of giving these other projects a
> >>> home in Apache, and making them sub-projects is the simplest
> >>> route to that. I’d love to have storm on yarn inside Apache.  I
> >>> just don’t want to go overboard with it.  There was a time when
> >>> HBase was a “contrib” module under Hadoop along with a lot of
> >>> other things, and the Apache board came and told Hadoop to brake
> >>> it up.
> >>>
> >>> Bringing storm-kafka into storm does not sound like it will solve
> >>> much from a developer’s perspective, because there is at least as
> >>> much coupling with kafka as there is with storm.  I can see how
> >>> it is a huge amount of overhead and pain to set up a new project
> >>> just for a few hundred lines of code, as such I am in favor of
> >>> pulling in closely related projects, especially those that are
> >>> spouts and state implementations. I just want to be sure that we
> >>> do it carefully, with a good reason, and with enough people who
> >>> are familiar with the code to support it long term.
> >>>
> >>> If it starts to look like we are pulling in too many projects
> >>> perhaps we should look at something more like the bigtop project
> >>> https://bigtop.apache.org/ which produces a tested distribution
> >>> of Hadoop with many different sub-projects included in it.
> >>>
> >>> I am also a bit concerned about these sub-projects becoming
> >>> second class citizens, where we break something, but because the
> >>> build is off by default we don’t know it.  I would prefer that
> >>> they are built and tested by default.  If the build and test time
> >>> starts to take too long, to me that means we need to start
> >>> wondering if we have too many contrib modules.
> >>>
> >>> —Bobby
> >>>
> >>> From: Brian Enochson <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>>
> > Reply-To: "[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>"
> > <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>>
> > Date: Tuesday, February 25, 2014 at 9:50 PM
> >>> To: "[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>"
> > <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>>
> > Cc: "[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>"
> > <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>>
> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> >>>
> >>> hi, I am in agreement with Taylor and believe I understand his
> >>> intent. An incredible tool/framework/application like Storm is
> >>> only enhanced and gains value from the number of well maintained
> >>> and vetted modules that can be used for integration and adding
> >>> further functionality. I am relatively new to the Storm community
> >>> but have spent quite some time reviewing contributing modules out
> >>> there, reviewing various duplicates and running into some version
> >>> incompatibilities. I understand the need to keep Storm itself
> >>> pure, but do think there needs to be some structure and
> >>> governance added to the contributing modules. Look at the benefit
> >>> a tool like npm brings to the node community. I like the idea of
> >>> sponsorship, vetting and a community vote.  I, as sure many would
> >>> be, am willing to offer support and time to working through how
> >>> to set this up and helping with the implementation if it is
> >>> decided to pursue some solution. I hope these views are taken in
> >>> the sprit they are made, to make this incredible system even
> >>> better along with the surrounding eco-system.
> >>>
> >>> Thanks, Brian
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
> >>> <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>> wrote: Just
> >>> to be clear (and play a little Devil’s advocate :) ), I’m not
> >>> suggesting that whatever a “contrib” project/module/subproject
> >>> might become, be a clearinghouse for anything Storm-related.
> >>>
> >>> I see it as something that is well-vetted by the Storm
> >>> community, subject to PPMC review, vote, etc. Entry would require
> >>> community review, PPMC review, and in some cases ASF IP
> >>> clearance/legal review. Anything added would require some level
> >>> of commitment from the PPMC/committers to provide some level of
> >>> support.
> >>>
> >>> In other words, nothing “willy-nilly”.
> >>>
> >>> One option could be that any module added require (X > 0)  number
> >>> of committers to volunteer as “sponsor”s for the module, and
> >>> commit to maintaining it.
> >>>
> >>> That being said, I don’t see storm-kafka being any different
> >>> from anything else that provides integration points for Storm.
> >>>
> >>> -Taylor
> >>>
> >>>
> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>>
> >>> wrote:
> >>>
> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
> >>> projects put these contrib modules in a "contrib" folder and keep
> >>> them managed as completely separate codebases. As it's not
> >>> actually a "module" necessary for Storm, there's an argument
> >>> there for doing it that way rather than via the multi-module
> >>> route.
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
> >>> <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>>
> >>> wrote: Hi Taylor,
> >>>
> >>> I'm +1 for pulling these external libraries into Apache codebase.
> >>> This will certainly benifit Strom community. I also like to
> >>> contribute to this process.
> >>>
> >>> Thanks Milinda
> >>>
> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
> >>> <[email protected]
> >>> <mailto:[email protected]><mailto:[email protected]>> wrote:
> >>>> A while back I opened STORM-206 [1] to capture ideas for
> >>>> pulling in "contrib" modules to the Apache codebase.
> >>>>
> >>>> In the past, we had the storm-contrib github project [2] which
> >>>> subsequently got broken up into individual projects hosted on
> >>>> the stormprocessor github group [3] and elsewhere.
> >>>>
> >>>> The problem with this approach is that in certain cases it led
> >>>> to code rot (modules not being updated in step with Storm's
> >>>> API), fragmentation (multiple similar modules with the same
> >>>> name), and confusion.
> >>>>
> >>>> A good example of this is the storm-kafka module [4], since it
> >>>> is a widely used component. Because storm-contrib wasn't being
> >>>> tagged in github, a lot of users had trouble reconciling with
> >>>> which versions of storm it was compatible. Some users built off
> >>>> specific commit hashes, some forked, and a few even pushed
> >>>> custom builds to repositories such as clojars. With kafka 0.8
> >>>> now available, there are two main storm-kafka projects, the
> >>>> original (compatible with kafka 0.7) and an updated fork [5]
> >>>> (compatible with kafka 0.8).
> >>>>
> >>>> My intention is not to find fault in any way, but rather to
> >>>> point out the resulting pain, and work toward a better
> >>>> solution.
> >>>>
> >>>> I think it would be beneficial to the Storm user community to
> >>>> have certain commonly used modules like storm-kafka brought
> >>>> into the Apache Storm project. Another benefit worth
> >>>> considering is the licensing/legal oversight that the ASF
> >>>> provides, which is important to many users.
> >>>>
> >>>> If this is something we want to do, then the big question
> >>>> becomes what sort governance process needs to be established to
> >>>> ensure that such things are properly maintained.
> >>>>
> >>>> Some random thoughts, questions, etc. that jump to mind
> >>>> include:
> >>>>
> >>>> What to call these things: "contib modules", "connectors",
> >>>> "integration modules", etc.? Build integration: I imagine they
> >>>> would be a multi-module submodule of the main maven build.
> >>>> Probably turned off by default and enabled by a maven profile.
> >>>> Governance: Have one or more committer volunteers responsible
> >>>> for maintenance, merging patches, etc.? Proposal process for
> >>>> pulling new modules?
> >>>>
> >>>>
> >>>> I look forward to hearing others' opinions.
> >>>>
> >>>> - Taylor
> >>>>
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
> >>>> https://github.com/nathanmarz/storm-contrib [3]
> >>>> https://github.com/stormprocessor [4]
> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> >
> 
> 
> 
> -- 
> Twitter: @nathanmarz
> http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Reply via email to