Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Nathan Marz Thu, 13 Mar 2014 12:09:06 -0700

We also don't want to create the impression that anything and everything
belongs in the Storm project itself. storm-kafka is special because Kafka
works so well with Storm and is so widely used. But if there was a folder
called "connectors" or "adapters", people may think we're willing to pull
in anything and everything.


+1 for putting storm-starter in an examples/ directory.


On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <[email protected]> wrote:

> To clarify somewhat, the pull request for pulling in storm-starter [1]
> puts it in an "examples" directory. And there are suggestions to pull
>  James' scheduler and testing examples in there as well. So there is a
> distinction between examples and other things like storm-kafka.
>
> What I'm proposing is a different yet-to-be-named directory that would be
> home to things that integrate storm with other technologies.
>
> In the storm-contrib README [2] the term used is "modules". On the Storm
> website, we also use the term "adapter" [3].
>
> - Taylor
>
> [1] https://github.com/apache/incubator-storm/pull/44
> [2]
> https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
> [3]
> http://storm.incubator.apache.org/documentation/Spout-implementations.html
>
>
> On Mar 13, 2014, at 1:24 AM, David Miller <[email protected]>
> wrote:
>
>
> what about both ?
> connectors for spout/bolt/states that connect to other tech, storm-kafka,
> storm-cassandra, etc
> extras for other things like storm-starter, storm-deploy, storm-puppet
>
>
>
> On 13 Mar 2014, at 3:57 pm, Nathan Marz <[email protected]> wrote:
>
> I don't like either name tbh. Storm itself is already broken into modules
> (storm-core, storm-netty, etc) and things like storm-starter and
> storm-kafka are something different. I don't like "connectors" because
> something like storm-starter is not a connector. Maybe we call them
> "extras"?
>
> I would say just to support 0.8.x of Kafka.
>
>
> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <[email protected]>wrote:
>
>> Incorporation of storm starter is underway.
>>
>> I'd like to turn the attention to kafka, with the goal being to pull in
>> kafka support that is maintained and will be known to be compatible with
>> the current version of storm and specific version(s) of kafka.
>>
>> I have the following questions for the community:
>>
>> 1. What do we want to call additions like this? I'm leaning toward
>> "modules" or "connectors".
>>
>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just
>> 0.8.x? From a release management perspective, the latter is preferable
>> because the 0.7.x line artifacts are not in maven central. This makes
>> building a real pain, and maintaining support for two versions won't be
>> fun. Also, most of the people I have worked with are looking at 0.8.x for a
>> variety of reasons, but I'm open to either way.
>>
>> - Taylor
>>
>>
>> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
>> [email protected]> wrote:
>> >
>> > Thanks for starting this discussion, Taylor.
>> >
>> > As a user of Storm (and a small-scale contributor to storm-starter) as
>> > well as a user of Kafka, here are my $.02.
>> >
>> > [Storm and Kafka]
>> > First, I agree with Nathan that storm-kafka should be considered to be
>> > brought in.  While various "integrate Storm with X" options exist,
>> > basically everyone I have been talking to is using Kafka in
>> > combination with Storm.  I'm sure this is not a representative sample
>> > of Storm users, and of course one may or may not agree that Kafka is
>> > important enough of a technology in Storm's ecosystem.  Still, I do
>> > see the need to make sure Storm and Kafka do work together without
>> > having to go through forks of forks on GitHub and spending days to
>> > figure out how to get data from Kafka (0.8) into Storm.
>> >    Speaking of Kafka spout implementations, please don't forget
>> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
>> > We've been quite happy with the former, so I'd suggest to at least
>> > consider both options here (maybe the two projects can even join
>> forces?).
>> >
>> > [Storm examples, storm-starter]
>> > Second, IMHO every open source project should have a "1-click starting
>> > experience" for new users.  That's very much related to the project
>> > principles of tools like LogStash [1] who say: "Community: If a newbie
>> > has a bad time, it's a bug."  For this reason I personally would like
>> > to see the equivalent of storm-starter being brought into the "core"
>> > Storm project -- think of an examples/ sub-module.  If the level of
>> > effort is deemed too high to e.g. maintain what's already in
>> > storm-starter, then (say) reduce the scope and remove some of the
>> > examples.  In any case I'd personally would like to see bundled
>> > examples that are known to work with the latest version of Storm.
>> > storm-starter is often used to show new users how to get started with
>> > Storm (I used that approach in my Storm blog posts, for instance, and
>> > others like Mesosphere.io are even using storm-starter for their
>> > commercial offerings [2]).
>> >
>> > [Have Storm up and running faster than you can brew an espresso]
>> > Third, for the same reason (get people up and running in a few
>> > minutes), I do like that other people in this thread have been
>> > bringing up projects like storm-deploy.  For the same reason I have
>> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>> > few days ago, and I'll soon open source another Vagrant/Puppet based
>> > tool that provides you with 1-click local and remote deployments of
>> > Storm and Kafka clusters.  That's way better IMHO than having to
>> > follow long articles or blog posts to deploy your first cluster.  And
>> > there are a number of other people that have been rolling their own
>> > variants.  Now don't get me wrong -- I don't mention this to pitch any
>> > of those tools.  My intention is to say that it would be greatly
>> > helpful to have /something/ like this for Storm, for the same reason
>> > that it's nice to have LocalCluster for unit testing.  I have been
>> > demo'ing both Storm and Kafka by launching clusters with a simple
>> > command line, which always gets people excited.  If they can then rely
>> > on existing examples (see above) to also /run/ an analysis on "their"
>> > cluster then they have a beautiful start.
>> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>> > VM cluster setup, too [4] so that people can run the Aurora tutorial
>> > on their machines in a few minutes.
>> >
>> > [Storm and YARN]
>> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>> > would be nice.  It ties into being able to run LocalCluster as well as
>> > to run Storm in local or remote VMs -- but now alongside your existing
>> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>> > will surely be similarly attractive.
>> >
>> >
>> > On a related note bringing the Storm docs up to speed with the quality
>> > of the Storm code would also be great.  I have seen that since Storm
>> > moved to Incubator several new sections have been added such as the
>> > FAQ [5] (btw: nice!).
>> >
>> > Similarly, there should be better examples and docs for users how to
>> > write unit tests for Storm.  Right now people seem to be cobbling
>> > together their test code by figuring out how the 1-year old code in
>> > [6] actually works, and copy-pasting other people's test code from
>> GitHub.
>> >
>> > --
>> >
>> > As I said above, these are my personal $.02.  I admit that my comments
>> > go a bit beyond the original question of bringing in contrib modules
>> > -- it think implicitly the discussion about the contrib modules also
>> > means "what do you need to provide a better and more well-rounded
>> > experience", i.e. the question whether to have batteries included or
>> > not. (As you may suspect I'm leaning towards included at least the
>> > most important batteries, though what's really "important" for on the
>> > project-level is of course up to debate.)
>> >
>> > On my side I'd be happy to help with those areas where I am able to
>> > contribute, whether that's code and examples (like storm-starter) or
>> > tutorials/docs (I already wrote e.g. [7] and [8]).
>> >
>> > Again, thanks Taylor for starting this discussion.  No matter the
>> > actual outcome I'm sure the state of the project will be improved.
>> >
>> > Best,
>> > Michael
>> >
>> >
>> >
>> > [1] https://github.com/elasticsearch/logstash
>> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>> > [3] https://github.com/miguno/puppet-storm
>> > [4]
>> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
>> > [6]
>> >
>> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>> > [7]
>> >
>> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>> > [8]
>> >
>> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>> >
>> >
>> >
>> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> >> Thanks for the feedback Bobby.
>> >>
>> >> To clarify, I'm mainly talking about spout/bolt/trident state
>> >> implementations that integrate storm with *Technology X*, where
>> >> *Technology X* is not a fundamental part of storm.
>> >>
>> >> Examples would be technologies that are part of or related to the
>> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>> >> Kafka, HDFS, HBase, Cassandra, etc.
>> >>
>> >> The idea behind having one or more Storm committers act as a
>> >> "sponsor" is to make sure new additions are done carefully and with
>> >> good reason. To add a new module, it would require committer/PPMC
>> >> consensus, and assignment of one or more sponsors. Part of a
>> >> sponsor's job would be to ensure that a module is maintained, which
>> >> would require enough familiarity with the code so support it long
>> >> term. If a new module was proposed, but no committers were willing
>> >> to act as a sponsor, it would not be added.
>> >>
>> >> It would be the Committers'/PPMC's responsibly to make sure things
>> >> didn't get out of hand, and to do something about it if it does.
>> >>
>> >> Here's an old Hadoop JIRA thread [1] discussing the addition of
>> >> Hive as a contrib module, similar to what happened with HBase as
>> >> Bobby pointed out. Some interesting points are brought up. The
>> >> difference here is that both HBase and Hive were pretty big
>> >> codebases relative to Hadoop. With spout/bolt/state implementations
>> >> I doubt we'd see anything along that scale.
>> >>
>> >> - Taylor
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> >>
>> >>
>> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <[email protected]
>> >> <mailto:[email protected]>> wrote:
>> >>
>> >>> I can see a lot of value in having a distribution of storm that
>> >>> comes with batteries included, everything is tested together and
>> >>> you know it works.  But I don't see much long term developer
>> >>> benefit in building them all together.  If there is strong
>> >>> coupling between storm and these external projects so that they
>> >>> break when storm changes then we need to understand the coupling
>> >>> and decide if we want to reduce that coupling by stabilizing
>> >>> APIs, improving version numbering and release process, etc.; or
>> >>> if the functionality is something that should be offered as a
>> >>> base service in storm.
>> >>>
>> >>> I can see politically the value of giving these other projects a
>> >>> home in Apache, and making them sub-projects is the simplest
>> >>> route to that. I'd love to have storm on yarn inside Apache.  I
>> >>> just don't want to go overboard with it.  There was a time when
>> >>> HBase was a "contrib" module under Hadoop along with a lot of
>> >>> other things, and the Apache board came and told Hadoop to brake
>> >>> it up.
>> >>>
>> >>> Bringing storm-kafka into storm does not sound like it will solve
>> >>> much from a developer's perspective, because there is at least as
>> >>> much coupling with kafka as there is with storm.  I can see how
>> >>> it is a huge amount of overhead and pain to set up a new project
>> >>> just for a few hundred lines of code, as such I am in favor of
>> >>> pulling in closely related projects, especially those that are
>> >>> spouts and state implementations. I just want to be sure that we
>> >>> do it carefully, with a good reason, and with enough people who
>> >>> are familiar with the code to support it long term.
>> >>>
>> >>> If it starts to look like we are pulling in too many projects
>> >>> perhaps we should look at something more like the bigtop project
>> >>> https://bigtop.apache.org/ which produces a tested distribution
>> >>> of Hadoop with many different sub-projects included in it.
>> >>>
>> >>> I am also a bit concerned about these sub-projects becoming
>> >>> second class citizens, where we break something, but because the
>> >>> build is off by default we don't know it.  I would prefer that
>> >>> they are built and tested by default.  If the build and test time
>> >>> starts to take too long, to me that means we need to start
>> >>> wondering if we have too many contrib modules.
>> >>>
>> >>> --Bobby
>> >>>
>> >>> From: Brian Enochson <[email protected]
>> >>> <mailto:[email protected]><mailto:[email protected]>>
>> > Reply-To: "[email protected]
>> >>> <mailto:[email protected]><mailto:
>> [email protected]>"
>> > <[email protected]
>> >>> <mailto:[email protected]><mailto:
>> [email protected]>>
>> > Date: Tuesday, February 25, 2014 at 9:50 PM
>> >>> To: "[email protected]
>> >>> <mailto:[email protected]><mailto:
>> [email protected]>"
>> > <[email protected]
>> >>> <mailto:[email protected]><mailto:
>> [email protected]>>
>> > Cc: "[email protected]
>> >>> <mailto:[email protected]><mailto:
>> [email protected]>"
>> > <[email protected]
>> >>> <mailto:[email protected]><mailto:
>> [email protected]>>
>> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> >>>
>> >>> hi, I am in agreement with Taylor and believe I understand his
>> >>> intent. An incredible tool/framework/application like Storm is
>> >>> only enhanced and gains value from the number of well maintained
>> >>> and vetted modules that can be used for integration and adding
>> >>> further functionality. I am relatively new to the Storm community
>> >>> but have spent quite some time reviewing contributing modules out
>> >>> there, reviewing various duplicates and running into some version
>> >>> incompatibilities. I understand the need to keep Storm itself
>> >>> pure, but do think there needs to be some structure and
>> >>> governance added to the contributing modules. Look at the benefit
>> >>> a tool like npm brings to the node community. I like the idea of
>> >>> sponsorship, vetting and a community vote.  I, as sure many would
>> >>> be, am willing to offer support and time to working through how
>> >>> to set this up and helping with the implementation if it is
>> >>> decided to pursue some solution. I hope these views are taken in
>> >>> the sprit they are made, to make this incredible system even
>> >>> better along with the surrounding eco-system.
>> >>>
>> >>> Thanks, Brian
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> >>> <[email protected]
>> >>> <mailto:[email protected]><mailto:[email protected]>> wrote: Just
>> >>> to be clear (and play a little Devil's advocate :) ), I'm not
>> >>> suggesting that whatever a "contrib" project/module/subproject
>> >>> might become, be a clearinghouse for anything Storm-related.
>> >>>
>> >>> I see it as something that is well-vetted by the Storm
>> >>> community, subject to PPMC review, vote, etc. Entry would require
>> >>> community review, PPMC review, and in some cases ASF IP
>> >>> clearance/legal review. Anything added would require some level
>> >>> of commitment from the PPMC/committers to provide some level of
>> >>> support.
>> >>>
>> >>> In other words, nothing "willy-nilly".
>> >>>
>> >>> One option could be that any module added require (X > 0)  number
>> >>> of committers to volunteer as "sponsor"s for the module, and
>> >>> commit to maintaining it.
>> >>>
>> >>> That being said, I don't see storm-kafka being any different
>> >>> from anything else that provides integration points for Storm.
>> >>>
>> >>> -Taylor
>> >>>
>> >>>
>> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <[email protected]
>> >>> <mailto:[email protected]><mailto:[email protected]>>
>> >>> wrote:
>> >>>
>> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> >>> projects put these contrib modules in a "contrib" folder and keep
>> >>> them managed as completely separate codebases. As it's not
>> >>> actually a "module" necessary for Storm, there's an argument
>> >>> there for doing it that way rather than via the multi-module
>> >>> route.
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>> >>> <[email protected]
>> >>> <mailto:[email protected]><mailto:[email protected]>>
>> >>> wrote: Hi Taylor,
>> >>>
>> >>> I'm +1 for pulling these external libraries into Apache codebase.
>> >>> This will certainly benifit Strom community. I also like to
>> >>> contribute to this process.
>> >>>
>> >>> Thanks Milinda
>> >>>
>> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> >>> <[email protected]
>> >>> <mailto:[email protected]><mailto:[email protected]>> wrote:
>> >>>> A while back I opened STORM-206 [1] to capture ideas for
>> >>>> pulling in "contrib" modules to the Apache codebase.
>> >>>>
>> >>>> In the past, we had the storm-contrib github project [2] which
>> >>>> subsequently got broken up into individual projects hosted on
>> >>>> the stormprocessor github group [3] and elsewhere.
>> >>>>
>> >>>> The problem with this approach is that in certain cases it led
>> >>>> to code rot (modules not being updated in step with Storm's
>> >>>> API), fragmentation (multiple similar modules with the same
>> >>>> name), and confusion.
>> >>>>
>> >>>> A good example of this is the storm-kafka module [4], since it
>> >>>> is a widely used component. Because storm-contrib wasn't being
>> >>>> tagged in github, a lot of users had trouble reconciling with
>> >>>> which versions of storm it was compatible. Some users built off
>> >>>> specific commit hashes, some forked, and a few even pushed
>> >>>> custom builds to repositories such as clojars. With kafka 0.8
>> >>>> now available, there are two main storm-kafka projects, the
>> >>>> original (compatible with kafka 0.7) and an updated fork [5]
>> >>>> (compatible with kafka 0.8).
>> >>>>
>> >>>> My intention is not to find fault in any way, but rather to
>> >>>> point out the resulting pain, and work toward a better
>> >>>> solution.
>> >>>>
>> >>>> I think it would be beneficial to the Storm user community to
>> >>>> have certain commonly used modules like storm-kafka brought
>> >>>> into the Apache Storm project. Another benefit worth
>> >>>> considering is the licensing/legal oversight that the ASF
>> >>>> provides, which is important to many users.
>> >>>>
>> >>>> If this is something we want to do, then the big question
>> >>>> becomes what sort governance process needs to be established to
>> >>>> ensure that such things are properly maintained.
>> >>>>
>> >>>> Some random thoughts, questions, etc. that jump to mind
>> >>>> include:
>> >>>>
>> >>>> What to call these things: "contib modules", "connectors",
>> >>>> "integration modules", etc.? Build integration: I imagine they
>> >>>> would be a multi-module submodule of the main maven build.
>> >>>> Probably turned off by default and enabled by a maven profile.
>> >>>> Governance: Have one or more committer volunteers responsible
>> >>>> for maintenance, merging patches, etc.? Proposal process for
>> >>>> pulling new modules?
>> >>>>
>> >>>>
>> >>>> I look forward to hearing others' opinions.
>> >>>>
>> >>>> - Taylor
>> >>>>
>> >>>>
>> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>> >>>> https://github.com/nathanmarz/storm-contrib [3]
>> >>>> https://github.com/stormprocessor [4]
>> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>> >
>>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>
>
>
>


-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Reply via email to