We also don't want to create the impression that anything and everything belongs in the Storm project itself. storm-kafka is special because Kafka works so well with Storm and is so widely used. But if there was a folder called "connectors" or "adapters", people may think we're willing to pull in anything and everything.
+1 for putting storm-starter in an examples/ directory. On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <[email protected]> wrote: > To clarify somewhat, the pull request for pulling in storm-starter [1] > puts it in an "examples" directory. And there are suggestions to pull > James' scheduler and testing examples in there as well. So there is a > distinction between examples and other things like storm-kafka. > > What I'm proposing is a different yet-to-be-named directory that would be > home to things that integrate storm with other technologies. > > In the storm-contrib README [2] the term used is "modules". On the Storm > website, we also use the term "adapter" [3]. > > - Taylor > > [1] https://github.com/apache/incubator-storm/pull/44 > [2] > https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about > [3] > http://storm.incubator.apache.org/documentation/Spout-implementations.html > > > On Mar 13, 2014, at 1:24 AM, David Miller <[email protected]> > wrote: > > > what about both ? > connectors for spout/bolt/states that connect to other tech, storm-kafka, > storm-cassandra, etc > extras for other things like storm-starter, storm-deploy, storm-puppet > > > > On 13 Mar 2014, at 3:57 pm, Nathan Marz <[email protected]> wrote: > > I don't like either name tbh. Storm itself is already broken into modules > (storm-core, storm-netty, etc) and things like storm-starter and > storm-kafka are something different. I don't like "connectors" because > something like storm-starter is not a connector. Maybe we call them > "extras"? > > I would say just to support 0.8.x of Kafka. > > > On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <[email protected]>wrote: > >> Incorporation of storm starter is underway. >> >> I'd like to turn the attention to kafka, with the goal being to pull in >> kafka support that is maintained and will be known to be compatible with >> the current version of storm and specific version(s) of kafka. >> >> I have the following questions for the community: >> >> 1. What do we want to call additions like this? I'm leaning toward >> "modules" or "connectors". >> >> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just >> 0.8.x? From a release management perspective, the latter is preferable >> because the 0.7.x line artifacts are not in maven central. This makes >> building a real pain, and maintaining support for two versions won't be >> fun. Also, most of the people I have worked with are looking at 0.8.x for a >> variety of reasons, but I'm open to either way. >> >> - Taylor >> >> >> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" < >> [email protected]> wrote: >> > >> > Thanks for starting this discussion, Taylor. >> > >> > As a user of Storm (and a small-scale contributor to storm-starter) as >> > well as a user of Kafka, here are my $.02. >> > >> > [Storm and Kafka] >> > First, I agree with Nathan that storm-kafka should be considered to be >> > brought in. While various "integrate Storm with X" options exist, >> > basically everyone I have been talking to is using Kafka in >> > combination with Storm. I'm sure this is not a representative sample >> > of Storm users, and of course one may or may not agree that Kafka is >> > important enough of a technology in Storm's ecosystem. Still, I do >> > see the need to make sure Storm and Kafka do work together without >> > having to go through forks of forks on GitHub and spending days to >> > figure out how to get data from Kafka (0.8) into Storm. >> > Speaking of Kafka spout implementations, please don't forget >> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's. >> > We've been quite happy with the former, so I'd suggest to at least >> > consider both options here (maybe the two projects can even join >> forces?). >> > >> > [Storm examples, storm-starter] >> > Second, IMHO every open source project should have a "1-click starting >> > experience" for new users. That's very much related to the project >> > principles of tools like LogStash [1] who say: "Community: If a newbie >> > has a bad time, it's a bug." For this reason I personally would like >> > to see the equivalent of storm-starter being brought into the "core" >> > Storm project -- think of an examples/ sub-module. If the level of >> > effort is deemed too high to e.g. maintain what's already in >> > storm-starter, then (say) reduce the scope and remove some of the >> > examples. In any case I'd personally would like to see bundled >> > examples that are known to work with the latest version of Storm. >> > storm-starter is often used to show new users how to get started with >> > Storm (I used that approach in my Storm blog posts, for instance, and >> > others like Mesosphere.io are even using storm-starter for their >> > commercial offerings [2]). >> > >> > [Have Storm up and running faster than you can brew an espresso] >> > Third, for the same reason (get people up and running in a few >> > minutes), I do like that other people in this thread have been >> > bringing up projects like storm-deploy. For the same reason I have >> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a >> > few days ago, and I'll soon open source another Vagrant/Puppet based >> > tool that provides you with 1-click local and remote deployments of >> > Storm and Kafka clusters. That's way better IMHO than having to >> > follow long articles or blog posts to deploy your first cluster. And >> > there are a number of other people that have been rolling their own >> > variants. Now don't get me wrong -- I don't mention this to pitch any >> > of those tools. My intention is to say that it would be greatly >> > helpful to have /something/ like this for Storm, for the same reason >> > that it's nice to have LocalCluster for unit testing. I have been >> > demo'ing both Storm and Kafka by launching clusters with a simple >> > command line, which always gets people excited. If they can then rely >> > on existing examples (see above) to also /run/ an analysis on "their" >> > cluster then they have a beautiful start. >> > Oh, and btw: Apache Aurora (with Mesos) have such a Vagrant-based >> > VM cluster setup, too [4] so that people can run the Aurora tutorial >> > on their machines in a few minutes. >> > >> > [Storm and YARN] >> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn >> > would be nice. It ties into being able to run LocalCluster as well as >> > to run Storm in local or remote VMs -- but now alongside your existing >> > Hadoop/YARN infrastructure. For those preferring Mesos Storm-on-Mesos >> > will surely be similarly attractive. >> > >> > >> > On a related note bringing the Storm docs up to speed with the quality >> > of the Storm code would also be great. I have seen that since Storm >> > moved to Incubator several new sections have been added such as the >> > FAQ [5] (btw: nice!). >> > >> > Similarly, there should be better examples and docs for users how to >> > write unit tests for Storm. Right now people seem to be cobbling >> > together their test code by figuring out how the 1-year old code in >> > [6] actually works, and copy-pasting other people's test code from >> GitHub. >> > >> > -- >> > >> > As I said above, these are my personal $.02. I admit that my comments >> > go a bit beyond the original question of bringing in contrib modules >> > -- it think implicitly the discussion about the contrib modules also >> > means "what do you need to provide a better and more well-rounded >> > experience", i.e. the question whether to have batteries included or >> > not. (As you may suspect I'm leaning towards included at least the >> > most important batteries, though what's really "important" for on the >> > project-level is of course up to debate.) >> > >> > On my side I'd be happy to help with those areas where I am able to >> > contribute, whether that's code and examples (like storm-starter) or >> > tutorials/docs (I already wrote e.g. [7] and [8]). >> > >> > Again, thanks Taylor for starting this discussion. No matter the >> > actual outcome I'm sure the state of the project will be improved. >> > >> > Best, >> > Michael >> > >> > >> > >> > [1] https://github.com/elasticsearch/logstash >> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7 >> > [3] https://github.com/miguno/puppet-storm >> > [4] >> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md >> > [5] http://storm.incubator.apache.org/documentation/FAQ.html >> > [6] >> > >> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java >> > [7] >> > >> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology >> > [8] >> > >> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/ >> > >> > >> > >> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote: >> >> Thanks for the feedback Bobby. >> >> >> >> To clarify, I'm mainly talking about spout/bolt/trident state >> >> implementations that integrate storm with *Technology X*, where >> >> *Technology X* is not a fundamental part of storm. >> >> >> >> Examples would be technologies that are part of or related to the >> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: >> >> Kafka, HDFS, HBase, Cassandra, etc. >> >> >> >> The idea behind having one or more Storm committers act as a >> >> "sponsor" is to make sure new additions are done carefully and with >> >> good reason. To add a new module, it would require committer/PPMC >> >> consensus, and assignment of one or more sponsors. Part of a >> >> sponsor's job would be to ensure that a module is maintained, which >> >> would require enough familiarity with the code so support it long >> >> term. If a new module was proposed, but no committers were willing >> >> to act as a sponsor, it would not be added. >> >> >> >> It would be the Committers'/PPMC's responsibly to make sure things >> >> didn't get out of hand, and to do something about it if it does. >> >> >> >> Here's an old Hadoop JIRA thread [1] discussing the addition of >> >> Hive as a contrib module, similar to what happened with HBase as >> >> Bobby pointed out. Some interesting points are brought up. The >> >> difference here is that both HBase and Hive were pretty big >> >> codebases relative to Hadoop. With spout/bolt/state implementations >> >> I doubt we'd see anything along that scale. >> >> >> >> - Taylor >> >> >> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601 >> >> >> >> >> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <[email protected] >> >> <mailto:[email protected]>> wrote: >> >> >> >>> I can see a lot of value in having a distribution of storm that >> >>> comes with batteries included, everything is tested together and >> >>> you know it works. But I don't see much long term developer >> >>> benefit in building them all together. If there is strong >> >>> coupling between storm and these external projects so that they >> >>> break when storm changes then we need to understand the coupling >> >>> and decide if we want to reduce that coupling by stabilizing >> >>> APIs, improving version numbering and release process, etc.; or >> >>> if the functionality is something that should be offered as a >> >>> base service in storm. >> >>> >> >>> I can see politically the value of giving these other projects a >> >>> home in Apache, and making them sub-projects is the simplest >> >>> route to that. I'd love to have storm on yarn inside Apache. I >> >>> just don't want to go overboard with it. There was a time when >> >>> HBase was a "contrib" module under Hadoop along with a lot of >> >>> other things, and the Apache board came and told Hadoop to brake >> >>> it up. >> >>> >> >>> Bringing storm-kafka into storm does not sound like it will solve >> >>> much from a developer's perspective, because there is at least as >> >>> much coupling with kafka as there is with storm. I can see how >> >>> it is a huge amount of overhead and pain to set up a new project >> >>> just for a few hundred lines of code, as such I am in favor of >> >>> pulling in closely related projects, especially those that are >> >>> spouts and state implementations. I just want to be sure that we >> >>> do it carefully, with a good reason, and with enough people who >> >>> are familiar with the code to support it long term. >> >>> >> >>> If it starts to look like we are pulling in too many projects >> >>> perhaps we should look at something more like the bigtop project >> >>> https://bigtop.apache.org/ which produces a tested distribution >> >>> of Hadoop with many different sub-projects included in it. >> >>> >> >>> I am also a bit concerned about these sub-projects becoming >> >>> second class citizens, where we break something, but because the >> >>> build is off by default we don't know it. I would prefer that >> >>> they are built and tested by default. If the build and test time >> >>> starts to take too long, to me that means we need to start >> >>> wondering if we have too many contrib modules. >> >>> >> >>> --Bobby >> >>> >> >>> From: Brian Enochson <[email protected] >> >>> <mailto:[email protected]><mailto:[email protected]>> >> > Reply-To: "[email protected] >> >>> <mailto:[email protected]><mailto: >> [email protected]>" >> > <[email protected] >> >>> <mailto:[email protected]><mailto: >> [email protected]>> >> > Date: Tuesday, February 25, 2014 at 9:50 PM >> >>> To: "[email protected] >> >>> <mailto:[email protected]><mailto: >> [email protected]>" >> > <[email protected] >> >>> <mailto:[email protected]><mailto: >> [email protected]>> >> > Cc: "[email protected] >> >>> <mailto:[email protected]><mailto: >> [email protected]>" >> > <[email protected] >> >>> <mailto:[email protected]><mailto: >> [email protected]>> >> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache >> >>> >> >>> hi, I am in agreement with Taylor and believe I understand his >> >>> intent. An incredible tool/framework/application like Storm is >> >>> only enhanced and gains value from the number of well maintained >> >>> and vetted modules that can be used for integration and adding >> >>> further functionality. I am relatively new to the Storm community >> >>> but have spent quite some time reviewing contributing modules out >> >>> there, reviewing various duplicates and running into some version >> >>> incompatibilities. I understand the need to keep Storm itself >> >>> pure, but do think there needs to be some structure and >> >>> governance added to the contributing modules. Look at the benefit >> >>> a tool like npm brings to the node community. I like the idea of >> >>> sponsorship, vetting and a community vote. I, as sure many would >> >>> be, am willing to offer support and time to working through how >> >>> to set this up and helping with the implementation if it is >> >>> decided to pursue some solution. I hope these views are taken in >> >>> the sprit they are made, to make this incredible system even >> >>> better along with the surrounding eco-system. >> >>> >> >>> Thanks, Brian >> >>> >> >>> >> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz >> >>> <[email protected] >> >>> <mailto:[email protected]><mailto:[email protected]>> wrote: Just >> >>> to be clear (and play a little Devil's advocate :) ), I'm not >> >>> suggesting that whatever a "contrib" project/module/subproject >> >>> might become, be a clearinghouse for anything Storm-related. >> >>> >> >>> I see it as something that is well-vetted by the Storm >> >>> community, subject to PPMC review, vote, etc. Entry would require >> >>> community review, PPMC review, and in some cases ASF IP >> >>> clearance/legal review. Anything added would require some level >> >>> of commitment from the PPMC/committers to provide some level of >> >>> support. >> >>> >> >>> In other words, nothing "willy-nilly". >> >>> >> >>> One option could be that any module added require (X > 0) number >> >>> of committers to volunteer as "sponsor"s for the module, and >> >>> commit to maintaining it. >> >>> >> >>> That being said, I don't see storm-kafka being any different >> >>> from anything else that provides integration points for Storm. >> >>> >> >>> -Taylor >> >>> >> >>> >> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <[email protected] >> >>> <mailto:[email protected]><mailto:[email protected]>> >> >>> wrote: >> >>> >> >>> I'm only +1 for pulling in storm-kafka and updating it. Other >> >>> projects put these contrib modules in a "contrib" folder and keep >> >>> them managed as completely separate codebases. As it's not >> >>> actually a "module" necessary for Storm, there's an argument >> >>> there for doing it that way rather than via the multi-module >> >>> route. >> >>> >> >>> >> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage >> >>> <[email protected] >> >>> <mailto:[email protected]><mailto:[email protected]>> >> >>> wrote: Hi Taylor, >> >>> >> >>> I'm +1 for pulling these external libraries into Apache codebase. >> >>> This will certainly benifit Strom community. I also like to >> >>> contribute to this process. >> >>> >> >>> Thanks Milinda >> >>> >> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz >> >>> <[email protected] >> >>> <mailto:[email protected]><mailto:[email protected]>> wrote: >> >>>> A while back I opened STORM-206 [1] to capture ideas for >> >>>> pulling in "contrib" modules to the Apache codebase. >> >>>> >> >>>> In the past, we had the storm-contrib github project [2] which >> >>>> subsequently got broken up into individual projects hosted on >> >>>> the stormprocessor github group [3] and elsewhere. >> >>>> >> >>>> The problem with this approach is that in certain cases it led >> >>>> to code rot (modules not being updated in step with Storm's >> >>>> API), fragmentation (multiple similar modules with the same >> >>>> name), and confusion. >> >>>> >> >>>> A good example of this is the storm-kafka module [4], since it >> >>>> is a widely used component. Because storm-contrib wasn't being >> >>>> tagged in github, a lot of users had trouble reconciling with >> >>>> which versions of storm it was compatible. Some users built off >> >>>> specific commit hashes, some forked, and a few even pushed >> >>>> custom builds to repositories such as clojars. With kafka 0.8 >> >>>> now available, there are two main storm-kafka projects, the >> >>>> original (compatible with kafka 0.7) and an updated fork [5] >> >>>> (compatible with kafka 0.8). >> >>>> >> >>>> My intention is not to find fault in any way, but rather to >> >>>> point out the resulting pain, and work toward a better >> >>>> solution. >> >>>> >> >>>> I think it would be beneficial to the Storm user community to >> >>>> have certain commonly used modules like storm-kafka brought >> >>>> into the Apache Storm project. Another benefit worth >> >>>> considering is the licensing/legal oversight that the ASF >> >>>> provides, which is important to many users. >> >>>> >> >>>> If this is something we want to do, then the big question >> >>>> becomes what sort governance process needs to be established to >> >>>> ensure that such things are properly maintained. >> >>>> >> >>>> Some random thoughts, questions, etc. that jump to mind >> >>>> include: >> >>>> >> >>>> What to call these things: "contib modules", "connectors", >> >>>> "integration modules", etc.? Build integration: I imagine they >> >>>> would be a multi-module submodule of the main maven build. >> >>>> Probably turned off by default and enabled by a maven profile. >> >>>> Governance: Have one or more committer volunteers responsible >> >>>> for maintenance, merging patches, etc.? Proposal process for >> >>>> pulling new modules? >> >>>> >> >>>> >> >>>> I look forward to hearing others' opinions. >> >>>> >> >>>> - Taylor >> >>>> >> >>>> >> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2] >> >>>> https://github.com/nathanmarz/storm-contrib [3] >> >>>> https://github.com/stormprocessor [4] >> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka >> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus >> > >> > > > > -- > Twitter: @nathanmarz > http://nathanmarz.com > > > > -- Twitter: @nathanmarz http://nathanmarz.com
