what about both ? connectors for spout/bolt/states that connect to other tech, storm-kafka, storm-cassandra, etc extras for other things like storm-starter, storm-deploy, storm-puppet
On 13 Mar 2014, at 3:57 pm, Nathan Marz <[email protected]> wrote: > I don't like either name tbh. Storm itself is already broken into modules > (storm-core, storm-netty, etc) and things like storm-starter and storm-kafka > are something different. I don't like "connectors" because something like > storm-starter is not a connector. Maybe we call them "extras"? > > I would say just to support 0.8.x of Kafka. > > > On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <[email protected]> wrote: > Incorporation of storm starter is underway. > > I'd like to turn the attention to kafka, with the goal being to pull in kafka > support that is maintained and will be known to be compatible with the > current version of storm and specific version(s) of kafka. > > I have the following questions for the community: > > 1. What do we want to call additions like this? I'm leaning toward "modules" > or "connectors". > > 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just > 0.8.x? From a release management perspective, the latter is preferable > because the 0.7.x line artifacts are not in maven central. This makes > building a real pain, and maintaining support for two versions won't be fun. > Also, most of the people I have worked with are looking at 0.8.x for a > variety of reasons, but I'm open to either way. > > - Taylor > > > > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" > > <[email protected]> wrote: > > > > Thanks for starting this discussion, Taylor. > > > > As a user of Storm (and a small-scale contributor to storm-starter) as > > well as a user of Kafka, here are my $.02. > > > > [Storm and Kafka] > > First, I agree with Nathan that storm-kafka should be considered to be > > brought in. While various "integrate Storm with X" options exist, > > basically everyone I have been talking to is using Kafka in > > combination with Storm. I'm sure this is not a representative sample > > of Storm users, and of course one may or may not agree that Kafka is > > important enough of a technology in Storm's ecosystem. Still, I do > > see the need to make sure Storm and Kafka do work together without > > having to go through forks of forks on GitHub and spending days to > > figure out how to get data from Kafka (0.8) into Storm. > > Speaking of Kafka spout implementations, please don't forget > > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's. > > We've been quite happy with the former, so I'd suggest to at least > > consider both options here (maybe the two projects can even join forces?). > > > > [Storm examples, storm-starter] > > Second, IMHO every open source project should have a "1-click starting > > experience" for new users. That's very much related to the project > > principles of tools like LogStash [1] who say: "Community: If a newbie > > has a bad time, it's a bug." For this reason I personally would like > > to see the equivalent of storm-starter being brought into the "core" > > Storm project -- think of an examples/ sub-module. If the level of > > effort is deemed too high to e.g. maintain what's already in > > storm-starter, then (say) reduce the scope and remove some of the > > examples. In any case I'd personally would like to see bundled > > examples that are known to work with the latest version of Storm. > > storm-starter is often used to show new users how to get started with > > Storm (I used that approach in my Storm blog posts, for instance, and > > others like Mesosphere.io are even using storm-starter for their > > commercial offerings [2]). > > > > [Have Storm up and running faster than you can brew an espresso] > > Third, for the same reason (get people up and running in a few > > minutes), I do like that other people in this thread have been > > bringing up projects like storm-deploy. For the same reason I have > > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a > > few days ago, and I'll soon open source another Vagrant/Puppet based > > tool that provides you with 1-click local and remote deployments of > > Storm and Kafka clusters. That's way better IMHO than having to > > follow long articles or blog posts to deploy your first cluster. And > > there are a number of other people that have been rolling their own > > variants. Now don't get me wrong -- I don't mention this to pitch any > > of those tools. My intention is to say that it would be greatly > > helpful to have /something/ like this for Storm, for the same reason > > that it's nice to have LocalCluster for unit testing. I have been > > demo'ing both Storm and Kafka by launching clusters with a simple > > command line, which always gets people excited. If they can then rely > > on existing examples (see above) to also /run/ an analysis on "their" > > cluster then they have a beautiful start. > > Oh, and btw: Apache Aurora (with Mesos) have such a Vagrant-based > > VM cluster setup, too [4] so that people can run the Aurora tutorial > > on their machines in a few minutes. > > > > [Storm and YARN] > > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn > > would be nice. It ties into being able to run LocalCluster as well as > > to run Storm in local or remote VMs -- but now alongside your existing > > Hadoop/YARN infrastructure. For those preferring Mesos Storm-on-Mesos > > will surely be similarly attractive. > > > > > > On a related note bringing the Storm docs up to speed with the quality > > of the Storm code would also be great. I have seen that since Storm > > moved to Incubator several new sections have been added such as the > > FAQ [5] (btw: nice!). > > > > Similarly, there should be better examples and docs for users how to > > write unit tests for Storm. Right now people seem to be cobbling > > together their test code by figuring out how the 1-year old code in > > [6] actually works, and copy-pasting other people's test code from GitHub. > > > > -- > > > > As I said above, these are my personal $.02. I admit that my comments > > go a bit beyond the original question of bringing in contrib modules > > -- it think implicitly the discussion about the contrib modules also > > means "what do you need to provide a better and more well-rounded > > experience", i.e. the question whether to have batteries included or > > not. (As you may suspect I'm leaning towards included at least the > > most important batteries, though what's really "important" for on the > > project-level is of course up to debate.) > > > > On my side I'd be happy to help with those areas where I am able to > > contribute, whether that's code and examples (like storm-starter) or > > tutorials/docs (I already wrote e.g. [7] and [8]). > > > > Again, thanks Taylor for starting this discussion. No matter the > > actual outcome I'm sure the state of the project will be improved. > > > > Best, > > Michael > > > > > > > > [1] https://github.com/elasticsearch/logstash > > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7 > > [3] https://github.com/miguno/puppet-storm > > [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md > > [5] http://storm.incubator.apache.org/documentation/FAQ.html > > [6] > > https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java > > [7] > > https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology > > [8] > > http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/ > > > > > > > >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote: > >> Thanks for the feedback Bobby. > >> > >> To clarify, I’m mainly talking about spout/bolt/trident state > >> implementations that integrate storm with *Technology X*, where > >> *Technology X* is not a fundamental part of storm. > >> > >> Examples would be technologies that are part of or related to the > >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: > >> Kafka, HDFS, HBase, Cassandra, etc. > >> > >> The idea behind having one or more Storm committers act as a > >> “sponsor” is to make sure new additions are done carefully and with > >> good reason. To add a new module, it would require committer/PPMC > >> consensus, and assignment of one or more sponsors. Part of a > >> sponsor’s job would be to ensure that a module is maintained, which > >> would require enough familiarity with the code so support it long > >> term. If a new module was proposed, but no committers were willing > >> to act as a sponsor, it would not be added. > >> > >> It would be the Committers’/PPMC’s responsibly to make sure things > >> didn’t get out of hand, and to do something about it if it does. > >> > >> Here’s an old Hadoop JIRA thread [1] discussing the addition of > >> Hive as a contrib module, similar to what happened with HBase as > >> Bobby pointed out. Some interesting points are brought up. The > >> difference here is that both HBase and Hive were pretty big > >> codebases relative to Hadoop. With spout/bolt/state implementations > >> I doubt we’d see anything along that scale. > >> > >> - Taylor > >> > >> [1] https://issues.apache.org/jira/browse/HADOOP-3601 > >> > >> > >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <[email protected] > >> <mailto:[email protected]>> wrote: > >> > >>> I can see a lot of value in having a distribution of storm that > >>> comes with batteries included, everything is tested together and > >>> you know it works. But I don’t see much long term developer > >>> benefit in building them all together. If there is strong > >>> coupling between storm and these external projects so that they > >>> break when storm changes then we need to understand the coupling > >>> and decide if we want to reduce that coupling by stabilizing > >>> APIs, improving version numbering and release process, etc.; or > >>> if the functionality is something that should be offered as a > >>> base service in storm. > >>> > >>> I can see politically the value of giving these other projects a > >>> home in Apache, and making them sub-projects is the simplest > >>> route to that. I’d love to have storm on yarn inside Apache. I > >>> just don’t want to go overboard with it. There was a time when > >>> HBase was a “contrib” module under Hadoop along with a lot of > >>> other things, and the Apache board came and told Hadoop to brake > >>> it up. > >>> > >>> Bringing storm-kafka into storm does not sound like it will solve > >>> much from a developer’s perspective, because there is at least as > >>> much coupling with kafka as there is with storm. I can see how > >>> it is a huge amount of overhead and pain to set up a new project > >>> just for a few hundred lines of code, as such I am in favor of > >>> pulling in closely related projects, especially those that are > >>> spouts and state implementations. I just want to be sure that we > >>> do it carefully, with a good reason, and with enough people who > >>> are familiar with the code to support it long term. > >>> > >>> If it starts to look like we are pulling in too many projects > >>> perhaps we should look at something more like the bigtop project > >>> https://bigtop.apache.org/ which produces a tested distribution > >>> of Hadoop with many different sub-projects included in it. > >>> > >>> I am also a bit concerned about these sub-projects becoming > >>> second class citizens, where we break something, but because the > >>> build is off by default we don’t know it. I would prefer that > >>> they are built and tested by default. If the build and test time > >>> starts to take too long, to me that means we need to start > >>> wondering if we have too many contrib modules. > >>> > >>> —Bobby > >>> > >>> From: Brian Enochson <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> > > Reply-To: "[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>" > > <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> > > Date: Tuesday, February 25, 2014 at 9:50 PM > >>> To: "[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>" > > <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> > > Cc: "[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>" > > <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> > > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache > >>> > >>> hi, I am in agreement with Taylor and believe I understand his > >>> intent. An incredible tool/framework/application like Storm is > >>> only enhanced and gains value from the number of well maintained > >>> and vetted modules that can be used for integration and adding > >>> further functionality. I am relatively new to the Storm community > >>> but have spent quite some time reviewing contributing modules out > >>> there, reviewing various duplicates and running into some version > >>> incompatibilities. I understand the need to keep Storm itself > >>> pure, but do think there needs to be some structure and > >>> governance added to the contributing modules. Look at the benefit > >>> a tool like npm brings to the node community. I like the idea of > >>> sponsorship, vetting and a community vote. I, as sure many would > >>> be, am willing to offer support and time to working through how > >>> to set this up and helping with the implementation if it is > >>> decided to pursue some solution. I hope these views are taken in > >>> the sprit they are made, to make this incredible system even > >>> better along with the surrounding eco-system. > >>> > >>> Thanks, Brian > >>> > >>> > >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz > >>> <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> wrote: Just > >>> to be clear (and play a little Devil’s advocate :) ), I’m not > >>> suggesting that whatever a “contrib” project/module/subproject > >>> might become, be a clearinghouse for anything Storm-related. > >>> > >>> I see it as something that is well-vetted by the Storm > >>> community, subject to PPMC review, vote, etc. Entry would require > >>> community review, PPMC review, and in some cases ASF IP > >>> clearance/legal review. Anything added would require some level > >>> of commitment from the PPMC/committers to provide some level of > >>> support. > >>> > >>> In other words, nothing “willy-nilly”. > >>> > >>> One option could be that any module added require (X > 0) number > >>> of committers to volunteer as “sponsor”s for the module, and > >>> commit to maintaining it. > >>> > >>> That being said, I don’t see storm-kafka being any different > >>> from anything else that provides integration points for Storm. > >>> > >>> -Taylor > >>> > >>> > >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> > >>> wrote: > >>> > >>> I'm only +1 for pulling in storm-kafka and updating it. Other > >>> projects put these contrib modules in a "contrib" folder and keep > >>> them managed as completely separate codebases. As it's not > >>> actually a "module" necessary for Storm, there's an argument > >>> there for doing it that way rather than via the multi-module > >>> route. > >>> > >>> > >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage > >>> <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> > >>> wrote: Hi Taylor, > >>> > >>> I'm +1 for pulling these external libraries into Apache codebase. > >>> This will certainly benifit Strom community. I also like to > >>> contribute to this process. > >>> > >>> Thanks Milinda > >>> > >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz > >>> <[email protected] > >>> <mailto:[email protected]><mailto:[email protected]>> wrote: > >>>> A while back I opened STORM-206 [1] to capture ideas for > >>>> pulling in "contrib" modules to the Apache codebase. > >>>> > >>>> In the past, we had the storm-contrib github project [2] which > >>>> subsequently got broken up into individual projects hosted on > >>>> the stormprocessor github group [3] and elsewhere. > >>>> > >>>> The problem with this approach is that in certain cases it led > >>>> to code rot (modules not being updated in step with Storm's > >>>> API), fragmentation (multiple similar modules with the same > >>>> name), and confusion. > >>>> > >>>> A good example of this is the storm-kafka module [4], since it > >>>> is a widely used component. Because storm-contrib wasn't being > >>>> tagged in github, a lot of users had trouble reconciling with > >>>> which versions of storm it was compatible. Some users built off > >>>> specific commit hashes, some forked, and a few even pushed > >>>> custom builds to repositories such as clojars. With kafka 0.8 > >>>> now available, there are two main storm-kafka projects, the > >>>> original (compatible with kafka 0.7) and an updated fork [5] > >>>> (compatible with kafka 0.8). > >>>> > >>>> My intention is not to find fault in any way, but rather to > >>>> point out the resulting pain, and work toward a better > >>>> solution. > >>>> > >>>> I think it would be beneficial to the Storm user community to > >>>> have certain commonly used modules like storm-kafka brought > >>>> into the Apache Storm project. Another benefit worth > >>>> considering is the licensing/legal oversight that the ASF > >>>> provides, which is important to many users. > >>>> > >>>> If this is something we want to do, then the big question > >>>> becomes what sort governance process needs to be established to > >>>> ensure that such things are properly maintained. > >>>> > >>>> Some random thoughts, questions, etc. that jump to mind > >>>> include: > >>>> > >>>> What to call these things: "contib modules", "connectors", > >>>> "integration modules", etc.? Build integration: I imagine they > >>>> would be a multi-module submodule of the main maven build. > >>>> Probably turned off by default and enabled by a maven profile. > >>>> Governance: Have one or more committer volunteers responsible > >>>> for maintenance, merging patches, etc.? Proposal process for > >>>> pulling new modules? > >>>> > >>>> > >>>> I look forward to hearing others' opinions. > >>>> > >>>> - Taylor > >>>> > >>>> > >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2] > >>>> https://github.com/nathanmarz/storm-contrib [3] > >>>> https://github.com/stormprocessor [4] > >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka > > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus > > > > > > -- > Twitter: @nathanmarz > http://nathanmarz.com
