First off, I want to say this is awesome! It has been great to see all the great YARN offerings being released lately. I noticed Hadoop 2.x was recently voted beta so very exciting!
Like many we use Storm for near real-time processing our Kafka based streams. In addition we send this data to Hadoop for offline analysis. Consolidating these three environments to one is a win by itself. I also really like the fault tolerance and security features. Are you guys using Samza in production yet at LinkedIn or still development? The local state approach is very interesting. Are you guys using Databus for the feed of changes from the external stores? Is something like Voldemort integrated locally for the key/value store? Can you maintain multiple tables locally for stream processing? Since we are using Storm, do any latency comparisons exist? Since Samza makes the fault tolerance/durability tradeoff to persist to disk on every hop between StreamTasks, it would seem to take a hit here. That said we use Trident a good bit, so many of our topologies are already slowed by remote calls to Cassandra. I know it is fairly new, but were any comparisons against Spark Streaming considered? They take a similar tact of maintaining state locally as opposed to external stores, but I believe they are limited on what can fit in memory. Finally where did the catchy name, Samza come from? Thanks! Jonathan On Fri, Aug 23, 2013 at 9:39 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > Hey guys, > > This may be relevant to people on this list. A few of us at LinkedIn have > been working on Samza, a stream processing framework built on YARN. We just > added this as an Apache Incubator project. We would love to get people's > feedback (and help!). Here are the docs: > > http://samza.incubator.apache.org > > If anyone has any questions I'm happy to discuss what we are up to. Our > mailing list is here: > > http://samza.incubator.apache.org/community/mailing-lists.html > > -Jay >