Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-09-10 Thread Guozhang Wang
Folks, I would like to revive this thread on KIP-28: I have just updated the patch rebased on latest trunk incorporating the feedbacks collected so far: https://github.com/apache/kafka/pull/130 And the wiki page for this KIP has also been updated with the API and architectural designs:

[DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-19 Thread Yan Fang
Hi Guozhang, Thank you for writing the KIP-28 up. (Hope this is the right thread for me to post some comments. :) I still have some confusing about the implementation of the Processor: 1. why do we maintain a separate consumer and producer for each worker thread? — from my understanding,

[DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-19 Thread Yan Fang
Hi Guozhang, Thank you for writing the KIP-28 up. (Hope this is the right thread for me to post some comments. :) I still have some confusing about the implementation of the Processor: 1. why do we maintain a separate consumer and producer for each worker thread? — from my understanding,

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-11 Thread Guozhang Wang
Jiangjie, Thanks for the explanation, now I understands the scenario. It is one of the CEP in stream processing, in which I think the local state should be used for some sort of pattern matching. More concretely, let's say in this case we have a local state storing what have been observed. Then

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-11 Thread Jiangjie Qin
Guozhang, By interleaved groups of message, I meant something like this: Say we have message 0,1,2,3, message 0 and 2 together completes a business logic, message 1 and 3 together completes a business logic. In that case, after user processed message 2, they cannot commit offsets because if they

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-10 Thread Guozhang Wang
Hello folks, I have updated the KIP page with some detailed API / architecture / packaging proposals, along with the long promised first patch in PR: https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client https://github.com/apache/kafka/pull/130 Any feedbacks /

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-10 Thread Guozhang Wang
Hi Jiangjie, Not sure I understand the What If user have interleaved groups of messages, each group makes a complete logic? Could you elaborate a bit? About the committing functionality, it currently will only commit up to the processed message's offset; the commit() call it self actually does

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-10 Thread Guozhang Wang
Hi Jun, 1. I have removed the streamTime in punctuate() since it is not only triggered by clock time, detailed explanation can be found here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime 2. Yes, if users do not schedule

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-04 Thread Jun Rao
A few questions/comments. 1. What's streamTime passed to punctuate()? Is that just the current time? 2. Is punctuate() only called if schedule() is called? 3. The way the KeyValueStore is created seems a bit weird. Since this is part of the internal state managed by KafkaProcessorContext, it

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-31 Thread Jiangjie Qin
I think the abstraction of processor would be useful. It is not quite clear to me yet though which grid in the following API analysis chart this processor is trying to satisfy. https://cwiki.apache.org/confluence/display/KAFKA/New+consumer+API+change+proposal For example, in current proposal. It

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-31 Thread Gwen Shapira
Just a quick ping, that regardless of the name of the thing, I'm still interested in answers to my questions :) On Tue, Jul 28, 2015 at 3:07 PM, Gwen Shapira gshap...@cloudera.com wrote: Thanks Guazhang! Much clearer now, at least for me. Few comments / questions: 1. Perhaps punctuate(int

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread James Cheng
I agree with Sriram and Martin. Kafka is already about providing streams of data, and so Kafka Streams or anything like that is confusing to me. This new library is about making it easier to process the data. -James On Jul 30, 2015, at 9:38 AM, Aditya Auradkar aaurad...@linkedin.com.INVALID

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Martin Kleppmann
I'm with Sriram -- Kafka is all about streams already (or topics, to be precise, but we're calling it stream processing not topic processing), so I find Kafka Streams, KStream and Kafka Streaming all confusing, since they seem to imply that other bits of Kafka are not about streams. I would

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Aditya Auradkar
Personally, I prefer KafkaStreams just because it sounds nicer. For the reasons identified above, KafkaProcessor or KProcessor is more apt but sounds less catchy (IMO). I also think we should prefix with Kafka (rather than K) because we will then have 3 clients: KafkaProducer, KafkaConsumer and

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Guozhang Wang
I would vote for KStream as it sounds sexier (is it only me??), second to that would be Kafka Streaming. On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io wrote: Also, the most important part of any prototype, we should have a name for this producing-consumer-thingamgigy: Various

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Gwen Shapira
I think its also a matter of intent. If we see it as yet another client library, than Processor (to match Producer and Consumer) will work great. If we see it is a stream processing framework, the name has to start with S to follow existing convention. Speaking of naming conventions: You know how

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Sriram Subramanian
I had the same thought. Kafka processor, KProcessor or even Kafka stream processor is more relevant. On Jul 30, 2015, at 2:09 PM, Martin Kleppmann mar...@kleppmann.com wrote: I'm with Sriram -- Kafka is all about streams already (or topics, to be precise, but we're calling it stream

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-29 Thread Neha Narkhede
Prefer something that evokes stream processing on top of Kafka. And since I've heard many people conflate streaming with streaming video (I know, duh!), I'd vote for Kafka Streams or a maybe KStream. Thanks, Neha On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io wrote: Also, the

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-29 Thread Gwen Shapira
Since it sounds like it is not a separate framework (like CopyCat) but rather a new client, it will be nice to follow existing convention. Producer, Consumer and Processor (or Transformer) make sense to me. Note that the way the API is currently described, people may want to use it inside Spark

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-29 Thread Jay Kreps
Also, the most important part of any prototype, we should have a name for this producing-consumer-thingamgigy: Various ideas: - Kafka Streams - KStream - Kafka Streaming - The Processor API - Metamorphosis - Transformer API - Verwandlung For my part I think what people are trying to do is stream

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Guozhang Wang
I have updated the wiki page incorporating people's comments, please feel free to take another look before today's meeting. On Mon, Jul 27, 2015 at 11:19 PM, Yi Pan nickpa...@gmail.com wrote: Hi, Jay, {quote} 1. Yeah we are going to try to generalize the partition management stuff. We'll

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Jay Kreps
The second question that came up on the KIP was how do joins and aggregations work. A lot of implicit thinking went into Kafka's data model to support stream processing so there is an idea of how this should work but it isn't exactly obvious. Let me go through the idea of how a processor is meant

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Jay Kreps
Here is the link to the original prototype we started with. I wouldn't focus to heavily on the details of this code or the api, but I think it gives the an idea of the lowest level api, amount of code, etc. It was basically a clone of Samza built on Kafka using the new consumer protocol just to

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Yi Pan
Hi, Aditya, {quote} - The KIP states that cmd line tools will be provided to deploy as a separate service. Is the proposed scope limited to providing a library with which makes it possible build stream-processing-as- a-service or provide such a service within Kafka itself? {quote} There has

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Neha Narkhede
Adi, How far away are we from having something a prototype patch to play with? We are working to share a prototype next week. Though the code will evolve to match the APIs and design as it shapes up, but it will be great if people can take a look and provide feedback. Couple of observations:

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Yi Pan
Hi, Neha, {quote} We do hope to include a DSL since that is the most natural way of expressing stream processing operations on top of the processor client. The DSL layer should be equivalent to that provided by Spark streaming or Flink in terms of expressiveness though there will be differences

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Guozhang Wang
Hi Adi, Just to clarify, the cmdline tool would be used, as stated in the wiki page, to run the client library as a process, which is still far away from a service. It is just like what we have for kafka-console-producer, kafka-console-consumer, kafka-mirror-maker, etc today. Guozhang On Mon,

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Yi Pan
Hi, Jay, {quote} 1. Yeah we are going to try to generalize the partition management stuff. We'll get a wiki/JIRA up for that. I think that gives what you want in terms of moving partitioning to the client side. {quote} Great! I am looking forward to that. {quote} I think the key observation is

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-28 Thread Gwen Shapira
Thanks Guazhang! Much clearer now, at least for me. Few comments / questions: 1. Perhaps punctuate(int numRecords) will be a nice API addition, some use-cases have record-count based windows, rather than time-based.. 2. The diagram for Flexible partition distribution shows two joins. Is the idea

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Aditya Auradkar
+1 on comparison with existing solutions. On a high level, it seems nice to have a transform library inside Kafka.. a lot of the building blocks are already there to build a stream processing framework. However the details are tricky to get right I think this discussion will get a lot more

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Neha Narkhede
Gwen, We have a compilation of notes from comparison with other systems. They might be missing details that folks who worked on that system might be able to point out. We can share that and discuss further on the KIP call. We do hope to include a DSL since that is the most natural way of

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Jay Kreps
Hey Yi, Great points. I think for some of this the most useful thing would be to get a wip prototype out that we could discuss concretely. I think Yasuhiro and Guozhang took that prototype I had done, and had some improvements. Give us a bit to get that into understandable shape so we can

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Gwen Shapira
Hi, Since we will be discussing KIP-28 in the call tomorrow, can you update the KIP with the feature-comparison with existing solutions? I admit that I do not see a need for single-event-producer-consumer pair (AKA Flume Interceptor). I've seen tons of people implement such apps in the last

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Yi Pan
Hi, Jay and all, Thanks for all your quick responses. I tried to summarize my thoughts here: - ConsumerRecord as stream processor API: * This KafkaProcessor API is targeted to receive the message from Kafka. So, to Yasuhiro's join/transformation example, any join/transformation results that

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Jiangjie Qin
Hey Guozhang, I just took a quick look at the KIP, is it very similar to mirror maker with message handler? Thanks, Jiangjie (Becket) Qin On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava e...@confluent.io wrote: Just some notes on the KIP doc itself: * It'd be useful to clarify at

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Neha Narkhede
Ewen: * I think trivial filtering and aggregation on a single stream usually work fine with this model. The way I see this, the process() API is an abstraction for message-at-a-time computations. In the future, you could imagine providing a simple DSL layer on top of the process() API that

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Yi Pan
Hi, Guozhang, Thanks for starting this. I took a quick look and had the following thoughts to share: - In the proposed KafkaProcessor API, there is no interface like Collector that allows users to send messages to. Why is that? Is the idea to initialize the producer once and re-use it in the

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Yasuhiro Matsuda
Jay, I understand that. Context can provide more information without breaking the compatibility if needed. Also I am not sure ConsumerRecord is the right abstraction of data for stream processing. After transformation or join, what is the topic and the offset? It is odd to use ConsumerRecord. We

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Guozhang Wang
Hi Jiangjie, Inlined. On Thu, Jul 23, 2015 at 11:32 PM, Jiangjie Qin j...@linkedin.com.invalid wrote: Hey Guozhang, I just took a quick look at the KIP, is it very similar to mirror maker with message handler? I think the processor client would supporting a superset of functionalities

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Yasuhiro Matsuda
The goal of this KIP is to provide a lightweight/embeddable streaming framework, and allows Kafka users to start using stream processing easily. DSL is not covered in this KIP. But, DSL is a very attractive option to have. In the proposed KafkaProcessor API, there is no interface like Collector

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Jay Kreps
Hey Yi, For your other two points: - This definitely doesn't cover any kind of SQL or anything like this. - The prototype we started with just had process() as a method but Yasuhiro had some ideas of adding additional filter/aggregate convenience methods. We should discuss how this would fit

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Guozhang Wang
Hi Yi, Inlined. On Fri, Jul 24, 2015 at 12:57 AM, Yi Pan nickpa...@gmail.com wrote: Hi, Guozhang, Thanks for starting this. I took a quick look and had the following thoughts to share: - In the proposed KafkaProcessor API, there is no interface like Collector that allows users to send

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Ewen Cheslack-Postava
@guozhang re: library vs framework - Nothing critical missing. But I think KIPs will serve as a sort of changelog since ones like this https://archive.apache.org/dist/kafka/0.8.2.0/RELEASE_NOTES.html are not all that helpful to end users. A KIP like this is a major new addition, so laying out all

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Guozhang Wang
Hi Ewen, Replies inlined. On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava e...@confluent.io wrote: Just some notes on the KIP doc itself: * It'd be useful to clarify at what point the plain consumer + custom code + producer breaks down. I think trivial filtering and aggregation on a

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Neha Narkhede
Agree that the normal KIP process is awkward for larger changes like this. I'm a +1 on trying out this new process for the processor client, see how it works out and then make that a process for future large changes of this nature. On Fri, Jul 24, 2015 at 10:03 AM, Jay Kreps j...@confluent.io

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Jay Kreps
To follow on to one of Yi's points about taking ConsumerRecord vs topic/key/value. One thing we have found is that for user-facing APIs considering future API evolution is really important. If you do topic/key/value and then realize you need offset added you end up having to break everyones code.

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-24 Thread Ismael Juma
On 24 Jul 2015 18:03, Jay Kreps j...@confluent.io wrote: Does this make sense to people? If so let's try it and if we like it better we can formally make that the process for this kind of big thing. Yes, sounds good to me. Best, Ismael

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-23 Thread Ewen Cheslack-Postava
Just some notes on the KIP doc itself: * It'd be useful to clarify at what point the plain consumer + custom code + producer breaks down. I think trivial filtering and aggregation on a single stream usually work fine with this model. Anything where you need more complex joins, windowing, etc. are

[DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-23 Thread Guozhang Wang
Hi all, I just posted KIP-28: Add a transform client for data processing https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing . The wiki page does not yet have the full design / implementation details, and this email is to kick-off the