Folks,
I would like to revive this thread on KIP-28: I have just updated the patch
rebased on latest trunk incorporating the feedbacks collected so far:
https://github.com/apache/kafka/pull/130
And the wiki page for this KIP has also been updated with the API and
architectural designs:
Hi Guozhang,
Thank you for writing the KIP-28 up. (Hope this is the right thread for me to
post some comments. :)
I still have some confusing about the implementation of the Processor:
1. why do we maintain a separate consumer and producer for each worker thread?
— from my understanding,
Hi Guozhang,
Thank you for writing the KIP-28 up. (Hope this is the right thread for me to
post some comments. :)
I still have some confusing about the implementation of the Processor:
1. why do we maintain a separate consumer and producer for each worker thread?
— from my understanding,
Jiangjie,
Thanks for the explanation, now I understands the scenario. It is one of
the CEP in stream processing, in which I think the local state should be
used for some sort of pattern matching. More concretely, let's say in this
case we have a local state storing what have been observed. Then
Guozhang,
By interleaved groups of message, I meant something like this: Say we have
message 0,1,2,3, message 0 and 2 together completes a business logic,
message 1 and 3 together completes a business logic. In that case, after
user processed message 2, they cannot commit offsets because if they
Hello folks,
I have updated the KIP page with some detailed API / architecture /
packaging proposals, along with the long promised first patch in PR:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client
https://github.com/apache/kafka/pull/130
Any feedbacks /
Hi Jiangjie,
Not sure I understand the What If user have interleaved groups of messages,
each group makes a complete logic? Could you elaborate a bit?
About the committing functionality, it currently will only commit up to the
processed message's offset; the commit() call it self actually does
Hi Jun,
1. I have removed the streamTime in punctuate() since it is not only
triggered by clock time, detailed explanation can be found here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime
2. Yes, if users do not schedule
A few questions/comments.
1. What's streamTime passed to punctuate()? Is that just the current time?
2. Is punctuate() only called if schedule() is called?
3. The way the KeyValueStore is created seems a bit weird. Since this is
part of the internal state managed by KafkaProcessorContext, it
I think the abstraction of processor would be useful. It is not quite clear
to me yet though which grid in the following API analysis chart this
processor is trying to satisfy.
https://cwiki.apache.org/confluence/display/KAFKA/New+consumer+API+change+proposal
For example, in current proposal. It
Just a quick ping, that regardless of the name of the thing, I'm still
interested in answers to my questions :)
On Tue, Jul 28, 2015 at 3:07 PM, Gwen Shapira gshap...@cloudera.com wrote:
Thanks Guazhang! Much clearer now, at least for me.
Few comments / questions:
1. Perhaps punctuate(int
I agree with Sriram and Martin. Kafka is already about providing streams of
data, and so Kafka Streams or anything like that is confusing to me.
This new library is about making it easier to process the data.
-James
On Jul 30, 2015, at 9:38 AM, Aditya Auradkar aaurad...@linkedin.com.INVALID
I'm with Sriram -- Kafka is all about streams already (or topics, to be
precise, but we're calling it stream processing not topic processing), so I
find Kafka Streams, KStream and Kafka Streaming all confusing, since they
seem to imply that other bits of Kafka are not about streams.
I would
Personally, I prefer KafkaStreams just because it sounds nicer. For the
reasons identified above, KafkaProcessor or KProcessor is more apt but
sounds less catchy (IMO). I also think we should prefix with Kafka (rather
than K) because we will then have 3 clients: KafkaProducer, KafkaConsumer
and
I would vote for KStream as it sounds sexier (is it only me??), second to
that would be Kafka Streaming.
On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io wrote:
Also, the most important part of any prototype, we should have a name for
this producing-consumer-thingamgigy:
Various
I think its also a matter of intent. If we see it as yet another
client library, than Processor (to match Producer and Consumer) will
work great.
If we see it is a stream processing framework, the name has to start
with S to follow existing convention.
Speaking of naming conventions:
You know how
I had the same thought. Kafka processor, KProcessor or even Kafka
stream processor is more relevant.
On Jul 30, 2015, at 2:09 PM, Martin Kleppmann mar...@kleppmann.com wrote:
I'm with Sriram -- Kafka is all about streams already (or topics, to be
precise, but we're calling it stream
Prefer something that evokes stream processing on top of Kafka. And since
I've heard many people conflate streaming with streaming video (I know,
duh!), I'd vote for Kafka Streams or a maybe KStream.
Thanks,
Neha
On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io wrote:
Also, the
Since it sounds like it is not a separate framework (like CopyCat) but
rather a new client, it will be nice to follow existing convention.
Producer, Consumer and Processor (or Transformer) make sense to me.
Note that the way the API is currently described, people may want to
use it inside Spark
Also, the most important part of any prototype, we should have a name for
this producing-consumer-thingamgigy:
Various ideas:
- Kafka Streams
- KStream
- Kafka Streaming
- The Processor API
- Metamorphosis
- Transformer API
- Verwandlung
For my part I think what people are trying to do is stream
I have updated the wiki page incorporating people's comments, please feel
free to take another look before today's meeting.
On Mon, Jul 27, 2015 at 11:19 PM, Yi Pan nickpa...@gmail.com wrote:
Hi, Jay,
{quote}
1. Yeah we are going to try to generalize the partition management stuff.
We'll
The second question that came up on the KIP was how do joins and
aggregations work. A lot of implicit thinking went into Kafka's data model
to support stream processing so there is an idea of how this should work
but it isn't exactly obvious. Let me go through the idea of how a processor
is meant
Here is the link to the original prototype we started with. I wouldn't
focus to heavily on the details of this code or the api, but I think it
gives the an idea of the lowest level api, amount of code, etc. It was
basically a clone of Samza built on Kafka using the new consumer protocol
just to
Hi, Aditya,
{quote}
- The KIP states that cmd line tools will be provided to deploy as a
separate service. Is the proposed scope limited to providing a library with
which makes it possible build stream-processing-as- a-service or provide
such a service within Kafka itself?
{quote}
There has
Adi,
How far away are we from having something a prototype patch to play with?
We are working to share a prototype next week. Though the code will evolve
to match the APIs and design as it shapes up, but it will be great if
people can take a look and provide feedback.
Couple of observations:
Hi, Neha,
{quote}
We do hope to include a DSL since that is the most natural way of
expressing stream processing operations on top of the processor client. The
DSL layer should be equivalent to that provided by Spark streaming or Flink
in terms of expressiveness though there will be differences
Hi Adi,
Just to clarify, the cmdline tool would be used, as stated in the wiki
page, to run the client library as a process, which is still far away
from a service. It is just like what we have for kafka-console-producer,
kafka-console-consumer, kafka-mirror-maker, etc today.
Guozhang
On Mon,
Hi, Jay,
{quote}
1. Yeah we are going to try to generalize the partition management stuff.
We'll get a wiki/JIRA up for that. I think that gives what you want in
terms of moving partitioning to the client side.
{quote}
Great! I am looking forward to that.
{quote}
I think the key observation is
Thanks Guazhang! Much clearer now, at least for me.
Few comments / questions:
1. Perhaps punctuate(int numRecords) will be a nice API addition, some
use-cases have record-count based windows, rather than time-based..
2. The diagram for Flexible partition distribution shows two joins.
Is the idea
+1 on comparison with existing solutions. On a high level, it seems nice to
have a transform library inside Kafka.. a lot of the building blocks are
already there to build a stream processing framework. However the details
are tricky to get right I think this discussion will get a lot more
Gwen,
We have a compilation of notes from comparison with other systems. They
might be missing details that folks who worked on that system might be able
to point out. We can share that and discuss further on the KIP call.
We do hope to include a DSL since that is the most natural way of
Hey Yi,
Great points. I think for some of this the most useful thing would be to
get a wip prototype out that we could discuss concretely. I think Yasuhiro
and Guozhang took that prototype I had done, and had some improvements.
Give us a bit to get that into understandable shape so we can
Hi,
Since we will be discussing KIP-28 in the call tomorrow, can you
update the KIP with the feature-comparison with existing solutions?
I admit that I do not see a need for single-event-producer-consumer
pair (AKA Flume Interceptor). I've seen tons of people implement such
apps in the last
Hi, Jay and all,
Thanks for all your quick responses. I tried to summarize my thoughts here:
- ConsumerRecord as stream processor API:
* This KafkaProcessor API is targeted to receive the message from Kafka.
So, to Yasuhiro's join/transformation example, any join/transformation
results that
Hey Guozhang,
I just took a quick look at the KIP, is it very similar to mirror maker
with message handler?
Thanks,
Jiangjie (Becket) Qin
On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava e...@confluent.io
wrote:
Just some notes on the KIP doc itself:
* It'd be useful to clarify at
Ewen:
* I think trivial filtering and aggregation on a single stream usually work
fine with this model.
The way I see this, the process() API is an abstraction for
message-at-a-time computations. In the future, you could imagine providing
a simple DSL layer on top of the process() API that
Hi, Guozhang,
Thanks for starting this. I took a quick look and had the following
thoughts to share:
- In the proposed KafkaProcessor API, there is no interface like Collector
that allows users to send messages to. Why is that? Is the idea to
initialize the producer once and re-use it in the
Jay, I understand that. Context can provide more information without
breaking the compatibility if needed. Also I am not sure ConsumerRecord is
the right abstraction of data for stream processing. After transformation
or join, what is the topic and the offset? It is odd to use ConsumerRecord.
We
Hi Jiangjie,
Inlined.
On Thu, Jul 23, 2015 at 11:32 PM, Jiangjie Qin j...@linkedin.com.invalid
wrote:
Hey Guozhang,
I just took a quick look at the KIP, is it very similar to mirror maker
with message handler?
I think the processor client would supporting a superset of functionalities
The goal of this KIP is to provide a lightweight/embeddable streaming
framework, and allows Kafka users to start using stream processing easily. DSL
is not covered in this KIP. But, DSL is a very attractive option to have.
In the proposed KafkaProcessor API, there is no interface like Collector
Hey Yi,
For your other two points:
- This definitely doesn't cover any kind of SQL or anything like this.
- The prototype we started with just had process() as a method but Yasuhiro
had some ideas of adding additional filter/aggregate convenience methods.
We should discuss how this would fit
Hi Yi,
Inlined.
On Fri, Jul 24, 2015 at 12:57 AM, Yi Pan nickpa...@gmail.com wrote:
Hi, Guozhang,
Thanks for starting this. I took a quick look and had the following
thoughts to share:
- In the proposed KafkaProcessor API, there is no interface like Collector
that allows users to send
@guozhang re: library vs framework - Nothing critical missing. But I think
KIPs will serve as a sort of changelog since ones like this
https://archive.apache.org/dist/kafka/0.8.2.0/RELEASE_NOTES.html are not
all that helpful to end users. A KIP like this is a major new addition, so
laying out all
Hi Ewen,
Replies inlined.
On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava e...@confluent.io
wrote:
Just some notes on the KIP doc itself:
* It'd be useful to clarify at what point the plain consumer + custom code
+ producer breaks down. I think trivial filtering and aggregation on a
Agree that the normal KIP process is awkward for larger changes like this.
I'm a +1 on trying out this new process for the processor client, see how
it works out and then make that a process for future large changes of this
nature.
On Fri, Jul 24, 2015 at 10:03 AM, Jay Kreps j...@confluent.io
To follow on to one of Yi's points about taking ConsumerRecord vs
topic/key/value. One thing we have found is that for user-facing APIs
considering future API evolution is really important. If you do
topic/key/value and then realize you need offset added you end up having to
break everyones code.
On 24 Jul 2015 18:03, Jay Kreps j...@confluent.io wrote:
Does this make sense to people? If so let's try it and if we like it
better
we can formally make that the process for this kind of big thing.
Yes, sounds good to me.
Best,
Ismael
Just some notes on the KIP doc itself:
* It'd be useful to clarify at what point the plain consumer + custom code
+ producer breaks down. I think trivial filtering and aggregation on a
single stream usually work fine with this model. Anything where you need
more complex joins, windowing, etc. are
Hi all,
I just posted KIP-28: Add a transform client for data processing
https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
.
The wiki page does not yet have the full design / implementation details,
and this email is to kick-off the
49 matches
Mail list logo