subject:"\[DISCUSS\] KIP\-28 \- Add a transform client for data processing"

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-09-10 Thread Guozhang Wang

Folks,

I would like to revive this thread on KIP-28: I have just updated the patch
rebased on latest trunk incorporating the feedbacks collected so far:

https://github.com/apache/kafka/pull/130

And the wiki page for this KIP has also been updated with the API and
architectural designs:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client

Would love to hear your thoughts or questions.

Guozhang


On Tue, Aug 11, 2015 at 10:50 AM, Guozhang Wang  wrote:

> Jiangjie,
>
> Thanks for the explanation, now I understands the scenario. It is one of
> the CEP in stream processing, in which I think the local state should be
> used for some sort of pattern matching. More concretely, let's say in this
> case we have a local state storing what have been observed. Then the
> sequence would be:
>
> T0: local state {}
> T1:message 0,  local state {0}
> T2:message 1,  local state {0, 1}
> T3:message 2,  local state {1}, matching 0 and 2, output some result
> and remove 0/2 from local state.
> T4:message 3,  local state {0}, matching 1 and 3, output some result
> and remove 1/3 from local state.
>
> Let's say user calls commit on T2, it will commit offset at message 2 as
> well as the local state {0, 1}; then upon failure recovery, it can recover
> the state as along with the committed offsets to continue.
>
> More generally, the current design of the processor will let users to
> specify their subscribed topics before starting the process, and users will
> not change topic subscription on the fly, users will not be committing on
> arbitrary offsets. The rationale behind this is to abstract the producer /
> consumer details from the processor developers as much as possible, i.e. if
> user do not want, they should not be exposed with message offsets /
> partition ids / topic names etc. For most cases, the subscribed topics
> should be able to specify before starting the processing job, so we let
> users to specify them once and then focus on the computational logic in
> implementing the process function.
>
> Guozhang
>
>
> On Tue, Aug 11, 2015 at 10:26 AM, Jiangjie Qin 
> wrote:
>
>> Guozhang,
>>
>> By interleaved groups of message, I meant something like this: Say we have
>> message 0,1,2,3, message 0 and 2 together completes a business logic,
>> message 1 and 3 together completes a business logic. In that case, after
>> user processed message 2, they cannot commit offsets because if they crash
>> before processing message 3, message 1 will not be reconsumed. That means
>> it is possible that user are not able to find a point where the current
>> state is safe to be committed.
>>
>> This is one example in the use case space table. It is still not clear to
>> me which use cases in the use case space table KIP-28 wants to cover. Are
>> we only covering the case for static topic stream with semi-auto commit?
>> i.e. user cannot change topic subscription on the fly and they can only
>> commit the current offset.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Mon, Aug 10, 2015 at 6:57 PM, Guozhang Wang 
>> wrote:
>>
>> > Hello folks,
>> >
>> > I have updated the KIP page with some detailed API / architecture /
>> > packaging proposals, along with the long promised first patch in PR:
>> >
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client
>> >
>> > https://github.com/apache/kafka/pull/130
>> >
>> >
>> > Any feedbacks / comments are more than welcomed.
>> >
>> > Guozhang
>> >
>> >
>> > On Mon, Aug 10, 2015 at 6:55 PM, Guozhang Wang 
>> wrote:
>> >
>> > > Hi Jun,
>> > >
>> > > 1. I have removed the streamTime in punctuate() since it is not only
>> > > triggered by clock time, detailed explanation can be found here:
>> > >
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime
>> > >
>> > > 2. Yes, if users do not schedule a task, then punctuate will never
>> fire.
>> > >
>> > > 3. Yes, I agree. The reason it was implemented in this way is that the
>> > > state store registration call is triggered by the users. However I
>> think
>> > it
>> > > is doable to change that API so that it will be more natural to have
>> sth.
>> > > like:
>> > >
>> > > context.createStore(store-name, store-type).
>> > >
>> > > Guozhang
>> > >
>> > > On Tue, Aug 4, 2015 at 9:17 AM, Jun Rao  wrote:
>> > >
>> > >> A few questions/comments.
>> > >>
>> > >> 1. What's streamTime passed to punctuate()? Is that just the current
>> > time?
>> > >> 2. Is punctuate() only called if schedule() is called?
>> > >> 3. The way the KeyValueStore is created seems a bit weird. Since
>> this is
>> > >> part of the internal state managed by KafkaProcessorContext, it seems
>> > >> there
>> > >> should be an api to create the KeyValueStore from
>> KafkaProcessorContext,

[DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-19 Thread Yan Fang

Hi Guozhang,

Thank you for writing the KIP-28 up. (Hope this is the right thread for me to 
post some comments. :) 

I still have some confusing about the implementation of the Processor:

1. why do we maintain a separate consumer and producer for each worker thread?
— from my understanding, the new consumer api will be able to fetch certain 
topic-partition. Is one consumer enough for one Kafka.process (it is shared 
among work threads)? The same thing for the producer, is one producer enough 
for sending out messages to the brokers? Will this have better performance?

2. how is the “Stream Synchronization” achieved?
— you talked about “pause” and “notify” the consumer. Still not very clear. 
If worker thread has group_1 {topicA-0, topicB-0} and group_2 {topicA-1, 
topicB-1}, and topicB is much slower. How can we pause the consumer to sync 
topicA and topicB if there is only one consumer?

3. how does the partition timestamp monotonically increase?
— “When the lowest timestamp corresponding record gets processed by the 
thread, the partition time possibly gets advanced.” How does the “gets 
advanced” work? Do we get another “lowest message timestamp value”? But doing 
this, may not get an “advanced” timestamp.

4. thoughts about the local state management.
— from the description, I think there is one kv store per partition-group. 
That means if one work thread is assigned more than one partition groups, it 
will have more than one kv-store connections. How can we avoid mis-operation? 
Because one partition group can easily write to another partition group’s kv 
store (they are in the same thread). 

5. do we plan to implement the throttle ?
— since we are “forwarding” the messages. It is very possible that, 
upstream-processor is much faster than the downstream-processor, how do we plan 
to deal with this?

6. how does the parallelism work?
— we achieve this by simply adding more threads? Or we plan to have the 
mechanism which can deploy different threads to different machines? It is easy 
to image that we can deploy different processors to different machines, then 
how about the work threads? Then how is the fault-tolerance? Maybe this is 
out-of-scope of the KIP?

Two nits in the KIP-28 doc:

1. miss the “close” method interfaceProcessorK1,V1,K2,V2. We have the 
“override close()” in KafkaProcessor.

2. “punctuate” does not accept “parameter”, while StatefulProcessJob has a 
punctuate method that accepts parameter.

Thanks,
Yan

[DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-19 Thread Yan Fang

Hi Guozhang,

Thank you for writing the KIP-28 up. (Hope this is the right thread for me to 
post some comments. :) 

I still have some confusing about the implementation of the Processor:

1. why do we maintain a separate consumer and producer for each worker thread?
— from my understanding, the new consumer api will be able to fetch certain 
topic-partition. Is one consumer enough for one Kafka.process (it is shared 
among work threads)? The same thing for the producer, is one producer enough 
for sending out messages to the brokers? Will this have better performance?

2. how is the “Stream Synchronization” achieved?
— you talked about “pause” and “notify” the consumer. Still not very clear. 
If worker thread has group_1 {topicA-0, topicB-0} and group_2 {topicA-1, 
topicB-1}, and topicB is much slower. How can we pause the consumer to sync 
topicA and topicB if there is only one consumer?

3. how does the partition timestamp monotonically increase?
— “When the lowest timestamp corresponding record gets processed by the 
thread, the partition time possibly gets advanced.” How does the “gets 
advanced” work? Do we get another “lowest message timestamp value”? But doing 
this, may not get an “advanced” timestamp.

4. thoughts about the local state management.
— from the description, I think there is one kv store per partition-group. 
That means if one work thread is assigned more than one partition groups, it 
will have more than one kv-store connections. How can we avoid mis-operation? 
Because one partition group can easily write to another partition group’s kv 
store (they are in the same thread). 

5. do we plan to implement the throttle ?
— since we are “forwarding” the messages. It is very possible that, 
upstream-processor is much faster than the downstream-processor, how do we plan 
to deal with this?

6. how does the parallelism work?
— we achieve this by simply adding more threads? Or we plan to have the 
mechanism which can deploy different threads to different machines? It is easy 
to image that we can deploy different processors to different machines, then 
how about the work threads? Then how is the fault-tolerance? Maybe this is 
out-of-scope of the KIP?

Two nits in the KIP-28 doc:

1. miss the “close” method interfaceProcessorK1,V1,K2,V2. We have the 
“override close()” in KafkaProcessor.

2. “punctuate” does not accept “parameter”, while StatefulProcessJob has a 
punctuate method that accepts parameter.

Thanks,
Yan

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-11 Thread Guozhang Wang

Jiangjie,

Thanks for the explanation, now I understands the scenario. It is one of
the CEP in stream processing, in which I think the local state should be
used for some sort of pattern matching. More concretely, let's say in this
case we have a local state storing what have been observed. Then the
sequence would be:

T0: local state {}
T1:message 0, local state {0}
T2:message 1, local state {0, 1}
T3:message 2, local state {1}, matching 0 and 2, output some result
and remove 0/2 from local state.
T4:message 3, local state {0}, matching 1 and 3, output some result
and remove 1/3 from local state.

Let's say user calls commit on T2, it will commit offset at message 2 as
well as the local state {0, 1}; then upon failure recovery, it can recover
the state as along with the committed offsets to continue.

More generally, the current design of the processor will let users to
specify their subscribed topics before starting the process, and users will
not change topic subscription on the fly, users will not be committing on
arbitrary offsets. The rationale behind this is to abstract the producer /
consumer details from the processor developers as much as possible, i.e. if
user do not want, they should not be exposed with message offsets /
partition ids / topic names etc. For most cases, the subscribed topics
should be able to specify before starting the processing job, so we let
users to specify them once and then focus on the computational logic in
implementing the process function.

Guozhang

On Tue, Aug 11, 2015 at 10:26 AM, Jiangjie Qin j...@linkedin.com.invalid
wrote:

Guozhang,

By interleaved groups of message, I meant something like this: Say we have
message 0,1,2,3, message 0 and 2 together completes a business logic,
message 1 and 3 together completes a business logic. In that case, after
user processed message 2, they cannot commit offsets because if they crash
before processing message 3, message 1 will not be reconsumed. That means
it is possible that user are not able to find a point where the current
state is safe to be committed.

This is one example in the use case space table. It is still not clear to
me which use cases in the use case space table KIP-28 wants to cover. Are
we only covering the case for static topic stream with semi-auto commit?
i.e. user cannot change topic subscription on the fly and they can only
commit the current offset.

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 10, 2015 at 6:57 PM, Guozhang Wang wangg...@gmail.com wrote:

Hello folks,

I have updated the KIP page with some detailed API / architecture /
packaging proposals, along with the long promised first patch in PR:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client

https://github.com/apache/kafka/pull/130

Any feedbacks / comments are more than welcomed.

Guozhang

On Mon, Aug 10, 2015 at 6:55 PM, Guozhang Wang wangg...@gmail.com
wrote:

Hi Jun,

1. I have removed the streamTime in punctuate() since it is not only
triggered by clock time, detailed explanation can be found here:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime

2. Yes, if users do not schedule a task, then punctuate will never
fire.

3. Yes, I agree. The reason it was implemented in this way is that the
state store registration call is triggered by the users. However I
think
it
is doable to change that API so that it will be more natural to have
sth.
like:

context.createStore(store-name, store-type).

Guozhang

On Tue, Aug 4, 2015 at 9:17 AM, Jun Rao j...@confluent.io wrote:

A few questions/comments.

1. What's streamTime passed to punctuate()? Is that just the current
time?
2. Is punctuate() only called if schedule() is called?
3. The way the KeyValueStore is created seems a bit weird. Since this
is
part of the internal state managed by KafkaProcessorContext, it seems
there
should be an api to create the KeyValueStore from
KafkaProcessorContext,
instead of passing context to the constructor of KeyValueStore?

Thanks,

Jun

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com
wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing

The wiki page does not yet have the full design / implementation
details,
and this email is to kick-off the conversation on whether we should
add
this new client with the described motivations, and if yes what
features /
functionalities should be included.

Looking forward to your feedback!

-- Guozhang

--
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-11 Thread Jiangjie Qin

Guozhang,

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 10, 2015 at 6:57 PM, Guozhang Wang wangg...@gmail.com wrote:

Hello folks,

I have updated the KIP page with some detailed API / architecture /
packaging proposals, along with the long promised first patch in PR:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client

https://github.com/apache/kafka/pull/130

Any feedbacks / comments are more than welcomed.

Guozhang

On Mon, Aug 10, 2015 at 6:55 PM, Guozhang Wang wangg...@gmail.com wrote:

Hi Jun,

1. I have removed the streamTime in punctuate() since it is not only
triggered by clock time, detailed explanation can be found here:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime

2. Yes, if users do not schedule a task, then punctuate will never fire.

3. Yes, I agree. The reason it was implemented in this way is that the
state store registration call is triggered by the users. However I think
it
is doable to change that API so that it will be more natural to have sth.
like:

context.createStore(store-name, store-type).

Guozhang

On Tue, Aug 4, 2015 at 9:17 AM, Jun Rao j...@confluent.io wrote:

A few questions/comments.

1. What's streamTime passed to punctuate()? Is that just the current
time?
2. Is punctuate() only called if schedule() is called?
3. The way the KeyValueStore is created seems a bit weird. Since this is
part of the internal state managed by KafkaProcessorContext, it seems
there
should be an api to create the KeyValueStore from KafkaProcessorContext,
instead of passing context to the constructor of KeyValueStore?

Thanks,

Jun

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com
wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing

Looking forward to your feedback!

-- Guozhang

--
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-10 Thread Guozhang Wang

Hello folks,

I have updated the KIP page with some detailed API / architecture /
packaging proposals, along with the long promised first patch in PR:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client

https://github.com/apache/kafka/pull/130

Any feedbacks / comments are more than welcomed.

Guozhang

On Mon, Aug 10, 2015 at 6:55 PM, Guozhang Wang wangg...@gmail.com wrote:

Hi Jun,

1. I have removed the streamTime in punctuate() since it is not only
triggered by clock time, detailed explanation can be found here:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime

2. Yes, if users do not schedule a task, then punctuate will never fire.

3. Yes, I agree. The reason it was implemented in this way is that the
state store registration call is triggered by the users. However I think it
is doable to change that API so that it will be more natural to have sth.
like:

context.createStore(store-name, store-type).

Guozhang

On Tue, Aug 4, 2015 at 9:17 AM, Jun Rao j...@confluent.io wrote:

A few questions/comments.

1. What's streamTime passed to punctuate()? Is that just the current time?
2. Is punctuate() only called if schedule() is called?
3. The way the KeyValueStore is created seems a bit weird. Since this is
part of the internal state managed by KafkaProcessorContext, it seems
there
should be an api to create the KeyValueStore from KafkaProcessorContext,
instead of passing context to the constructor of KeyValueStore?

Thanks,

Jun

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com
wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing

The wiki page does not yet have the full design / implementation
details,
and this email is to kick-off the conversation on whether we should add
this new client with the described motivations, and if yes what
features /
functionalities should be included.

Looking forward to your feedback!

-- Guozhang

--
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-10 Thread Guozhang Wang

Hi Jiangjie,

Not sure I understand the What If user have interleaved groups of messages,
each group makes a complete logic? Could you elaborate a bit?

About the committing functionality, it currently will only commit up to the
processed message's offset; the commit() call it self actually does more
than consumer committing offsets, but together with flushing the local
state and the producer.

Guozhang

On Fri, Jul 31, 2015 at 9:20 PM, Jiangjie Qin j...@linkedin.com.invalid
wrote:

I think the abstraction of processor would be useful. It is not quite clear
to me yet though which grid in the following API analysis chart this
processor is trying to satisfy.

https://cwiki.apache.org/confluence/display/KAFKA/New+consumer+API+change+proposal

For example, in current proposal. It looks user will only be able to commit
offsets for the last seen message. What If user have interleaved groups of
messages, each group makes a complete logic? In that case, user will not
have a safe boundary to commit offset.

Is the processor client only intended to address the static topic data
stream with semi-auto offset commit (which means user can only commit the
last seen message)?

Jiangjie (Becket) Qin

On Thu, Jul 30, 2015 at 2:32 PM, James Cheng jch...@tivo.com wrote:

I agree with Sriram and Martin. Kafka is already about providing streams
of data, and so Kafka Streams or anything like that is confusing to me.

This new library is about making it easier to process the data.

-James

On Jul 30, 2015, at 9:38 AM, Aditya Auradkar
aaurad...@linkedin.com.INVALID wrote:

Personally, I prefer KafkaStreams just because it sounds nicer. For the
reasons identified above, KafkaProcessor or KProcessor is more apt but
sounds less catchy (IMO). I also think we should prefix with Kafka
(rather
than K) because we will then have 3 clients: KafkaProducer,
KafkaConsumer
and KafkaProcessor which is very nice and consistent.

Aditya

On Thu, Jul 30, 2015 at 9:17 AM, Gwen Shapira gshap...@cloudera.com
wrote:

I think its also a matter of intent. If we see it as yet another
client library, than Processor (to match Producer and Consumer) will
work great.
If we see it is a stream processing framework, the name has to start
with S to follow existing convention.

Speaking of naming conventions:
You know how people have stack names for technologies that are usually
used in tandem? ELK, LAMP, etc.
The pattern of Kafka - Stream Processor - NoSQL Store is super
common. KSN stack doesn't sound right, though. Maybe while we are
bikeshedding, someone has ideas in that direction :)

On Thu, Jul 30, 2015 at 2:01 AM, Sriram Subramanian
srsubraman...@linkedin.com.invalid wrote:
I had the same thought. Kafka processor, KProcessor or even Kafka
stream processor is more relevant.

On Jul 30, 2015, at 2:09 PM, Martin Kleppmann mar...@kleppmann.com

wrote:

I'm with Sriram -- Kafka is all about streams already (or topics, to
be
precise, but we're calling it stream processing not topic
processing),
so I find Kafka Streams, KStream and Kafka Streaming all
confusing,
since they seem to imply that other bits of Kafka are not about
streams.

I would prefer The Processor API or Kafka Processors or Kafka
Processing Client or KProcessor, or something along those lines.

On 30 Jul 2015, at 15:07, Guozhang Wang wangg...@gmail.com
wrote:

I would vote for KStream as it sounds sexier (is it only me??),
second
to
that would be Kafka Streaming.

On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io
wrote:

Also, the most important part of any prototype, we should have a
name
for
this producing-consumer-thingamgigy:

Various ideas:
- Kafka Streams
- KStream
- Kafka Streaming
- The Processor API
- Metamorphosis
- Transformer API
- Verwandlung

For my part I think what people are trying to do is stream
processing
with
Kafka so I think something that evokes Kafka and stream processing
is
preferable. I like Kafka Streams or Kafka Streaming followed by
KStream.

Transformer kind of makes me think of the shape-shifting cars.

Metamorphosis is cool and hilarious but since we are kind of
envisioning
this as more limited scope thing rather than a massive framework
in
its own
right I actually think it should have a descriptive name rather
than a
personality of it's own.

Anyhow let the bikeshedding commence.

-Jay

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang
wangg...@gmail.com

wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
.

The wiki page does not yet have the full design / implementation
details,
and this email is

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-10 Thread Guozhang Wang

Hi Jun,

1. I have removed the streamTime in punctuate() since it is not only
triggered by clock time, detailed explanation can be found here:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client#KIP-28-Addaprocessorclient-StreamTime

2. Yes, if users do not schedule a task, then punctuate will never fire.

3. Yes, I agree. The reason it was implemented in this way is that the
state store registration call is triggered by the users. However I think it
is doable to change that API so that it will be more natural to have sth.
like:

context.createStore(store-name, store-type).

Guozhang

On Tue, Aug 4, 2015 at 9:17 AM, Jun Rao j...@confluent.io wrote:

A few questions/comments.

1. What's streamTime passed to punctuate()? Is that just the current time?
2. Is punctuate() only called if schedule() is called?
3. The way the KeyValueStore is created seems a bit weird. Since this is
part of the internal state managed by KafkaProcessorContext, it seems there
should be an api to create the KeyValueStore from KafkaProcessorContext,
instead of passing context to the constructor of KeyValueStore?

Thanks,

Jun

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing

The wiki page does not yet have the full design / implementation details,
and this email is to kick-off the conversation on whether we should add
this new client with the described motivations, and if yes what features
/
functionalities should be included.

Looking forward to your feedback!

-- Guozhang

--
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-08-04 Thread Jun Rao

A few questions/comments.

1. What's streamTime passed to punctuate()? Is that just the current time?
2. Is punctuate() only called if schedule() is called?
3. The way the KeyValueStore is created seems a bit weird. Since this is
part of the internal state managed by KafkaProcessorContext, it seems there
should be an api to create the KeyValueStore from KafkaProcessorContext,
instead of passing context to the constructor of KeyValueStore?

Thanks,

Jun

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com wrote:

 Hi all,

 I just posted KIP-28: Add a transform client for data processing
 
 https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
 
 .

 The wiki page does not yet have the full design / implementation details,
 and this email is to kick-off the conversation on whether we should add
 this new client with the described motivations, and if yes what features /
 functionalities should be included.

 Looking forward to your feedback!

 -- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-31 Thread Jiangjie Qin

I think the abstraction of processor would be useful. It is not quite clear
to me yet though which grid in the following API analysis chart this
processor is trying to satisfy.

https://cwiki.apache.org/confluence/display/KAFKA/New+consumer+API+change+proposal

Is the processor client only intended to address the static topic data
stream with semi-auto offset commit (which means user can only commit the
last seen message)?

Jiangjie (Becket) Qin

On Thu, Jul 30, 2015 at 2:32 PM, James Cheng jch...@tivo.com wrote:

I agree with Sriram and Martin. Kafka is already about providing streams
of data, and so Kafka Streams or anything like that is confusing to me.

This new library is about making it easier to process the data.

-James

On Jul 30, 2015, at 9:38 AM, Aditya Auradkar
aaurad...@linkedin.com.INVALID wrote:

Personally, I prefer KafkaStreams just because it sounds nicer. For the
reasons identified above, KafkaProcessor or KProcessor is more apt but
sounds less catchy (IMO). I also think we should prefix with Kafka
(rather
than K) because we will then have 3 clients: KafkaProducer, KafkaConsumer
and KafkaProcessor which is very nice and consistent.

Aditya

On Thu, Jul 30, 2015 at 9:17 AM, Gwen Shapira gshap...@cloudera.com
wrote:

On Thu, Jul 30, 2015 at 2:01 AM, Sriram Subramanian
srsubraman...@linkedin.com.invalid wrote:
I had the same thought. Kafka processor, KProcessor or even Kafka
stream processor is more relevant.

On Jul 30, 2015, at 2:09 PM, Martin Kleppmann mar...@kleppmann.com
wrote:

I would prefer The Processor API or Kafka Processors or Kafka
Processing Client or KProcessor, or something along those lines.

On 30 Jul 2015, at 15:07, Guozhang Wang wangg...@gmail.com wrote:

I would vote for KStream as it sounds sexier (is it only me??),
second
to
that would be Kafka Streaming.

On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io
wrote:

Also, the most important part of any prototype, we should have a
name
for
this producing-consumer-thingamgigy:

Various ideas:
- Kafka Streams
- KStream
- Kafka Streaming
- The Processor API
- Metamorphosis
- Transformer API
- Verwandlung

Transformer kind of makes me think of the shape-shifting cars.

Metamorphosis is cool and hilarious but since we are kind of
envisioning
this as more limited scope thing rather than a massive framework in
its own
right I actually think it should have a descriptive name rather
than a
personality of it's own.

Anyhow let the bikeshedding commence.

-Jay

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com

wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
.

Looking forward to your feedback!

-- Guozhang

--
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-31 Thread Gwen Shapira

Just a quick ping, that regardless of the name of the thing, I'm still
interested in answers to my questions :)



On Tue, Jul 28, 2015 at 3:07 PM, Gwen Shapira gshap...@cloudera.com wrote:
 Thanks Guazhang! Much clearer now, at least for me.

 Few comments / questions:

 1. Perhaps punctuate(int numRecords) will be a nice API addition, some
 use-cases have record-count based windows, rather than time-based..
 2. The diagram for Flexible partition distribution shows two joins.
 Is the idea to implement two Processors and string them together?
 3.  Is the local state persistent? Can you talk a bit about how local
 state works with high availability?

 Gwen

 On Tue, Jul 28, 2015 at 12:57 AM, Guozhang Wang wangg...@gmail.com wrote:
 I have updated the wiki page incorporating people's comments, please feel
 free to take another look before today's meeting.

 On Mon, Jul 27, 2015 at 11:19 PM, Yi Pan nickpa...@gmail.com wrote:

 Hi, Jay,

 {quote}
 1. Yeah we are going to try to generalize the partition management stuff.
 We'll get a wiki/JIRA up for that. I think that gives what you want in
 terms of moving partitioning to the client side.
 {quote}
 Great! I am looking forward to that.

 {quote}
 I think the key observation is that the whole reason
 LinkedIn split data over clusters to begin with was because of the lack of
 quotas, which are in any case getting implemented.
 {quote}
 I am not sure that I followed this point. Is your point that with quota, it
 is possible to host all data in a single cluster?

 -Yi

 On Mon, Jul 27, 2015 at 8:53 AM, Jay Kreps j...@confluent.io wrote:

  Hey Yi,
 
  Great points. I think for some of this the most useful thing would be to
  get a wip prototype out that we could discuss concretely. I think
 Yasuhiro
  and Guozhang took that prototype I had done, and had some improvements.
  Give us a bit to get that into understandable shape so we can discuss.
 
  To address a few of your other points:
  1. Yeah we are going to try to generalize the partition management stuff.
  We'll get a wiki/JIRA up for that. I think that gives what you want in
  terms of moving partitioning to the client side.
  2. I think consuming from a different cluster you produce to will be
 easy.
  More than that is more complex, though I agree the pluggable partitioning
  makes it theoretically possible. Let's try to get something that works
 for
  the first case, it sounds like that solves the use case you describe of
  wanting to directly transform from a given cluster but produce back to a
  different cluster. I think the key observation is that the whole reason
  LinkedIn split data over clusters to begin with was because of the lack
 of
  quotas, which are in any case getting implemented.
 
  -Jay
 
  On Sun, Jul 26, 2015 at 11:31 PM, Yi Pan nickpa...@gmail.com wrote:
 
   Hi, Jay and all,
  
   Thanks for all your quick responses. I tried to summarize my thoughts
  here:
  
   - ConsumerRecord as stream processor API:
  
  * This KafkaProcessor API is targeted to receive the message from
  Kafka.
   So, to Yasuhiro's join/transformation example, any join/transformation
   results that are materialized in Kafka should have ConsumerRecord
 format
   (i.e. w/ topic and offsets). Any non-materialized join/transformation
   results should not be processed by this KafkaProcessor API. One example
  is
   the in-memory operators API in Samza, which is designed to handle the
   non-materialzied join/transformation results. And yes, in this case, a
  more
   abstract data model is needed.
  
  * Just to support Jay's point of a general
   ConsumerRecord/ProducerRecord, a general stream processing on more than
  one
   data sources would need at least the following info: data source
   description (i.e. which topic/table), and actual data (i.e. key-value
   pairs). It would make sense to have the data source name as part of the
   general metadata in stream processing (think about it as the table name
  for
   records in standard SQL).
  
   - SQL/DSL
  
  * I think that this topic itself is worthy of another KIP
 discussion.
  I
   would prefer to leave it out of scope in KIP-28.
  
   - Client-side pluggable partition manager
  
  * Given the use cases we have seen with large-scale deployment of
   Samza/Kafka in LinkedIn, I would argue that we should make it as the
   first-class citizen in this KIP. The use cases include:
  
 * multi-cluster Kafka
  
 * host-affinity (i.e. local-state associated w/ certain
 partitions
  on
   client)
  
   - Multi-cluster scenario
  
  * Although I originally just brought it up as a use case that
 requires
   client-side partition manager, reading Jay’s comments, I realized that
 I
   have one fundamental issue w/ the current copycat + transformation
 model.
   If I interpret Jay’s comment correctly, the proposed
  copycat+transformation
   plays out in the following way: i) copycat takes all data from sources
  (no
   matter it is

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread James Cheng

I agree with Sriram and Martin. Kafka is already about providing streams of
data, and so Kafka Streams or anything like that is confusing to me.

This new library is about making it easier to process the data.

-James

On Jul 30, 2015, at 9:38 AM, Aditya Auradkar aaurad...@linkedin.com.INVALID
wrote:

Personally, I prefer KafkaStreams just because it sounds nicer. For the
reasons identified above, KafkaProcessor or KProcessor is more apt but
sounds less catchy (IMO). I also think we should prefix with Kafka (rather
than K) because we will then have 3 clients: KafkaProducer, KafkaConsumer
and KafkaProcessor which is very nice and consistent.

Aditya

On Thu, Jul 30, 2015 at 9:17 AM, Gwen Shapira gshap...@cloudera.com wrote:

On Thu, Jul 30, 2015 at 2:01 AM, Sriram Subramanian
srsubraman...@linkedin.com.invalid wrote:
I had the same thought. Kafka processor, KProcessor or even Kafka
stream processor is more relevant.

On Jul 30, 2015, at 2:09 PM, Martin Kleppmann mar...@kleppmann.com
wrote:

I'm with Sriram -- Kafka is all about streams already (or topics, to be
precise, but we're calling it stream processing not topic processing),
so I find Kafka Streams, KStream and Kafka Streaming all confusing,
since they seem to imply that other bits of Kafka are not about streams.

I would prefer The Processor API or Kafka Processors or Kafka
Processing Client or KProcessor, or something along those lines.

On 30 Jul 2015, at 15:07, Guozhang Wang wangg...@gmail.com wrote:

I would vote for KStream as it sounds sexier (is it only me??), second
to
that would be Kafka Streaming.

On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io wrote:

Also, the most important part of any prototype, we should have a name
for
this producing-consumer-thingamgigy:

Various ideas:
- Kafka Streams
- KStream
- Kafka Streaming
- The Processor API
- Metamorphosis
- Transformer API
- Verwandlung

For my part I think what people are trying to do is stream processing
with
Kafka so I think something that evokes Kafka and stream processing is
preferable. I like Kafka Streams or Kafka Streaming followed by
KStream.

Transformer kind of makes me think of the shape-shifting cars.

Metamorphosis is cool and hilarious but since we are kind of
envisioning
this as more limited scope thing rather than a massive framework in
its own
right I actually think it should have a descriptive name rather than a
personality of it's own.

Anyhow let the bikeshedding commence.

-Jay

On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com
wrote:

Hi all,

I just posted KIP-28: Add a transform client for data processing

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
.

Looking forward to your feedback!

-- Guozhang

--
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Martin Kleppmann

I'm with Sriram -- Kafka is all about streams already (or topics, to be 
precise, but we're calling it stream processing not topic processing), so I 
find Kafka Streams, KStream and Kafka Streaming all confusing, since they 
seem to imply that other bits of Kafka are not about streams.

I would prefer The Processor API or Kafka Processors or Kafka Processing 
Client or KProcessor, or something along those lines.

On 30 Jul 2015, at 15:07, Guozhang Wang wangg...@gmail.com wrote:

 I would vote for KStream as it sounds sexier (is it only me??), second to
 that would be Kafka Streaming.
 
 On Wed, Jul 29, 2015 at 6:08 PM, Jay Kreps j...@confluent.io wrote:
 
 Also, the most important part of any prototype, we should have a name for
 this producing-consumer-thingamgigy:
 
 Various ideas:
 - Kafka Streams
 - KStream
 - Kafka Streaming
 - The Processor API
 - Metamorphosis
 - Transformer API
 - Verwandlung
 
 For my part I think what people are trying to do is stream processing with
 Kafka so I think something that evokes Kafka and stream processing is
 preferable. I like Kafka Streams or Kafka Streaming followed by KStream.
 
 Transformer kind of makes me think of the shape-shifting cars.
 
 Metamorphosis is cool and hilarious but since we are kind of envisioning
 this as more limited scope thing rather than a massive framework in its own
 right I actually think it should have a descriptive name rather than a
 personality of it's own.
 
 Anyhow let the bikeshedding commence.
 
 -Jay
 
 
 On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang wangg...@gmail.com wrote:
 
 Hi all,
 
 I just posted KIP-28: Add a transform client for data processing
 
 
 https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
 
 .
 
 The wiki page does not yet have the full design / implementation details,
 and this email is to kick-off the conversation on whether we should add
 this new client with the described motivations, and if yes what features
 /
 functionalities should be included.
 
 Looking forward to your feedback!
 
 -- Guozhang
 
 
 
 
 
 -- 
 -- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-30 Thread Aditya Auradkar