Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-16 Thread Matthias J. Sax
Thant for the update Bill. LGTM.


On 9/14/18 5:33 PM, Guozhang Wang wrote:
> Hello Bill,
> 
> I've made another pass over the wiki page and it lgtm now. Thanks!
> 
> 
> Guozhang
> 
> 
> On Fri, Sep 14, 2018 at 3:05 PM, John Roesler  wrote:
> 
>> Hey Bill (and Guozhang),
>>
>> Apologies. I misunderstood what Guozhang was getting at in his earlier
>> remark.
>>
>> I'm a +1 on the current proposal.
>>
>> Thanks,
>> -John
>>
>> On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck  wrote:
>>
>>> All,
>>>
>>> Thanks for the comments, I'll respond to all of your comments in the
>>> recieved order.
>>>
>>>
>>> Guozhang,
>>>
  And if [join-name] not specified, stay the same, which is:
 * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>> join
>>> and S-T join
>>>
>>>   I believe the current approach is
>>> [appId]-[previous-processor-name]-repartition, but the main point is if
>> no
>>> name is provided, the current naming scheme will remain in place
>>>
  I think it is more natural to rename it to
  * [app-id]-[join-name]-left-repartition
  * [app-id]-[join-name]-right-repartition
>>>
>>> I agree and I'll update the KIP accordingly
>>>
 2 I'd suggest to use the name to also define the corresponding
>> processor
>>>
>>> I'll address this in conjunction with comments from Matthias and your
>>> second round of comments.
>>>
>>>
 3. Could you also specify how this would affect the optimization for
>>>
>>>  merging multiple repartition topics?
>>>
>>>   With an optimization where we reduce the number of repartition
>>> topics, if the user has not named the repartition topics, we'll generate
>> a
>>> name for
>>>   the optimized repartition topic. In the case where users have
>>> provided names for the repartition topics,
>>>   we'll reuse the name from the first of the named repartition
>>> topics.
>>>   However, optimizing a topology by reducing the number of
>>> repartition topics may require an application reset, but that is beyond
>> the
>>> scope of this KIP.
>>>
>>>   I'll update the KIP with this case
>>>
>>>   I've also addressed 4.a and 4.b in the "Compatibility,
>>> Deprecation and Migration plan" section
>>>
>>>
>>> Matthias
>>>
 1) We should either use `[app-id]-this|other-[join-name]-repartition`
>> or
 `app-id]-[join-name]-left|right-repartition` but we should not change
 the pattern depending if the user specifies a name of not.
>>>
>>> I've updated the KIP to say in the case of Joins and a name is provided
>>> we'll go with `app-id-name-left|right-repartition` in all cases if the
>> name
>>> is not provided
>>> we will continue to use the current naming process.
>>>
 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
 to be orthogonal, and thus KIP-372 should not change any processor
 names, but KIP-307 should define a holistic strategy for all processor.
 Otherwise, we might up with different strategies or revert what we
 decide in this KIP if it's not compatible with KIP-307.
>>>
>>> I agree and have updated the name of this KIP to be more precise.
>>>
>>> Eno
>>>
 I know we don't normally have a "Related work" section in KIPs, but
 sometimes I find it useful to see what others have done in similar
>> cases.
 Since this will be important for rolling re-deployments, I wonder what
 other frameworks like Flink (or Samza) have done in these cases.
>> Perhaps
 they have done nothing, in which case it's fine to do this from first
 principles, but IMO it would be good to know just to make sure we're
 heading in the right direction.
>>>
 Also I don't get a good feel for how much work this will be for an end
>>> user
 who is doing the rolling deployment, perhaps an end-to-end example
>> would
 help.
>>>
>>> I can't speak for the other projects, so I'll have to take a look and
>> see.
>>>
>>> As for getting a feel for how much work this will be for an end user,
>> could
>>> you look at the
>>> updated KIP and see if it clears things up?
>>>
>>>
>>> Matthias
>>>
 (1) For `Grouped` should we add `with(String name, Serde key,
 Serde value)` to allow specifying all parameters at once?
>>>
 Produced/Consumed/Serialized etc follow a similar pattern. There is one
 static method for each config parameter, plus one static method that
 accepts all parameters. Might be more consistent if we follow this
>>> pattern.
>>>
 (2) It seems that `Serialized` is only used in `groupBy` and
 `groupByKey` -- because both methods accepting `Serialized` parameter
 are deprecated and replaced with methods accepting `Grouped`, it seems
 that we also want to deprecate `Serialized`?
>>>
 (3) About naming repartition topics: thinking about this once more, I
 actually prefer to use `left|right` instead of `this|other` :)
>>>
>>> For 1, I agree and I'll update the KIP.
>>>
>>> For 2, Yes that seems to make 

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-14 Thread Guozhang Wang
Hello Bill,

I've made another pass over the wiki page and it lgtm now. Thanks!


Guozhang


On Fri, Sep 14, 2018 at 3:05 PM, John Roesler  wrote:

> Hey Bill (and Guozhang),
>
> Apologies. I misunderstood what Guozhang was getting at in his earlier
> remark.
>
> I'm a +1 on the current proposal.
>
> Thanks,
> -John
>
> On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck  wrote:
>
> > All,
> >
> > Thanks for the comments, I'll respond to all of your comments in the
> > recieved order.
> >
> >
> > Guozhang,
> >
> > >  And if [join-name] not specified, stay the same, which is:
> > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> join
> > and S-T join
> >
> >   I believe the current approach is
> > [appId]-[previous-processor-name]-repartition, but the main point is if
> no
> > name is provided, the current naming scheme will remain in place
> >
> > >  I think it is more natural to rename it to
> > >  * [app-id]-[join-name]-left-repartition
> > >  * [app-id]-[join-name]-right-repartition
> >
> > I agree and I'll update the KIP accordingly
> >
> > > 2 I'd suggest to use the name to also define the corresponding
> processor
> >
> > I'll address this in conjunction with comments from Matthias and your
> > second round of comments.
> >
> >
> > >3. Could you also specify how this would affect the optimization for
> >
> >  merging multiple repartition topics?
> >
> >   With an optimization where we reduce the number of repartition
> > topics, if the user has not named the repartition topics, we'll generate
> a
> > name for
> >   the optimized repartition topic. In the case where users have
> > provided names for the repartition topics,
> >   we'll reuse the name from the first of the named repartition
> > topics.
> >   However, optimizing a topology by reducing the number of
> > repartition topics may require an application reset, but that is beyond
> the
> > scope of this KIP.
> >
> >   I'll update the KIP with this case
> >
> >   I've also addressed 4.a and 4.b in the "Compatibility,
> > Deprecation and Migration plan" section
> >
> >
> > Matthias
> >
> > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> or
> > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > the pattern depending if the user specifies a name of not.
> >
> > I've updated the KIP to say in the case of Joins and a name is provided
> > we'll go with `app-id-name-left|right-repartition` in all cases if the
> name
> > is not provided
> > we will continue to use the current naming process.
> >
> > > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > > to be orthogonal, and thus KIP-372 should not change any processor
> > > names, but KIP-307 should define a holistic strategy for all processor.
> > > Otherwise, we might up with different strategies or revert what we
> > > decide in this KIP if it's not compatible with KIP-307.
> >
> > I agree and have updated the name of this KIP to be more precise.
> >
> > Eno
> >
> > > I know we don't normally have a "Related work" section in KIPs, but
> > > sometimes I find it useful to see what others have done in similar
> cases.
> > > Since this will be important for rolling re-deployments, I wonder what
> > > other frameworks like Flink (or Samza) have done in these cases.
> Perhaps
> > > they have done nothing, in which case it's fine to do this from first
> > > principles, but IMO it would be good to know just to make sure we're
> > > heading in the right direction.
> >
> > > Also I don't get a good feel for how much work this will be for an end
> > user
> > > who is doing the rolling deployment, perhaps an end-to-end example
> would
> > > help.
> >
> > I can't speak for the other projects, so I'll have to take a look and
> see.
> >
> > As for getting a feel for how much work this will be for an end user,
> could
> > you look at the
> > updated KIP and see if it clears things up?
> >
> >
> > Matthias
> >
> > > (1) For `Grouped` should we add `with(String name, Serde key,
> > > Serde value)` to allow specifying all parameters at once?
> >
> > > Produced/Consumed/Serialized etc follow a similar pattern. There is one
> > > static method for each config parameter, plus one static method that
> > > accepts all parameters. Might be more consistent if we follow this
> > pattern.
> >
> > > (2) It seems that `Serialized` is only used in `groupBy` and
> > > `groupByKey` -- because both methods accepting `Serialized` parameter
> > > are deprecated and replaced with methods accepting `Grouped`, it seems
> > > that we also want to deprecate `Serialized`?
> >
> > > (3) About naming repartition topics: thinking about this once more, I
> > > actually prefer to use `left|right` instead of `this|other` :)
> >
> > For 1, I agree and I'll update the KIP.
> >
> > For 2, Yes that seems to make sense and I'll update the KIP
> >
> > Part 3 addressed in earlier comments and the 

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-14 Thread John Roesler
Hey Bill (and Guozhang),

Apologies. I misunderstood what Guozhang was getting at in his earlier
remark.

I'm a +1 on the current proposal.

Thanks,
-John

On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck  wrote:

> All,
>
> Thanks for the comments, I'll respond to all of your comments in the
> recieved order.
>
>
> Guozhang,
>
> >  And if [join-name] not specified, stay the same, which is:
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
> and S-T join
>
>   I believe the current approach is
> [appId]-[previous-processor-name]-repartition, but the main point is if no
> name is provided, the current naming scheme will remain in place
>
> >  I think it is more natural to rename it to
> >  * [app-id]-[join-name]-left-repartition
> >  * [app-id]-[join-name]-right-repartition
>
> I agree and I'll update the KIP accordingly
>
> > 2 I'd suggest to use the name to also define the corresponding processor
>
> I'll address this in conjunction with comments from Matthias and your
> second round of comments.
>
>
> >3. Could you also specify how this would affect the optimization for
>
>  merging multiple repartition topics?
>
>   With an optimization where we reduce the number of repartition
> topics, if the user has not named the repartition topics, we'll generate a
> name for
>   the optimized repartition topic. In the case where users have
> provided names for the repartition topics,
>   we'll reuse the name from the first of the named repartition
> topics.
>   However, optimizing a topology by reducing the number of
> repartition topics may require an application reset, but that is beyond the
> scope of this KIP.
>
>   I'll update the KIP with this case
>
>   I've also addressed 4.a and 4.b in the "Compatibility,
> Deprecation and Migration plan" section
>
>
> Matthias
>
> > 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> > `app-id]-[join-name]-left|right-repartition` but we should not change
> > the pattern depending if the user specifies a name of not.
>
> I've updated the KIP to say in the case of Joins and a name is provided
> we'll go with `app-id-name-left|right-repartition` in all cases if the name
> is not provided
> we will continue to use the current naming process.
>
> > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > to be orthogonal, and thus KIP-372 should not change any processor
> > names, but KIP-307 should define a holistic strategy for all processor.
> > Otherwise, we might up with different strategies or revert what we
> > decide in this KIP if it's not compatible with KIP-307.
>
> I agree and have updated the name of this KIP to be more precise.
>
> Eno
>
> > I know we don't normally have a "Related work" section in KIPs, but
> > sometimes I find it useful to see what others have done in similar cases.
> > Since this will be important for rolling re-deployments, I wonder what
> > other frameworks like Flink (or Samza) have done in these cases. Perhaps
> > they have done nothing, in which case it's fine to do this from first
> > principles, but IMO it would be good to know just to make sure we're
> > heading in the right direction.
>
> > Also I don't get a good feel for how much work this will be for an end
> user
> > who is doing the rolling deployment, perhaps an end-to-end example would
> > help.
>
> I can't speak for the other projects, so I'll have to take a look and see.
>
> As for getting a feel for how much work this will be for an end user, could
> you look at the
> updated KIP and see if it clears things up?
>
>
> Matthias
>
> > (1) For `Grouped` should we add `with(String name, Serde key,
> > Serde value)` to allow specifying all parameters at once?
>
> > Produced/Consumed/Serialized etc follow a similar pattern. There is one
> > static method for each config parameter, plus one static method that
> > accepts all parameters. Might be more consistent if we follow this
> pattern.
>
> > (2) It seems that `Serialized` is only used in `groupBy` and
> > `groupByKey` -- because both methods accepting `Serialized` parameter
> > are deprecated and replaced with methods accepting `Grouped`, it seems
> > that we also want to deprecate `Serialized`?
>
> > (3) About naming repartition topics: thinking about this once more, I
> > actually prefer to use `left|right` instead of `this|other` :)
>
> For 1, I agree and I'll update the KIP.
>
> For 2, Yes that seems to make sense and I'll update the KIP
>
> Part 3 addressed in earlier comments and the KIP is clarified.
>
> Guozhang
>
> > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> is
> > to extend ValueJoiner with Named and hence this part is what I meant to
> > have "overlaps".
>
> > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > joins but not T-T joins, having the naming schemes 

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-14 Thread Bill Bejeck
All,

Thanks for the comments, I'll respond to all of your comments in the
recieved order.


Guozhang,

>  And if [join-name] not specified, stay the same, which is:
> * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
and S-T join

  I believe the current approach is
[appId]-[previous-processor-name]-repartition, but the main point is if no
name is provided, the current naming scheme will remain in place

>  I think it is more natural to rename it to
>  * [app-id]-[join-name]-left-repartition
>  * [app-id]-[join-name]-right-repartition

I agree and I'll update the KIP accordingly

> 2 I'd suggest to use the name to also define the corresponding processor

I'll address this in conjunction with comments from Matthias and your
second round of comments.


>3. Could you also specify how this would affect the optimization for

 merging multiple repartition topics?

  With an optimization where we reduce the number of repartition
topics, if the user has not named the repartition topics, we'll generate a
name for
  the optimized repartition topic. In the case where users have
provided names for the repartition topics,
  we'll reuse the name from the first of the named repartition
topics.
  However, optimizing a topology by reducing the number of
repartition topics may require an application reset, but that is beyond the
scope of this KIP.

  I'll update the KIP with this case

  I've also addressed 4.a and 4.b in the "Compatibility,
Deprecation and Migration plan" section


Matthias

> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> `app-id]-[join-name]-left|right-repartition` but we should not change
> the pattern depending if the user specifies a name of not.

I've updated the KIP to say in the case of Joins and a name is provided
we'll go with `app-id-name-left|right-repartition` in all cases if the name
is not provided
we will continue to use the current naming process.

> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> to be orthogonal, and thus KIP-372 should not change any processor
> names, but KIP-307 should define a holistic strategy for all processor.
> Otherwise, we might up with different strategies or revert what we
> decide in this KIP if it's not compatible with KIP-307.

I agree and have updated the name of this KIP to be more precise.

Eno

> I know we don't normally have a "Related work" section in KIPs, but
> sometimes I find it useful to see what others have done in similar cases.
> Since this will be important for rolling re-deployments, I wonder what
> other frameworks like Flink (or Samza) have done in these cases. Perhaps
> they have done nothing, in which case it's fine to do this from first
> principles, but IMO it would be good to know just to make sure we're
> heading in the right direction.

> Also I don't get a good feel for how much work this will be for an end
user
> who is doing the rolling deployment, perhaps an end-to-end example would
> help.

I can't speak for the other projects, so I'll have to take a look and see.

As for getting a feel for how much work this will be for an end user, could
you look at the
updated KIP and see if it clears things up?


Matthias

> (1) For `Grouped` should we add `with(String name, Serde key,
> Serde value)` to allow specifying all parameters at once?

> Produced/Consumed/Serialized etc follow a similar pattern. There is one
> static method for each config parameter, plus one static method that
> accepts all parameters. Might be more consistent if we follow this
pattern.

> (2) It seems that `Serialized` is only used in `groupBy` and
> `groupByKey` -- because both methods accepting `Serialized` parameter
> are deprecated and replaced with methods accepting `Grouped`, it seems
> that we also want to deprecate `Serialized`?

> (3) About naming repartition topics: thinking about this once more, I
> actually prefer to use `left|right` instead of `this|other` :)

For 1, I agree and I'll update the KIP.

For 2, Yes that seems to make sense and I'll update the KIP

Part 3 addressed in earlier comments and the KIP is clarified.

Guozhang

> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> `groupBy/groupByKey` naming schemes, and for joins its current proposal is
> to extend ValueJoiner with Named and hence this part is what I meant to
> have "overlaps".

> Thinking about it a bit more, since Joined is only used for S-S and S-T
> joins but not T-T joins, having the naming schemes on Joined would not be
> sufficient, and extending ValueJoiner would indeed be a good choice.

> As for groupBy, since it is using KeyValueMapper which is supposed to be
> extended with Named in KIP-307, it does not require extending to processor
> nodes as well.

> Given this, I'm fine with limiting the scope to only repartition topics.

I agree with limiting the scope of this KIP to only repartition topics and

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-14 Thread Guozhang Wang
Hello John,

Not sure if I completely understand your email above. Are you suggesting to
still use the proposed Joined / Grouped object to indicate the underlying
processor names *in addition to* the repartition topic names?

My reasoning is that, if we do not want to use the proposed names to
indicate underlying processor names, only the repartition topic names, then
the current proposal looks good since KTable#join would not introduce
repartition topics at all. AND regarding underlying processor names, it is
suggested that we should only consider it in KIP-307 separately. Are you
suggesting that we should consider them together?


Guozhang

On Thu, Sep 13, 2018 at 3:49 PM, John Roesler  wrote:

> Hey all,
>
> I think it's slightly out of scope for this KIP, but I'm not sure it's
> right to add a name to ValueJoiner or KeyValueMapper.
> Both of those are "functional interfaces", that is, they are basically
> named functions.
>
> It seems like we should preserve this property both to provide a clean
> separation between the operation *logic* and the operator *config*.
>
> It's a good point that `Joined` doesn't appear in the KTable join. However,
> KTable join does have the variant that accepts "Materialized".
> This makes it similar to the grouping operators that currently accept
> "Serialized", namely the operator is already configurable, but only in
> a narrow way (we can only configure the materialization of the table or the
> serialization of the grouping).
>
> Consider as an alternative the KStream.join operation that takes a config
> object called "Joined". This implies no more than that we are
> configuring the join operation. If we are deciding to allow adding a name
> to the operation, it's natural to add it to the operation's config.
>
> Conversely, we also want to name the KTable.join operation, not the
> materialization itself (which already has a name with separate semantics).
>
> To me, this suggests that, just like we propose to subsume the Serialized
> config for grouping into a more general config called Grouped, we should
> also want to do something similar and replace `KTable#join(KTable<>,
> ValueJoiner<>, Materialized<>)` with one taking a more general
> configuration, maybe:
> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>
> Does this make sense?
>
> Thanks,
> -John
>
> On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang  wrote:
>
> > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> is
> > to extend ValueJoiner with Named and hence this part is what I meant to
> > have "overlaps".
> >
> > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > joins but not T-T joins, having the naming schemes on Joined would not be
> > sufficient, and extending ValueJoiner would indeed be a good choice.
> >
> > As for groupBy, since it is using KeyValueMapper which is supposed to be
> > extended with Named in KIP-307, it does not require extending to
> processor
> > nodes as well.
> >
> >
> > Given this, I'm fine with limiting the scope to only repartition topics.
> >
> >
> > Guozhang
> >
> > On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax  >
> > wrote:
> >
> > > Follow up comments:
> > >
> > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> or
> > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > the pattern depending if the user specifies a name of not. I am fine
> > > with both patterns---just want to make sure with stick with one.
> > >
> > > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > > to be orthogonal, and thus KIP-372 should not change any processor
> > > names, but KIP-307 should define a holistic strategy for all processor.
> > > Otherwise, we might up with different strategies or revert what we
> > > decide in this KIP if it's not compatible with KIP-307.
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > > Hello Bill,
> > > >
> > > > I made a pass over your proposal and here are some questions:
> > > >
> > > > 1. For Joined names, the current proposal is to define the
> repartition
> > > > topic names as
> > > >
> > > > * [app-id]-this-[join-name]-repartition
> > > >
> > > > * [app-id]-other-[join-name]-repartition
> > > >
> > > >
> > > > And if [join-name] not specified, stay the same, which is:
> > > >
> > > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> > > join
> > > > and S-T join
> > > >
> > > > I think it is more natural to rename it to
> > > >
> > > > * [app-id]-[join-name]-left-repartition
> > > >
> > > > * [app-id]-[join-name]-right-repartition
> > > >
> > > >
> > > > 2. I'd suggest to use the name to also define the corresponding
> > processor
> > > > names accordingly, in addition to the repartition topic names. Note
> > that
> > > > for joins, this may be overlapping with KIP-307
> > > > 

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-13 Thread John Roesler
Hey all,

I think it's slightly out of scope for this KIP, but I'm not sure it's
right to add a name to ValueJoiner or KeyValueMapper.
Both of those are "functional interfaces", that is, they are basically
named functions.

It seems like we should preserve this property both to provide a clean
separation between the operation *logic* and the operator *config*.

It's a good point that `Joined` doesn't appear in the KTable join. However,
KTable join does have the variant that accepts "Materialized".
This makes it similar to the grouping operators that currently accept
"Serialized", namely the operator is already configurable, but only in
a narrow way (we can only configure the materialization of the table or the
serialization of the grouping).

Consider as an alternative the KStream.join operation that takes a config
object called "Joined". This implies no more than that we are
configuring the join operation. If we are deciding to allow adding a name
to the operation, it's natural to add it to the operation's config.

Conversely, we also want to name the KTable.join operation, not the
materialization itself (which already has a name with separate semantics).

To me, this suggests that, just like we propose to subsume the Serialized
config for grouping into a more general config called Grouped, we should
also want to do something similar and replace `KTable#join(KTable<>,
ValueJoiner<>, Materialized<>)` with one taking a more general
configuration, maybe:
`KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?

Does this make sense?

Thanks,
-John

On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang  wrote:

> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> `groupBy/groupByKey` naming schemes, and for joins its current proposal is
> to extend ValueJoiner with Named and hence this part is what I meant to
> have "overlaps".
>
> Thinking about it a bit more, since Joined is only used for S-S and S-T
> joins but not T-T joins, having the naming schemes on Joined would not be
> sufficient, and extending ValueJoiner would indeed be a good choice.
>
> As for groupBy, since it is using KeyValueMapper which is supposed to be
> extended with Named in KIP-307, it does not require extending to processor
> nodes as well.
>
>
> Given this, I'm fine with limiting the scope to only repartition topics.
>
>
> Guozhang
>
> On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax 
> wrote:
>
> > Follow up comments:
> >
> > 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> > `app-id]-[join-name]-left|right-repartition` but we should not change
> > the pattern depending if the user specifies a name of not. I am fine
> > with both patterns---just want to make sure with stick with one.
> >
> > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > to be orthogonal, and thus KIP-372 should not change any processor
> > names, but KIP-307 should define a holistic strategy for all processor.
> > Otherwise, we might up with different strategies or revert what we
> > decide in this KIP if it's not compatible with KIP-307.
> >
> >
> > -Matthias
> >
> >
> > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > Hello Bill,
> > >
> > > I made a pass over your proposal and here are some questions:
> > >
> > > 1. For Joined names, the current proposal is to define the repartition
> > > topic names as
> > >
> > > * [app-id]-this-[join-name]-repartition
> > >
> > > * [app-id]-other-[join-name]-repartition
> > >
> > >
> > > And if [join-name] not specified, stay the same, which is:
> > >
> > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> > join
> > > and S-T join
> > >
> > > I think it is more natural to rename it to
> > >
> > > * [app-id]-[join-name]-left-repartition
> > >
> > > * [app-id]-[join-name]-right-repartition
> > >
> > >
> > > 2. I'd suggest to use the name to also define the corresponding
> processor
> > > names accordingly, in addition to the repartition topic names. Note
> that
> > > for joins, this may be overlapping with KIP-307
> > >  > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > as
> > > it also have proposals for defining processor names for join operators
> as
> > > well.
> > >
> > > 3. Could you also specify how this would affect the optimization for
> > > merging multiple repartition topics?
> > >
> > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> could
> > > you also mention the following scenarios, if any of the upgrade path
> > would
> > > be changed:
> > >
> > >  a) changing user DSL code: under which scenarios users can now do a
> > > rolling bounce instead of resetting applications.
> > >
> > >  b) upgrading from older version to new version, with all the names
> > > specified, and with optimization turned on. E.g. say we have the code
> > > written in 2.1 with all names specified, and now upgrading to 2.2 with
> > new
> > > 

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-13 Thread Guozhang Wang
Just to clarify on 2): currently KIP-307 do not have proposed APIs for
`groupBy/groupByKey` naming schemes, and for joins its current proposal is
to extend ValueJoiner with Named and hence this part is what I meant to
have "overlaps".

Thinking about it a bit more, since Joined is only used for S-S and S-T
joins but not T-T joins, having the naming schemes on Joined would not be
sufficient, and extending ValueJoiner would indeed be a good choice.

As for groupBy, since it is using KeyValueMapper which is supposed to be
extended with Named in KIP-307, it does not require extending to processor
nodes as well.


Given this, I'm fine with limiting the scope to only repartition topics.


Guozhang

On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax 
wrote:

> Follow up comments:
>
> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> `app-id]-[join-name]-left|right-repartition` but we should not change
> the pattern depending if the user specifies a name of not. I am fine
> with both patterns---just want to make sure with stick with one.
>
> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> to be orthogonal, and thus KIP-372 should not change any processor
> names, but KIP-307 should define a holistic strategy for all processor.
> Otherwise, we might up with different strategies or revert what we
> decide in this KIP if it's not compatible with KIP-307.
>
>
> -Matthias
>
>
> On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > Hello Bill,
> >
> > I made a pass over your proposal and here are some questions:
> >
> > 1. For Joined names, the current proposal is to define the repartition
> > topic names as
> >
> > * [app-id]-this-[join-name]-repartition
> >
> > * [app-id]-other-[join-name]-repartition
> >
> >
> > And if [join-name] not specified, stay the same, which is:
> >
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> join
> > and S-T join
> >
> > I think it is more natural to rename it to
> >
> > * [app-id]-[join-name]-left-repartition
> >
> > * [app-id]-[join-name]-right-repartition
> >
> >
> > 2. I'd suggest to use the name to also define the corresponding processor
> > names accordingly, in addition to the repartition topic names. Note that
> > for joins, this may be overlapping with KIP-307
> >  307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > as
> > it also have proposals for defining processor names for join operators as
> > well.
> >
> > 3. Could you also specify how this would affect the optimization for
> > merging multiple repartition topics?
> >
> > 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
> > you also mention the following scenarios, if any of the upgrade path
> would
> > be changed:
> >
> >  a) changing user DSL code: under which scenarios users can now do a
> > rolling bounce instead of resetting applications.
> >
> >  b) upgrading from older version to new version, with all the names
> > specified, and with optimization turned on. E.g. say we have the code
> > written in 2.1 with all names specified, and now upgrading to 2.2 with
> new
> > optimizations that may potentially change the repartition topics. Is that
> > always safe to do?
> >
> >
> >
> > Guozhang
> >
> >
> > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck  wrote:
> >
> >> All I'd like to start a discussion on KIP-372 for the naming of joins
> and
> >> grouping operations in Kafka Streams.
> >>
> >> The KIP page can be found here:
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> 372%3A+Naming+Joins+and+Grouping
> >>
> >> I look forward to feedback and comments.
> >>
> >> Thanks,
> >> Bill
> >>
> >
> >
> >
>
>


-- 
-- Guozhang


Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-13 Thread Matthias J. Sax
Three more comments:

(1) For `Grouped` should we add `with(String name, Serde key,
Serde value)` to allow specifying all parameters at once?

Produced/Consumed/Serialized etc follow a similar pattern. There is one
static method for each config parameter, plus one static method that
accepts all parameters. Might be more consistent if we follow this pattern.

(2) It seems that `Serialized` is only used in `groupBy` and
`groupByKey` -- because both methods accepting `Serialized` parameter
are deprecated and replaced with methods accepting `Grouped`, it seems
that we also want to deprecate `Serialized`?

(3) About naming repartition topics: thinking about this once more, I
actually prefer to use `left|right` instead of `this|other` :)


-Matthias


On 9/13/18 6:45 AM, Matthias J. Sax wrote:
> I don't know what Samza does, however, Flink requires users to specify
> names similar to this proposal to be able to re-identify state in case
> the topology gets altered between deployments.
> 
> Flink only has state they need to worry about. For Kafka Streams, it's
> state plus repartition topics.
> 
> 
> -Matthias
> 
> On 9/13/18 1:48 AM, Eno Thereska wrote:
>> Hi folks,
>>
>> I know we don't normally have a "Related work" section in KIPs, but
>> sometimes I find it useful to see what others have done in similar cases.
>> Since this will be important for rolling re-deployments, I wonder what
>> other frameworks like Flink (or Samza) have done in these cases. Perhaps
>> they have done nothing, in which case it's fine to do this from first
>> principles, but IMO it would be good to know just to make sure we're
>> heading in the right direction.
>>
>> Also I don't get a good feel for how much work this will be for an end user
>> who is doing the rolling deployment, perhaps an end-to-end example would
>> help.
>>
>> Thanks
>> Eno
>>
>> On Thu, Sep 13, 2018 at 6:22 AM, Matthias J. Sax 
>> wrote:
>>
>>> Follow up comments:
>>>
>>> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
>>> `app-id]-[join-name]-left|right-repartition` but we should not change
>>> the pattern depending if the user specifies a name of not. I am fine
>>> with both patterns---just want to make sure with stick with one.
>>>
>>> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
>>> to be orthogonal, and thus KIP-372 should not change any processor
>>> names, but KIP-307 should define a holistic strategy for all processor.
>>> Otherwise, we might up with different strategies or revert what we
>>> decide in this KIP if it's not compatible with KIP-307.
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 9/12/18 6:28 PM, Guozhang Wang wrote:
 Hello Bill,

 I made a pass over your proposal and here are some questions:

 1. For Joined names, the current proposal is to define the repartition
 topic names as

 * [app-id]-this-[join-name]-repartition

 * [app-id]-other-[join-name]-repartition


 And if [join-name] not specified, stay the same, which is:

 * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>>> join
 and S-T join

 I think it is more natural to rename it to

 * [app-id]-[join-name]-left-repartition

 * [app-id]-[join-name]-right-repartition


 2. I'd suggest to use the name to also define the corresponding processor
 names accordingly, in addition to the repartition topic names. Note that
 for joins, this may be overlapping with KIP-307
 >> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
 as
 it also have proposals for defining processor names for join operators as
 well.

 3. Could you also specify how this would affect the optimization for
 merging multiple repartition topics?

 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
 you also mention the following scenarios, if any of the upgrade path
>>> would
 be changed:

  a) changing user DSL code: under which scenarios users can now do a
 rolling bounce instead of resetting applications.

  b) upgrading from older version to new version, with all the names
 specified, and with optimization turned on. E.g. say we have the code
 written in 2.1 with all names specified, and now upgrading to 2.2 with
>>> new
 optimizations that may potentially change the repartition topics. Is that
 always safe to do?



 Guozhang


 On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck  wrote:

> All I'd like to start a discussion on KIP-372 for the naming of joins
>>> and
> grouping operations in Kafka Streams.
>
> The KIP page can be found here:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 372%3A+Naming+Joins+and+Grouping
>
> I look forward to feedback and comments.
>
> Thanks,
> Bill
>



Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-13 Thread Matthias J. Sax
I don't know what Samza does, however, Flink requires users to specify
names similar to this proposal to be able to re-identify state in case
the topology gets altered between deployments.

Flink only has state they need to worry about. For Kafka Streams, it's
state plus repartition topics.


-Matthias

On 9/13/18 1:48 AM, Eno Thereska wrote:
> Hi folks,
> 
> I know we don't normally have a "Related work" section in KIPs, but
> sometimes I find it useful to see what others have done in similar cases.
> Since this will be important for rolling re-deployments, I wonder what
> other frameworks like Flink (or Samza) have done in these cases. Perhaps
> they have done nothing, in which case it's fine to do this from first
> principles, but IMO it would be good to know just to make sure we're
> heading in the right direction.
> 
> Also I don't get a good feel for how much work this will be for an end user
> who is doing the rolling deployment, perhaps an end-to-end example would
> help.
> 
> Thanks
> Eno
> 
> On Thu, Sep 13, 2018 at 6:22 AM, Matthias J. Sax 
> wrote:
> 
>> Follow up comments:
>>
>> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
>> `app-id]-[join-name]-left|right-repartition` but we should not change
>> the pattern depending if the user specifies a name of not. I am fine
>> with both patterns---just want to make sure with stick with one.
>>
>> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
>> to be orthogonal, and thus KIP-372 should not change any processor
>> names, but KIP-307 should define a holistic strategy for all processor.
>> Otherwise, we might up with different strategies or revert what we
>> decide in this KIP if it's not compatible with KIP-307.
>>
>>
>> -Matthias
>>
>>
>> On 9/12/18 6:28 PM, Guozhang Wang wrote:
>>> Hello Bill,
>>>
>>> I made a pass over your proposal and here are some questions:
>>>
>>> 1. For Joined names, the current proposal is to define the repartition
>>> topic names as
>>>
>>> * [app-id]-this-[join-name]-repartition
>>>
>>> * [app-id]-other-[join-name]-repartition
>>>
>>>
>>> And if [join-name] not specified, stay the same, which is:
>>>
>>> * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>> join
>>> and S-T join
>>>
>>> I think it is more natural to rename it to
>>>
>>> * [app-id]-[join-name]-left-repartition
>>>
>>> * [app-id]-[join-name]-right-repartition
>>>
>>>
>>> 2. I'd suggest to use the name to also define the corresponding processor
>>> names accordingly, in addition to the repartition topic names. Note that
>>> for joins, this may be overlapping with KIP-307
>>> > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
>>> as
>>> it also have proposals for defining processor names for join operators as
>>> well.
>>>
>>> 3. Could you also specify how this would affect the optimization for
>>> merging multiple repartition topics?
>>>
>>> 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
>>> you also mention the following scenarios, if any of the upgrade path
>> would
>>> be changed:
>>>
>>>  a) changing user DSL code: under which scenarios users can now do a
>>> rolling bounce instead of resetting applications.
>>>
>>>  b) upgrading from older version to new version, with all the names
>>> specified, and with optimization turned on. E.g. say we have the code
>>> written in 2.1 with all names specified, and now upgrading to 2.2 with
>> new
>>> optimizations that may potentially change the repartition topics. Is that
>>> always safe to do?
>>>
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck  wrote:
>>>
 All I'd like to start a discussion on KIP-372 for the naming of joins
>> and
 grouping operations in Kafka Streams.

 The KIP page can be found here:
 https://cwiki.apache.org/confluence/display/KAFKA/KIP-
 372%3A+Naming+Joins+and+Grouping

 I look forward to feedback and comments.

 Thanks,
 Bill

>>>
>>>
>>>
>>
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-13 Thread Eno Thereska
Hi folks,

I know we don't normally have a "Related work" section in KIPs, but
sometimes I find it useful to see what others have done in similar cases.
Since this will be important for rolling re-deployments, I wonder what
other frameworks like Flink (or Samza) have done in these cases. Perhaps
they have done nothing, in which case it's fine to do this from first
principles, but IMO it would be good to know just to make sure we're
heading in the right direction.

Also I don't get a good feel for how much work this will be for an end user
who is doing the rolling deployment, perhaps an end-to-end example would
help.

Thanks
Eno

On Thu, Sep 13, 2018 at 6:22 AM, Matthias J. Sax 
wrote:

> Follow up comments:
>
> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> `app-id]-[join-name]-left|right-repartition` but we should not change
> the pattern depending if the user specifies a name of not. I am fine
> with both patterns---just want to make sure with stick with one.
>
> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> to be orthogonal, and thus KIP-372 should not change any processor
> names, but KIP-307 should define a holistic strategy for all processor.
> Otherwise, we might up with different strategies or revert what we
> decide in this KIP if it's not compatible with KIP-307.
>
>
> -Matthias
>
>
> On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > Hello Bill,
> >
> > I made a pass over your proposal and here are some questions:
> >
> > 1. For Joined names, the current proposal is to define the repartition
> > topic names as
> >
> > * [app-id]-this-[join-name]-repartition
> >
> > * [app-id]-other-[join-name]-repartition
> >
> >
> > And if [join-name] not specified, stay the same, which is:
> >
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> join
> > and S-T join
> >
> > I think it is more natural to rename it to
> >
> > * [app-id]-[join-name]-left-repartition
> >
> > * [app-id]-[join-name]-right-repartition
> >
> >
> > 2. I'd suggest to use the name to also define the corresponding processor
> > names accordingly, in addition to the repartition topic names. Note that
> > for joins, this may be overlapping with KIP-307
> >  307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > as
> > it also have proposals for defining processor names for join operators as
> > well.
> >
> > 3. Could you also specify how this would affect the optimization for
> > merging multiple repartition topics?
> >
> > 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
> > you also mention the following scenarios, if any of the upgrade path
> would
> > be changed:
> >
> >  a) changing user DSL code: under which scenarios users can now do a
> > rolling bounce instead of resetting applications.
> >
> >  b) upgrading from older version to new version, with all the names
> > specified, and with optimization turned on. E.g. say we have the code
> > written in 2.1 with all names specified, and now upgrading to 2.2 with
> new
> > optimizations that may potentially change the repartition topics. Is that
> > always safe to do?
> >
> >
> >
> > Guozhang
> >
> >
> > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck  wrote:
> >
> >> All I'd like to start a discussion on KIP-372 for the naming of joins
> and
> >> grouping operations in Kafka Streams.
> >>
> >> The KIP page can be found here:
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> 372%3A+Naming+Joins+and+Grouping
> >>
> >> I look forward to feedback and comments.
> >>
> >> Thanks,
> >> Bill
> >>
> >
> >
> >
>
>


Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-12 Thread Matthias J. Sax
Follow up comments:

1) We should either use `[app-id]-this|other-[join-name]-repartition` or
`app-id]-[join-name]-left|right-repartition` but we should not change
the pattern depending if the user specifies a name of not. I am fine
with both patterns---just want to make sure with stick with one.

2) I didn't see why we would need to do this in this KIP. KIP-307 seems
to be orthogonal, and thus KIP-372 should not change any processor
names, but KIP-307 should define a holistic strategy for all processor.
Otherwise, we might up with different strategies or revert what we
decide in this KIP if it's not compatible with KIP-307.


-Matthias


On 9/12/18 6:28 PM, Guozhang Wang wrote:
> Hello Bill,
> 
> I made a pass over your proposal and here are some questions:
> 
> 1. For Joined names, the current proposal is to define the repartition
> topic names as
> 
> * [app-id]-this-[join-name]-repartition
> 
> * [app-id]-other-[join-name]-repartition
> 
> 
> And if [join-name] not specified, stay the same, which is:
> 
> * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
> and S-T join
> 
> I think it is more natural to rename it to
> 
> * [app-id]-[join-name]-left-repartition
> 
> * [app-id]-[join-name]-right-repartition
> 
> 
> 2. I'd suggest to use the name to also define the corresponding processor
> names accordingly, in addition to the repartition topic names. Note that
> for joins, this may be overlapping with KIP-307
> 
> as
> it also have proposals for defining processor names for join operators as
> well.
> 
> 3. Could you also specify how this would affect the optimization for
> merging multiple repartition topics?
> 
> 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
> you also mention the following scenarios, if any of the upgrade path would
> be changed:
> 
>  a) changing user DSL code: under which scenarios users can now do a
> rolling bounce instead of resetting applications.
> 
>  b) upgrading from older version to new version, with all the names
> specified, and with optimization turned on. E.g. say we have the code
> written in 2.1 with all names specified, and now upgrading to 2.2 with new
> optimizations that may potentially change the repartition topics. Is that
> always safe to do?
> 
> 
> 
> Guozhang
> 
> 
> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck  wrote:
> 
>> All I'd like to start a discussion on KIP-372 for the naming of joins and
>> grouping operations in Kafka Streams.
>>
>> The KIP page can be found here:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> 372%3A+Naming+Joins+and+Grouping
>>
>> I look forward to feedback and comments.
>>
>> Thanks,
>> Bill
>>
> 
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [DISCUSS] KIP-372: Naming Joins and Grouping

2018-09-12 Thread Guozhang Wang
Hello Bill,

I made a pass over your proposal and here are some questions:

1. For Joined names, the current proposal is to define the repartition
topic names as

* [app-id]-this-[join-name]-repartition

* [app-id]-other-[join-name]-repartition


And if [join-name] not specified, stay the same, which is:

* [previous-processor-name]-repartition for both Stream-Stream (S-S) join
and S-T join

I think it is more natural to rename it to

* [app-id]-[join-name]-left-repartition

* [app-id]-[join-name]-right-repartition


2. I'd suggest to use the name to also define the corresponding processor
names accordingly, in addition to the repartition topic names. Note that
for joins, this may be overlapping with KIP-307

as
it also have proposals for defining processor names for join operators as
well.

3. Could you also specify how this would affect the optimization for
merging multiple repartition topics?

4. In the "Compatibility, Deprecation, and Migration Plan" section, could
you also mention the following scenarios, if any of the upgrade path would
be changed:

 a) changing user DSL code: under which scenarios users can now do a
rolling bounce instead of resetting applications.

 b) upgrading from older version to new version, with all the names
specified, and with optimization turned on. E.g. say we have the code
written in 2.1 with all names specified, and now upgrading to 2.2 with new
optimizations that may potentially change the repartition topics. Is that
always safe to do?



Guozhang


On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck  wrote:

> All I'd like to start a discussion on KIP-372 for the naming of joins and
> grouping operations in Kafka Streams.
>
> The KIP page can be found here:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 372%3A+Naming+Joins+and+Grouping
>
> I look forward to feedback and comments.
>
> Thanks,
> Bill
>



-- 
-- Guozhang