For record purpose, this KIP is closed as its design has been merged into
KIP-320. See
https://cwiki.apache.org/confluence/display/KAFKA/KIP-320%3A+Allow+fetchers+to+detect+and+handle+log+truncation
.
On Wed, Jan 31, 2018 at 12:16 AM Dong Lin wrote:
> Hey Jun, Jason,
>
> Thanks for all the
Hey Jun, Jason,
Thanks for all the comments. Could you see if you can give +1 for the KIP?
I am open to make further improvements for the KIP.
Thanks,
Dong
On Tue, Jan 23, 2018 at 3:44 PM, Dong Lin wrote:
> Hey Jun, Jason,
>
> Thanks much for all the review! I will open
Hey Jun, Jason,
Thanks much for all the review! I will open the voting thread.
Regards,
Dong
On Tue, Jan 23, 2018 at 3:37 PM, Jun Rao wrote:
> Hi, Dong,
>
> The current KIP looks good to me.
>
> Thanks,
>
> Jun
>
> On Tue, Jan 23, 2018 at 12:29 PM, Dong Lin
Hi, Dong,
The current KIP looks good to me.
Thanks,
Jun
On Tue, Jan 23, 2018 at 12:29 PM, Dong Lin wrote:
> Hey Jun,
>
> Do you think the current KIP looks OK? I am wondering if we can open the
> voting thread.
>
> Thanks!
> Dong
>
> On Fri, Jan 19, 2018 at 3:08 PM, Dong
Hey Jun,
Do you think the current KIP looks OK? I am wondering if we can open the
voting thread.
Thanks!
Dong
On Fri, Jan 19, 2018 at 3:08 PM, Dong Lin wrote:
> Hey Jun,
>
> I think we can probably have a static method in Util class to decode the
> byte[]. Both
Hey Jun,
I think we can probably have a static method in Util class to decode the
byte[]. Both KafkaConsumer implementation and the user application will be
able to decode the byte array and log its content for debug purpose. So it
seems that we can still print the information we want. It is just
Hi, Dong,
The issue with using just byte[] for OffsetEpoch is that it won't be
printable, which makes debugging harder.
Also, KIP-222 proposes a listGroupOffset() method in AdminClient. If that
gets adopted before this KIP, we probably want to include OffsetEpoch in
the AdminClient too.
Thanks,
Hey Jun,
I agree. I have updated the KIP to remove the class OffetEpoch and replace
OffsetEpoch with byte[] in APIs that use it. Can you see if it looks good?
Thanks!
Dong
On Thu, Jan 18, 2018 at 6:07 PM, Jun Rao wrote:
> Hi, Dong,
>
> Thanks for the updated KIP. It looks
Hi, Dong,
Thanks for the updated KIP. It looks good to me now. The only thing is
for OffsetEpoch.
If we expose the individual fields in the class, we probably don't need the
encode/decode methods. If we want to hide the details of OffsetEpoch, we
probably don't want expose the individual fields.
Thinking about point 61 more, I realize that the async zookeeper read may
make it less of an issue for controller to read more zookeeper nodes.
Writing partition_epoch in the per-partition znode makes it simpler to
handle the broker failure between zookeeper writes for a topic creation. I
have
Hey Jun,
Thanks much for the comments. Please see my comments inline.
On Tue, Jan 16, 2018 at 4:38 PM, Jun Rao wrote:
> Hi, Dong,
>
> Thanks for the updated KIP. Looks good to me overall. Just a few minor
> comments.
>
> 60. OffsetAndMetadata
Hi, Dong,
Thanks for the updated KIP. Looks good to me overall. Just a few minor
comments.
60. OffsetAndMetadata positionAndOffsetEpoch(TopicPartition partition): It
seems that there is no need to return metadata. We probably want to return
sth like OffsetAndEpoch.
61. Should we store
Hi, Dong,
Thanks for the updated KIP. Looks good to me overall. Just a few minor
comments.
60. OffsetAndMetadata positionAndOffsetEpoch(TopicPartition partition): It
seems that there is no need to return metadata. We probably want to return
sth like OffsetAndEpoch.
61. Should we store
Hey Jun,
Thanks much. I agree that we can not rely on committed offsets to be always
deleted when we delete topic. So it is necessary to use a per-partition
epoch that does not change unless this partition is deleted. I also agree
that it is very nice to be able to uniquely identify a message
Hi, Dong,
My replies are the following.
60. What you described could also work. The drawback is that we will be
unnecessarily changing the partition epoch when a partition hasn't really
changed. I was imagining that the partition epoch will be stored in
Regarding the use of the global epoch in 65), it is very similar to the
proposal of the metadata_epoch we discussed earlier. The main difference is
that this epoch is incremented when we create/expand/delete topic and does
not change when controller re-send metadata.
I looked at our previous
Hey Jun,
Thanks so much. These comments very useful. Please see below my comments.
On Mon, Jan 8, 2018 at 5:52 PM, Jun Rao wrote:
> Hi, Dong,
>
> Thanks for the updated KIP. A few more comments.
>
> 60. Perhaps having a partition epoch is more flexible since in the future,
>
Hi, Dong,
Thanks for the updated KIP. A few more comments.
60. Perhaps having a partition epoch is more flexible since in the future,
we may support deleting a partition as well.
61. It seems that the leader epoch returned in the position() call should
the the leader epoch returned in the fetch
Hey Jason,
Certainly. This sounds good. I have updated the KIP to clarity that the
global epoch will be incremented by 1 each time a topic is deleted.
Thanks,
Dong
On Mon, Jan 8, 2018 at 4:09 PM, Jason Gustafson wrote:
> Hi Dong,
>
>
> I think your approach will allow user
Hi Dong,
I think your approach will allow user to distinguish between the metadata
> before and after the topic deletion. I also agree that this can be
> potentially be useful to user. I am just not very sure whether we already
> have a good use-case to make the additional complexity worthwhile.
Hey Ismael,
I guess we actually need user to see this field so that user can store this
value in the external store together with the offset. We just prefer the
value to be opaque to discourage most users from interpreting this value.
One more advantage of using such an opaque field is to be able
Hey Jason,
Thanks for the comments.
On Mon, Jan 8, 2018 at 2:27 PM, Jason Gustafson wrote:
> >
> > I am not sure I understand the benefit of incrementing this epoch after
> > topic deletion. At a high level, client can not differentiate between
> topic
> > deletion and
On Fri, Jan 5, 2018 at 7:15 PM, Jason Gustafson wrote:
>
> class OffsetAndMetadata {
> long offset;
> byte[] offsetMetadata;
> String metadata;
> }
> Admittedly, the naming is a bit annoying, but we can probably come up with
> something better. Internally the byte
>
> I am not sure I understand the benefit of incrementing this epoch after
> topic deletion. At a high level, client can not differentiate between topic
> deletion and topic creation when the global epoch is incremented. Can you
> provide more specific use-case?
Say you send two metadata
Hey Jason,
Thanks a lot for the comments. I will comment inline. And I have updated
the KIP accordingly. Could you take another look?
On Fri, Jan 5, 2018 at 11:15 AM, Jason Gustafson wrote:
> Hi Dong,
>
> Sorry for the late reply. I think the latest revision is looking
Hi Dong,
Sorry for the late reply. I think the latest revision is looking good. I
have a few minor suggestions:
1. The name "partition_epoch" makes me think it changes independently at
the partition level, but all partitions for a topic should have the same
epoch. Maybe "topic_epoch" is nearer
Hey Jun, Jason,
Thanks much for all the feedback. I have updated the KIP based on the
latest discussion. Can you help check whether it looks good?
Thanks,
Dong
On Thu, Jan 4, 2018 at 5:36 PM, Dong Lin wrote:
> Hey Jun,
>
> Hmm... thinking about this more, I am not sure
Hey Jun,
Hmm... thinking about this more, I am not sure that the proposed API is
sufficient. For users that store offset externally, we probably need extra
API to return the leader_epoch and partition_epoch for all partitions that
consumers are consuming. I suppose these users currently use
Hi, Dong,
Yes, that's what I am thinking. OffsetEpoch will be composed of
(partition_epoch,
leader_epoch).
Thanks,
Jun
On Thu, Jan 4, 2018 at 4:22 PM, Dong Lin wrote:
> Hey Jun,
>
> Thanks much. I like the the new API that you proposed. I am not sure what
> you exactly
Hey Jun,
Thanks much. I like the the new API that you proposed. I am not sure what
you exactly mean by offset_epoch. I suppose that we can use the pair of
(partition_epoch, leader_epoch) as the offset_epoch, right?
Thanks,
Dong
On Thu, Jan 4, 2018 at 4:02 PM, Jun Rao wrote:
Hi, Dong,
Got it. The api that you proposed works. The question is whether that's the
api that we want to have in the long term. My concern is that while the api
change is simple, the new api seems harder to explain and use. For example,
a consumer storing offsets externally now needs to call
Hey Jun,
Yeah I agree that ideally we don't want an ever growing global metadata
version. I just think it may be more desirable to keep the consumer API
simple.
In my current proposal, metadata version returned in the fetch response
will be stored with the offset together. More specifically, the
Hi, Dong,
Not sure that I fully understand your latest suggestion. Returning an ever
growing global metadata version itself is no ideal, but is fine. My
question is whether the metadata version returned in the fetch response
needs to be stored with the offset together if offsets are stored
Hey Jun,
Thanks much for the explanation.
I understand the advantage of partition_epoch over metadata_epoch. My
current concern is that the use of leader_epoch and the partition_epoch
requires us considerable change to consumer's public API to take care of
the case where user stores offset
Hi, Dong,
Thanks for the reply.
To solve the topic recreation issue, we could use either a global metadata
version or a partition level epoch. But either one will be a new concept,
right? To me, the latter seems more natural. It also makes it easier to
detect if a consumer's offset is still
Hey Jun,
This is a very good example. After thinking through this in detail, I agree
that we need to commit offset with leader epoch in order to address this
example.
I think the remaining question is how to address the scenario that the
topic is deleted and re-created. One possible solution is
Hi, Dong,
My previous reply has a couple of typos. So, sending it again with the fix.
Consider the following scenario. In metadata v1, the leader for a partition
is at broker 1. In metadata v2, leader is at broker 2. In metadata v3,
leader is at broker 1 again. The last committed offset in v1,
Hi, Dong,
Thanks for the updated KIP. Still have some questions on the latest
approach in the KIP.
Consider the following scenario. In metadata v1, the leader for a partition
is at broker 1. In metadata v2, leader is at broker 2. In metadata v3,
leader is at broker 1 again. The last committed
Hey Jun,
Thanks much for your comments. Yeah I have not considered the case where
the offset is stored externally.
Based Jason's question, I think we probably have to use a global
metadata_epoch. And since we have a global metadata_epoch, this KIP
probably no longer needs the per-partition
Hey Jason,
Thanks much. Great question. I have considered topic deletion but I have
not considered the scenario that user creates topic very soon after topic
deletion.
After thinking through this scenario, I think the only option is to have a
global metadata_epoch that keeps increasing every
Hi, Dong,
Thanks for the reply.
10. I was actually just thinking the case when the consumer consumes old
data. If the current leader epoch is 3 and the consumer is consuming
records generated in leader epoch 1, the epoch associated with the offset
should be 1. However, as you pointed out, the
Hey Dong,
One more thought came to mind. Have you considered edge cases around topic
deletion? I think currently if a topic is deleted and then re-created, the
leader epoch will start back at the beginning. It seems like that could
cause trouble for this solution. One thing that helps is that we
I think you're saying that depending on the bug, in the worst case, you may
have to downgrade the client. I think that's fair. Note that one advantage
of making this a fatal error is that we'll be more likely to hit unexpected
edge cases in system tests.
-Jason
On Tue, Dec 19, 2017 at 11:26 AM,
Hey Jason,
Yeah this may sound a bit confusing. Let me explain my thoughts.
If there is no bug in the client library, after consumer rebalance or
consumer restart, consume will fetch the previously committed offset and
fetch the committed metadata until the leader epoch in the metadata >= the
Hey Dong,
Thanks for the updates. Just one question:
When application receives
> this exception, the only choice will be to revert Kafka client library to
> an earlier version.
Not sure I follow this. Wouldn't we just restart the consumer? That would
cause it to fetch the previous committed
Hey Jason,
Thanks for the comments. These make sense. I have updated the KIP to
include a new error INVALID_LEADER_EPOCH. This will be a non-retriable
error which may be thrown from consumer's API. When application receives
this exception, the only choice will be to revert Kafka client library to
Hey Dong,
> I think it is a good idea to let coordinator do the additional sanity check
> to ensure the leader epoch from OffsetCommitRequest never decreases. This
> can help us detect bug. The next question will be what should we do if
> OffsetCommitRequest provides a smaller leader epoch. One
Hey Jason,
Thanks much for reviewing the KIP.
I think it is a good idea to let coordinator do the additional sanity check
to ensure the leader epoch from OffsetCommitRequest never decreases. This
can help us detect bug. The next question will be what should we do if
OffsetCommitRequest provides
Hi Dong,
Thanks for the KIP. Good job identifying the problem. One minor question I
had is whether the coordinator should enforce that the leader epoch
associated with an offset commit can only go forward for each partition?
Currently it looks like we just depend on the client for this, but since
Hey Jun,
Thanks much for your comments. These are very thoughtful ideas. Please see
my comments below.
On Thu, Dec 14, 2017 at 6:38 PM, Jun Rao wrote:
> Hi, Dong,
>
> Thanks for the update. A few more comments below.
>
> 10. It seems that we need to return the leader epoch
Hi, Dong,
Thanks for the update. A few more comments below.
10. It seems that we need to return the leader epoch in the fetch response
as well When fetching data, we could be fetching data from a leader epoch
older than what's returned in the metadata response. So, we want to use the
leader
Hey Jun,
I see. Sounds good. Yeah it is probably simpler to leave this to another
KIP in the future.
Thanks for all the comments. Since there is no further comment in the
community, I will open the voting thread.
Thanks,
Dong
On Mon, Dec 11, 2017 at 5:37 PM, Jun Rao wrote:
Hey Jun,
I have updated the KIP based on our discussion. Thanks!
Dong
On Sat, Dec 9, 2017 at 10:12 PM, Dong Lin wrote:
> Hey Jun,
>
> Thanks much for your comments. Given that client needs to de-serialize the
> metadata anyway, the extra overhead of checking the
Hi, Dong,
The case that I am thinking is network partitioning. Suppose one deploys a
stretched cluster across multiple AZs in the same region. If the machines
in one AZ can't communicate to brokers in other AZs due to a network issue,
the brokers in that AZ won't get any new metadata.
We can
Hey Jun,
Thanks for the comment. I am open to improve this KIP to address more
problems. I probably need more help in understanding what is the current
problem with consumer using outdated metadata and whether it is easier to
address it together with this KIP.
I agree that a consumer can
Hi, Dong,
Thanks for the reply.
My suggestion of forcing the metadata refresh from the controller may not
work in general since the cached controller could be outdated too. The
general problem is that if a consumer's metadata is outdated, it may get
stuck with the old leader for a long time. We
Hey Jun,
Thanks much for your comments. Given that client needs to de-serialize the
metadata anyway, the extra overhead of checking the per-partition version
for every partition should not be a big concern. Thus it makes sense to use
leader epoch as the per-partition version instead of creating a
Hi, Dong,
Thanks for the reply. A few more points below.
For dealing with how to prevent a consumer switching from a new leader to
an old leader, you suggestion that refreshes metadata on consumer restart
until it sees a metadata version >= the one associated with the offset
works too, as long
Hey Jun,
Thanks much for the comments. Great point particularly regarding (3). I
haven't thought about this before.
It seems that there are two possible ways where the version number can be
used. One solution is for client to check the version number at the time it
receives MetadataResponse. And
Hi, Dong,
Great finding on the issue. It's a real problem. A few comments about the
KIP. (1) I am not sure about updating controller_metadata_epoch on every
UpdateMetadataRequest. Currently, the controller can send
UpdateMetadataRequest when there is no actual metadata change. Doing this
may
Bump up the thread.
It will be great to have more comments on whether we should do it or
whether there is better way to address the motivation of this KIP.
On Mon, Dec 4, 2017 at 3:09 PM, Dong Lin wrote:
> I don't have an interesting rejected alternative solution to put in
I don't have an interesting rejected alternative solution to put in the
KIP. If there is good alternative solution from anyone in this thread, I am
happy to discuss this and update the KIP accordingly.
Thanks,
Dong
On Mon, Dec 4, 2017 at 1:12 PM, Ted Yu wrote:
> It is
It is clearer now.
I noticed that Rejected Alternatives section is empty.
Have you considered any alternative ?
Cheers
On Mon, Dec 4, 2017 at 1:07 PM, Dong Lin wrote:
> Ted, thanks for catching this. I have updated the sentence to make it
> readable.
>
> Thanks,
> Dong
>
Ted, thanks for catching this. I have updated the sentence to make it
readable.
Thanks,
Dong
On Sat, Dec 2, 2017 at 3:05 PM, Ted Yu wrote:
> bq. It the controller_epoch of the incoming MetadataResponse, or if the
> controller_epoch is the same but the
bq. It the controller_epoch of the incoming MetadataResponse, or if the
controller_epoch is the same but the controller_metadata_epoch
Can you update the above sentence so that the intention is clearer ?
Thanks
On Fri, Dec 1, 2017 at 6:33 PM, Dong Lin wrote:
> Hi all,
>
>
65 matches
Mail list logo