Ah, you maybe hitting the GC due to IO issue. You can confirm if this is
really the case by looking at the gc.log on the broker and check if you see
a GC entry with a small user and sys time but high real time. We saw a
similar IO-causing-GC pauses problem when compressing our request log4j
files which happens every hour or so. Since these files are large and the
gzip process hogs the IO bandwidth, the linux box hits the dirty_ratio
threshold and the kernel stops all threads doing I/O until all the dirty
pages are flushed to disk. We have seen GC pauses until 15-20 seconds when
this happens. A workaround is to increase your zookeeper session timeout
higher to prevent the session expiration and the leader re-elections that
follow.

As for your file deletion issue, we have seen that if you configure a Kafka
broker with time based expiration, it ends up deleting possibly 100s of
large segment files all at the same time. This puts pressure on file system
journaling (we are using ext4 in data=ordered mode) and it slows down
writes on the Kafka side. Kafka should throttle time based rolling as well
as time based expiration to prevent this situation. With that said, we have
never really seen this cause a GC pause like the one you described though.

So it will be good to investigate the root cause of your GC pause anyway.
Could you check your gc.log and send back the relevant part of the log that
shows the pause?

Thanks,
Neha


On Wed, Aug 28, 2013 at 1:09 PM, Yu, Libo <libo...@citi.com> wrote:

> Hi team,
>
> We notice when the incoming throughput is very high, the broker has to
> delete
> old log files to free up disk space. That caused some kind of blocking
> (latency) and
> frequently the broker's zookeeper session times out. Currently our
> zookeeper
> time out threshold is 4s. We can increase it. But if this threshold is too
> large, what
> is the consequence? Thanks.
>
>
> Libo
>
>

Reply via email to