identified. Perhaps the problem could be simplified though by
considering
the context and purpose of Kafka. I would use a persistent message
queue
because I want to guarantee that data/messages don't get lost. But,
since
Kafka is not meant to be a long term storage solution (other
products
can
be used for that), I would clarify that guarantee to apply only
to the
most
recent messages up until a certain configured threshold (i.e. max 24
hrs,
max 500GB, etc). Once those thresholds are reached, old messages
are
deleted first.
To ensure no message loss (up to a limit), I must ensure Kafka is
highly
available. There's a small a chance that the message deletion
rate is
the
same rate that receive rate. For example, when the incoming
volume is
so
high that the size threshold is reached before the time threshold.
But, I
may be ok with that because if Kafka goes down, it can cause
upstream
applications to fail. This can result in higher losses overall, and
particularly of the most *recent* messages.
In other words, in a persistent but ephemeral message queue, I would
give
higher precedence to recent messages over older ones. On the flip
side, by
allowing Kafka to go down when a disk is full, applications are
forced
to
deal with the issue. This adds complexity to apps, but perhaps it's
not a
bad thing. After all, in scalability, all apps should be
designed to
handle failure.
Having said that, next is to decide which messages to delete
first. I
believe that's a separate issue and has its own complexities, too.
The main idea though is that a global knob would provide
flexibility,
even if not used. From an operation perspective, if we can't
ensure HA
for
all applications/components, it would be good if we can for at least
some
of the core ones, like Kafka. This is much easier said that done
though.
On May 5, 2014, at 9:16 AM, Jun Rao <jun...@gmail.com> wrote:
Yes, your understanding is correct. A global knob that controls
aggregate
log size may make sense. What would be the expected behavior
when that
limit is reached? Would you reduce the retention uniformly
across all
topics? Then, it just means that some of the logs may not be
retained
as
long as you want. Also, we need to think through what happens when
every
log has only 1 segment left and yet the total size still exceeds
the
limit.
Do we roll log segments early?
Thanks,
Jun
On Sun, May 4, 2014 at 4:31 AM, vinh <v...@loggly.com> wrote:
Thanks Jun. So if I understand this correctly, there really
is no
master
property to control the total aggregate size of all Kafka data
files
on
a
broker.
log.retention.size and log.file.size are great for managing
data at
the
application level. In our case, application needs change
frequently,
and
performance itself is an ever evolving feature. This means various
configs
are constantly changing, like topics, # of partitions, etc.
What rarely changes though is provisioned hardware resources.
So a
setting to control the total aggregate size of Kafka logs (or
persisted
data, for better clarity) would definitely simplify things at an
operational level, regardless what happens at the application
level.
On May 2, 2014, at 7:49 AM, Jun Rao <jun...@gmail.com> wrote:
log.retention.size controls the total size in a log dir (per
partition). log.file.size
controls the size of each log segment in the log dir.
Thanks,
Jun
On Thu, May 1, 2014 at 9:31 PM, vinh <v...@loggly.com> wrote:
In the 0.7 docs, the description for log.retention.size and
log.file.size
sound very much the same. In particular, that they apply to a
single
log
file (or log segment file).
http://kafka.apache.org/07/configuration.html
I'm beginning to think there is no setting to control the max
aggregate
size of all logs. If this is correct, what would be a good
approach
to
enforce this requirement? In my particular scenario, I have
a lot
of
data
being written to Kafka at a very high rate. So a 1TB disk can
easily
be
filled up in 24hrs or so. One option is to add more Kafka
brokers
to
add
more disk space to the pool, but I'd like to avoid that and
see if I
can
simply configure Kafka to not write more than 1TB aggregate.
Else,
Kafka
will OOM and kill itself, and possibly the crash the node itself
because
the disk is full.
On May 1, 2014, at 9:21 PM, vinh <v...@loggly.com> wrote:
Using Kafka 0.7.2, I have the following in server.properties:
log.retention.hours=48
log.retention.size=107374182400
log.file.size=536870912
My interpretation of this is:
a) a single log segment file over 48hrs old will be deleted
b) the total combined size of *all* logs is 100GB
c) a single log segment file is limited to 500MB in size
before a
new
segment file is spawned spawning a new segment file
d) a "log file" can be composed of many "log segment files"
But, even after setting the above, I find that the total
combined
size
of all Kafka logs on disk is 200GB right now. Isn't
log.retention.size
supposed to limit it to 100GB? Am I missing something? The
docs
are
not
really clear, especially when it comes to distinguishing
between a
"log
file" and a "log segment file".
I have disk monitoring. But like anything else in
software, even
monitoring can fail. Via configuration, I'd like to make sure
that
Kafka
does not write more than the available disk space. Or
something like
log4j, where I can set a max number of log files and the max
size
per
file,
which essentially allows me to set a max aggregate size limit
across
all
logs.
Thanks,
-Vinh