Re: Optimal Message Size for Kafka

2018-09-07 Thread Manoj Khangaonkar
Best kafka performance is with messages size in the order of a few KB.
Larger messages put heavy load on brokers and is very inefficient. It is
inefficient on producers and consumers as well

regards

On Thu, Sep 6, 2018 at 11:16 PM SenthilKumar K 
wrote:

> Hello Experts,  We are planning to use Kafka for large message set ( size
> various from 2 MB to 4 MB  per event ). By setting message.max.bytes to 64
> MB  value this Kafka Producer allows large messages. But how does it
> impacts performance (both producer and consumer)?
>
> Would like to understand  the performance impact of different message size
> . Example : producing  50KB vs 1MB .
>
> What is the optimal message size can be used in Kafka Producer ?
>
> --Senthil
>


-- 
http://khangaonkar.blogspot.com/


Re: Official Kafka Disaster Recovery is insufficient - Suggestions needed

2018-09-07 Thread Manjunath N
Henning,

> It is my understanding that you produce messages to Kafka partitions using 
> the normal producer API and then subsequently ETL them to some cold storage 
> using one or more consumers, i.e. the cold storage is eventually consistent 
> with Kafka!? 

In some of the deployments i have worked. We keep unprocessed raw data for 3 
days for DR purposes. This is even before we process it through ETL.

I am not sure how your data is structured to implement this scenario but here 
is one of the ways you could work around if you have timestamp for each record 
in the raw data.

Each record produced into kafka broker has a CREATE_TIME timestamp. When 
consumers read->process->write to sink. irrespective of what offset we were 
able to commit in kafka before a failure, we can always find the last record 
processed and updated successfully from the sink. Before restarting a consumer 
we can run a identityconsumer to just identify the offset for the last 
successfully updated record in the sink and starting from last committed offset 
we retrieve records and check for the matching timestamp. Once the offset is 
identified for the last successfully updated record timestamp we seek to it in 
ConsumerRebalanceListener’s onPartitionsAssigned method and start processing 
from where we crashed. 

I haven’t tried, i am just sharing what came in my mind.

For bad messages too, you should be able to restore correct state if kafka is 
used as event store by reading from the raw data.

Manjunath

> On Sep 4, 2018, at 7:40 PM, Henning Røigaard-Petersen  wrote:
> 
> Thank you for your answer Ryanne. It’s always a pleasure to be presented with 
> such unambiguous advice. It really gives you something to work with :). 
> 
> To any other readers, I am very interested in hearing of other approaches to 
> DR in Kafka. 
> 
> Ryanne, I agree with your statement as to the probability difference of the 
> different DR scenarios, and I get in principle how your approach would allow 
> us to recover from “bad” messages, but we must of course ensure that we have 
> counter measures for all the scenarios. 
> 
> To that end, I have a couple of questions to your approach to DR.
> 
> Q1) 
> It is my understanding that you produce messages to Kafka partitions using 
> the normal producer API and then subsequently ETL them to some cold storage 
> using one or more consumers, i.e. the cold storage is eventually consistent 
> with Kafka!? 
> 
> If this is true, isn’t your approach prone to the same loss-of-tail issues as 
> regular multi cluster replication in case of total ISR loss? That is, we may 
> end up with an inconsistent cold storage, because downstream messages may be 
> backed up before the corresponding upstream messages are backed up?
> 
> I guess some ways around this would be to have only one partition (not 
> feasible) or to store state changes directly to other storage and ETL those 
> changes back to Kafka for downstream consummation. However, I believe that is 
> not what you are suggesting.
> 
> Q2) 
> I am unsure how you approach should work in practice, concerning the likely 
> disaster scenario of bad messages. 
> 
> Assume a bad message is produced and ETL’ed to the cold storage. 
> 
> As an isolated message, we could simply wipe the Kafka partition and 
> reproduce all relevant messages or compact the bad message with a newer 
> version. This all makes sense. 
> However, more likely, it will not be an isolated bad message, but rather a 
> plethora of downstream consumers will process it and in turn produce derived 
> bad messages, which are further processed downstream. This could result in an 
> enormous amount of bad messages and bad state in cold storage. 
> 
> How would you recover in this case? 
> 
> It might be possible to iterate through the entirety of the state to detect 
> bad messages, but updating with the correct data seems impossible.
> 
> I guess one very crude fallback solution may be to identify the root bad 
> message, and somehow restore to a previous consistent state for the entire 
> system. This however, requires some global message property across the entire 
> system. You mention Timestamps, but traditionally these are intrinsically 
> unreliable, especially in a distributed environment, and will most likely 
> lead to loss of messages with timestamps close to the root bad message.
> 
> Q3) 
> Does the statement “Don't rely on unlimited retention in Kafka” imply some 
> flaw in the implementation, or is it simply a reference to the advice of not 
> using Kafka as Source of Truth due to the DR issues?
> 
> Thank you for your time
> 
> Henning Røigaard-Petersen
> 
> -Original Message-
> From: Ryanne Dolan  
> Sent: 3. september 2018 20:27
> To: users@kafka.apache.org
> Subject: Re: Official Kafka Disaster Recovery is insufficient - Suggestions 
> needed
> 
> Sorry to have misspelled your name Henning.
> 
> On Mon, Sep 3, 2018, 1:26 PM Ryanne Dolan  wrote:
> 
>> Hanning,
>> 
>> In missi