It's by design of Kafka (and by extension flume). The producers are designed to 
be many-to-one (producers to partitions) and as such picking a random partition 
every 10 minutes prevents separate producer instances from all randomly picking 
the same partition. 

-- 
Chris Horrocks

From: Jason Williams <[email protected]>
Reply: [email protected] <[email protected]>
Date: 7 June 2016 at 09:43:34
To: [email protected] <[email protected]>
Subject:  Re: Kafka Sink random partition assignment  

Hey Chris,

Thanks for help! 

Is that a limitation of the Flume Kafka sink or Kafka itself? Because when I 
use another Kafka producer and publish without a key, it randomly moves among 
the partitions on every publish.

-J

Sent via iPhone

On Jun 7, 2016, at 00:08, Chris Horrocks <[email protected]> wrote:

The producers bind to random partitions and move every 10 minutes. If you leave 
it long enough you should see it in the producer flume agent logs, although 
there's nothing to stop it from "randomly" choosing the same partition twice. 
Annoyingly there's no concept of producer groups (yet) to ensure that producers 
apportion the available partitions between them as this would create a 
synchronisation issue between what should be entirely independent processes. 

-- 
Chris Horrocks
On 7 June 2016 at 00:32:29, Jason J. W. Williams ([email protected]) 
wrote:

Hi,

New to flume and I'm trying to relay log messages received over netcat source 
to Kafka sink. 

Everything seems to be fine, except that Flume is acting like it IS assigning a 
partition key to the produced messages though none is assigned. I'd like the 
messages to be assigned to a random partition, so that consumers are load 
balanced.

* Flume 1.6.0
* Kafka 0.9.0.1

Flume config: 
https://gist.github.com/williamsjj/8ae025906955fbc4b5f990e162b75d7c

Kafka topic config: kafka-topics --zookeeper localhost/kafka --create --topic 
activity.history --partitions 20 --replication-factor 1

Python consumer program: 
https://gist.github.com/williamsjj/9e67287f0154816c3a733a39ad008437

Test program (publishes to Flume): 
https://gist.github.com/williamsjj/1eb097a187a3edb17ec1a3913e47e58b

Flume agent listens on 3132tcp for connections, and publishes messages received 
to the Kafka activity.history topic.  I'm running two instances of the Python 
consumer.

What happens however, is all logs messages get sent to a single Kafka 
consumer...if I restart Flume (leave consumers running) and re-run the test, 
all messages get published to the other consumer. So it feels like Flume is 
assigning a permanent partition key even though one is not defined (and should 
therefore be random).

Any advice is greatly appreciated.

-J

Attachment: signature.asc
Description: Message signed with OpenPGP using AMPGpg

Reply via email to