Re: Issue when enabling SSL on broker

2015-08-26 Thread Xiang Zhou (Samuel)
Hi, Harsha,

I appreciate you very much for your response and the bash script you
provided to generate the keystores works for me and solve the problem. I
was wondering it was caused by the cipher suite differences between openjdk
and oracle-jdk, anyway it is not that case. Finally I got it worked under
both openjdk and oracle-jdk.

Thanks,
Samuel

On Tue, Aug 25, 2015 at 9:55 PM, Sriharsha Chintalapani 
harsh...@fastmail.fm wrote:

 Hi,
   Turns out to be a bug in the instructions in the wiki . I fixed it
 can you please retry generating the truststore and keystore

 https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka
  .

 checkout this section All of the above steps in a bash script” to
 generate the keystores.
 Thanks,
 Harsha


 On August 25, 2015 at 8:56:24 PM, Sriharsha Chintalapani (ka...@harsha.io)
 wrote:

 Hi Xiang,
  Did you try following the instructions here
 https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka
  .
 Whats the output of openssl s_client and which version of java and OS are
 you using.

 Thanks,
 Harsha


 On August 25, 2015 at 8:42:18 PM, Xiang Zhou (Samuel) (zhou...@gmail.com)
 wrote:

 no cipher suites in common




Re: Mirror a partition of a topic

2015-08-26 Thread Kishore Senji
Flume could be an option with an Interceptor although the throughput could
be less compared to Mirror Maker with compression and shallow iterator
enabled.
On Tue, Aug 25, 2015 at 10:28 PM tao xiao xiaotao...@gmail.com wrote:

 In the trunk code mirror maker provides the ability to filter out messages
 on demand by supplying a message handler.


 https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/MirrorMaker.scala#L443

 On Wed, 26 Aug 2015 at 11:35 Binh Nguyen Van binhn...@gmail.com wrote:

  Hi all,
 
  Is there any tool that allows to mirror a partition of a topic?
  I looked at the mirror maker but it is used to mirror the whole topic
  and I don't see any option that allow me to configure partitions.
 
  I want to mirror our live data to staging environment but my staging
  environment
  can't handle the whole topic so I am looking for a tool that I can use to
  mirror specific
  partitions of a topic only.
 
  Thanks
  -Binh
 



Reg Leader re-election when leader crashes

2015-08-26 Thread Amar
Hi,

I have configured 3 brokers and 3 zookeepers in different unix boxes. All
are receiving message successfully.

1. Now leader broker got crashed and when I try to publish 10 message. some
of them are getting failed with
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata
after 3 ms.
Can you please advise

2.  two broker including leader broker got crashed. I ran below command and
found that some offline partiitions. Please advise why they are not getting
re-assigned to third broker ?

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic
RTL_RESERVATION_UPDATES --unavailable-partitions
Regards
Amar


Http Kafka producer

2015-08-26 Thread Hemanth Abbina
Hi,

Our application receives events through a HAProxy server on HTTPs, which should 
be forwarded and stored to Kafka cluster.

What should be the best option for this ?
This layer should receive events from HAProxy  produce them to Kafka cluster, 
in a reliable and efficient way (and should scale horizontally).

Please suggest.

--regards
Hemanth


Re: 0.8.x

2015-08-26 Thread Ben Stopford
Hi Damian

Just clarifying - you’re saying you currently have Kafka 0.7.x running with a 
dedicated broker addresses (bypassing ZK) and hitting a VIP which you use for 
load balancing writes. Is that correct?

Are you worried about something specific in the 0.8.x way of doing things (ZK 
under that level of load etc) or are you just looking for experiences running 
0.8.x with that many producers?

B


 On 25 Aug 2015, at 10:29, Damian Guy damian@gmail.com wrote:
 
 Hi,
 
 We currently run 0.7.x on our clusters and are now finally getting around
 to upgrading to kafka latest.  One thing that has been holding us back is
 that we can no longer use a VIP to front the clusters. I understand we
 could use a VIP for metadata lookups, but we have 100,000 + producers to at
 least one of our clusters.
 So, my question is: How is Kafka 0.8.x going to handle 100,000+ producers?
 Any recommendations on setup etc?
 
 Thanks for the help,
 Damian



Re: 0.8.x

2015-08-26 Thread Damian Guy
Hi Ben,

Yes we have a VIP fronting our kafka 0.7 clusters. The producers connect
with a broker list that just points to the VIP.

In 0.8, with the new producer at least, the producer doesn't go through
Zookeeper - which is good. However, all producers will connect to all
brokers to send messages. My concern is that the brokers won't be able to
handle the load of 100k+ connections.

So i want to know if anyone has experience with this sort of load? Will it
work?
Do i need to take a multi-layered approach to Kafka?  i.e., have many
front-end kafka clusters that subsets of producers connect to and then
mirror data from there to one or more back-end clusters? I'd rather not
have to do this as it becomes painful to track data though the system.

Any other recommendations on how to go about it?

Thanks,
Damian


On 26 August 2015 at 10:57, Ben Stopford b...@confluent.io wrote:

 Hi Damian

 Just clarifying - you’re saying you currently have Kafka 0.7.x running
 with a dedicated broker addresses (bypassing ZK) and hitting a VIP which
 you use for load balancing writes. Is that correct?

 Are you worried about something specific in the 0.8.x way of doing things
 (ZK under that level of load etc) or are you just looking for experiences
 running 0.8.x with that many producers?

 B


  On 25 Aug 2015, at 10:29, Damian Guy damian@gmail.com wrote:
 
  Hi,
 
  We currently run 0.7.x on our clusters and are now finally getting around
  to upgrading to kafka latest.  One thing that has been holding us back is
  that we can no longer use a VIP to front the clusters. I understand we
  could use a VIP for metadata lookups, but we have 100,000 + producers to
 at
  least one of our clusters.
  So, my question is: How is Kafka 0.8.x going to handle 100,000+
 producers?
  Any recommendations on setup etc?
 
  Thanks for the help,
  Damian




Re: Painfully slow kafka recovery / cluster breaking

2015-08-26 Thread Jörg Wagner

Just a little feedback on our issue(s) as FYI to whoever is interested.

It basically all boiled down to the configuration of topics. We noticed 
while performance testing (or trying to ;) ) that the partitioning was 
most critical to us.


We originally followed the linkedin recommendation and used 600 
partitions for our main topic. Testing that, the replicas always went 
out of sync within a short timeframe, leaders could not be determined 
and the cluster failed horribly (even writing several hundred lines of 
logs within a 1/100th second).


So for our 27 log.dirs (= disks) we went with 27 partitions. And voilá: 
we could use kafka with around 35k requests per second (via an 
application accessing it). Kafka stayed stable.


Currently we are testing with 81 partitions (27*3) and it's running 
well. No issues so far, replicas in sync and up to 50k requests per second.


Cheers

On 25.08.2015 15:18, Jörg Wagner wrote:
So okay, this is a little embarassing but the core of the issue was 
that max open files was not set correctly for kafka. It was not an 
oversight, but a few things together caused that the system 
configuration was not changed correctly, resulting in the default value.


No wonder that kafka behaved strangely everytime we had enough data in 
log.dirs and connections.


Anyhow, that doesn't seem to be the last problem. The brokers get in 
sync with each other (within an expected time frame), everything seems 
fine.


After a little stress testing, the kafka cluster falls apart (around 
40k requests/s). Using topics describe we can see leaders missing 
(e.g. from 1,2,3 only 1 and 3 are leading partitions, although 
zookeeper lists all under /brokers/ids). This ultimately results in 
partitions being unavailable and massive leader not local spam in 
the logs.


What are we missing?

Cheers
Jörg

On 24.08.2015 10:31, Jörg Wagner wrote:

Thank you for your answers.

@Raja
No, it also seems to happen if we stop kafka completely clean.

@Gwen
I was testing the situation with num.replica.fetchers set higher. If 
you say that was the right direction, I will try it again. What would 
be a good setting? I went with 50 which seemed reasonable (having 27 
single disks).

How long should it take to get complete ISR?

Regarding no Data flowing into kafka: I just wanted to point out that 
the setup is not yet live. So we can completely stop the usage of 
kafka, and it should possibly get into sync faster without a steady 
stream of new messages.
Kafka itself is working fine during this on the other hand, just 
missing ISR, hence redundancy. If I stop another broker (clean!) 
though, it tends to happen that the expected number of partitions 
have Leader -1; which should not happen as I assume.


Cheers
Jörg

On 21.08.2015 19:18, Rajasekar Elango wrote:

We are seeing same behavior in 5 broker cluster when losing one broker.

In our case, we are losing broker as well as kafka data dir.

Jörg Wagner,

Are you losing just broker or kafka data dir as well?

Gwen,

We have also observed that latency of messages arriving at consumers 
goes

up by 10x when we lose a broker. Is it because the broker is busy with
handling failed fetch requests and loaded with more data thats 
slowing down

the writes ? Also, if we had simply lost the broker not the data dir,
impact would have been minimal?

Thanks,
Raja.



On Fri, Aug 21, 2015 at 12:31 PM, Gwen Shapira g...@confluent.io 
wrote:


By default, num.replica.fetchers = 1. This means only one thread 
per broker

is fetching data from leaders. This means it make take a while for the
recovering machine to catch up and rejoin the ISR.

If you have bandwidth to spare, try increasing this value.

Regarding no data flowing into kafka - If you have 3 replicas and 
only
one is down, I'd expect writes to continue to the new leader even 
if one
replica is not in the ISR yet. Can you see that a new leader is 
elected?


Gwen

On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner joerg.wagn...@1und1.de
wrote:


Hey everyone,

here's my crosspost from irc.

Our setup:
3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
logdisks each). We use a handful of topics, but only one topic is

utilized

heavily. It features a replication of 2 and 600 partitions.

Our issue:
If one kafka was down, it takes very long ( from 1 to 10 hours) 
to show
that all partitions have all isr again. This seems to heavily 
depend on

the
amount of data which is in the log.dirs (I have configured 27 
threads -

one

for each dir featuring a own drive).
This all takes this long while there is NO data flowing into kafka.

We seem to be missing something critical here. It might be some 
option

set

wrong, or are we thinking wrong and it's not critical to have the

replicas

in sync.

Any pointers would be great.

Cheers
Jörg










--
Mit freundlichem Gruß

Jörg Wagner

 
Mobile  Services


11 Internet AG | Sapporobogen 6-8 | 80637 München | Germany
Phone: +49 89 14339 324
E-Mail: 

Re: 0.8.x

2015-08-26 Thread Ben Stopford
That’s a fair few connections indeed! 

You may be able to take the route of pinning producers to specific partitions. 
KIP-22 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-22+-+Expose+a+Partitioner+interface+in+the+new+producer
 made this easier by exposing a partitioner interface to producers. That should 
let you balance load. 

Also it may be worth adding that the code all uses non-blocking IO. I don’t 
have hard numbers here though.  

Has anyone else worked with 0.8.x at this level of load?

B 


 On 26 Aug 2015, at 10:57, Ben Stopford b...@confluent.io wrote:
 
 Hi Damian
 
 Just clarifying - you’re saying you currently have Kafka 0.7.x running with a 
 dedicated broker addresses (bypassing ZK) and hitting a VIP which you use for 
 load balancing writes. Is that correct?
 
 Are you worried about something specific in the 0.8.x way of doing things (ZK 
 under that level of load etc) or are you just looking for experiences running 
 0.8.x with that many producers?
 
 B
 
 
 On 25 Aug 2015, at 10:29, Damian Guy damian@gmail.com wrote:
 
 Hi,
 
 We currently run 0.7.x on our clusters and are now finally getting around
 to upgrading to kafka latest.  One thing that has been holding us back is
 that we can no longer use a VIP to front the clusters. I understand we
 could use a VIP for metadata lookups, but we have 100,000 + producers to at
 least one of our clusters.
 So, my question is: How is Kafka 0.8.x going to handle 100,000+ producers?
 Any recommendations on setup etc?
 
 Thanks for the help,
 Damian
 



Re: Painfully slow kafka recovery / cluster breaking

2015-08-26 Thread Rajasekar Elango
Thanks for updates Jörg. It's very useful.

Thanks,
Raja.

On Wed, Aug 26, 2015 at 8:58 AM, Jörg Wagner joerg.wagn...@1und1.de wrote:

 Just a little feedback on our issue(s) as FYI to whoever is interested.

 It basically all boiled down to the configuration of topics. We noticed
 while performance testing (or trying to ;) ) that the partitioning was most
 critical to us.

 We originally followed the linkedin recommendation and used 600 partitions
 for our main topic. Testing that, the replicas always went out of sync
 within a short timeframe, leaders could not be determined and the cluster
 failed horribly (even writing several hundred lines of logs within a
 1/100th second).

 So for our 27 log.dirs (= disks) we went with 27 partitions. And voilá: we
 could use kafka with around 35k requests per second (via an application
 accessing it). Kafka stayed stable.

 Currently we are testing with 81 partitions (27*3) and it's running well.
 No issues so far, replicas in sync and up to 50k requests per second.

 Cheers

 On 25.08.2015 15:18, Jörg Wagner wrote:

 So okay, this is a little embarassing but the core of the issue was that
 max open files was not set correctly for kafka. It was not an oversight,
 but a few things together caused that the system configuration was not
 changed correctly, resulting in the default value.

 No wonder that kafka behaved strangely everytime we had enough data in
 log.dirs and connections.

 Anyhow, that doesn't seem to be the last problem. The brokers get in sync
 with each other (within an expected time frame), everything seems fine.

 After a little stress testing, the kafka cluster falls apart (around 40k
 requests/s). Using topics describe we can see leaders missing (e.g. from
 1,2,3 only 1 and 3 are leading partitions, although zookeeper lists all
 under /brokers/ids). This ultimately results in partitions being
 unavailable and massive leader not local spam in the logs.

 What are we missing?

 Cheers
 Jörg

 On 24.08.2015 10:31, Jörg Wagner wrote:

 Thank you for your answers.

 @Raja
 No, it also seems to happen if we stop kafka completely clean.

 @Gwen
 I was testing the situation with num.replica.fetchers set higher. If you
 say that was the right direction, I will try it again. What would be a good
 setting? I went with 50 which seemed reasonable (having 27 single disks).
 How long should it take to get complete ISR?

 Regarding no Data flowing into kafka: I just wanted to point out that
 the setup is not yet live. So we can completely stop the usage of kafka,
 and it should possibly get into sync faster without a steady stream of new
 messages.
 Kafka itself is working fine during this on the other hand, just
 missing ISR, hence redundancy. If I stop another broker (clean!) though, it
 tends to happen that the expected number of partitions have Leader -1;
 which should not happen as I assume.

 Cheers
 Jörg

 On 21.08.2015 19:18, Rajasekar Elango wrote:

 We are seeing same behavior in 5 broker cluster when losing one broker.

 In our case, we are losing broker as well as kafka data dir.

 Jörg Wagner,

 Are you losing just broker or kafka data dir as well?

 Gwen,

 We have also observed that latency of messages arriving at consumers
 goes
 up by 10x when we lose a broker. Is it because the broker is busy with
 handling failed fetch requests and loaded with more data thats slowing
 down
 the writes ? Also, if we had simply lost the broker not the data dir,
 impact would have been minimal?

 Thanks,
 Raja.



 On Fri, Aug 21, 2015 at 12:31 PM, Gwen Shapira g...@confluent.io
 wrote:

 By default, num.replica.fetchers = 1. This means only one thread per
 broker
 is fetching data from leaders. This means it make take a while for the
 recovering machine to catch up and rejoin the ISR.

 If you have bandwidth to spare, try increasing this value.

 Regarding no data flowing into kafka - If you have 3 replicas and
 only
 one is down, I'd expect writes to continue to the new leader even if
 one
 replica is not in the ISR yet. Can you see that a new leader is
 elected?

 Gwen

 On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner joerg.wagn...@1und1.de
 wrote:

 Hey everyone,

 here's my crosspost from irc.

 Our setup:
 3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
 logdisks each). We use a handful of topics, but only one topic is

 utilized

 heavily. It features a replication of 2 and 600 partitions.

 Our issue:
 If one kafka was down, it takes very long ( from 1 to 10 hours) to
 show
 that all partitions have all isr again. This seems to heavily depend
 on

 the

 amount of data which is in the log.dirs (I have configured 27 threads
 -

 one

 for each dir featuring a own drive).
 This all takes this long while there is NO data flowing into kafka.

 We seem to be missing something critical here. It might be some option

 set

 wrong, or are we thinking wrong and it's not critical to have the

 replicas

 in sync.

 Any pointers would be great.

 

Re: Http Kafka producer

2015-08-26 Thread Marc Bollinger
I'm actually also really interested in this...I had a chat about this on
the distributed systems slack's http://dist-sys.slack.com Kafka channel a
few days ago, but we're not much further than griping about the problem.
We're basically migrating an existing event system, one which packed
messages into files, waited for a time-or-space threshold to be crossed,
then dealt with distribution in terms of files. Basically, we'd like to
keep a lot of those semantics: we can acknowledge success on the app server
as soon as we've flushed to disk, and rely on the filesystem for
durability, and total order across the system doesn't matter, as the HTTP
PUTs sending the messages are load balanced across many app servers. We
also can tolerate [very] long downstream event system outages,
because...we're ultimately just writing sequentially to disk, per process
(I should mention that this part is in Rails, which means we're dealing
largely in terms of processes, not threads).

RocksDB was mentioned in the discussion, but spending exactly 5 minutes
researching that solution, it seems like the dead simplest solution on an
app server in terms of moving parts (multiple processes writing, one
process reading/forwarding to Kafka) wouldn't work well with RocksDB.
Although now that I'm looking at it more, it looks like they're working on
a MySQL storage engine?

Anyway yeah, I'd love some discussion on this, or war stories of migration
to Kafka from other event systems (F/OSS or...bespoke).

On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com
wrote:

 Hi,

 Our application receives events through a HAProxy server on HTTPs, which
 should be forwarded and stored to Kafka cluster.

 What should be the best option for this ?
 This layer should receive events from HAProxy  produce them to Kafka
 cluster, in a reliable and efficient way (and should scale horizontally).

 Please suggest.

 --regards
 Hemanth



RE: Http Kafka producer

2015-08-26 Thread Hemanth Abbina
Marc,

Thanks for your response.  Let's have more details on the problem.

As I already mentioned in the previous post, here is our expected data flow:  
logs - HAProxy - {new layer } - Kafka Cluster

The 'new layer' should receive logs as HTTP requests from HAproxy and produce 
the same logs to Kafka without loss.

Options that seems to be available, are
1. Flume: It has a HTTP source  Kafka sink, but the documentation says HTTP 
source is not for production use.
2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of 
Schema Registry servers to validate the schema, which should be again used by 
the consumers.
3. Custom plugin to handle this functionality: Though the functionality seems 
to be simple - scalability, reliability aspects and maintenance would be more.

Thanks
Hemanth

-Original Message-
From: Marc Bollinger [mailto:m...@lumoslabs.com] 
Sent: Thursday, August 27, 2015 4:39 AM
To: users@kafka.apache.org
Cc: dev-subscr...@kafka.apache.org
Subject: Re: Http Kafka producer

I'm actually also really interested in this...I had a chat about this on the 
distributed systems slack's http://dist-sys.slack.com Kafka channel a few 
days ago, but we're not much further than griping about the problem.
We're basically migrating an existing event system, one which packed messages 
into files, waited for a time-or-space threshold to be crossed, then dealt with 
distribution in terms of files. Basically, we'd like to keep a lot of those 
semantics: we can acknowledge success on the app server as soon as we've 
flushed to disk, and rely on the filesystem for durability, and total order 
across the system doesn't matter, as the HTTP PUTs sending the messages are 
load balanced across many app servers. We also can tolerate [very] long 
downstream event system outages, because...we're ultimately just writing 
sequentially to disk, per process (I should mention that this part is in Rails, 
which means we're dealing largely in terms of processes, not threads).

RocksDB was mentioned in the discussion, but spending exactly 5 minutes 
researching that solution, it seems like the dead simplest solution on an app 
server in terms of moving parts (multiple processes writing, one process 
reading/forwarding to Kafka) wouldn't work well with RocksDB.
Although now that I'm looking at it more, it looks like they're working on a 
MySQL storage engine?

Anyway yeah, I'd love some discussion on this, or war stories of migration to 
Kafka from other event systems (F/OSS or...bespoke).

On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com
wrote:

 Hi,

 Our application receives events through a HAProxy server on HTTPs, 
 which should be forwarded and stored to Kafka cluster.

 What should be the best option for this ?
 This layer should receive events from HAProxy  produce them to Kafka 
 cluster, in a reliable and efficient way (and should scale horizontally).

 Please suggest.

 --regards
 Hemanth



Re: Http Kafka producer

2015-08-26 Thread Marc Bollinger
Apologies if this is somewhat redundant, I'm quite new to both Kafka and the 
Confluent Platform. Ewen, when you say Under the hood, the new producer will 
automatically batch requests. 

Do you mean that this is a current or planned behavior of the REST proxy? Are 
there any durability guarantees, or are batches just held in memory before 
being sent to Kafka (or some other option)?

Thanks!

 On Aug 26, 2015, at 9:50 PM, Ewen Cheslack-Postava e...@confluent.io wrote:
 
 Hemanth,
 
 The Confluent Platform 1.0 version of have JSON embedded format support
 (i.e. direct embedding of JSON messages), but you can serialize, base64
 encode, and use the binary mode, paying a bit of overhead. However, since
 then we merged a patch to add JSON support:
 https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does
 not interact with the schema registry at all. If you're ok building your
 own version from trunk you could use that, or this will be released with
 our next platform version.
 
 In the REST proxy, each HTTP requests will result in one call to
 producer.send(). Under the hood, the new producer will automatically batch
 requests. The default settings will only batch when it's necessary (because
 there are already too many outstanding requests, so messages pile up in the
 local buffer), so you get the advantages of batching, but with a lower
 request rate the messages will still be sent to the broker immediately.
 
 -Ewen
 
 On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com
 wrote:
 
 Ewen,
 
 Thanks for the explanation.
 
 We have control over the logs format coming to HAProxy. Right now, these
 are plain JSON logs (just like syslog messages with few additional meta
 information) sent to HAProxy from remote clients using HTTPs. No
 serialization is used.
 
 Currently, we have one log each of the HTTP request. I understood that
 every request is produced individually without batching.
 
 Will this work with REST proxy, without using schema registry ?
 
 --regards
 Hemanth
 
 -Original Message-
 From: Ewen Cheslack-Postava [mailto:e...@confluent.io]
 Sent: Thursday, August 27, 2015 9:14 AM
 To: users@kafka.apache.org
 Subject: Re: Http Kafka producer
 
 Hemanth,
 
 Can you be a bit more specific about your setup? Do you have control over
 the format of the request bodies that reach HAProxy or not? If you do,
 Confluent's REST proxy should work fine and does not require the Schema
 Registry. It supports both binary (encoded as base64 so it can be passed
 via the JSON request body) and Avro. With Avro it uses the schema registry,
 but the binary mode doesn't require it.
 
 If you don't have control over the format, then the REST proxy is not
 currently designed to support that use case. I don't think HAProxy can
 rewrite request bodies (beyond per-line regexes, which would be hard to
 make work), so that's not an option either. It would certainly be possible
 to make a small addition to the REST proxy to allow binary request bodies
 to be produced directly to a topic specified in the URL, though you'd be
 paying pretty high overhead per message -- without the ability to batch,
 you're doing one HTTP request per messages. This might not be bad if your
 messages are large enough? (Then again, the same issue applies regardless
 of what solution you end up with if each of the requests to HAProxy only
 contains one message).
 
 -Ewen
 
 
 
 On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com
 wrote:
 
 Marc,
 
 Thanks for your response.  Let's have more details on the problem.
 
 As I already mentioned in the previous post, here is our expected data
 flow:  logs - HAProxy - {new layer } - Kafka Cluster
 
 The 'new layer' should receive logs as HTTP requests from HAproxy and
 produce the same logs to Kafka without loss.
 
 Options that seems to be available, are 1. Flume: It has a HTTP source
  Kafka sink, but the documentation says HTTP source is not for
 production use.
 2. Kafka Rest Proxy: Though this seems to be fine, adding another
 dependency of Schema Registry servers to validate the schema, which
 should be again used by the consumers.
 3. Custom plugin to handle this functionality: Though the
 functionality seems to be simple - scalability, reliability aspects
 and maintenance would be more.
 
 Thanks
 Hemanth
 
 -Original Message-
 From: Marc Bollinger [mailto:m...@lumoslabs.com]
 Sent: Thursday, August 27, 2015 4:39 AM
 To: users@kafka.apache.org
 Cc: dev-subscr...@kafka.apache.org
 Subject: Re: Http Kafka producer
 
 I'm actually also really interested in this...I had a chat about this
 on the distributed systems slack's http://dist-sys.slack.com Kafka
 channel a few days ago, but we're not much further than griping about
 the problem.
 We're basically migrating an existing event system, one which packed
 messages into files, waited for a time-or-space threshold to be
 crossed, then dealt with distribution in terms of files. 

RE: Http Kafka producer

2015-08-26 Thread Hemanth Abbina
Ewen,

Thanks for the valuable information.

I will surely try this and come up with my comments.

Thanks again
Hemanth

-Original Message-
From: Ewen Cheslack-Postava [mailto:e...@confluent.io] 
Sent: Thursday, August 27, 2015 10:21 AM
To: users@kafka.apache.org
Subject: Re: Http Kafka producer

Hemanth,

The Confluent Platform 1.0 version of have JSON embedded format support (i.e. 
direct embedding of JSON messages), but you can serialize, base64 encode, and 
use the binary mode, paying a bit of overhead. However, since then we merged a 
patch to add JSON support:
https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does not 
interact with the schema registry at all. If you're ok building your own 
version from trunk you could use that, or this will be released with our next 
platform version.

In the REST proxy, each HTTP requests will result in one call to 
producer.send(). Under the hood, the new producer will automatically batch 
requests. The default settings will only batch when it's necessary (because 
there are already too many outstanding requests, so messages pile up in the 
local buffer), so you get the advantages of batching, but with a lower request 
rate the messages will still be sent to the broker immediately.

-Ewen

On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com
wrote:

 Ewen,

 Thanks for the explanation.

 We have control over the logs format coming to HAProxy. Right now, 
 these are plain JSON logs (just like syslog messages with few 
 additional meta
 information) sent to HAProxy from remote clients using HTTPs. No 
 serialization is used.

 Currently, we have one log each of the HTTP request. I understood that 
 every request is produced individually without batching.

 Will this work with REST proxy, without using schema registry ?

 --regards
 Hemanth

 -Original Message-
 From: Ewen Cheslack-Postava [mailto:e...@confluent.io]
 Sent: Thursday, August 27, 2015 9:14 AM
 To: users@kafka.apache.org
 Subject: Re: Http Kafka producer

 Hemanth,

 Can you be a bit more specific about your setup? Do you have control 
 over the format of the request bodies that reach HAProxy or not? If 
 you do, Confluent's REST proxy should work fine and does not require 
 the Schema Registry. It supports both binary (encoded as base64 so it 
 can be passed via the JSON request body) and Avro. With Avro it uses 
 the schema registry, but the binary mode doesn't require it.

 If you don't have control over the format, then the REST proxy is not 
 currently designed to support that use case. I don't think HAProxy can 
 rewrite request bodies (beyond per-line regexes, which would be hard 
 to make work), so that's not an option either. It would certainly be 
 possible to make a small addition to the REST proxy to allow binary 
 request bodies to be produced directly to a topic specified in the 
 URL, though you'd be paying pretty high overhead per message -- 
 without the ability to batch, you're doing one HTTP request per 
 messages. This might not be bad if your messages are large enough? 
 (Then again, the same issue applies regardless of what solution you 
 end up with if each of the requests to HAProxy only contains one message).

 -Ewen



 On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina 
 heman...@eiqnetworks.com
 wrote:

  Marc,
 
  Thanks for your response.  Let's have more details on the problem.
 
  As I already mentioned in the previous post, here is our expected 
  data
  flow:  logs - HAProxy - {new layer } - Kafka Cluster
 
  The 'new layer' should receive logs as HTTP requests from HAproxy 
  and produce the same logs to Kafka without loss.
 
  Options that seems to be available, are 1. Flume: It has a HTTP 
  source  Kafka sink, but the documentation says HTTP source is not 
  for production use.
  2. Kafka Rest Proxy: Though this seems to be fine, adding another 
  dependency of Schema Registry servers to validate the schema, which 
  should be again used by the consumers.
  3. Custom plugin to handle this functionality: Though the 
  functionality seems to be simple - scalability, reliability aspects 
  and maintenance would be more.
 
  Thanks
  Hemanth
 
  -Original Message-
  From: Marc Bollinger [mailto:m...@lumoslabs.com]
  Sent: Thursday, August 27, 2015 4:39 AM
  To: users@kafka.apache.org
  Cc: dev-subscr...@kafka.apache.org
  Subject: Re: Http Kafka producer
 
  I'm actually also really interested in this...I had a chat about 
  this on the distributed systems slack's http://dist-sys.slack.com 
  Kafka channel a few days ago, but we're not much further than 
  griping about
 the problem.
  We're basically migrating an existing event system, one which packed 
  messages into files, waited for a time-or-space threshold to be 
  crossed, then dealt with distribution in terms of files. Basically, 
  we'd like to keep a lot of those semantics: we can acknowledge 
  success on the app server as soon as we've flushed to 

Re: Http Kafka producer

2015-08-26 Thread Ewen Cheslack-Postava
Hemanth,

Can you be a bit more specific about your setup? Do you have control over
the format of the request bodies that reach HAProxy or not? If you do,
Confluent's REST proxy should work fine and does not require the Schema
Registry. It supports both binary (encoded as base64 so it can be passed
via the JSON request body) and Avro. With Avro it uses the schema registry,
but the binary mode doesn't require it.

If you don't have control over the format, then the REST proxy is not
currently designed to support that use case. I don't think HAProxy can
rewrite request bodies (beyond per-line regexes, which would be hard to
make work), so that's not an option either. It would certainly be possible
to make a small addition to the REST proxy to allow binary request bodies
to be produced directly to a topic specified in the URL, though you'd be
paying pretty high overhead per message -- without the ability to batch,
you're doing one HTTP request per messages. This might not be bad if your
messages are large enough? (Then again, the same issue applies regardless
of what solution you end up with if each of the requests to HAProxy only
contains one message).

-Ewen



On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com
wrote:

 Marc,

 Thanks for your response.  Let's have more details on the problem.

 As I already mentioned in the previous post, here is our expected data
 flow:  logs - HAProxy - {new layer } - Kafka Cluster

 The 'new layer' should receive logs as HTTP requests from HAproxy and
 produce the same logs to Kafka without loss.

 Options that seems to be available, are
 1. Flume: It has a HTTP source  Kafka sink, but the documentation says
 HTTP source is not for production use.
 2. Kafka Rest Proxy: Though this seems to be fine, adding another
 dependency of Schema Registry servers to validate the schema, which should
 be again used by the consumers.
 3. Custom plugin to handle this functionality: Though the functionality
 seems to be simple - scalability, reliability aspects and maintenance would
 be more.

 Thanks
 Hemanth

 -Original Message-
 From: Marc Bollinger [mailto:m...@lumoslabs.com]
 Sent: Thursday, August 27, 2015 4:39 AM
 To: users@kafka.apache.org
 Cc: dev-subscr...@kafka.apache.org
 Subject: Re: Http Kafka producer

 I'm actually also really interested in this...I had a chat about this on
 the distributed systems slack's http://dist-sys.slack.com Kafka channel
 a few days ago, but we're not much further than griping about the problem.
 We're basically migrating an existing event system, one which packed
 messages into files, waited for a time-or-space threshold to be crossed,
 then dealt with distribution in terms of files. Basically, we'd like to
 keep a lot of those semantics: we can acknowledge success on the app server
 as soon as we've flushed to disk, and rely on the filesystem for
 durability, and total order across the system doesn't matter, as the HTTP
 PUTs sending the messages are load balanced across many app servers. We
 also can tolerate [very] long downstream event system outages,
 because...we're ultimately just writing sequentially to disk, per process
 (I should mention that this part is in Rails, which means we're dealing
 largely in terms of processes, not threads).

 RocksDB was mentioned in the discussion, but spending exactly 5 minutes
 researching that solution, it seems like the dead simplest solution on an
 app server in terms of moving parts (multiple processes writing, one
 process reading/forwarding to Kafka) wouldn't work well with RocksDB.
 Although now that I'm looking at it more, it looks like they're working on
 a MySQL storage engine?

 Anyway yeah, I'd love some discussion on this, or war stories of migration
 to Kafka from other event systems (F/OSS or...bespoke).

 On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com
 wrote:

  Hi,
 
  Our application receives events through a HAProxy server on HTTPs,
  which should be forwarded and stored to Kafka cluster.
 
  What should be the best option for this ?
  This layer should receive events from HAProxy  produce them to Kafka
  cluster, in a reliable and efficient way (and should scale horizontally).
 
  Please suggest.
 
  --regards
  Hemanth
 




-- 
Thanks,
Ewen


RE: Http Kafka producer

2015-08-26 Thread Hemanth Abbina
Ewen,

Thanks for the explanation.

We have control over the logs format coming to HAProxy. Right now, these are 
plain JSON logs (just like syslog messages with few additional meta 
information) sent to HAProxy from remote clients using HTTPs. No serialization 
is used.

Currently, we have one log each of the HTTP request. I understood that every 
request is produced individually without batching.

Will this work with REST proxy, without using schema registry ?

--regards
Hemanth

-Original Message-
From: Ewen Cheslack-Postava [mailto:e...@confluent.io] 
Sent: Thursday, August 27, 2015 9:14 AM
To: users@kafka.apache.org
Subject: Re: Http Kafka producer

Hemanth,

Can you be a bit more specific about your setup? Do you have control over the 
format of the request bodies that reach HAProxy or not? If you do, Confluent's 
REST proxy should work fine and does not require the Schema Registry. It 
supports both binary (encoded as base64 so it can be passed via the JSON 
request body) and Avro. With Avro it uses the schema registry, but the binary 
mode doesn't require it.

If you don't have control over the format, then the REST proxy is not currently 
designed to support that use case. I don't think HAProxy can rewrite request 
bodies (beyond per-line regexes, which would be hard to make work), so that's 
not an option either. It would certainly be possible to make a small addition 
to the REST proxy to allow binary request bodies to be produced directly to a 
topic specified in the URL, though you'd be paying pretty high overhead per 
message -- without the ability to batch, you're doing one HTTP request per 
messages. This might not be bad if your messages are large enough? (Then again, 
the same issue applies regardless of what solution you end up with if each of 
the requests to HAProxy only contains one message).

-Ewen



On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com
wrote:

 Marc,

 Thanks for your response.  Let's have more details on the problem.

 As I already mentioned in the previous post, here is our expected data
 flow:  logs - HAProxy - {new layer } - Kafka Cluster

 The 'new layer' should receive logs as HTTP requests from HAproxy and 
 produce the same logs to Kafka without loss.

 Options that seems to be available, are 1. Flume: It has a HTTP source 
  Kafka sink, but the documentation says HTTP source is not for 
 production use.
 2. Kafka Rest Proxy: Though this seems to be fine, adding another 
 dependency of Schema Registry servers to validate the schema, which 
 should be again used by the consumers.
 3. Custom plugin to handle this functionality: Though the 
 functionality seems to be simple - scalability, reliability aspects 
 and maintenance would be more.

 Thanks
 Hemanth

 -Original Message-
 From: Marc Bollinger [mailto:m...@lumoslabs.com]
 Sent: Thursday, August 27, 2015 4:39 AM
 To: users@kafka.apache.org
 Cc: dev-subscr...@kafka.apache.org
 Subject: Re: Http Kafka producer

 I'm actually also really interested in this...I had a chat about this 
 on the distributed systems slack's http://dist-sys.slack.com Kafka 
 channel a few days ago, but we're not much further than griping about the 
 problem.
 We're basically migrating an existing event system, one which packed 
 messages into files, waited for a time-or-space threshold to be 
 crossed, then dealt with distribution in terms of files. Basically, 
 we'd like to keep a lot of those semantics: we can acknowledge success 
 on the app server as soon as we've flushed to disk, and rely on the 
 filesystem for durability, and total order across the system doesn't 
 matter, as the HTTP PUTs sending the messages are load balanced across 
 many app servers. We also can tolerate [very] long downstream event 
 system outages, because...we're ultimately just writing sequentially 
 to disk, per process (I should mention that this part is in Rails, 
 which means we're dealing largely in terms of processes, not threads).

 RocksDB was mentioned in the discussion, but spending exactly 5 
 minutes researching that solution, it seems like the dead simplest 
 solution on an app server in terms of moving parts (multiple processes 
 writing, one process reading/forwarding to Kafka) wouldn't work well with 
 RocksDB.
 Although now that I'm looking at it more, it looks like they're 
 working on a MySQL storage engine?

 Anyway yeah, I'd love some discussion on this, or war stories of 
 migration to Kafka from other event systems (F/OSS or...bespoke).

 On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina 
 heman...@eiqnetworks.com
 wrote:

  Hi,
 
  Our application receives events through a HAProxy server on HTTPs, 
  which should be forwarded and stored to Kafka cluster.
 
  What should be the best option for this ?
  This layer should receive events from HAProxy  produce them to 
  Kafka cluster, in a reliable and efficient way (and should scale 
  horizontally).
 
  Please suggest.
 
  --regards
  Hemanth
 

Re: Http Kafka producer

2015-08-26 Thread Ewen Cheslack-Postava
Hemanth,

The Confluent Platform 1.0 version of have JSON embedded format support
(i.e. direct embedding of JSON messages), but you can serialize, base64
encode, and use the binary mode, paying a bit of overhead. However, since
then we merged a patch to add JSON support:
https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does
not interact with the schema registry at all. If you're ok building your
own version from trunk you could use that, or this will be released with
our next platform version.

In the REST proxy, each HTTP requests will result in one call to
producer.send(). Under the hood, the new producer will automatically batch
requests. The default settings will only batch when it's necessary (because
there are already too many outstanding requests, so messages pile up in the
local buffer), so you get the advantages of batching, but with a lower
request rate the messages will still be sent to the broker immediately.

-Ewen

On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com
wrote:

 Ewen,

 Thanks for the explanation.

 We have control over the logs format coming to HAProxy. Right now, these
 are plain JSON logs (just like syslog messages with few additional meta
 information) sent to HAProxy from remote clients using HTTPs. No
 serialization is used.

 Currently, we have one log each of the HTTP request. I understood that
 every request is produced individually without batching.

 Will this work with REST proxy, without using schema registry ?

 --regards
 Hemanth

 -Original Message-
 From: Ewen Cheslack-Postava [mailto:e...@confluent.io]
 Sent: Thursday, August 27, 2015 9:14 AM
 To: users@kafka.apache.org
 Subject: Re: Http Kafka producer

 Hemanth,

 Can you be a bit more specific about your setup? Do you have control over
 the format of the request bodies that reach HAProxy or not? If you do,
 Confluent's REST proxy should work fine and does not require the Schema
 Registry. It supports both binary (encoded as base64 so it can be passed
 via the JSON request body) and Avro. With Avro it uses the schema registry,
 but the binary mode doesn't require it.

 If you don't have control over the format, then the REST proxy is not
 currently designed to support that use case. I don't think HAProxy can
 rewrite request bodies (beyond per-line regexes, which would be hard to
 make work), so that's not an option either. It would certainly be possible
 to make a small addition to the REST proxy to allow binary request bodies
 to be produced directly to a topic specified in the URL, though you'd be
 paying pretty high overhead per message -- without the ability to batch,
 you're doing one HTTP request per messages. This might not be bad if your
 messages are large enough? (Then again, the same issue applies regardless
 of what solution you end up with if each of the requests to HAProxy only
 contains one message).

 -Ewen



 On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com
 wrote:

  Marc,
 
  Thanks for your response.  Let's have more details on the problem.
 
  As I already mentioned in the previous post, here is our expected data
  flow:  logs - HAProxy - {new layer } - Kafka Cluster
 
  The 'new layer' should receive logs as HTTP requests from HAproxy and
  produce the same logs to Kafka without loss.
 
  Options that seems to be available, are 1. Flume: It has a HTTP source
   Kafka sink, but the documentation says HTTP source is not for
  production use.
  2. Kafka Rest Proxy: Though this seems to be fine, adding another
  dependency of Schema Registry servers to validate the schema, which
  should be again used by the consumers.
  3. Custom plugin to handle this functionality: Though the
  functionality seems to be simple - scalability, reliability aspects
  and maintenance would be more.
 
  Thanks
  Hemanth
 
  -Original Message-
  From: Marc Bollinger [mailto:m...@lumoslabs.com]
  Sent: Thursday, August 27, 2015 4:39 AM
  To: users@kafka.apache.org
  Cc: dev-subscr...@kafka.apache.org
  Subject: Re: Http Kafka producer
 
  I'm actually also really interested in this...I had a chat about this
  on the distributed systems slack's http://dist-sys.slack.com Kafka
  channel a few days ago, but we're not much further than griping about
 the problem.
  We're basically migrating an existing event system, one which packed
  messages into files, waited for a time-or-space threshold to be
  crossed, then dealt with distribution in terms of files. Basically,
  we'd like to keep a lot of those semantics: we can acknowledge success
  on the app server as soon as we've flushed to disk, and rely on the
  filesystem for durability, and total order across the system doesn't
  matter, as the HTTP PUTs sending the messages are load balanced across
  many app servers. We also can tolerate [very] long downstream event
  system outages, because...we're ultimately just writing sequentially
  to disk, per process (I should mention that this