Hi Marc, That describes the behavior of the kafka producer library that batches writes to kafka. This post on confluent.io explains it pretty well: http://kafka.apache.org/082/javadoc/index.html?org/apache/kafka/clients/pro ducer/KafkaProducer.html
But the general idea is that the producer will group together a bunch of writes to kafka for a specific topic and partition, and then send them as a single request. Durability guarantees in kafka depend on your configuration, and can be very week, or very strong. Reading the kafka documentation page¹s sections about producers should make it clear which setting improve Durability at the cost of latency and throughput. But there would be a risk of loosing the messages that are inside the proxy application during a failure, unless there is a replay ability from the source. -Erik On 8/27/15, 12:34 AM, "Marc Bollinger" <m...@lumoslabs.com> wrote: >Apologies if this is somewhat redundant, I'm quite new to both Kafka and >the Confluent Platform. Ewen, when you say "Under the hood, the new >producer will automatically batch requests." > >Do you mean that this is a current or planned behavior of the REST proxy? >Are there any durability guarantees, or are batches just held in memory >before being sent to Kafka (or some other option)? > >Thanks! > >> On Aug 26, 2015, at 9:50 PM, Ewen Cheslack-Postava <e...@confluent.io> >>wrote: >> >> Hemanth, >> >> The Confluent Platform 1.0 version of have JSON embedded format support >> (i.e. direct embedding of JSON messages), but you can serialize, base64 >> encode, and use the binary mode, paying a bit of overhead. However, >>since >> then we merged a patch to add JSON support: >> https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does >> not interact with the schema registry at all. If you're ok building your >> own version from trunk you could use that, or this will be released with >> our next platform version. >> >> In the REST proxy, each HTTP requests will result in one call to >> producer.send(). Under the hood, the new producer will automatically >>batch >> requests. The default settings will only batch when it's necessary >>(because >> there are already too many outstanding requests, so messages pile up in >>the >> local buffer), so you get the advantages of batching, but with a lower >> request rate the messages will still be sent to the broker immediately. >> >> -Ewen >> >> On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina >><heman...@eiqnetworks.com> >> wrote: >> >>> Ewen, >>> >>> Thanks for the explanation. >>> >>> We have control over the logs format coming to HAProxy. Right now, >>>these >>> are plain JSON logs (just like syslog messages with few additional meta >>> information) sent to HAProxy from remote clients using HTTPs. No >>> serialization is used. >>> >>> Currently, we have one log each of the HTTP request. I understood that >>> every request is produced individually without batching. >>> >>> Will this work with REST proxy, without using schema registry ? >>> >>> --regards >>> Hemanth >>> >>> -----Original Message----- >>> From: Ewen Cheslack-Postava [mailto:e...@confluent.io] >>> Sent: Thursday, August 27, 2015 9:14 AM >>> To: users@kafka.apache.org >>> Subject: Re: Http Kafka producer >>> >>> Hemanth, >>> >>> Can you be a bit more specific about your setup? Do you have control >>>over >>> the format of the request bodies that reach HAProxy or not? If you do, >>> Confluent's REST proxy should work fine and does not require the Schema >>> Registry. It supports both binary (encoded as base64 so it can be >>>passed >>> via the JSON request body) and Avro. With Avro it uses the schema >>>registry, >>> but the binary mode doesn't require it. >>> >>> If you don't have control over the format, then the REST proxy is not >>> currently designed to support that use case. I don't think HAProxy can >>> rewrite request bodies (beyond per-line regexes, which would be hard to >>> make work), so that's not an option either. It would certainly be >>>possible >>> to make a small addition to the REST proxy to allow binary request >>>bodies >>> to be produced directly to a topic specified in the URL, though you'd >>>be >>> paying pretty high overhead per message -- without the ability to >>>batch, >>> you're doing one HTTP request per messages. This might not be bad if >>>your >>> messages are large enough? (Then again, the same issue applies >>>regardless >>> of what solution you end up with if each of the requests to HAProxy >>>only >>> contains one message). >>> >>> -Ewen >>> >>> >>> >>> On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina >>><heman...@eiqnetworks.com> >>> wrote: >>> >>>> Marc, >>>> >>>> Thanks for your response. Let's have more details on the problem. >>>> >>>> As I already mentioned in the previous post, here is our expected data >>>> flow: logs -> HAProxy -> {new layer } -> Kafka Cluster >>>> >>>> The 'new layer' should receive logs as HTTP requests from HAproxy and >>>> produce the same logs to Kafka without loss. >>>> >>>> Options that seems to be available, are 1. Flume: It has a HTTP source >>>> & Kafka sink, but the documentation says HTTP source is not for >>>> production use. >>>> 2. Kafka Rest Proxy: Though this seems to be fine, adding another >>>> dependency of Schema Registry servers to validate the schema, which >>>> should be again used by the consumers. >>>> 3. Custom plugin to handle this functionality: Though the >>>> functionality seems to be simple - scalability, reliability aspects >>>> and maintenance would be more. >>>> >>>> Thanks >>>> Hemanth >>>> >>>> -----Original Message----- >>>> From: Marc Bollinger [mailto:m...@lumoslabs.com] >>>> Sent: Thursday, August 27, 2015 4:39 AM >>>> To: users@kafka.apache.org >>>> Cc: dev-subscr...@kafka.apache.org >>>> Subject: Re: Http Kafka producer >>>> >>>> I'm actually also really interested in this...I had a chat about this >>>> on the distributed systems slack's <http://dist-sys.slack.com> Kafka >>>> channel a few days ago, but we're not much further than griping about >>> the problem. >>>> We're basically migrating an existing event system, one which packed >>>> messages into files, waited for a time-or-space threshold to be >>>> crossed, then dealt with distribution in terms of files. Basically, >>>> we'd like to keep a lot of those semantics: we can acknowledge success >>>> on the app server as soon as we've flushed to disk, and rely on the >>>> filesystem for durability, and total order across the system doesn't >>>> matter, as the HTTP PUTs sending the messages are load balanced across >>>> many app servers. We also can tolerate [very] long downstream event >>>> system outages, because...we're ultimately just writing sequentially >>>> to disk, per process (I should mention that this part is in Rails, >>>> which means we're dealing largely in terms of processes, not threads). >>>> >>>> RocksDB was mentioned in the discussion, but spending exactly 5 >>>> minutes researching that solution, it seems like the dead simplest >>>> solution on an app server in terms of moving parts (multiple processes >>>> writing, one process reading/forwarding to Kafka) wouldn't work well >>> with RocksDB. >>>> Although now that I'm looking at it more, it looks like they're >>>> working on a MySQL storage engine? >>>> >>>> Anyway yeah, I'd love some discussion on this, or war stories of >>>> migration to Kafka from other event systems (F/OSS or...bespoke). >>>> >>>> On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina >>>> <heman...@eiqnetworks.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Our application receives events through a HAProxy server on HTTPs, >>>>> which should be forwarded and stored to Kafka cluster. >>>>> >>>>> What should be the best option for this ? >>>>> This layer should receive events from HAProxy & produce them to >>>>> Kafka cluster, in a reliable and efficient way (and should scale >>> horizontally). >>>>> >>>>> Please suggest. >>>>> >>>>> --regards >>>>> Hemanth >>> >>> >>> >>> -- >>> Thanks, >>> Ewen >> >> >> >> -- >> Thanks, >> Ewen