Re: Http Kafka producer
Hi Marc, That describes the behavior of the kafka producer library that batches writes to kafka. This post on confluent.io explains it pretty well: http://kafka.apache.org/082/javadoc/index.html?org/apache/kafka/clients/pro ducer/KafkaProducer.html But the general idea is that the producer will group together a bunch of writes to kafka for a specific topic and partition, and then send them as a single request. Durability guarantees in kafka depend on your configuration, and can be very week, or very strong. Reading the kafka documentation pageĀ¹s sections about producers should make it clear which setting improve Durability at the cost of latency and throughput. But there would be a risk of loosing the messages that are inside the proxy application during a failure, unless there is a replay ability from the source. -Erik On 8/27/15, 12:34 AM, Marc Bollinger m...@lumoslabs.com wrote: Apologies if this is somewhat redundant, I'm quite new to both Kafka and the Confluent Platform. Ewen, when you say Under the hood, the new producer will automatically batch requests. Do you mean that this is a current or planned behavior of the REST proxy? Are there any durability guarantees, or are batches just held in memory before being sent to Kafka (or some other option)? Thanks! On Aug 26, 2015, at 9:50 PM, Ewen Cheslack-Postava e...@confluent.io wrote: Hemanth, The Confluent Platform 1.0 version of have JSON embedded format support (i.e. direct embedding of JSON messages), but you can serialize, base64 encode, and use the binary mode, paying a bit of overhead. However, since then we merged a patch to add JSON support: https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does not interact with the schema registry at all. If you're ok building your own version from trunk you could use that, or this will be released with our next platform version. In the REST proxy, each HTTP requests will result in one call to producer.send(). Under the hood, the new producer will automatically batch requests. The default settings will only batch when it's necessary (because there are already too many outstanding requests, so messages pile up in the local buffer), so you get the advantages of batching, but with a lower request rate the messages will still be sent to the broker immediately. -Ewen On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Ewen, Thanks for the explanation. We have control over the logs format coming to HAProxy. Right now, these are plain JSON logs (just like syslog messages with few additional meta information) sent to HAProxy from remote clients using HTTPs. No serialization is used. Currently, we have one log each of the HTTP request. I understood that every request is produced individually without batching. Will this work with REST proxy, without using schema registry ? --regards Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 9:14 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry
Re: Http Kafka producer
with REST proxy, without using schema registry ? --regards Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 9:14 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed to disk, and rely on the filesystem for durability, and total order across the system doesn't matter, as the HTTP PUTs sending the messages are load balanced across many app servers. We also can tolerate [very] long downstream event system outages, because...we're ultimately just writing sequentially to disk, per process (I should mention that this part is in Rails, which means we're dealing largely in terms of processes, not threads). RocksDB was mentioned in the discussion, but spending exactly 5 minutes researching that solution, it seems like the dead simplest solution on an app server in terms of moving parts (multiple processes writing, one process reading/forwarding to Kafka) wouldn't work well with RocksDB. Although now that I'm looking at it more, it looks like they're working on a MySQL storage engine? Anyway yeah, I'd love some discussion on this, or war stories of migration to Kafka from other event systems (F/OSS or...bespoke). On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Hi, Our application receives events through a HAProxy server on HTTPs, which should be forwarded and stored to Kafka cluster. What should be the best option for this ? This layer should receive events from HAProxy produce them to Kafka cluster, in a reliable and efficient way (and should scale horizontally). Please suggest. --regards Hemanth -- Thanks, Ewen -- Thanks, Ewen -- Thanks, Ewen
Http Kafka producer
Hi, Our application receives events through a HAProxy server on HTTPs, which should be forwarded and stored to Kafka cluster. What should be the best option for this ? This layer should receive events from HAProxy produce them to Kafka cluster, in a reliable and efficient way (and should scale horizontally). Please suggest. --regards Hemanth
Re: Http Kafka producer
I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed to disk, and rely on the filesystem for durability, and total order across the system doesn't matter, as the HTTP PUTs sending the messages are load balanced across many app servers. We also can tolerate [very] long downstream event system outages, because...we're ultimately just writing sequentially to disk, per process (I should mention that this part is in Rails, which means we're dealing largely in terms of processes, not threads). RocksDB was mentioned in the discussion, but spending exactly 5 minutes researching that solution, it seems like the dead simplest solution on an app server in terms of moving parts (multiple processes writing, one process reading/forwarding to Kafka) wouldn't work well with RocksDB. Although now that I'm looking at it more, it looks like they're working on a MySQL storage engine? Anyway yeah, I'd love some discussion on this, or war stories of migration to Kafka from other event systems (F/OSS or...bespoke). On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Hi, Our application receives events through a HAProxy server on HTTPs, which should be forwarded and stored to Kafka cluster. What should be the best option for this ? This layer should receive events from HAProxy produce them to Kafka cluster, in a reliable and efficient way (and should scale horizontally). Please suggest. --regards Hemanth
RE: Http Kafka producer
Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed to disk, and rely on the filesystem for durability, and total order across the system doesn't matter, as the HTTP PUTs sending the messages are load balanced across many app servers. We also can tolerate [very] long downstream event system outages, because...we're ultimately just writing sequentially to disk, per process (I should mention that this part is in Rails, which means we're dealing largely in terms of processes, not threads). RocksDB was mentioned in the discussion, but spending exactly 5 minutes researching that solution, it seems like the dead simplest solution on an app server in terms of moving parts (multiple processes writing, one process reading/forwarding to Kafka) wouldn't work well with RocksDB. Although now that I'm looking at it more, it looks like they're working on a MySQL storage engine? Anyway yeah, I'd love some discussion on this, or war stories of migration to Kafka from other event systems (F/OSS or...bespoke). On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Hi, Our application receives events through a HAProxy server on HTTPs, which should be forwarded and stored to Kafka cluster. What should be the best option for this ? This layer should receive events from HAProxy produce them to Kafka cluster, in a reliable and efficient way (and should scale horizontally). Please suggest. --regards Hemanth
Re: Http Kafka producer
Apologies if this is somewhat redundant, I'm quite new to both Kafka and the Confluent Platform. Ewen, when you say Under the hood, the new producer will automatically batch requests. Do you mean that this is a current or planned behavior of the REST proxy? Are there any durability guarantees, or are batches just held in memory before being sent to Kafka (or some other option)? Thanks! On Aug 26, 2015, at 9:50 PM, Ewen Cheslack-Postava e...@confluent.io wrote: Hemanth, The Confluent Platform 1.0 version of have JSON embedded format support (i.e. direct embedding of JSON messages), but you can serialize, base64 encode, and use the binary mode, paying a bit of overhead. However, since then we merged a patch to add JSON support: https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does not interact with the schema registry at all. If you're ok building your own version from trunk you could use that, or this will be released with our next platform version. In the REST proxy, each HTTP requests will result in one call to producer.send(). Under the hood, the new producer will automatically batch requests. The default settings will only batch when it's necessary (because there are already too many outstanding requests, so messages pile up in the local buffer), so you get the advantages of batching, but with a lower request rate the messages will still be sent to the broker immediately. -Ewen On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Ewen, Thanks for the explanation. We have control over the logs format coming to HAProxy. Right now, these are plain JSON logs (just like syslog messages with few additional meta information) sent to HAProxy from remote clients using HTTPs. No serialization is used. Currently, we have one log each of the HTTP request. I understood that every request is produced individually without batching. Will this work with REST proxy, without using schema registry ? --regards Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 9:14 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files
RE: Http Kafka producer
Ewen, Thanks for the valuable information. I will surely try this and come up with my comments. Thanks again Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 10:21 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, The Confluent Platform 1.0 version of have JSON embedded format support (i.e. direct embedding of JSON messages), but you can serialize, base64 encode, and use the binary mode, paying a bit of overhead. However, since then we merged a patch to add JSON support: https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does not interact with the schema registry at all. If you're ok building your own version from trunk you could use that, or this will be released with our next platform version. In the REST proxy, each HTTP requests will result in one call to producer.send(). Under the hood, the new producer will automatically batch requests. The default settings will only batch when it's necessary (because there are already too many outstanding requests, so messages pile up in the local buffer), so you get the advantages of batching, but with a lower request rate the messages will still be sent to the broker immediately. -Ewen On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Ewen, Thanks for the explanation. We have control over the logs format coming to HAProxy. Right now, these are plain JSON logs (just like syslog messages with few additional meta information) sent to HAProxy from remote clients using HTTPs. No serialization is used. Currently, we have one log each of the HTTP request. I understood that every request is produced individually without batching. Will this work with REST proxy, without using schema registry ? --regards Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 9:14 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed
Re: Http Kafka producer
Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed to disk, and rely on the filesystem for durability, and total order across the system doesn't matter, as the HTTP PUTs sending the messages are load balanced across many app servers. We also can tolerate [very] long downstream event system outages, because...we're ultimately just writing sequentially to disk, per process (I should mention that this part is in Rails, which means we're dealing largely in terms of processes, not threads). RocksDB was mentioned in the discussion, but spending exactly 5 minutes researching that solution, it seems like the dead simplest solution on an app server in terms of moving parts (multiple processes writing, one process reading/forwarding to Kafka) wouldn't work well with RocksDB. Although now that I'm looking at it more, it looks like they're working on a MySQL storage engine? Anyway yeah, I'd love some discussion on this, or war stories of migration to Kafka from other event systems (F/OSS or...bespoke). On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Hi, Our application receives events through a HAProxy server on HTTPs, which should be forwarded and stored to Kafka cluster. What should be the best option for this ? This layer should receive events from HAProxy produce them to Kafka cluster, in a reliable and efficient way (and should scale horizontally). Please suggest. --regards Hemanth -- Thanks, Ewen
RE: Http Kafka producer
Ewen, Thanks for the explanation. We have control over the logs format coming to HAProxy. Right now, these are plain JSON logs (just like syslog messages with few additional meta information) sent to HAProxy from remote clients using HTTPs. No serialization is used. Currently, we have one log each of the HTTP request. I understood that every request is produced individually without batching. Will this work with REST proxy, without using schema registry ? --regards Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 9:14 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed to disk, and rely on the filesystem for durability, and total order across the system doesn't matter, as the HTTP PUTs sending the messages are load balanced across many app servers. We also can tolerate [very] long downstream event system outages, because...we're ultimately just writing sequentially to disk, per process (I should mention that this part is in Rails, which means we're dealing largely in terms of processes, not threads). RocksDB was mentioned in the discussion, but spending exactly 5 minutes researching that solution, it seems like the dead simplest solution on an app server in terms of moving parts (multiple processes writing, one process reading/forwarding to Kafka) wouldn't work well with RocksDB. Although now that I'm looking at it more, it looks like they're working on a MySQL storage engine? Anyway yeah, I'd love some discussion on this, or war stories of migration to Kafka from other event systems (F/OSS or...bespoke). On Wed, Aug 26, 2015 at 3:45 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Hi, Our application receives events through a HAProxy server on HTTPs, which should be forwarded and stored to Kafka cluster. What should be the best option for this ? This layer should receive events from HAProxy produce them to Kafka cluster, in a reliable and efficient way (and should scale horizontally). Please suggest. --regards Hemanth
Re: Http Kafka producer
Hemanth, The Confluent Platform 1.0 version of have JSON embedded format support (i.e. direct embedding of JSON messages), but you can serialize, base64 encode, and use the binary mode, paying a bit of overhead. However, since then we merged a patch to add JSON support: https://github.com/confluentinc/kafka-rest/pull/89 The JSON support does not interact with the schema registry at all. If you're ok building your own version from trunk you could use that, or this will be released with our next platform version. In the REST proxy, each HTTP requests will result in one call to producer.send(). Under the hood, the new producer will automatically batch requests. The default settings will only batch when it's necessary (because there are already too many outstanding requests, so messages pile up in the local buffer), so you get the advantages of batching, but with a lower request rate the messages will still be sent to the broker immediately. -Ewen On Wed, Aug 26, 2015 at 9:31 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Ewen, Thanks for the explanation. We have control over the logs format coming to HAProxy. Right now, these are plain JSON logs (just like syslog messages with few additional meta information) sent to HAProxy from remote clients using HTTPs. No serialization is used. Currently, we have one log each of the HTTP request. I understood that every request is produced individually without batching. Will this work with REST proxy, without using schema registry ? --regards Hemanth -Original Message- From: Ewen Cheslack-Postava [mailto:e...@confluent.io] Sent: Thursday, August 27, 2015 9:14 AM To: users@kafka.apache.org Subject: Re: Http Kafka producer Hemanth, Can you be a bit more specific about your setup? Do you have control over the format of the request bodies that reach HAProxy or not? If you do, Confluent's REST proxy should work fine and does not require the Schema Registry. It supports both binary (encoded as base64 so it can be passed via the JSON request body) and Avro. With Avro it uses the schema registry, but the binary mode doesn't require it. If you don't have control over the format, then the REST proxy is not currently designed to support that use case. I don't think HAProxy can rewrite request bodies (beyond per-line regexes, which would be hard to make work), so that's not an option either. It would certainly be possible to make a small addition to the REST proxy to allow binary request bodies to be produced directly to a topic specified in the URL, though you'd be paying pretty high overhead per message -- without the ability to batch, you're doing one HTTP request per messages. This might not be bad if your messages are large enough? (Then again, the same issue applies regardless of what solution you end up with if each of the requests to HAProxy only contains one message). -Ewen On Wed, Aug 26, 2015 at 5:05 PM, Hemanth Abbina heman...@eiqnetworks.com wrote: Marc, Thanks for your response. Let's have more details on the problem. As I already mentioned in the previous post, here is our expected data flow: logs - HAProxy - {new layer } - Kafka Cluster The 'new layer' should receive logs as HTTP requests from HAproxy and produce the same logs to Kafka without loss. Options that seems to be available, are 1. Flume: It has a HTTP source Kafka sink, but the documentation says HTTP source is not for production use. 2. Kafka Rest Proxy: Though this seems to be fine, adding another dependency of Schema Registry servers to validate the schema, which should be again used by the consumers. 3. Custom plugin to handle this functionality: Though the functionality seems to be simple - scalability, reliability aspects and maintenance would be more. Thanks Hemanth -Original Message- From: Marc Bollinger [mailto:m...@lumoslabs.com] Sent: Thursday, August 27, 2015 4:39 AM To: users@kafka.apache.org Cc: dev-subscr...@kafka.apache.org Subject: Re: Http Kafka producer I'm actually also really interested in this...I had a chat about this on the distributed systems slack's http://dist-sys.slack.com Kafka channel a few days ago, but we're not much further than griping about the problem. We're basically migrating an existing event system, one which packed messages into files, waited for a time-or-space threshold to be crossed, then dealt with distribution in terms of files. Basically, we'd like to keep a lot of those semantics: we can acknowledge success on the app server as soon as we've flushed to disk, and rely on the filesystem for durability, and total order across the system doesn't matter, as the HTTP PUTs sending the messages are load balanced across many app servers. We also can tolerate [very] long downstream event system outages, because...we're ultimately just writing sequentially to disk, per process (I should mention