Re: Kafka security
Don't hard code it. Martin's suggestion allows it to be read from a configuration file or injected from another source such as an environment variable at runtime. If you neither of these are acceptable for corporate policy I suggest asking how it has been handled before at your company. Christian On Apr 11, 2017 11:10, "IT Consultant" <0binarybudd...@gmail.com> wrote: Thanks for your response . We aren't allowed to hard code password in any of our program On Apr 11, 2017 23:39, "Mar Ian"wrote: > Since is a java property you could set the property (keystore password) > programmatically, > > before you connect to kafka (ie, before creating a consumer or producer) > > System.setProperty("zookeeper.ssl.keyStore.password", password); > > martin > > > From: IT Consultant <0binarybudd...@gmail.com> > Sent: April 11, 2017 2:01 PM > To: users@kafka.apache.org > Subject: Kafka security > > Hi All > > How can I avoid using password for keystore creation ? > > Our corporate policies doesn'tallow us to hardcore password. We are > currently passing keystore password while accessing TLS enabled Kafka > instance . > > I would like to use either passwordless keystore or avoid password for > cleint accessing Kafka . > > > Please help >
Re: Encryption at Rest
"We need to be capable of changing encryption keys on regular intervals and in case of expected key compromise." is achievable with full disk encryption particularly if you are willing to add and remove Kafka servers so that you replicate the data to new machines/disks with new keys and take the machines with old keys out of use and wipe them. For the second part of it I would suggest reevaluating your threat model since you are looking at a machine that is compromised but not compromised enough to be able to read the key from Kafka or to use Kafka to read the data. While you could add support to encrypt data on the way in and out of compression I believe you would need either substantial work in Kafka to support rewriting/reencrypting the logfiles (with performance penalties) or rotate machines in and out as with full disk encryption. Though I'll let someone with more knowledge of the implementation comment further on what would be required. Christian On Mon, May 2, 2016 at 9:41 PM, Bruno Rassaertswrote: > We did try indeed the last scenario you describe as encrypted disks do not > fulfil our requirements. > We need to be capable of changing encryption keys on regular intervals and in > case of expected key compromise. > Also, when a running machine is hacked, disk based or file system based > encryption doesn’t offer any protection. > > Our goal is indeed to have the content in the broker files encrypted. The > problem is that only way to achieve this is through custom serialisers. > This works, but the overhead is quite dramatic as the messages are no longer > efficiently compressed (in batch). > Compression in the serialiser, before the encryption, doesn’t really solve > the performance problem. > > The best thing for us would be able to encrypt after the batch compression > offered by kafka. > The hook to do this is missing in the current implementation. > > Bruno > >> On 02 May 2016, at 22:46, Tom Brown wrote: >> >> I'm trying to understand your use-case for encrypted data. >> >> Does it need to be encrypted only over the wire? This can be accomplished >> using TLS encryption (v0.9.0.0+). See >> https://issues.apache.org/jira/browse/KAFKA-1690 >> >> Does it need to be encrypted only when at rest? This can be accomplished >> using full disk encryption as others have mentioned. >> >> Does it need to be encrypted during both? Use both TLS and full disk >> encryption. >> >> Does it need to be encrypted fully from end-to-end so even Kafka can't read >> it? Since Kafka shouldn't be able to know the contents, the key should not >> be known to Kafka. What remains is manually encrypting each message before >> giving it to the producer (or by implementing an encrypting serializer). >> Either way, each message is still encrypted individually. >> >> Have I left out a scenario? >> >> --Tom >> >> >> On Mon, May 2, 2016 at 2:01 PM, Bruno Rassaerts >> wrote: >> >>> Hello, >>> >>> We tried encrypting the data before sending it to kafka, however this >>> makes the compression done by kafka almost impossible. >>> Also the performance overhead of encrypting the individual messages was >>> quite significant. >>> >>> Ideally, a pluggable “compression” algorithm would be best. Where message >>> can first be compressed, then encrypted in batch. >>> However, the current kafka implementation does not allow this. >>> >>> Bruno >>> On 26 Apr 2016, at 19:02, Jim Hoagland >>> wrote: Another option is to encrypt the data before you hand it to Kafka and >>> have the downstream decrypt it. This takes care of on-disk on on-wire encryption. We did a proof of concept of this: >>> http://www.symantec.com/connect/blogs/end-end-encryption-though-kafka-our-p roof-concept ( http://symc.ly/1pC2CEG ) -- Jim On 4/25/16, 11:39 AM, "David Buschman" wrote: > Kafka handles messages which are compose of an array of bytes. Kafka >>> does > not care what is in those byte arrays. > > You could use a custom Serializer and Deserializer to encrypt and >>> decrypt > the data from with your application(s) easily enough. > > This give the benefit of having encryption at rest and over the wire. >>> Two > birds, one stone. > > DaVe. > > >> On Apr 25, 2016, at 2:14 AM, Jens Rantil wrote: >> >> IMHO, I think that responsibility should lie on the file system, not >> Kafka. >> Feels like a waste of time and double work to implement that unless >> there's >> a really good reason for it. Let's try to keep Kafka a focused product >> that >> does one thing well. >> >> Cheers, >> Jens >> >> On Fri, Apr 22, 2016 at 3:31 AM Tauzell, Dave >> >> wrote: >> >>> I meant encryption
Re: Encryption at Rest
>From what I know of previous discussions encryption at rest can be handled with transparent disk encryption. When that's sufficient it's nice and easy. Christian On Thu, Apr 21, 2016 at 2:31 PM, Tauzell, Davewrote: > Has there been any discussion or work on at rest encryption for Kafka? > > Thanks, > Dave > > This e-mail and any files transmitted with it are confidential, may contain > sensitive information, and are intended solely for the use of the individual > or entity to whom they are addressed. If you have received this e-mail in > error, please notify the sender by reply e-mail immediately and destroy all > copies of the e-mail and any attachments.
Re: Kafka over Satellite links
I would not do that. I admit I may be a bit biased due to working for Buddy Platform (IoT backend stuff including telemetry collection), but you want to send the data via some protocol (HTTP? MQTT? COAP?) to the central hub and then have those servers put the data into Kafka. Now if you want to use Kafka there are the various HTTP front ends that will basically put the data into Kafka for you without the client needing to deal with the partition management part. But putting data into Kafka directly really seems like a bad idea even if it's a large number of messages per second per node, even if the security parts work out for you. Christian On Wed, Mar 2, 2016 at 9:52 PM, Janwrote: > Hi folks; > does anyone know of Kafka's ability to work over Satellite links. We have a > IoT Telemetry application that uses Satellite communication to send data from > remote sites to a Central hub. > Any help/ input/ links/ gotchas would be much appreciated. > Regards,Jan
Re: Kafka behind AWS ELB
Dillian, On Mon, May 4, 2015 at 1:52 PM, Dillian Murphey crackshotm...@gmail.com wrote: I'm interested in this topic as well. If you put kafka brokers inside an autoscaling group, then AWS will automatically add brokers if demand increases, and the ELB will automatically round-robin across all of your kafka instances. So in your config files and code, you only need to provide a single DNS name (the load balancer). You don't need to specify all your kafka brokers inside your config file. If a broker dies, the ELB will only route to healthy nodes. So you get a lot of robustness, scalability, and fault-tolerance by using the AWS services. Kafka Brokers will automatically load balance, but the question is whether it is ok to put all your brokers behind an ELB and expect the system to work properly. You should not expect it to work properly. Broker nodes are data bearing which means that any operation to scale down would need to be aware of the data distribution. The client connects to specific nodes to send them data so even the Level 4 load balancing wouldn't work. What alternatives are there to dynamic/scalable broker clusters? I don't want to have to modify my config files or code if I add more brokers, and I want to be able to handle a broker going down. So these are the reasons AWS questions like this come up. The clients already give you options for specifying only a subset of brokers https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example one of which must be alive to discover the rest of the cluster. The main clients handle node failures (you'll still have some operational work). Kafka and other data storage systems do not work the same as an HTTP driven web application. While it can certainly be scaled, and automation could be done to do so in response to load it's going to be more complicated. AWS's off the shelf solution/low operations offering for some (definitely not all) of Kafka's use cases is Kinesis, Azure's is EventHubs. Before using Kakfa or any system in production you'll want to be sure you understand the operational aspects of it. Christian Thanks for any comments too. :) On Mon, May 4, 2015 at 9:03 AM, Mayuresh Gharat gharatmayures...@gmail.com wrote: Ok. You can deploy kafka in AWS. You can have brokers on AWS servers. Kafka is not a push system. So you will need someone writing to kafka and consuming from kafka. It will work. My suggestion will be to try it out on a smaller instance in AWS and see the effects. As I do not know the actual use case about why you want to use kafka for, I cannot comment on whether it will work for you personalized use case. Thanks, Mayuresh On Mon, May 4, 2015 at 8:55 AM, Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: I am sorry but I cannot reveal those details due to confidentiality issues. I hope you understand. Regards, Chandrash3khar Kotekar Mobile - +91 8600011455 On Mon, May 4, 2015 at 9:18 PM, Mayuresh Gharat gharatmayures...@gmail.com wrote: Hi Chandrashekar, Can you please elaborate the use case for Kafka here, like how you are planning to use it. Thanks, Mayuresh On Sat, May 2, 2015 at 9:08 PM, Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: Hi, I am new to Apache Kafka. I have played with it on my laptop. I want to use Kafka in AWS. Currently we have tomcat web servers based REST API. We want to replace REST API with Apache Kafka, web servers are behind ELB. I would like to know if we can keep Kafka brokers behind ELB? Will it work? Regards, Chandrash3khar Kotekar Mobile - +91 8600011455 -- -Regards, Mayuresh R. Gharat (862) 250-7125 -- -Regards, Mayuresh R. Gharat (862) 250-7125
Re: Kafka Poll: Version You Use?
Do you have a anything on the number of voters, or audience breakdown? Christian On Wed, Mar 4, 2015 at 8:08 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hello hello, Results of the poll are here! Any guesses before looking? What % of Kafka users are on 0.8.2.x already? What % of people are still on 0.7.x? http://blog.sematext.com/2015/03/04/poll-results-kafka-version-distribution/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Thu, Feb 26, 2015 at 3:32 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, With 0.8.2 out I thought it might be useful for everyone to see which version(s) of Kafka people are using. Here's a quick poll: http://blog.sematext.com/2015/02/23/kafka-poll-version-you-use/ We'll publish the results next week. Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/
Re: kafka producer does not distribute messages to partitions evenly?
I believe you are seeing the behavior where the random partitioner is sticky. http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3ccahwhrrxax5ynimqnacsk7jcggnhjc340y4qbqoqcismm43u...@mail.gmail.com%3E has details. So with the default 10 minute refresh if your test is only an hour or two with a single producer you would not expect to see all partitions be hit. Christian On Mon, Mar 2, 2015 at 4:23 PM, Yang tedd...@gmail.com wrote: thanks. just checked code below. in the code below, the line that calls Random.nextInt() seems to be called only *a few times* , and all the rest of the cases getPartition() is called, the cached sendPartitionPerTopicCache.get(topic) seems to be called, so apparently you won't get an even partition distribution ? the code I got is from commit 7847e9c703f3a0b70519666cdb8a6e4c8e37c3a7 ./core/src/main/scala/kafka/producer/async/DefaultEventHandler.scala 336 lines --66%-- 222,4673% private def getPartition(topic: String, key: Any, topicPartitionList: Seq[PartitionAndLeader]): Int = { val numPartitions = topicPartitionList.size if(numPartitions = 0) throw new UnknownTopicOrPartitionException(Topic + topic + doesn't exist) val partition = if(key == null) { // If the key is null, we don't really need a partitioner // So we look up in the send partition cache for the topic to decide the target partition val id = sendPartitionPerTopicCache.get(topic) id match { case Some(partitionId) = // directly return the partitionId without checking availability of the leader, // since we want to postpone the failure until the send operation anyways partitionId case None = val availablePartitions = topicPartitionList.filter(_.leaderBrokerIdOpt.isDefined) if (availablePartitions.isEmpty) throw new LeaderNotAvailableException(No leader for any partition in topic + topic) val index = Utils.abs(Random.nextInt) % availablePartitions.size val partitionId = availablePartitions(index).partitionId sendPartitionPerTopicCache.put(topic, partitionId) partitionId } } else partitioner.partition(key, numPartitions) if(partition 0 || partition = numPartitions) throw new UnknownTopicOrPartitionException(Invalid partition id: + partition + for topic + topic + ; Valid values are in the inclusive range of [0, + (numPartitions-1) + ]) trace(Assigning message of topic %s and key %s to a selected partition %d.format(topic, if (key == null) [none] else key.toString, partition)) partition } On Mon, Mar 2, 2015 at 3:58 PM, Mayuresh Gharat gharatmayures...@gmail.com wrote: Probably your keys are getting hashed to only those partitions. I don't think anything is wrong here. You can check how the default hashPartitioner is used in the code and try to do the same for your keys before you send them and check which partitions are those going to. The default hashpartitioner does something like this : hash(key) % numPartitions. Thanks, Mayuresh On Mon, Mar 2, 2015 at 3:52 PM, Yang tedd...@gmail.com wrote: we have 10 partitions for a topic, and omit the explicit partition param in the message creation: KeyedMessageString, String data = new KeyedMessageString, String (mytopic, myMessageContent); // partition key need to be polished producer.send(data); but on average 3--5 of the partitions are empty. what went wrong? thanks Yang -- -Regards, Mayuresh R. Gharat (862) 250-7125
Re: Tips for working with Kafka and data streams
I wouldn't say no to some discussion of encryption. We're running on Azure EventHubs (with preparations for Kinesis for EC2, and Kafka for deployments in customer datacenters when needed) so can't just use disk level encryption (which would have its own overhead). We're putting all of our messages inside of encrypted envelopes before sending them to the stream which limits our opportunities for schema verification of the underlying messages to the declared type of the message. Encryption at rest mostly works out to a sales point for customers who want assurances, and in a Kafka focused discussion might be dealt with by covering disk encryption and how the conversations between Kafka instances are protected. Christian On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps j...@confluent.io wrote: Hey guys, One thing we tried to do along with the product release was start to put together a practical guide for using Kafka. I wrote this up here: http://blog.confluent.io/2015/02/25/stream-data-platform-1/ I'd like to keep expanding on this as good practices emerge and we learn more stuff. So two questions: 1. Anything you think other people should know about working with data streams? What did you wish you knew when you got started? 2. Anything you don't know about but would like to hear more about? -Jay
Re: Tips for working with Kafka and data streams
Yeah, we do have scenarios where we use customer specific keys so our envelopes end up containing key identification information for accessing our key repository. I'll certainly follow any changes you propose in this area with interest, but I'd expect that sort of centralized key thing to be fairly separate from Kafka even if there's a handy optional layer that integrates with it. Christian On Wed, Feb 25, 2015 at 5:34 PM, Julio Castillo jcasti...@financialengines.com wrote: Although full disk encryption appears to be an easy solution, in our case that may not be sufficient. For cases where the actual payload needs to be encrypted, the cost of encryption is paid by the consumer and producers. Further complicating the matter would be the handling of encryption keys, etc. I think this is the area where enhancements to Kafka may facilitate that key exchange between consumers and producers, still leaving it up to the clients, but facilitating the key handling. Julio On 2/25/15, 4:24 PM, Christian Csar christ...@csar.us wrote: The questions we get from customers typically end up being general so we break out our answer into network level and on disk scenarios. On disk/at rest scenario may just be use full disk encryption at the OS level and Kafka doesn't need to worry about it. But documenting any issues around it would be good. For example what sort of Kafka specific performance impacts does it have, ie budgeting for better processors. The security story right now is to run on a private network, but I believe some of our customers like to be told that within datacenter transmissions are encrypted on the wire. Based on https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_conf luence_display_KAFKA_Securityd=AwIBaQc=cKbMccWasSe6U4u_qE0M-qEjqwAh3shju L5QPa1B7Ykr=rJHFl4LhCQ-6kvKROhIocflKqVSHRTvT-PgdZ5MFuS0m=jhFmJTJBQfbq0sN jxtKA4M1tvSVgBLKOr2ePaK6zqwws=HqZ4N2gLpCZ796dRG7Fo-KLOBc0tgnnvDnC_8VTUo84 e= that might mean waiting for TLS support, or using a VPN/ssh tunnel for the network connections. Since we're in hosted stream land we can't do either of the above and encrypt the messages themselves. For those enterprises that are like our customers but would run Kafka or use Confluent, having a story like the above so they don't give up the benefits of your schema management layers would be good. Since I didn't mention it before I did find your blog posts handy (though I'm already moving us towards stream centric land). Christian On Wed, Feb 25, 2015 at 3:57 PM, Jay Kreps jay.kr...@gmail.com wrote: Hey Christian, That makes sense. I agree that would be a good area to dive into. Are you primarily interested in network level security or encryption on disk? -Jay On Wed, Feb 25, 2015 at 1:38 PM, Christian Csar christ...@csar.us wrote: I wouldn't say no to some discussion of encryption. We're running on Azure EventHubs (with preparations for Kinesis for EC2, and Kafka for deployments in customer datacenters when needed) so can't just use disk level encryption (which would have its own overhead). We're putting all of our messages inside of encrypted envelopes before sending them to the stream which limits our opportunities for schema verification of the underlying messages to the declared type of the message. Encryption at rest mostly works out to a sales point for customers who want assurances, and in a Kafka focused discussion might be dealt with by covering disk encryption and how the conversations between Kafka instances are protected. Christian On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps j...@confluent.io wrote: Hey guys, One thing we tried to do along with the product release was start to put together a practical guide for using Kafka. I wrote this up here: https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.confluent.io_201 5_02_25_stream-2Ddata-2Dplatform-2D1_d=AwIBaQc=cKbMccWasSe6U4u_qE0M-qEj qwAh3shjuL5QPa1B7Ykr=rJHFl4LhCQ-6kvKROhIocflKqVSHRTvT-PgdZ5MFuS0m=jhFmJ TJBQfbq0sNjxtKA4M1tvSVgBLKOr2ePaK6zqwws=0I9x4bCw1kN3y9Y22l9lK_YbhSYEZpp4 ZwBBrP-dSLke= I'd like to keep expanding on this as good practices emerge and we learn more stuff. So two questions: 1. Anything you think other people should know about working with data streams? What did you wish you knew when you got started? 2. Anything you don't know about but would like to hear more about? -Jay NOTICE: This e-mail and any attachments to it may be privileged, confidential or contain trade secret information and is intended only for the use of the individual or entity to which it is addressed. If this e-mail was sent to you in error, please notify me immediately by either reply e-mail or by phone at 408.498.6000, and do not use, disseminate, retain, print or copy the e-mail or any attachment. All messages sent to and from
Re: Tips for working with Kafka and data streams
The questions we get from customers typically end up being general so we break out our answer into network level and on disk scenarios. On disk/at rest scenario may just be use full disk encryption at the OS level and Kafka doesn't need to worry about it. But documenting any issues around it would be good. For example what sort of Kafka specific performance impacts does it have, ie budgeting for better processors. The security story right now is to run on a private network, but I believe some of our customers like to be told that within datacenter transmissions are encrypted on the wire. Based on https://cwiki.apache.org/confluence/display/KAFKA/Security that might mean waiting for TLS support, or using a VPN/ssh tunnel for the network connections. Since we're in hosted stream land we can't do either of the above and encrypt the messages themselves. For those enterprises that are like our customers but would run Kafka or use Confluent, having a story like the above so they don't give up the benefits of your schema management layers would be good. Since I didn't mention it before I did find your blog posts handy (though I'm already moving us towards stream centric land). Christian On Wed, Feb 25, 2015 at 3:57 PM, Jay Kreps jay.kr...@gmail.com wrote: Hey Christian, That makes sense. I agree that would be a good area to dive into. Are you primarily interested in network level security or encryption on disk? -Jay On Wed, Feb 25, 2015 at 1:38 PM, Christian Csar christ...@csar.us wrote: I wouldn't say no to some discussion of encryption. We're running on Azure EventHubs (with preparations for Kinesis for EC2, and Kafka for deployments in customer datacenters when needed) so can't just use disk level encryption (which would have its own overhead). We're putting all of our messages inside of encrypted envelopes before sending them to the stream which limits our opportunities for schema verification of the underlying messages to the declared type of the message. Encryption at rest mostly works out to a sales point for customers who want assurances, and in a Kafka focused discussion might be dealt with by covering disk encryption and how the conversations between Kafka instances are protected. Christian On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps j...@confluent.io wrote: Hey guys, One thing we tried to do along with the product release was start to put together a practical guide for using Kafka. I wrote this up here: http://blog.confluent.io/2015/02/25/stream-data-platform-1/ I'd like to keep expanding on this as good practices emerge and we learn more stuff. So two questions: 1. Anything you think other people should know about working with data streams? What did you wish you knew when you got started? 2. Anything you don't know about but would like to hear more about? -Jay
Re: Proper Relationship Between Partition and Threads
Ricardo, The parallelism of each logical consumer (consumer group) is the number of partitions. So with four partitions it could make sense to have one logical consumer (application) have two processes on different machines each with two threads, or one process with four. While with two logical consumers (two different applications) you would want each to have 4 threads (4*2 = 8 threads total). There are also considerations depending on which consumer code you are using (which I'm decidedly not someone with good information on) Christian On Wed, Jan 28, 2015 at 1:28 PM, Ricardo Ferreira jricardoferre...@gmail.com wrote: Hi experts, I'm newbie in the Kafka world, so excuse me for such basic question. I'm in the process of designing a client for Kafka, and after few hours of study, I was told that to achieve a proper level of parallelism, it is a best practice having one thread for each partition of an topic. My question is that this rule-of-thumb also applies for multiple consumer applications. For instance: Considering a topic with 4 partitions, it is OK to have one consumer application with 4 threads, just like would be OK to have two consumer applications with 2 threads each. But what about having two consumer applications with 4 threads each? It would break any load-balancing made by Kafka brokers? Anyway, I'd like to understand if the proper number of threads that should match the number of partitions is per application or if there is some other best practice. Thanks in advance, Ricardo Ferreira
Re: How to mark a message as needing to retry in Kafka?
noodles, Without an external mechanism you won't be able to mark individual messages/offsets as needing to be retried at a later time. Guozhang is describing a way to get the offset of a message that's been received so that you can find it later. You would need to save that into a 'failed messages' store somewhere else and have code that looks in there to make retries happen (assuming you want the failure/retry to persist beyond the lifetime of the process). Christian On Wed, Jan 28, 2015 at 7:00 PM, Guozhang Wang wangg...@gmail.com wrote: I see. If you are using the high-level consumer, once the message is returned to the application it is considered consumed, and current it is not supported to re-wind to a previously consumed message. With the new consumer coming in 0.8.3 release, we have an api for you to get the offset of each message and do the rewinding based on offsets. For example, you can do sth. like message = // get one message from consumer try { // process message } catch { consumer.seek(message.offset) } Guozhang On Wed, Jan 28, 2015 at 6:26 PM, noodles rungumpth...@gmail.com wrote: I did not describe my problem clearly. In my case, I got the message from Kakfa, but I could not handle this message because of some reason, for example the external server is down. So I want to mark the message as not being consumed directly. 2015-01-28 23:26 GMT+08:00 Guozhang Wang wangg...@gmail.com: Hi, Which consumer are you using? If you are using a high level consumer then retry would be automatic upon network exceptions. Guozhang On Wed, Jan 28, 2015 at 1:32 AM, noodles rungumpth...@gmail.com wrote: Hi group: I'm working for building a webhook notification service based on Kafka. I produce all of the payloads into Kafka, and consumers consume these payloads by offset. Sometimes some payloads cannot be consumed because of network exception or http server exception. So I want to mark the failed payloads and retry them by timers. But I have no idea if I don't use a storage (like MySQL) except kafka and zookeeper. -- *noodles!* -- -- Guozhang -- *Yeah, I'm noodles!* -- -- Guozhang
Re: Is kafka suitable for our architecture?
in a production environment I can't speak to that. I don't think there is any reason you *need* to use Storm rather than just Kafka to achieve your needs though. Christian Christian Thanks On Thu, Oct 9, 2014 at 11:57 PM, Albert Vila albert.v...@augure.com wrote: Hi We process data in real time, and we are taking a look at Storm and Spark streaming too, however our actions are atomic, done at a document level so I don't know if it fits on something like Storm/Spark. Regarding what you Christian said, isn't Kafka used for scenarios like the one I described? I mean, we do have work queues right now with Gearman, but with a bunch of workers on each step. I thought we could change that to a producer and a bunch of consumers (where the message should only reach one and exact one consumer). And what I said about the data locally, it was only an optimization we did some time ago because we was moving more data back then. Maybe now its not necessary and we could move messages around the system using Kafka, so it will allow us to simplify the architecture a little bit. I've seen people saying they move Tb of data every day using Kafka. Just to be clear on the size of each document/message, we are talking about tweets, blog posts, ... (on 90% of cases the size is less than 50Kb) Regards On 9 October 2014 20:02, Christian Csar cac...@gmail.com wrote: Apart from your data locality problem it sounds like what you want is a workqueue. Kafka's consumer structure doesn't lend itself too well to that use case as a single partition of a topic should only have one consumer instance per logical subscriber of the topic, and that consumer would not be able to mark jobs as completed except in a strict order (while maintaining a processed successfully at least once guarantee). This is not to say it cannot be done, but I believe your workqueue would end up working a bit strangely if built with Kafka. Christian On 10/09/2014 06:13 AM, William Briggs wrote: Manually managing data locality will become difficult to scale. Kafka is one potential tool you can use to help scale, but by itself, it will not solve your problem. If you need the data in near-real time, you could use a technology like Spark or Storm to stream data from Kafka and perform your processing. If you can batch the data, you might be better off pulling it into a distributed filesystem like HDFS, and using MapReduce, Spark or another scalable processing framework to handle your transformations. Once you've paid the initial price for moving the document into HDFS, your network traffic should be fairly manageable; most clusters built for this purpose will schedule work to be run local to the data, and typically have separate, high-speed network interfaces and a dedicated switch in order to optimize intra-cluster communications when moving data is unavoidable. -Will On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila albert.v...@augure.com wrote: Hi I just came across Kafta when I was trying to find solutions to scale our current architecture. We are currently downloading and processing 6M documents per day from online and social media. We have a different workflow for each type of document, but some of the steps are keyword extraction, language detection, clustering, classification, indexation, We are using Gearman to dispatch the job to workers and we have some queues on a database. I'm wondering if we could integrate Kafka on the current workflow and if it's feasible. One of our main discussions are if we have to go to a fully distributed architecture or to a semi-distributed one. I mean, distribute everything or process some steps on the same machine (crawling, keyword extraction, language detection, indexation). We don't know which one scales more, each one has pros and cont. Now we have a semi-distributed one as we had network problems taking into account the amount of data we were moving around. So now, all documents crawled on server X, later on are dispatched through Gearman to the same server. What we dispatch on Gearman is only the document id, and the document data remains on the crawling server on a Memcached, so the network traffic is keep at minimum. What do you think? It's feasible to remove all database queues and Gearman and move to Kafka? As Kafka is mainly based on messages I think we should move the messages around, should we take into account the network? We may face the same problems? If so, there is a way to isolate some steps to be processed on the same machine, to avoid network traffic? Any help or comment will be appreciate. And If someone has had a similar problem and has knowledge about the architecture approach will be more than welcomed. Thanks -- *Albert Vila* RD Manager Software Developer Tél. : +34 972 982 968 *www.augure.com* http://www.augure.com/ | *Blog.* Reputation in action
Re: Is kafka suitable for our architecture?
Apart from your data locality problem it sounds like what you want is a workqueue. Kafka's consumer structure doesn't lend itself too well to that use case as a single partition of a topic should only have one consumer instance per logical subscriber of the topic, and that consumer would not be able to mark jobs as completed except in a strict order (while maintaining a processed successfully at least once guarantee). This is not to say it cannot be done, but I believe your workqueue would end up working a bit strangely if built with Kafka. Christian On 10/09/2014 06:13 AM, William Briggs wrote: Manually managing data locality will become difficult to scale. Kafka is one potential tool you can use to help scale, but by itself, it will not solve your problem. If you need the data in near-real time, you could use a technology like Spark or Storm to stream data from Kafka and perform your processing. If you can batch the data, you might be better off pulling it into a distributed filesystem like HDFS, and using MapReduce, Spark or another scalable processing framework to handle your transformations. Once you've paid the initial price for moving the document into HDFS, your network traffic should be fairly manageable; most clusters built for this purpose will schedule work to be run local to the data, and typically have separate, high-speed network interfaces and a dedicated switch in order to optimize intra-cluster communications when moving data is unavoidable. -Will On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila albert.v...@augure.com wrote: Hi I just came across Kafta when I was trying to find solutions to scale our current architecture. We are currently downloading and processing 6M documents per day from online and social media. We have a different workflow for each type of document, but some of the steps are keyword extraction, language detection, clustering, classification, indexation, We are using Gearman to dispatch the job to workers and we have some queues on a database. I'm wondering if we could integrate Kafka on the current workflow and if it's feasible. One of our main discussions are if we have to go to a fully distributed architecture or to a semi-distributed one. I mean, distribute everything or process some steps on the same machine (crawling, keyword extraction, language detection, indexation). We don't know which one scales more, each one has pros and cont. Now we have a semi-distributed one as we had network problems taking into account the amount of data we were moving around. So now, all documents crawled on server X, later on are dispatched through Gearman to the same server. What we dispatch on Gearman is only the document id, and the document data remains on the crawling server on a Memcached, so the network traffic is keep at minimum. What do you think? It's feasible to remove all database queues and Gearman and move to Kafka? As Kafka is mainly based on messages I think we should move the messages around, should we take into account the network? We may face the same problems? If so, there is a way to isolate some steps to be processed on the same machine, to avoid network traffic? Any help or comment will be appreciate. And If someone has had a similar problem and has knowledge about the architecture approach will be more than welcomed. Thanks
Re: Use case
The thought experiment I did ended up having a set of front end servers corresponding to a given chunk of the user id space, each of which was a separate subscriber to the same set of partitions. The you have one or more partitions corresponding to that same chunk of users. You want the chunk/set of partitions to be of a size where each of those front end servers can process all the messages in it and send out the chats/notifications/status change notifications perhaps/read receipts, to those users who happen to be connected to the particular front end node. You would need to handle some deduplication on the consumers/FE servers and would need to decide where to produce. Producing from every front end server to potentially every broker could be expensive in terms of connections and you might want to first relay the messages to the corresponding front end cluster, but since we don't use large numbers of producers it's hard for me to say. For persistence and offline delivery you can probably accept a delay in user receipt so you can use another set of consumers that persist the messages to the longer latency datastore on the backend and then get the last 50 or so messages with a bit of lag when the user first looks at history (see hipchat and hangouts lag). This gives you a smaller number of partitions and avoids the issue of having to keep too much history on the Kafka brokers. There are obviously a significant number of complexities to deal with. For example if you are using default consumer code that commits offsets into zookeeper it may be inadvisable at large scales you probably don't need to worry about reaching. And remember I had done this only as a thought experiment not a proper technical evaluation. I expect Kafka, used correctly, can make aspects of building such a chat system much much easier (you can avoid writing your own message replication system) but it is definitely not plug and play using topics for users. Christian On 09/05/2014 09:46 AM, Jonathan Weeks wrote: +1 Topic Deletion with 0.8.1.1 is extremely problematic, and coupled with the fact that rebalance/broker membership changes pay a cost per partition today, whereby excessive partitions extend downtime in the case of a failure; this means fewer topics (e.g. hundreds or thousands) is a best practice in the published version of kafka. There are also secondary impacts on topic count — e.g. useful operational tools such as: http://quantifind.com/KafkaOffsetMonitor/ start to become problematic in terms of UX with a massive number of topics. Once topic deletion is a supported feature, the use-case outlined might be more tenable. Best Regards, -Jonathan On Sep 5, 2014, at 4:20 AM, Sharninder sharnin...@gmail.com wrote: I'm not really sure about your exact use-case but I don't think having a topic per user is very efficient. Deleting topics in kafka, at the moment, isn't really straightforward. You should rethink your date pipeline a bit. Also, just because kafka has the ability to store messages for a certain time, don't think of it as a data store. Kafka is a streaming system, think of it as a fast queue that gives you the ability to move your pointer back. -- Sharninder On Fri, Sep 5, 2014 at 4:27 PM, Aris Alexis aris.alexis@gmail.com wrote: Thanks for the reply. If I use it only for activity streams like twitter: I would want a topic for each #tag and a topic for each user and maybe foreach city. Would that be too many topics or it doesn't matter since most of them will be deleted in a specified interval. Best Regards, Aris Giachnis On Fri, Sep 5, 2014 at 6:57 AM, Sharninder sharnin...@gmail.com wrote: Since you want all chats and mail history persisted all the time, I personally wouldn't recommend kafka for your requirement. Kafka is more suitable as a streaming system where events expire after a certain time. Look at something more general purpose like hbase for persisting data indefinitely. So, for example all activity streams can go into kafka from where consumers will pick up messages to parse and put them to hbase or other clients. -- Sharninder On Fri, Sep 5, 2014 at 12:05 AM, Aris Alexis snowboard...@gmail.com wrote: Hello, I am building a big web application that I want to be massively scalable (I am using cassandra and titan as a general db). I want to implement the following: real time web chat that is persisted so that user a in the future can recall his chat with user b,c,d much like facebook. mail like messages in the web application (not sure about this as it is somewhat covered by the first one) user activity streams users subscribing to topics for example florida/musicevents Could i use kafka for this? can you recommend another technology maybe? signature.asc Description: OpenPGP digital signature
Re: Handling send failures with async producer
TLDR: I use one Callback per job I send to Kafka and include that sort of information by reference in the Callback instance. Our system is currently moving data from beanstalkd to Kafka due to historical reasons so we use the callback to either delete or release the message depending on success. The org.apache.kafka.clients.producer.Callback I give to the send method is an instance of a class that stores all the additional information I need to process the callback. Remember that the async call operates in the Kafka producer thread so they must be fast to avoid constraining the throughput. My call back ends up putting information about the call to beanstalk into another executor service for later processing. Christian On 08/26/2014 12:35 PM, Ryan Persaud wrote: Hello, I'm looking to insert log lines from log files into kafka, but I'm concerned with handling asynchronous send() failures. Specifically, if some of the log lines fail to send, I want to be notified of the failure so that I can attempt to resend them. Based on previous threads on the mailing list (http://comments.gmane.org/gmane.comp.apache.kafka.user/1322), I know that the trunk version of kafka supports callbacks for dealing with failures. However, the callback function is not passed any metadata that can be used by the producer end to reference the original message. Including the key of the message in the RecordMetadata seems like it would be really useful for recovery purposes. Is anyone using the callback functionality to trigger resends of failed messages? If so, how are they tying the callbacks to messages? Is anyone using other methods for handling async errors/resending today? I can’t imagine that I am the only one trying to do this. I asked this question on the IRC channel today, and it sparked some discussion, but I wanted to hear from a wider audience. Thanks for the information, -Ryan signature.asc Description: OpenPGP digital signature
Re: EBCDIC support
Having been spared any EBCDIC experience whatsoever (ie from a positio of thorough ignorance), if you are transmitting text or things with a designated textual form (presumably) I would recommend that your conversion be to unicode rather than ascii if you don't already have consumers expecting a given conversion. That way you will avoid losing information, particularly if you expect any of your conversion tools to be of more general use. Christian On 08/25/2014 05:36 PM, Gwen Shapira wrote: Personally, I like converting data before writing to Kafka, so I can easily support many consumers who don't know about EBCDIC. A third option is to have a consumer that reads EBCDIC data from one Kafka topic and writes ASCII to another Kafka topic. This has the benefits of preserving the raw data in Kafka, in case you need it for troubleshooting, and also supporting non-EBCDIC consumers. The cost is a more complex architecture, but if you already have a stream processing system around (Storm, Samza, Spark), it can be an easy addition. On Mon, Aug 25, 2014 at 5:28 PM, sonali.parthasara...@accenture.com wrote: Thanks Gwen! makes sense. So I'll have to weigh the pros and cons of doing an EBCDIC to ASCII conversion before sending to Kafka Vs. using an ebcdic library after in the consumer Thanks! S -Original Message- From: Gwen Shapira [mailto:gshap...@cloudera.com] Sent: Monday, August 25, 2014 5:22 PM To: users@kafka.apache.org Subject: Re: EBCDIC support Hi Sonali, Kafka doesn't really care about EBCDIC or any other format - for Kafka bits are just bits. So they are all supported. Kafka does not read data from a socket though. Well, it does, but the data has to be sent by a Kafka producer. Most likely you'll need to implement a producer that will get the data from the socket and send it as a message to Kafka. The content of the message can be anything, including EBCDIC -. Then you'll need a consumer to read the data from Kafka and do something with this - the consumer will need to know what to do with a message that contains EBCDIC data. Perhaps you have EBCDIC libraries you can reuse there. Hope this helps. Gwen On Mon, Aug 25, 2014 at 5:14 PM, sonali.parthasara...@accenture.com wrote: Hey all, This might seem like a silly question, but does kafka have support for EBCDIC? Say I had to read data from an IBM mainframe via a TCP/IP socket where the data resides in EBCDIC format, can Kafka read that directly? Thanks, Sonali This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com signature.asc Description: OpenPGP digital signature
Re: Which producer to use?
I ended up coding against the new one, org.apache.kafka.clients.producer.Producer, though it is not yet in production here. It might be slightly more painful to select a partition since there isn't a place to plug in a partitioner class, but overall it was quite easy and had the key feature of an Async callback. Christian On 06/23/2014 04:54 PM, Guozhang Wang wrote: Hi Kyle, We have not fully completed the test in production yet for the new producer, currently some improvement jiras like KAFKA-1498 are still open. Once we have it stabilized in production at LinkedIn we plan to update the wiki in favor of the new producer. Guozhang On Mon, Jun 23, 2014 at 3:39 PM, Kyle Banker kyleban...@gmail.com wrote: As of today, the latest Kafka docs show kafka.javaapi.producer.Producer in their example of the producer API ( https://kafka.apache.org/documentation.html#producerapi). Is there a reason why the latest producer client (org.apache.kafka.clients.producer.Producer) isn't mentioned? Is this client not preferred or production-ready? signature.asc Description: OpenPGP digital signature
0.8.1 Java Producer API Callbacks
I'm looking at using the java producer api for 0.8.1 and I'm slightly confused by this passage from section 4.4 of https://kafka.apache.org/documentation.html#theproducer Note that as of Kafka 0.8.1 the async producer does not have a callback, which could be used to register handlers to catch send errors. Adding such callback functionality is proposed for Kafka 0.9, see [Proposed Producer API](https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ProposedProducerAPI). org.apache.kafka.clients.producer.KafkaProducer in 0.8.1 appears to have public FutureRecordMetadata send(ProducerRecord record, Callback callback) which looks like the mentioned callback. How do the callbacks with the async producer? Is it as described in the comment on the send method (see https://github.com/apache/kafka/blob/0.8.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L151 for reference)? Looking around it seems plausible the language in the documentation might refer to a separate sort of callback that existed in 0.7 but not 0.8. In our use case we have something useful to do if we can detect messages failing to be sent. Christian signature.asc Description: OpenPGP digital signature
Re: 0.8.1 Java Producer API Callbacks
On 05/01/2014 07:22 PM, Christian Csar wrote: I'm looking at using the java producer api for 0.8.1 and I'm slightly confused by this passage from section 4.4 of https://kafka.apache.org/documentation.html#theproducer Note that as of Kafka 0.8.1 the async producer does not have a callback, which could be used to register handlers to catch send errors. Adding such callback functionality is proposed for Kafka 0.9, see [Proposed Producer API](https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ProposedProducerAPI). org.apache.kafka.clients.producer.KafkaProducer in 0.8.1 appears to have public FutureRecordMetadata send(ProducerRecord record, Callback callback) which looks like the mentioned callback. How do the callbacks with the async producer? Is it as described in the comment on the send method (see https://github.com/apache/kafka/blob/0.8.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L151 for reference)? Looking around it seems plausible the language in the documentation might refer to a separate sort of callback that existed in 0.7 but not 0.8. In our use case we have something useful to do if we can detect messages failing to be sent. Christian It appears that I was looking at the Java client rather than the Scala java api referenced by the documentation https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/javaapi/producer/Producer.scala Are both of these currently suited for use from java and still supported? Given the support for callbacks in the event of failure I am inclined to use the Java one despite the currently limited support for specifying partitioners (though it supports specifying the partition) or encoders. Any guidance on this would be appreciated. Christian signature.asc Description: OpenPGP digital signature