Re: kafka user group in los angeles
Hey Alex, It looks like this group might be appropriate to have a Kafka talk at: http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/ It might be worth showing up at one of their events and asking around. -Jon On Thu, Apr 23, 2015 at 11:40 AM, Alex Toth a...@purificator.net wrote: Hi, Sorry this isn't directly a kafka question, but I was wondering if there are andy Kafka user groups in (or in near driving range of) Los Angeles. Looking through meetup.com and the usual web search engines hasn't brought me much outside of the LA Hadoop user group and I was hoping for something more specific. If I should have asked this somewhere else, again, sorry and let me know. alex
Re: Post on running Kafka at LinkedIn
Keep in mind that these brokers aren't really stressed too much at any given time -- we need to stay ahead of the capacity curve. Your message throughput will really just depend on what hardware you're using. However, in the past, we've benchmarked at 400,000 to more than 800,000 messages / broker / sec, depending on configuration (https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines). -Jon On Mar 20, 2015, at 3:03 PM, Emmanuel ele...@msn.com wrote: 800B messages / day = 9.26M messages / sec over 1100 brokers = ~8400 message / broker / sec Do I get this right? Trying to benchmark my own test cluster and that's what I see with 2 brokers...Just wondering if my numbers are good or bad... Subject: Re: Post on running Kafka at LinkedIn From: cl...@kafka.guru Date: Fri, 20 Mar 2015 14:27:58 -0700 To: users@kafka.apache.org Yep! We are growing :) -Clark Sent from my iPhone On Mar 20, 2015, at 2:14 PM, James Cheng jch...@tivo.com wrote: Amazing growth numbers. At the meetup on 1/27, Clark Haskins presented their Kafka usage at the time. It was: Bytes in: 120 TB Messages In: 585 million Bytes out: 540 TB Total brokers: 704 In Todd's post, the current numbers: Bytes in: 175 TB (45% growth) Messages In: 800 billion (36% growth) Bytes out: 650 TB (20% growth) Total brokers: 1100 (56% growth) That much growth in just 2 months? Wowzers. -James On Mar 20, 2015, at 11:30 AM, James Cheng jch...@tivo.com wrote: For those who missed it: The Kafka Audit tool was also presented at the 1/27 Kafka meetup: http://www.meetup.com/http-kafka-apache-org/events/219626780/ Recorded video is here, starting around the 40 minute mark: http://www.ustream.tv/recorded/58109076 Slides are here: http://www.ustream.tv/recorded/58109076 -James On Mar 20, 2015, at 9:47 AM, Todd Palino tpal...@gmail.com wrote: For those who are interested in detail on how we've got Kafka set up at LinkedIn, I have just published a new posted to our Engineering blog titled Running Kafka at Scale https://engineering.linkedin.com/kafka/running-kafka-scale It's a general overview of our current Kafka install, tiered architecture, audit, and the libraries we use for producers and consumers. You'll also be seeing more posts from the SRE team here in the coming weeks on deeper looks into both Kafka and Samza. Additionally, I'll be giving a talk at ApacheCon next month on running tiered Kafka architectures. If you're in Austin for that, please come by and check it out. -Todd signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Anyone interested in speaking at Bay Area Kafka meetup @ LinkedIn on March 24?
The meetups are recorded. For example, here's a link to the January meetup: http://www.ustream.tv/recorded/58109076 The links to the recordings are usually posted to the comments for each meetup on http://www.meetup.com/http-kafka-apache-org/ -Jon On Feb 23, 2015, at 3:24 PM, Ruslan Khafizov ruslan.khafi...@gmail.com wrote: +1 For recording sessions. On 24 Feb 2015 07:22, Jiangjie Qin j...@linkedin.com.invalid wrote: +1, I¹m very interested. On 2/23/15, 3:05 PM, Jay Kreps jay.kr...@gmail.com wrote: +1 I think something like Kafka on AWS at Netflix would be hugely interesting to a lot of people. -Jay On Mon, Feb 23, 2015 at 3:02 PM, Allen Wang aw...@netflix.com.invalid wrote: We (Steven Wu and Allen Wang) can talk about Kafka use cases and operations in Netflix. Specifically, we can talk about how we scale and operate Kafka clusters in AWS and how we migrate our data pipeline to Kafka. Thanks, Allen On Mon, Feb 23, 2015 at 12:15 PM, Ed Yakabosky eyakabo...@linkedin.com.invalid wrote: Hi Kafka Open Source - LinkedIn will host another Bay Area Kafka meetup in Mountain View on March 24. We are planning to present on Offset Management but are looking for additional speakers. If you¹re interested in presenting a use case, operational plan, or your experience with a particular feature (REST interface, WebConsole), please reply-all to let us know. [BCC: Open Source lists] Thanks, Ed signature.asc Description: Message signed with OpenPGP using GPGMail
Re: question about new consumer offset management in 0.8.2
There should probably be a wiki page started for this so we have the details in one place. The same question was asked on Freenode IRC a few minutes ago. :) A summary of the migration procedure is: 1) Upgrade your brokers and set dual.commit.enabled=false and offsets.storage=zookeeper (Commit offsets to Zookeeper Only). 2) Set dual.commit.enabled=true and offsets.storage=kafka and restart (Commit offsets to Zookeeper and Kafka). 3) Set dual.commit.enabled=false and offsets.storage=kafka and restart (Commit offsets to Kafka only). -Jon On Feb 5, 2015, at 9:03 AM, Jason Rosenberg j...@squareup.com wrote: Hi, For 0.8.2, one of the features listed is: - Kafka-based offset storage. Is there documentation on this (I've heard discussion of it of course)? Also, is it something that will be used by existing consumers when they migrate up to 0.8.2? What is the migration process? Thanks, Jason signature.asc Description: Message signed with OpenPGP using GPGMail
Re: One or multiple instances of MM to aggregate kafka data to one hadoop
Hey Mingjie, Here's how we have our mirror makers configured. For some context, let me try to describe this using the example datacenter layout as described in: https://engineering.linkedin.com/samza/operating-apache-samza-scale In that example, there are four data centers (A, B, C, and D). However, we only need Datacenter A and B to describe this. Datacenter A mirrors data from local(A) to aggregate(A) as well as local(B) to aggregate(A). Datacenter B mirrors data from local(B) to aggregate(B) as well as local(A) to aggregate(B). The diagram in the article should make easy to visualize. Note that the mirror makers are running in the destination datacenter and pull the traffic in. Let's say we have two physical machines (lets call them servers 1 and 2 in datacenter A; servers 3 and 4 in datacenter B) in each datacenter dedicated to running mirror makers. This is how the layout of mirror maker processes would look like: * Datacenter A MirrorMaker Cluster * Server 1 * local(A) to aggregate(A) MirrorMaker Instance * local(B) to aggregate(A) MirrorMaker Instance * Server 2 * local(A) to aggregate(A) MirrorMaker Instance * local(B) to aggregate(A) MirrorMaker Instance * Datacenter B MirrorMaker Cluster * Server 3 * local(B) to aggregate(B) MirrorMaker Instance * local(A) to aggregate(B) MirrorMaker Instance * Server 4 * local(B) to aggregate(B) MirrorMaker Instance * local(A) to aggregate(B) MirrorMaker Instance The benefit of this layout is that if the load becomes too high, we would then add on another server to each cluster that looks exactly like the others in the cluster (easy to provision). If you get really huge, you can start creating multiple mirror maker clusters that each handle a specific flow (but still have homogeneous processes within each cluster). Of course, YMMV, but this is what works well for us. :) -Jon On Jan 28, 2015, at 3:54 PM, Daniel Compton daniel.compton.li...@gmail.com wrote: Hi Mingjie I would recommend the first option of running one mirrormaker instance pulling from multiple DC's. A single MM instance will be able to make more efficient use of the machine resources in two ways: 1. You will only have to run one process which will be able to be allocated the full amount of resources 2. Within the process, if you run enough consumer threads, I think that they should be able to rebalance and pick up the load if they don't have anything to do. I'm not 100% sure on this, but 1 still holds. A single MM instance should handle connectivity issues with one DC without affecting the rest of the consumer threads for other DC's. You would gain process isolation running a MM per DC, but this would raise the operational burden and resource requirements. I'm not sure what benefit you'd actually get from process isolation, so I'd recommend against it. However I'd be interested to hear if others do things differently. Daniel. On Thu Jan 29 2015 at 11:14:29 AM Mingjie Lai m...@apache.org wrote: Hi. We have a pretty typical data ingestion use case that we use mirrormaker at one hadoop data center, to mirror kafka data from multiple remote application data centers. I know mirrormaker can support to consume kafka data from multiple kafka source, by one instance at one physical node. By this, we can give one instance of mm multiple consumer config files, so it can consume data from muti places. Another option is to have multiple mirrormaker instances at one node, each mm instance is dedicated to grab data from one single source data center. Certainly there will be multiple mm nodes to balance the load. The second option looks better since it kind of has an isolation for different data centers. Any recommendation for this kind of data aggregation cases? Still new to kafka and mirrormaker. Welcome any information. Thanks, Mingjie signature.asc Description: Message signed with OpenPGP using GPGMail
LinkedIn Engineering Blog Post - Current and Future
Here's an overview of what LinkedIn plans to concentrate on in the upcoming year. https://engineering.linkedin.com/kafka/kafka-linkedin-%E2%80%93-current-and-future -Jon signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Production settings for JDK 7 + G1 GC
We're currently using JDK 8 update 5 with the following settings: -server -Xms4g -Xmx4g -XX:PermSize=96m -XX:MaxPermSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:logs/gc.log -XX:ErrorFile=logs/err.log This works well for us, but you should customize it to your workload. :) -Jon On Jan 14, 2015, at 5:51 PM, Albert Strasheim full...@gmail.com wrote: Greetings all We're expanding our Kafka cluster, and I thought this would be a good time to try the suggestions in http://www.slideshare.net/ToddPalino/enterprise-kafka-kafka-as-a-service slide #37 about running on JDK 7 with G1 GC. Anybody (Todd?) that could shed some light on a complete set of good GC flags to start with and what the best JDK version is to run with these days? Thanks! Regards Albert signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Kafka 0.8.1.1 Leadership changes are happening very often
Several features in Zookeeper depend on server time. I would highly recommend that you properly setup ntpd (or whatever), then try to reproduce. -Jon On Jan 2, 2015, at 2:35 PM, Birla, Lokesh lokesh.bi...@verizon.com wrote: We don¹t see zookeeper expiration. However I noticed that our servers system time is NOT synced. Hence server1 and server2 had 30+sec delay. Do you think that could cause leadership change or any other issue. On 12/31/14, 4:03 PM, Jun Rao j...@confluent.io wrote: A typical cause of frequent leadership changes is GC-induced soft failure. Do you see ZK session expiration on the broker? If so, you may want to enable GC log to see the GC time. Thanks, Jun On Tue, Dec 23, 2014 at 2:06 PM, Birla, Lokesh lokesh.bi...@verizon.com wrote: I was already using 4GB heap memory. I even changed to 8 GB heap memory and could see leadership changing very often. In my 5 minute run, I saw leadership changed from 1,2,3 to 3,3,3, to 1,1,1. Also my message rate is just: 7k and total msg count is only 2,169,001. Does anyone has cline on leadership change? ‹Lokesh From: Thunder Stumpges tstump...@ntent.commailto:tstump...@ntent.com Date: Monday, December 22, 2014 at 6:31 PM To: users@kafka.apache.orgmailto:users@kafka.apache.org users@kafka.apache.orgmailto:users@kafka.apache.org Cc: Birla, Lokesh lokesh.bi...@one.verizon.commailto: lokesh.bi...@one.verizon.com Subject: RE: Kafka 0.8.1.1 eadership changes are happening very often Did you check the GC logs in the server? We ran into this and the default setting of 1G max heap on the broker process was nowhere near enough. We currently have set to 4G. -T -Original Message- From: Birla, Lokesh [lokesh.bi...@verizon.commailto: lokesh.bi...@verizon.com] Received: Monday, 22 Dec 2014, 5:27PM To: users@kafka.apache.orgmailto:users@kafka.apache.org [ users@kafka.apache.orgmailto:users@kafka.apache.org] CC: Birla, Lokesh [lokesh.bi...@verizon.commailto: lokesh.bi...@verizon.com] Subject: Kafka 0.8.1.1 eadership changes are happening very often Hello, I am running 3 brokers, one zookeeper and producer all on separate machine. I am also sending very low load around 6K msg/sec. Each msg is around 150 bytes only. I ran the load for only 5 minutes and during this time, I see leadership chained very often. I created 3 partitions. Here leadership for each partitions changed. All 3 brokers are running perfectly fine. No broker is down. Could someone let me know why kafka leadership changed very often. Initially: Topic:mmetopic1PartitionCount:3 ReplicationFactor:3 Configs: Topic: mmetopic1Partition: 0 Leader: 2Replicas: 2,3,1 Isr: 2,3,1 Topic: mmetopic1Partition: 1 Leader: 3Replicas: 3,1,2 Isr: 3,1,2 Topic: mmetopic1Partition: 2 Leader: 1Replicas: 1,2,3 Isr: 1,2,3 Changed to: Topic:mmetopic1PartitionCount:3 ReplicationFactor:3 Configs: Topic: mmetopic1Partition: 0 Leader: 3Replicas: 2,3,1 Isr: 3,1,2 Topic: mmetopic1Partition: 1 Leader: 3Replicas: 3,1,2 Isr: 3,1,2 Topic: mmetopic1Partition: 2 Leader: 1Replicas: 1,2,3 Isr: 1,3,2 Changed to: Topic:mmetopic1PartitionCount:3 ReplicationFactor:3 Configs: Topic: mmetopic1Partition: 0 Leader: 1Replicas: 2,3,1 Isr: 1,2,3 Topic: mmetopic1Partition: 1 Leader: 1Replicas: 3,1,2 Isr: 1,2,3 Topic: mmetopic1Partition: 2 Leader: 2Replicas: 1,2,3 Isr: 2,1,3 Changed to: Topic:mmetopic1PartitionCount:3 ReplicationFactor:3 Configs: Topic: mmetopic1Partition: 0 Leader: 3Replicas: 2,3,1 Isr: 3,1,2 Topic: mmetopic1Partition: 1 Leader: 3Replicas: 3,1,2 Isr: 3,1,2 Topic: mmetopic1Partition: 2 Leader: 1Replicas: 1,2,3 Isr: 1,3,2 Thanks, Lokesh signature.asc Description: Message signed with OpenPGP using GPGMail
Re: keyed-messages de-duplication
It looks like the log.cleanup.policy config option was changed from dedupe to compact. https://github.com/apache/kafka/blob/0.8.1.1/core/src/main/scala/kafka/log/LogConfig.scala#L68 -Jon On May 13, 2014, at 1:08 PM, Jay Kreps jay.kr...@gmail.com wrote: Hi, The compaction is done to clean-up space. It isn't done immediately only periodically. I suspect the reason you see no compaction is that we never compact the active segment of the log (the most recent file) as that is still being written to. The compaction would not happen until a new segment file was rolled. If you want to see this happen I recommend changing the file segment size configuration to something small (5mb) and produce enough messages to roll a new segment file. You should then see logging about compaction in logs/log-cleaner.log. -Jay On Tue, May 13, 2014 at 11:52 AM, C 4.5 cfourf...@gmail.com wrote: I understand Kafka supports keyed messages (I am using 0.8.1.1) and it is possible to de-duplicate messages based on the message key. (The log compaction section of the on-line documentation described how that works.) I am using a code example that come with Kafka (namely KafkaConsumerProducerDemo) and run it through Kafka local mode. I write a set of messages with the same String key and then have a consumer that consumes data. The consumer consumes messages *only* after the producer has produced all its messages. I would expect the consumer to retrieve only the latest message (as all messages have the same key) but it retrieves all messages the producer has emitted. I have also turned on these properties in the Kafka server: log.cleaner.enable=true log.cleanup.policy=dedupe - is de-duplication of messages guaranteed to take effect only after compaction? - I have tried to force compaction by setting log.cleaner.backoff.ms and log.cleaner.min.cleanabke.ratio to very low values, but I still observe the same behavior. Any ideas or pointers? Thanks. signature.asc Description: Message signed with OpenPGP using GPGMail