Re: Using Kafka as a persistent store
Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: performance benchmarking of kafka
Hi, Appreciate your response. It works now! It is just a typo of the class names : (. It really has nothing to do with whether you are using the binaries or the source version of kafka. Thanks everyone! On Mon, Jul 13, 2015 at 11:18 PM, tao xiao xiaotao...@gmail.com wrote: org.apache.kafka.clients.tools.ProducerPerformance resides in kafka-clients-0.8.2.1.jar. You need to make sure the jar exists in $KAFKA_HOME/libs/. I use kafka_2.10-0.8.2.1 too and here is the output % bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance USAGE: java org.apache.kafka.clients.tools.ProducerPerformance topic_name num_records record_size target_records_sec [prop_name=prop_value]* On Tue, 14 Jul 2015 at 05:08 Yuheng Du yuheng.du.h...@gmail.com wrote: I am using the binaries of kafka_2.10-0.8.2.1. Could that be the problem? Should I use the source of kafka-0.8.2.1-src.tgz to each of my machiines, build them and run the test? Thanks. On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote: You may need to open up your run-class.sh in a text editor and modify the classpath -- I believe I had a similar error before. On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com wrote: Hi guys, I am trying to replicate the test of benchmarking kafka at http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines . When I run bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092 buffer.memory=67108864 batch.size=8196 and I got the following error: Error: Could not find or load main class org.apache.kafka.client.tools.ProducerPerformance What should I fix? Thank you! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
Re: performance benchmarking of kafka
org.apache.kafka.clients.tools.ProducerPerformance resides in kafka-clients-0.8.2.1.jar. You need to make sure the jar exists in $KAFKA_HOME/libs/. I use kafka_2.10-0.8.2.1 too and here is the output % bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance USAGE: java org.apache.kafka.clients.tools.ProducerPerformance topic_name num_records record_size target_records_sec [prop_name=prop_value]* On Tue, 14 Jul 2015 at 05:08 Yuheng Du yuheng.du.h...@gmail.com wrote: I am using the binaries of kafka_2.10-0.8.2.1. Could that be the problem? Should I use the source of kafka-0.8.2.1-src.tgz to each of my machiines, build them and run the test? Thanks. On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote: You may need to open up your run-class.sh in a text editor and modify the classpath -- I believe I had a similar error before. On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com wrote: Hi guys, I am trying to replicate the test of benchmarking kafka at http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines . When I run bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092 buffer.memory=67108864 batch.size=8196 and I got the following error: Error: Could not find or load main class org.apache.kafka.client.tools.ProducerPerformance What should I fix? Thank you! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
Re: Data Structure abstractions over kafka
Tim, Kafka can be used as a key-value store if you turn on log compaction: http://kafka.apache.org/documentation.html#compaction You need to be careful with that since it's purely last-writer-wins and doesn't have anything like CAS that might help you manage concurrent writers, but the basic functionality is there. This is used by the brokers to store offsets in Kafka (where keys are (consumer-group, topic, partition), values are the offset, and they already have a mechanism to ensure only a single writer at a time). You could possibly use this to implement the linked list functionality you're talking about, although there are probably a number of challenges (e.g., performing atomic updates if you need a doubly-linked list, ensuring garbage is collected after removals even if you only need a singly-linked list, etc). Also, I'm not sure it would be particularly efficient, you'd still need to ensure a single writer (or at least single writer per linked list node), etc. You're almost definitely better off using a specialized store for something like that simply because Kafka isn't designed around that use case, but it'd be interesting to see how far you could get with Kafka's current functionality, and what would be required to make it practical! -Ewen On Mon, Jul 13, 2015 at 11:36 AM, Tim Smith secs...@gmail.com wrote: Hi, In the big data ecosystem, I have started to use kafka, essentially, as a: - unordered list/array, and - a cluster-wide pipe I guess you could argue that any message bus product is a simple array/pipe but kafka's scale and model make things so easy :) I am wondering if there are any abstractions on top of kafka that will let me use kafka to store/organize other simple data structures like a linked-list? I have a use case for massive linked list that can easily grow to tens of gigabytes and could easily use - (1) redundancy (2) multiple producers/consumers working on processing the list (implemented over spark, storm etc). Any ideas? Maybe maintain a linked-list of offsets in another store like ZooKeeper or a NoSQL DB while store the messages on kafka? Thanks, - Tim -- Thanks, Ewen
Re: Using Kafka as a persistent store
We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Using Kafka as a persistent store
Scott, This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 15:41, Scott Thibault wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto:j...@confluent.io) wrote: If I recall correctly, setting log.retention.ms (http://log.retention.ms) and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Kafka as an event store for Event Sourcing
Ah, just saw this. I actually just submitted a patch this evening -- just for the partitionwide version at the moment, since it turns out to be pretty simple to implement. Still very interested in moving forward with this stuff, though not always as much time as I would like... On Thu, Jul 9, 2015 at 9:39 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Ben, are you still interested in working on this? On Mon, Jun 15, 2015 at 9:49 AM Daniel Schierbeck daniel.schierb...@gmail.com wrote: I like to refer to it as conditional write or conditional request, semantically similar to HTTP's If-Match header. Ben: I'm adding a comment about per-key checking to your JIRA. On Mon, Jun 15, 2015 at 4:06 AM Ben Kirwin b...@kirw.in wrote: Yeah, it's definitely not a standard CAS, but it feels like the right fit for the commit log abstraction -- CAS on a 'current value' does seem a bit too key-value-store-ish for Kafka to support natively. I tried to avoid referring to the check-offset-before-publish functionality as a CAS in the ticket because, while they're both types of 'optimistic concurrency control', they are a bit different -- and the offset check is both easier to implement and handier for the stuff I tend to work on. (Though that ticket's about checking the latest offset on a whole partition, not the key -- there's a different set of tradeoffs for the latter, and I haven't thought it through properly yet.) On Sat, Jun 13, 2015 at 3:35 PM, Ewen Cheslack-Postava e...@confluent.io wrote: If you do CAS where you compare the offset of the current record for the key, then yes. This might work fine for applications that track key, value, and offset. It is not quite the same as doing a normal CAS. On Sat, Jun 13, 2015 at 12:07 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: But wouldn't the key-offset table be enough to accept or reject a write? I'm not familiar with the exact implementation of Kafka, so I may be wrong. On lør. 13. jun. 2015 at 21.05 Ewen Cheslack-Postava e...@confluent.io wrote: Daniel: By random read, I meant not reading the data sequentially as is the norm in Kafka, not necessarily a random disk seek. That in-memory data structure is what enables the random read. You're either going to need the disk seek if the data isn't in the fs cache or you're trading memory to avoid it. If it's a full index containing keys and values then you're potentially committing to a much larger JVM memory footprint (and all the GC issues that come with it) since you'd be storing that data in the JVM heap. If you're only storing the keys + offset info, then you potentially introduce random disk seeks on any CAS operation (and making page caching harder for the OS, etc.). On Sat, Jun 13, 2015 at 11:33 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Ewen: would single-key CAS necessitate random reads? My idea was to have the broker maintain an in-memory table that could be rebuilt from the log or a snapshot. On lør. 13. jun. 2015 at 20.26 Ewen Cheslack-Postava e...@confluent.io wrote: Jay - I think you need broker support if you want CAS to work with compacted topics. With the approach you described you can't turn on compaction since that would make it last-writer-wins, and using any non-infinite retention policy would require some external process to monitor keys that might expire and refresh them by rewriting the data. That said, I think any addition like this warrants a lot of discussion about potential use cases since there are a lot of ways you could go adding support for something like this. I think this is an obvious next incremental step, but someone is bound to have a use case that would require multi-key CAS and would be costly to build atop single key CAS. Or, since the compare requires a random read anyway, why not throw in read-by-key rather than sequential log reads, which would allow for minitransactions a la Sinfonia? I'm not convinced trying to make Kafka support traditional key-value store functionality is a good idea. Compacted topics made it possible to use it a bit more in that way, but didn't change the public interface, only the way storage was implemented, and importantly all the potential additional performance costs data structures are isolated to background threads. -Ewen On Sat, Jun 13, 2015 at 9:59 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: @Jay: Regarding your first proposal: wouldn't that mean that a producer wouldn't know whether a write succeeded? In the case of event sourcing, a failed CAS may require re-validating the input with the new state. Simply discarding the write would
Re: kafka benchmark tests
I implemented (nearly) the same basic set of tests in the system test framework we started at Confluent and that is going to move into Kafka -- see the wip patch for KIP-25 here: https://github.com/apache/kafka/pull/70 In particular, that test is implemented in benchmark_test.py: https://github.com/apache/kafka/pull/70/files#diff-ca984778cf9943407645eb6784f19dc8 Hopefully once that's merged people can reuse that benchmark (and add to it!) so they can easily run the same benchmarks across different hardware. Here are some results from an older version of that test on m3.2xlarge instances on EC2 using local ephemeral storage (I think... it's been awhile since I ran these numbers and I didn't document methodology that carefully): INFO:_.KafkaBenchmark:= INFO:_.KafkaBenchmark:BENCHMARK RESULTS INFO:_.KafkaBenchmark:= INFO:_.KafkaBenchmark:Single producer, no replication: 684097.470208 rec/sec (65.24 MB/s) INFO:_.KafkaBenchmark:Single producer, async 3x replication: 667494.359673 rec/sec (63.66 MB/s) INFO:_.KafkaBenchmark:Single producer, sync 3x replication: 116485.764275 rec/sec (11.11 MB/s) INFO:_.KafkaBenchmark:Three producers, async 3x replication: 1696519.022182 rec/sec (161.79 MB/s) INFO:_.KafkaBenchmark:Message size: INFO:_.KafkaBenchmark: 10: 1637825.195625 rec/sec (15.62 MB/s) INFO:_.KafkaBenchmark: 100: 605504.877911 rec/sec (57.75 MB/s) INFO:_.KafkaBenchmark: 1000: 90351.817570 rec/sec (86.17 MB/s) INFO:_.KafkaBenchmark: 1: 8306.180862 rec/sec (79.21 MB/s) INFO:_.KafkaBenchmark: 10: 978.403499 rec/sec (93.31 MB/s) INFO:_.KafkaBenchmark:Throughput over long run, data memory: INFO:_.KafkaBenchmark: Time block 0: 684725.151324 rec/sec (65.30 MB/s) INFO:_.KafkaBenchmark:Single consumer: 701031.14 rec/sec (56.830500 MB/s) INFO:_.KafkaBenchmark:Three consumers: 3304011.014900 rec/sec (267.830800 MB/s) INFO:_.KafkaBenchmark:Producer + consumer: INFO:_.KafkaBenchmark: Producer: 624984.375391 rec/sec (59.60 MB/s) INFO:_.KafkaBenchmark: Consumer: 624984.375391 rec/sec (59.60 MB/s) INFO:_.KafkaBenchmark:End-to-end latency: median 2.00 ms, 99% 4.00 ms, 99.9% 19.00 ms Don't trust these numbers for anything, the were a quick one-off test. I'm just pasting the output so you get some idea of what the results might look like. Once we merge the KIP-25 patch, Confluent will be running the tests regularly and results will be available publicly so we'll be able to keep better tabs on performance, albeit for only a specific class of hardware. For the batch.size question -- I'm not sure the results in the blog post actually have different settings, it could be accidental divergence between the script and the blog post. The post specifically notes that tuning the batch size in the synchronous case might help, but that he didn't do that. If you're trying to benchmark the *optimal* throughput, tuning the batch size would make sense. Since synchronous replication will have higher latency and there's a limit to how many requests can be in flight at once, you'll want a larger batch size to compensate for the additional latency. However, in practice the increase you see may be negligible. Somebody who has spent more time fiddling with tweaking producer performance may have more insight. -Ewen On Mon, Jul 13, 2015 at 10:08 AM, JIEFU GONG jg...@berkeley.edu wrote: Hi all, I was wondering if any of you guys have done benchmarks on Kafka performance before, and if they or their details (# nodes in cluster, # records / size(s) of messages, etc.) could be shared. For comparison purposes, I am trying to benchmark Kafka against some similar services such as Kinesis or Scribe. Additionally, I was wondering if anyone could shed some insight on Jay Kreps' benchmarks that he has openly published here: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines Specifically, I am unsure of why between his tests of 3x synchronous replication and 3x async replication he changed the batch.size, as well as why he is seemingly publishing to incorrect topics: Configs: https://gist.github.com/jkreps/c7ddb4041ef62a900e6c Any help is greatly appreciated! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427 -- Thanks, Ewen
Re: Using Kafka as a persistent store
Hi, 1. What you described sounds like a reasonable architecture, but may I ask why JSON? Avro seems better supported in the ecosystem (Confluent's tools, Hadoop integration, schema evolution, tools, etc). 1.5 If all you do is convert data into JSON, SparkStreaming sounds like a difficult-to-manage overkill. Compared to Flume or a slightly modified MirrorMaker (Or CopyCat, if it exists yet). Any specific reasons for SparkStreaming? 2. Different compute engines prefer different storage formats because in most cases thats where optimizations come from. Parquet improves scan performance for Impala and MR, but will be pretty horrible for NoSQL. So, I wouldn't hold my breath for compute engines to start sharing data storage suddenly. Gwen On Mon, Jul 13, 2015 at 11:45 AM, Tim Smith secs...@gmail.com wrote: I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text
kafka benchmark tests
Hi all, I was wondering if any of you guys have done benchmarks on Kafka performance before, and if they or their details (# nodes in cluster, # records / size(s) of messages, etc.) could be shared. For comparison purposes, I am trying to benchmark Kafka against some similar services such as Kinesis or Scribe. Additionally, I was wondering if anyone could shed some insight on Jay Kreps' benchmarks that he has openly published here: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines Specifically, I am unsure of why between his tests of 3x synchronous replication and 3x async replication he changed the batch.size, as well as why he is seemingly publishing to incorrect topics: Configs: https://gist.github.com/jkreps/c7ddb4041ef62a900e6c Any help is greatly appreciated! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
Re: Using Kafka as a persistent store
For what it's worth, I did something similar to Rad's suggestion of cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis is also a message bus, but only has a 24 hour retention window. I wrote a Kinesis consumer that would take all messages from Kinesis and save them into S3. I stored them in S3 in such a way that the structure mirrors the original Kinesis stream, and all message metadata is preserved (message offsets and primary keys, for example). This means that I can write a consumer that would consume from S3 files in the same way that it would consume from the Kinesis stream itself. And the data is structured such that when you are done reading from S3, you can connect to the Kinesis stream at the point where the S3 archive left off. This effectively allowed me to add a configurable retention period when consuming from Kinesis. -James On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com wrote: I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that
Re: Using Kafka as a persistent store
I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Data Structure abstractions over kafka
Hi, In the big data ecosystem, I have started to use kafka, essentially, as a: - unordered list/array, and - a cluster-wide pipe I guess you could argue that any message bus product is a simple array/pipe but kafka's scale and model make things so easy :) I am wondering if there are any abstractions on top of kafka that will let me use kafka to store/organize other simple data structures like a linked-list? I have a use case for massive linked list that can easily grow to tens of gigabytes and could easily use - (1) redundancy (2) multiple producers/consumers working on processing the list (implemented over spark, storm etc). Any ideas? Maybe maintain a linked-list of offsets in another store like ZooKeeper or a NoSQL DB while store the messages on kafka? Thanks, - Tim
Re: Using Kafka as a persistent store
Sounds like the same idea. The nice thing about having such option is that, with a correct application of containers, backup and restore strategy, one can create an infinite ordered backup of raw input stream using native Kafka storage format. I understand the point of having the data in other formats in other systems. Impossible to get away from that. My concept presented a few days ago is to address having “multiple same-looking copies of the truth”. At the end of the day, if something happens with operational data, it will have to be recreated from the truth”. But, if the data was once ingested over Kafka and there is already a pipeline for building operational state from Kafka, why would someone write another processing logic to get the truth, say, from Hadoop? And if fast, parallel processing of native Kafka format is required, it can still be done with Samza or Hadoop / whathaveyou. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 21:17, James Cheng wrote: For what it's worth, I did something similar to Rad's suggestion of cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis is also a message bus, but only has a 24 hour retention window. I wrote a Kinesis consumer that would take all messages from Kinesis and save them into S3. I stored them in S3 in such a way that the structure mirrors the original Kinesis stream, and all message metadata is preserved (message offsets and primary keys, for example). This means that I can write a consumer that would consume from S3 files in the same way that it would consume from the Kinesis stream itself. And the data is structured such that when you are done reading from S3, you can connect to the Kinesis stream at the point where the S3 archive left off. This effectively allowed me to add a configurable retention period when consuming from Kinesis. -James On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com (mailto:secs...@gmail.com) wrote: I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com (mailto:scott.thiba...@multiscalehn.com) wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault
Re: Using Kafka as a persistent store
Indeed, the files would have to be moved to some separate, dedicated storage. There are basically 3 options, as kafka does not allow adding logs at runtime: 1. make the consumer able to read from an arbitrary file 2. add ability to drop files in (I believe this adds a lot of complexity) 3. read files with another program, as suggested in my first email I’d love to get some input from someone who knows the code and options a bit better! Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 18:02, Scott Thibault wrote: Yes, consider my e-mail an up vote! I guess the files would automatically moved somewhere else to separate the active from cold segments? Ideally, one could run an unmodified consumer application on the cold segments. --Scott On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) wrote: Scott, This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: ra...@gruchalski.com (mailto:ra...@gruchalski.com)) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) ( http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 15:41, Scott Thibault wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto:j...@confluent.io) (mailto: j...@confluent.io (mailto:j...@confluent.io)) wrote: If I recall correctly, setting log.retention.ms (http://log.retention.ms) ( http://log.retention.ms) and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works
Re: Using Kafka as a persistent store
Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Fetching details from Kafka Server
2)You need to implement MetricReporter and provider that implementation class name against producer side configuration metric.reporters On Mon, Jul 13, 2015 at 9:08 PM, Swati Suman swatisuman1...@gmail.com wrote: Hi Team, We are using Kafka 0.8.2 I have two questions: 1)Is there any Java Api in Kafka that gives me the list of all the consumer groups along with the topic/partition from which they are consuming Also, is there any way that I can fetch the zookeeper list from the kafka server side . Note: I am able to fetch the above information from the Zookeeper. But I want to fetch it from Kafka Server. 2). I have implemented a Custom Metrics Reporter which is implementing KafkaMetricsReporter and KafkaMetricsMBeanReporter. So it is extracting all the Server Metrics as seen in page http://docs.confluent.io/1.0/kafka/monitoring.html and not the Producer and Consumer Metrics. Is there any way I can fetch them from the kafka server side or do the Producer/Consumer need to implement something to be able to fetch/emit them. I will be very thankful if you could share your thoughts on this. Thanks In Advance!! Best Regards, Swati Suman
Fetching details from Kafka Server
Hi Team, We are using Kafka 0.8.2 I have two questions: 1)Is there any Java Api in Kafka that gives me the list of all the consumer groups along with the topic/partition from which they are consuming Also, is there any way that I can fetch the zookeeper list from the kafka server side . Note: I am able to fetch the above information from the Zookeeper. But I want to fetch it from Kafka Server. 2). I have implemented a Custom Metrics Reporter which is implementing KafkaMetricsReporter and KafkaMetricsMBeanReporter. So it is extracting all the Server Metrics as seen in page http://docs.confluent.io/1.0/kafka/monitoring.html and not the Producer and Consumer Metrics. Is there any way I can fetch them from the kafka server side or do the Producer/Consumer need to implement something to be able to fetch/emit them. I will be very thankful if you could share your thoughts on this. Thanks In Advance!! Best Regards, Swati Suman
Re: Using Kafka as a persistent store
Yes, consider my e-mail an up vote! I guess the files would automatically moved somewhere else to separate the active from cold segments? Ideally, one could run an unmodified consumer application on the cold segments. --Scott On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com wrote: Scott, This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ ( http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 15:41, Scott Thibault wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto: j...@confluent.io) wrote: If I recall correctly, setting log.retention.ms ( http://log.retention.ms) and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto: daniel.schierb...@gmail.com) wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090* -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk
Re: Using Kafka as a persistent store
Did this work for you? I set the topic settings to retention.ms=-1 and retention.bytes=-1 and it looks like it is deleting segments immediately. On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
performance benchmarking of kafka
Hi guys, I am trying to replicate the test of benchmarking kafka at http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines . When I run bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092 buffer.memory=67108864 batch.size=8196 and I got the following error: Error: Could not find or load main class org.apache.kafka.client.tools.ProducerPerformance What should I fix? Thank you!
Re: performance benchmarking of kafka
Thank you. I see that in run-class.sh, they have the following lines: 63 for file in $base_dir/clients/build/libs/kafka-clients*.jar; 64 do 65 CLASSPATH=$CLASSPATH:$file 66 done So I believe all the jars in the libs/ directory have already been included in the classpath? Which directory is the ProducerPerformance class resides? Thanks. On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote: You may need to open up your run-class.sh in a text editor and modify the classpath -- I believe I had a similar error before. On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com wrote: Hi guys, I am trying to replicate the test of benchmarking kafka at http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines . When I run bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092 buffer.memory=67108864 batch.size=8196 and I got the following error: Error: Could not find or load main class org.apache.kafka.client.tools.ProducerPerformance What should I fix? Thank you! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
Re: performance benchmarking of kafka
I am using the binaries of kafka_2.10-0.8.2.1. Could that be the problem? Should I use the source of kafka-0.8.2.1-src.tgz to each of my machiines, build them and run the test? Thanks. On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote: You may need to open up your run-class.sh in a text editor and modify the classpath -- I believe I had a similar error before. On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com wrote: Hi guys, I am trying to replicate the test of benchmarking kafka at http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines . When I run bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092 buffer.memory=67108864 batch.size=8196 and I got the following error: Error: Could not find or load main class org.apache.kafka.client.tools.ProducerPerformance What should I fix? Thank you! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
Re: performance benchmarking of kafka
You may need to open up your run-class.sh in a text editor and modify the classpath -- I believe I had a similar error before. On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com wrote: Hi guys, I am trying to replicate the test of benchmarking kafka at http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines . When I run bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092 buffer.memory=67108864 batch.size=8196 and I got the following error: Error: Could not find or load main class org.apache.kafka.client.tools.ProducerPerformance What should I fix? Thank you! -- Jiefu Gong University of California, Berkeley | Class of 2017 B.A Computer Science | College of Letters and Sciences jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
Offset not committed
I am trying to replace ActiveMQ with Kafka in our environment however I have encountered a strange problem that basically prevents from using Kafka in production. The problem is that sometimes the offsets are not committed. I am using Kafka 0.8.2.1, offset storage = kafka, high level consumer, auto-commit = off. Every N messages I issue commitOffsets(). Now here is the problem - if N is below a certain number (180 000 for me) it works and the offset is moving. If N is 180 000 or more the offset is not updated after commitOffsets I am looking at offsets using kafka-run-class.sh kafka.tools.ConsumerOffsetChecker Any help?
Re: Using Kafka as a persistent store
Using -1 for log.retention.ms should work only for 0.8.3 ( https://issues.apache.org/jira/browse/KAFKA-1990). 2015-07-13 17:08 GMT-03:00 Shayne S shaynest...@gmail.com: Did this work for you? I set the topic settings to retention.ms=-1 and retention.bytes=-1 and it looks like it is deleting segments immediately. On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
New producer and ordering of Callbacks when sending to multiple partitions
Hi, I'm trying to understand the new producer, and the order in which the Callbacks will be called. From my understanding, records are batched up per partition. So all records destined for a specific partition will be sent in order, and that means that their callbacks will be called in order. What about message batches that cover multiple partitions? E.g. If I send three messages to three partitions A, B, and C, in the following order: A1 A2 A3 B1 B2 B3 C1 C2 D3 Then is it possible that messages B1 B2 B3 will be sent prior to A1 A2 A3? Which means the callbacks for B1 B2 B3 will also be called prior to the ones from A1 A2 A3? Thanks, -James
Re: New producer and ordering of Callbacks when sending to multiple partitions
James, There are separate queues for each partition, so there are no guarantees on the order of the sends (or callbacks) between partitions. (Actually, IIRC, the code intentionally randomizes the partition order a bit, possibly to avoid starvation) Gwen On Mon, Jul 13, 2015 at 5:41 PM, James Cheng jch...@tivo.com wrote: Hi, I'm trying to understand the new producer, and the order in which the Callbacks will be called. From my understanding, records are batched up per partition. So all records destined for a specific partition will be sent in order, and that means that their callbacks will be called in order. What about message batches that cover multiple partitions? E.g. If I send three messages to three partitions A, B, and C, in the following order: A1 A2 A3 B1 B2 B3 C1 C2 D3 Then is it possible that messages B1 B2 B3 will be sent prior to A1 A2 A3? Which means the callbacks for B1 B2 B3 will also be called prior to the ones from A1 A2 A3? Thanks, -James