Re: Using Kafka as a persistent store
Thanks, I'm on 0.8.2 so that explains it. Should retention.ms affect segment rolling? In my experiment it did ( retention.ms = -1), which was unexpected since I thought only segment.bytes and segment.ms would control that. On Mon, Jul 13, 2015 at 7:57 PM, Daniel Tamai daniel.ta...@gmail.com wrote: Using -1 for log.retention.ms should work only for 0.8.3 ( https://issues.apache.org/jira/browse/KAFKA-1990). 2015-07-13 17:08 GMT-03:00 Shayne S shaynest...@gmail.com: Did this work for you? I set the topic settings to retention.ms=-1 and retention.bytes=-1 and it looks like it is deleting segments immediately. On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Using Kafka as a persistent store
Scott, This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 15:41, Scott Thibault wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto:j...@confluent.io) wrote: If I recall correctly, setting log.retention.ms (http://log.retention.ms) and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Using Kafka as a persistent store
Hi, 1. What you described sounds like a reasonable architecture, but may I ask why JSON? Avro seems better supported in the ecosystem (Confluent's tools, Hadoop integration, schema evolution, tools, etc). 1.5 If all you do is convert data into JSON, SparkStreaming sounds like a difficult-to-manage overkill. Compared to Flume or a slightly modified MirrorMaker (Or CopyCat, if it exists yet). Any specific reasons for SparkStreaming? 2. Different compute engines prefer different storage formats because in most cases thats where optimizations come from. Parquet improves scan performance for Impala and MR, but will be pretty horrible for NoSQL. So, I wouldn't hold my breath for compute engines to start sharing data storage suddenly. Gwen On Mon, Jul 13, 2015 at 11:45 AM, Tim Smith secs...@gmail.com wrote: I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text
Re: Using Kafka as a persistent store
For what it's worth, I did something similar to Rad's suggestion of cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis is also a message bus, but only has a 24 hour retention window. I wrote a Kinesis consumer that would take all messages from Kinesis and save them into S3. I stored them in S3 in such a way that the structure mirrors the original Kinesis stream, and all message metadata is preserved (message offsets and primary keys, for example). This means that I can write a consumer that would consume from S3 files in the same way that it would consume from the Kinesis stream itself. And the data is structured such that when you are done reading from S3, you can connect to the Kinesis stream at the point where the S3 archive left off. This effectively allowed me to add a configurable retention period when consuming from Kinesis. -James On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com wrote: I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that
Re: Using Kafka as a persistent store
I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Using Kafka as a persistent store
Sounds like the same idea. The nice thing about having such option is that, with a correct application of containers, backup and restore strategy, one can create an infinite ordered backup of raw input stream using native Kafka storage format. I understand the point of having the data in other formats in other systems. Impossible to get away from that. My concept presented a few days ago is to address having “multiple same-looking copies of the truth”. At the end of the day, if something happens with operational data, it will have to be recreated from the truth”. But, if the data was once ingested over Kafka and there is already a pipeline for building operational state from Kafka, why would someone write another processing logic to get the truth, say, from Hadoop? And if fast, parallel processing of native Kafka format is required, it can still be done with Samza or Hadoop / whathaveyou. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 21:17, James Cheng wrote: For what it's worth, I did something similar to Rad's suggestion of cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis is also a message bus, but only has a 24 hour retention window. I wrote a Kinesis consumer that would take all messages from Kinesis and save them into S3. I stored them in S3 in such a way that the structure mirrors the original Kinesis stream, and all message metadata is preserved (message offsets and primary keys, for example). This means that I can write a consumer that would consume from S3 files in the same way that it would consume from the Kinesis stream itself. And the data is structured such that when you are done reading from S3, you can connect to the Kinesis stream at the point where the S3 archive left off. This effectively allowed me to add a configurable retention period when consuming from Kinesis. -James On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com (mailto:secs...@gmail.com) wrote: I have had a similar issue where I wanted a single source of truth between Search and HDFS. First, if you zoom out a little, eventually you are going to have some compute engine(s) process the data. If you store it in a compute neutral tier like kafka then you will need to suck the data out at runtime and stage it for the compute engine to use. So pick your poison, process at ingest and store multiple copies of data, one per compute engine, OR store in a neutral store and process at runtime. I am not saying one is better than the other but that's how I see the trade-off so depending on your use cases, YMMV. What I do is: - store raw data into kafka - use spark streaming to transform data to JSON and post it back to kafka - Hang multiple data stores off kafka that ingest the JSON - Not do any other transformations in the consumer stores and store the copy as immutable event So I do have multiple copies (one per compute tier) but they all look the same. Unless different compute engines, natively start to use a common data storage format, I don't see how one could get away from storing multiple copies. Primarily, I see Lucene based products have their format, the Hadoop ecosystem seems congregating around Parquet and then the NoSQL players have their formats (one per each product). My 2 cents worth :) On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com (mailto:scott.thiba...@multiscalehn.com) wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault
Re: Using Kafka as a persistent store
Indeed, the files would have to be moved to some separate, dedicated storage. There are basically 3 options, as kafka does not allow adding logs at runtime: 1. make the consumer able to read from an arbitrary file 2. add ability to drop files in (I believe this adds a lot of complexity) 3. read files with another program, as suggested in my first email I’d love to get some input from someone who knows the code and options a bit better! Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 18:02, Scott Thibault wrote: Yes, consider my e-mail an up vote! I guess the files would automatically moved somewhere else to separate the active from cold segments? Ideally, one could run an unmodified consumer application on the cold segments. --Scott On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) wrote: Scott, This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: ra...@gruchalski.com (mailto:ra...@gruchalski.com)) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) ( http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 15:41, Scott Thibault wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto:j...@confluent.io) (mailto: j...@confluent.io (mailto:j...@confluent.io)) wrote: If I recall correctly, setting log.retention.ms (http://log.retention.ms) ( http://log.retention.ms) and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works
Re: Using Kafka as a persistent store
Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. Daniel Schierbeck On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090*
Re: Using Kafka as a persistent store
Yes, consider my e-mail an up vote! I guess the files would automatically moved somewhere else to separate the active from cold segments? Ideally, one could run an unmodified consumer application on the cold segments. --Scott On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com wrote: Scott, This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ ( http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Monday, 13 July 2015 at 15:41, Scott Thibault wrote: We've tried to use Kafka not as a persistent store, but as a long-term archival store. An outstanding issue we've had with that is that the broker holds on to an open file handle on every file in the log! The other issue we've had is when you create a long-term archival log on shared storage, you can't simply access that data from another cluster b/c of meta data being stored in zookeeper rather than in the log. --Scott Thibault On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: Would it be possible to document how to configure Kafka to never delete messages in a topic? It took a good while to figure this out, and I see it as an important use case for Kafka. On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto: j...@confluent.io) wrote: If I recall correctly, setting log.retention.ms ( http://log.retention.ms) and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto: daniel.schierb...@gmail.com) wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this? -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk that the information in this e-mail could be read by a third party. Accordingly, the recipient(s) named above are hereby advised to not communicate protected health information using this e-mail address. If you desire to send protected health information electronically, please contact MultiScale Health Networks at (206)538-6090* -- *This e-mail is not encrypted. Due to the unsecured nature of unencrypted e-mail, there may be some level of risk
Re: Using Kafka as a persistent store
Did this work for you? I set the topic settings to retention.ms=-1 and retention.bytes=-1 and it looks like it is deleting segments immediately. On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
Using -1 for log.retention.ms should work only for 0.8.3 ( https://issues.apache.org/jira/browse/KAFKA-1990). 2015-07-13 17:08 GMT-03:00 Shayne S shaynest...@gmail.com: Did this work for you? I set the topic settings to retention.ms=-1 and retention.bytes=-1 and it looks like it is deleting segments immediately. On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote: If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. Thanks! On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
Radek: I don't see how data could be stored more efficiently than in Kafka itself. It's optimized for cheap storage and offers high-performance bulk export, exactly what you want from long-term archival. On fre. 10. jul. 2015 at 23.16 Rad Gruchalski ra...@gruchalski.com wrote: Hello all, This is a very interesting discussion. I’ve been thinking of a similar use case for Kafka over the last few days. The usual data workflow with Kafka is most likely something this: - ingest with Kafka - process with Storm / Samza / whathaveyou - put some processed data back on Kafka - at the same time store the raw data somewhere in case if everything has to be reprocessed in the future (hdfs, similar?) Currently Kafka offers a couple of types of topics: regular stream (non-compacted topic) and a compacted topic (key/value). In case of a stream topic, when the compaction kicks in, the “old” data is truncated. It is lost from Kafka. What if there was an additional compaction setting: cold-store. Instead of trimming old data, Kafka would compile old data into a separate log with its own index. The user would be free to decide what to do with such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not needed. The only 3 things are: - the folder name / partition index - the log itself - topic metadata at the time of taking the data out of the segment With all this info, reading data back is fairly easy, even without starting Kafka, sample program goes like this (scala-ish): val props = new Properties() props.put(log.segment.bytes, 1073741824) props.put(segment.index.bytes, 10485760) // should be 10MB val log = new Log( new File(“/somestorage/kafka-test-0), cfg, 0L, null ) val fdi = log.activeSegment.read( log.logStartOffset, Some(log.logEndOffset), 100 ) var msgs = 1 fdi.messageSet.iterator.foreach { msgoffset = println( s ${msgoffset.message.hasKey} ::: $msgs ${msgoffset.offset} :: ${msgoffset.nextOffset} ) msgs = msgs + 1 val key = new String( msgoffset.message.key.array(), UTF-8) val msg = new String( msgoffset.message.payload.array(), UTF-8) println( s === ${key} ) println( s === ${msg} ) } This reads from active segment (the last known segment) but it’s easy to make it read from all segments. The interesting thing is - as long as the back up files are well formed, they can be read without having to put them in Kafka itself. The advantage is: what was once the raw data (as it came in), is the raw data forever, without having to introduce another format for storing this. Another advantage is: in case of reprocessing, no need to write a producer to ingest the data back and so on, so on (it’s possible but not necessary). Such raw Kafka files can be easily processed by Storm / Samza (would need another stream definition) / Hadoop. This sounds like a very useful addition to Kafka. But I could be overthinking this... Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ ( http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto: shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a
Re: Using Kafka as a persistent store
Daniel, I understand your point. From what I understand the mode that suits you is what Jay suggested: log.retention.ms (http://log.retention.ms) and log.retention.bytes both set to -1. A few questions before I continue on something what may already be possible: 1. is it possible to attach additional storage without having to restart Kafka? 2. If answer to 1. is yes: will Kafka continue the topic on a new storage if all attached disks are full? Or is the assumption that one data_dir = one topic/partition (the code suggests so). 3. If answer to 1. is no: is it possible to take segments out without having to restart Kafka? Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Saturday, 11 July 2015 at 22:22, Daniel Schierbeck wrote: Radek: I don't see how data could be stored more efficiently than in Kafka itself. It's optimized for cheap storage and offers high-performance bulk export, exactly what you want from long-term archival. On fre. 10. jul. 2015 at 23.16 Rad Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) wrote: Hello all, This is a very interesting discussion. I’ve been thinking of a similar use case for Kafka over the last few days. The usual data workflow with Kafka is most likely something this: - ingest with Kafka - process with Storm / Samza / whathaveyou - put some processed data back on Kafka - at the same time store the raw data somewhere in case if everything has to be reprocessed in the future (hdfs, similar?) Currently Kafka offers a couple of types of topics: regular stream (non-compacted topic) and a compacted topic (key/value). In case of a stream topic, when the compaction kicks in, the “old” data is truncated. It is lost from Kafka. What if there was an additional compaction setting: cold-store. Instead of trimming old data, Kafka would compile old data into a separate log with its own index. The user would be free to decide what to do with such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not needed. The only 3 things are: - the folder name / partition index - the log itself - topic metadata at the time of taking the data out of the segment With all this info, reading data back is fairly easy, even without starting Kafka, sample program goes like this (scala-ish): val props = new Properties() props.put(log.segment.bytes, 1073741824) props.put(segment.index.bytes, 10485760) // should be 10MB val log = new Log( new File(“/somestorage/kafka-test-0), cfg, 0L, null ) val fdi = log.activeSegment.read( log.logStartOffset, Some(log.logEndOffset), 100 ) var msgs = 1 fdi.messageSet.iterator.foreach { msgoffset = println( s ${msgoffset.message.hasKey} ::: $msgs ${msgoffset.offset} :: ${msgoffset.nextOffset} ) msgs = msgs + 1 val key = new String( msgoffset.message.key.array(), UTF-8) val msg = new String( msgoffset.message.payload.array(), UTF-8) println( s === ${key} ) println( s === ${msg} ) } This reads from active segment (the last known segment) but it’s easy to make it read from all segments. The interesting thing is - as long as the back up files are well formed, they can be read without having to put them in Kafka itself. The advantage is: what was once the raw data (as it came in), is the raw data forever, without having to introduce another format for storing this. Another advantage is: in case of reprocessing, no need to write a producer to ingest the data back and so on, so on (it’s possible but not necessary). Such raw Kafka files can be easily processed by Storm / Samza (would need another stream definition) / Hadoop. This sounds like a very useful addition to Kafka. But I could be overthinking this... Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: ra...@gruchalski.com (mailto:ra...@gruchalski.com)) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) ( http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote: On
Using Kafka as a persistent store
I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
I don't want to endorse this use of Kafka, but assuming you can give your message unique identifiers, I believe using log compaction will keep all unique messages forever. You can read about how consumer offsets stored in Kafka are managed using a compacted topic here: http://kafka.apache.org/documentation.html#distributionimpl In that case, the consumer group id+topic+partition forms a unique message id and the brokers read that topic on startup into the offsets cache (and take updates to the offsets cache via the same topic.) If you have a finite, smallish data set that you want indexed in multiple systems, that might be a good approach. If your data can grow without bound, it doesn't seem to me like Kafka is a good choice? Even with compaction, you will still have to sequentially read it all, message by message, to get it into a different system. As far as I know, there is no lookup by id, and even going to a specific date is a manual O(log n) process. (warning: I'm just another user, so I may have a few things wrong.) On Fri, Jul 10, 2015 at 3:47 AM Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
If I recall correctly, setting log.retention.ms and log.retention.bytes to -1 disables both. On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?
Re: Using Kafka as a persistent store
Hello all, This is a very interesting discussion. I’ve been thinking of a similar use case for Kafka over the last few days. The usual data workflow with Kafka is most likely something this: - ingest with Kafka - process with Storm / Samza / whathaveyou - put some processed data back on Kafka - at the same time store the raw data somewhere in case if everything has to be reprocessed in the future (hdfs, similar?) Currently Kafka offers a couple of types of topics: regular stream (non-compacted topic) and a compacted topic (key/value). In case of a stream topic, when the compaction kicks in, the “old” data is truncated. It is lost from Kafka. What if there was an additional compaction setting: cold-store. Instead of trimming old data, Kafka would compile old data into a separate log with its own index. The user would be free to decide what to do with such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not needed. The only 3 things are: - the folder name / partition index - the log itself - topic metadata at the time of taking the data out of the segment With all this info, reading data back is fairly easy, even without starting Kafka, sample program goes like this (scala-ish): val props = new Properties() props.put(log.segment.bytes, 1073741824) props.put(segment.index.bytes, 10485760) // should be 10MB val log = new Log( new File(“/somestorage/kafka-test-0), cfg, 0L, null ) val fdi = log.activeSegment.read( log.logStartOffset, Some(log.logEndOffset), 100 ) var msgs = 1 fdi.messageSet.iterator.foreach { msgoffset = println( s ${msgoffset.message.hasKey} ::: $msgs ${msgoffset.offset} :: ${msgoffset.nextOffset} ) msgs = msgs + 1 val key = new String( msgoffset.message.key.array(), UTF-8) val msg = new String( msgoffset.message.payload.array(), UTF-8) println( s === ${key} ) println( s === ${msg} ) } This reads from active segment (the last known segment) but it’s easy to make it read from all segments. The interesting thing is - as long as the back up files are well formed, they can be read without having to put them in Kafka itself. The advantage is: what was once the raw data (as it came in), is the raw data forever, without having to introduce another format for storing this. Another advantage is: in case of reprocessing, no need to write a producer to ingest the data back and so on, so on (it’s possible but not necessary). Such raw Kafka files can be easily processed by Storm / Samza (would need another stream definition) / Hadoop. This sounds like a very useful addition to Kafka. But I could be overthinking this... Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote: On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:shaynest...@gmail.com) wrote: There are two ways you can configure your topics, log compaction and with no cleaning. The choice depends on your use case. Are the records uniquely identifiable and will they receive updates? Then log compaction is the way to go. If they are truly read only, you can go without log compaction. I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished? We have a small processes which consume a topic and perform upserts to our various database engines. It's easy to change how it all works and simply consume the single source of truth again. I've written a bit about log compaction here: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote: I'd like to use Kafka as a persistent store – sort of as an alternative to HDFS. The idea is that I'd load the data into various other systems in order to solve specific needs such as full-text search, analytics, indexing by various attributes, etc. I'd like to keep a single source of truth, however. I'm struggling a bit to understand how I can configure a topic to retain messages indefinitely. I want to make sure that my data isn't deleted. Is there a guide to configuring Kafka like this?