Hello all, This is a very interesting discussion. I’ve been thinking of a similar use case for Kafka over the last few days. The usual data workflow with Kafka is most likely something this:
- ingest with Kafka - process with Storm / Samza / whathaveyou - put some processed data back on Kafka - at the same time store the raw data somewhere in case if everything has to be reprocessed in the future (hdfs, similar?) Currently Kafka offers a couple of types of topics: regular stream (non-compacted topic) and a compacted topic (key/value). In case of a stream topic, when the compaction kicks in, the “old” data is truncated. It is lost from Kafka. What if there was an additional compaction setting: cold-store. Instead of trimming old data, Kafka would compile old data into a separate log with its own index. The user would be free to decide what to do with such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not needed. The only 3 things are: - the folder name / partition index - the log itself - topic metadata at the time of taking the data out of the segment With all this info, reading data back is fairly easy, even without starting Kafka, sample program goes like this (scala-ish): val props = new Properties() props.put("log.segment.bytes", "1073741824") props.put("segment.index.bytes", "10485760") // should be 10MB val log = new Log( new File(“/somestorage/kafka-test-0"), cfg, 0L, null ) val fdi = log.activeSegment.read( log.logStartOffset, Some(log.logEndOffset), 1000000 ) var msgs = 1 fdi.messageSet.iterator.foreach { msgoffset => println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::> ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" ) msgs = msgs + 1 val key = new String( msgoffset.message.key.array(), "UTF-8") val msg = new String( msgoffset.message.payload.array(), "UTF-8") println( s" === ${key} " ) println( s" === ${msg} " ) } This reads from active segment (the last known segment) but it’s easy to make it read from all segments. The interesting thing is - as long as the back up files are well formed, they can be read without having to put them in Kafka itself. The advantage is: what was once the raw data (as it came in), is the raw data forever, without having to introduce another format for storing this. Another advantage is: in case of reprocessing, no need to write a producer to ingest the data back and so on, so on (it’s possible but not necessary). Such raw Kafka files can be easily processed by Storm / Samza (would need another stream definition) / Hadoop. This sounds like a very useful addition to Kafka. But I could be overthinking this... Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote: > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest...@gmail.com > > (mailto:shaynest...@gmail.com)> wrote: > > > > There are two ways you can configure your topics, log compaction and with > > no cleaning. The choice depends on your use case. Are the records uniquely > > identifiable and will they receive updates? Then log compaction is the way > > to go. If they are truly read only, you can go without log compaction. > > > > > I'd rather be free to use the key for partitioning, and the records are > immutable — they're event records — so disabling compaction altogether would > be preferable. How is that accomplished? > > > > We have a small processes which consume a topic and perform upserts to our > > various database engines. It's easy to change how it all works and simply > > consume the single source of truth again. > > > > I've written a bit about log compaction here: > > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck < > > daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)> wrote: > > > > > I'd like to use Kafka as a persistent store – sort of as an alternative to > > > HDFS. The idea is that I'd load the data into various other systems in > > > order to solve specific needs such as full-text search, analytics, > > > indexing > > > by various attributes, etc. I'd like to keep a single source of truth, > > > however. > > > > > > I'm struggling a bit to understand how I can configure a topic to retain > > > messages indefinitely. I want to make sure that my data isn't deleted. Is > > > there a guide to configuring Kafka like this? > > > > > > > > > >