Hello all,

This is a very interesting discussion. I’ve been thinking of a similar use case 
for Kafka over the last few days.  
The usual data workflow with Kafka is most likely something this:

- ingest with Kafka
- process with Storm / Samza / whathaveyou
  - put some processed data back on Kafka
  - at the same time store the raw data somewhere in case if everything has to 
be reprocessed in the future (hdfs, similar?)

Currently Kafka offers a couple of types of topics: regular stream 
(non-compacted topic) and a compacted topic (key/value). In case of a stream 
topic, when the compaction kicks in, the “old” data is truncated. It is lost 
from Kafka. What if there was an additional compaction setting: cold-store.
Instead of trimming old data, Kafka would compile old data into a separate log 
with its own index. The user would be free to decide what to do with such 
files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not 
needed. The only 3 things are:

 - the folder name / partition index
 - the log itself
 - topic metadata at the time of taking the data out of the segment

With all this info, reading data back is fairly easy, even without starting 
Kafka, sample program goes like this (scala-ish):

    val props = new Properties()
    props.put("log.segment.bytes", "1073741824")
    props.put("segment.index.bytes", "10485760") // should be 10MB

    val log = new Log(
      new File(“/somestorage/kafka-test-0"),
      cfg,
      0L,
      null )

    val fdi = log.activeSegment.read( log.logStartOffset, 
Some(log.logEndOffset), 1000000 )
    var msgs = 1
    fdi.messageSet.iterator.foreach { msgoffset =>
      println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::> 
${msgoffset.offset} :::::: ${msgoffset.nextOffset}" )
      msgs = msgs + 1
      val key = new String( msgoffset.message.key.array(), "UTF-8")
      val msg = new String( msgoffset.message.payload.array(), "UTF-8")
      println( s" === ${key} " )
      println( s" === ${msg} " )
    }


This reads from active segment (the last known segment) but it’s easy to make 
it read from all segments. The interesting thing is - as long as the back up 
files are well formed, they can be read without having to put them in Kafka 
itself.

The advantage is: what was once the raw data (as it came in), is the raw data 
forever, without having to introduce another format for storing this. Another 
advantage is: in case of reprocessing, no need to write a producer to ingest 
the data back and so on, so on (it’s possible but not necessary). Such raw 
Kafka files can be easily processed by Storm / Samza (would need another stream 
definition) / Hadoop.

This sounds like a very useful addition to Kafka. But I could be overthinking 
this...  










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:

>  
> > On 10. jul. 2015, at 15.16, Shayne S <shaynest...@gmail.com 
> > (mailto:shaynest...@gmail.com)> wrote:
> >  
> > There are two ways you can configure your topics, log compaction and with
> > no cleaning. The choice depends on your use case. Are the records uniquely
> > identifiable and will they receive updates? Then log compaction is the way
> > to go. If they are truly read only, you can go without log compaction.
> >  
>  
>  
> I'd rather be free to use the key for partitioning, and the records are 
> immutable — they're event records — so disabling compaction altogether would 
> be preferable. How is that accomplished?
> >  
> > We have a small processes which consume a topic and perform upserts to our
> > various database engines. It's easy to change how it all works and simply
> > consume the single source of truth again.
> >  
> > I've written a bit about log compaction here:
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >  
> > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)> wrote:
> >  
> > > I'd like to use Kafka as a persistent store – sort of as an alternative to
> > > HDFS. The idea is that I'd load the data into various other systems in
> > > order to solve specific needs such as full-text search, analytics, 
> > > indexing
> > > by various attributes, etc. I'd like to keep a single source of truth,
> > > however.
> > >  
> > > I'm struggling a bit to understand how I can configure a topic to retain
> > > messages indefinitely. I want to make sure that my data isn't deleted. Is
> > > there a guide to configuring Kafka like this?
> > >  
> >  
> >  
>  
>  
>  


Reply via email to