Re: Using Kafka as a persistent store

2015-07-14 Thread Shayne S
Thanks, I'm on 0.8.2 so that explains it.

Should retention.ms affect segment rolling? In my experiment it did (
retention.ms = -1), which was unexpected since I thought only segment.bytes
and segment.ms would control that.

On Mon, Jul 13, 2015 at 7:57 PM, Daniel Tamai daniel.ta...@gmail.com
wrote:

 Using -1 for log.retention.ms should work only for 0.8.3 (
 https://issues.apache.org/jira/browse/KAFKA-1990).

 2015-07-13 17:08 GMT-03:00 Shayne S shaynest...@gmail.com:

  Did this work for you? I set the topic settings to retention.ms=-1 and
  retention.bytes=-1 and it looks like it is deleting segments immediately.
 
  On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  
On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
   
If I recall correctly, setting log.retention.ms and
  log.retention.bytes
   to
-1 disables both.
  
   Thanks!
  
   
On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:
   
   
On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com
 wrote:
   
There are two ways you can configure your topics, log compaction
 and
   with
no cleaning. The choice depends on your use case. Are the records
uniquely
identifiable and will they receive updates? Then log compaction is
  the
way
to go. If they are truly read only, you can go without log
  compaction.
   
I'd rather be free to use the key for partitioning, and the records
  are
immutable — they're event records — so disabling compaction
 altogether
would be preferable. How is that accomplished?
   
We have a small processes which consume a topic and perform upserts
  to
our
various database engines. It's easy to change how it all works and
   simply
consume the single source of truth again.
   
I've written a bit about log compaction here:
   
  
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
   
On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:
   
I'd like to use Kafka as a persistent store – sort of as an
   alternative
to
HDFS. The idea is that I'd load the data into various other
 systems
  in
order to solve specific needs such as full-text search, analytics,
indexing
by various attributes, etc. I'd like to keep a single source of
  truth,
however.
   
I'm struggling a bit to understand how I can configure a topic to
   retain
messages indefinitely. I want to make sure that my data isn't
  deleted.
Is
there a guide to configuring Kafka like this?
   
  
 



Re: Using Kafka as a persistent store

2015-07-13 Thread Daniel Schierbeck
Would it be possible to document how to configure Kafka to never delete
messages in a topic? It took a good while to figure this out, and I see it
as an important use case for Kafka.

On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:


  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and log.retention.bytes
 to
  -1 disables both.

 Thanks!

 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
 with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is the
  way
  to go. If they are truly read only, you can go without log compaction.
 
  I'd rather be free to use the key for partitioning, and the records are
  immutable — they're event records — so disabling compaction altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts to
  our
  various database engines. It's easy to change how it all works and
 simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
 alternative
  to
  HDFS. The idea is that I'd load the data into various other systems in
  order to solve specific needs such as full-text search, analytics,
  indexing
  by various attributes, etc. I'd like to keep a single source of truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to
 retain
  messages indefinitely. I want to make sure that my data isn't deleted.
  Is
  there a guide to configuring Kafka like this?
 



Re: Using Kafka as a persistent store

2015-07-13 Thread Scott Thibault
We've tried to use Kafka not as a persistent store, but as a long-term
archival store.  An outstanding issue we've had with that is that the
broker holds on to an open file handle on every file in the log!  The other
issue we've had is when you create a long-term archival log on shared
storage, you can't simply access that data from another cluster b/c of meta
data being stored in zookeeper rather than in the log.

--Scott Thibault


On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:

 Would it be possible to document how to configure Kafka to never delete
 messages in a topic? It took a good while to figure this out, and I see it
 as an important use case for Kafka.

 On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 
   On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
  
   If I recall correctly, setting log.retention.ms and
 log.retention.bytes
  to
   -1 disables both.
 
  Thanks!
 
  
   On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
  
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
  
   There are two ways you can configure your topics, log compaction and
  with
   no cleaning. The choice depends on your use case. Are the records
   uniquely
   identifiable and will they receive updates? Then log compaction is
 the
   way
   to go. If they are truly read only, you can go without log
 compaction.
  
   I'd rather be free to use the key for partitioning, and the records
 are
   immutable — they're event records — so disabling compaction altogether
   would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform upserts
 to
   our
   various database engines. It's easy to change how it all works and
  simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
  
  http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
   I'd like to use Kafka as a persistent store – sort of as an
  alternative
   to
   HDFS. The idea is that I'd load the data into various other systems
 in
   order to solve specific needs such as full-text search, analytics,
   indexing
   by various attributes, etc. I'd like to keep a single source of
 truth,
   however.
  
   I'm struggling a bit to understand how I can configure a topic to
  retain
   messages indefinitely. I want to make sure that my data isn't
 deleted.
   Is
   there a guide to configuring Kafka like this?
  
 




-- 
*This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
e-mail, there may be some level of risk that the information in this e-mail
could be read by a third party.  Accordingly, the recipient(s) named above
are hereby advised to not communicate protected health information using
this e-mail address.  If you desire to send protected health information
electronically, please contact MultiScale Health Networks at (206)538-6090*


Re: Using Kafka as a persistent store

2015-07-13 Thread Rad Gruchalski
Scott,  

This is what I was trying to target in one of my previous responses to Daniel. 
The one in which I suggest another compaction setting for kafka.










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:

 We've tried to use Kafka not as a persistent store, but as a long-term
 archival store. An outstanding issue we've had with that is that the
 broker holds on to an open file handle on every file in the log! The other
 issue we've had is when you create a long-term archival log on shared
 storage, you can't simply access that data from another cluster b/c of meta
 data being stored in zookeeper rather than in the log.
  
 --Scott Thibault
  
  
 On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
  
  Would it be possible to document how to configure Kafka to never delete
  messages in a topic? It took a good while to figure this out, and I see it
  as an important use case for Kafka.
   
  On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
   

On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io 
(mailto:j...@confluent.io) wrote:
 
If I recall correctly, setting log.retention.ms 
(http://log.retention.ms) and
  log.retention.bytes
   to
-1 disables both.


   Thanks!

 
On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
 
  
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com 
  (mailto:shaynest...@gmail.com) wrote:
   
  There are two ways you can configure your topics, log compaction and
   with
  no cleaning. The choice depends on your use case. Are the records
  
 uniquely
  identifiable and will they receive updates? Then log compaction is
  
  
 


   
  the
 way
  to go. If they are truly read only, you can go without log
  
  
 

   
  compaction.
  
 I'd rather be free to use the key for partitioning, and the records
  are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
   
  We have a small processes which consume a topic and perform upserts
  to
 our
  various database engines. It's easy to change how it all works and
  
  
 

   simply
  consume the single source of truth again.
   
  I've written a bit about log compaction here:
   http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
   
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) 
  wrote:
   
   I'd like to use Kafka as a persistent store – sort of as an
   alternative
 to
   HDFS. The idea is that I'd load the data into various other 
   systems
   
  
  
 


   
  in
   order to solve specific needs such as full-text search, analytics,
   
  
 indexing
   by various attributes, etc. I'd like to keep a single source of
   
  
  
 

   
  truth,
   however.

   I'm struggling a bit to understand how I can configure a topic to
   retain
   messages indefinitely. I want to make sure that my data isn't
   
  
 


   
  deleted.
 Is
   there a guide to configuring Kafka like this?
   
  
  
 

   
   
  
  
  
  
 --  
 *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
 e-mail, there may be some level of risk that the information in this e-mail
 could be read by a third party. Accordingly, the recipient(s) named above
 are hereby advised to not communicate protected health information using
 this e-mail address. If you desire to send protected health information
 electronically, please contact MultiScale Health Networks at (206)538-6090*
  
  




Re: Using Kafka as a persistent store

2015-07-13 Thread Gwen Shapira
Hi,

1. What you described sounds like a reasonable architecture, but may I
ask why JSON? Avro seems better supported in the ecosystem
(Confluent's tools, Hadoop integration, schema evolution, tools, etc).

1.5 If all you do is convert data into JSON, SparkStreaming sounds
like a difficult-to-manage overkill. Compared to Flume or a slightly
modified MirrorMaker (Or CopyCat, if it exists yet). Any specific
reasons for SparkStreaming?

2. Different compute engines prefer different storage formats because
in most cases thats where optimizations come from. Parquet improves
scan performance for Impala and MR, but will be pretty horrible for
NoSQL. So, I wouldn't hold my breath for compute engines to start
sharing data storage suddenly.

Gwen

On Mon, Jul 13, 2015 at 11:45 AM, Tim Smith secs...@gmail.com wrote:
 I have had a similar issue where I wanted a single source of truth between
 Search and HDFS. First, if you zoom out a little, eventually you are going
 to have some compute engine(s) process the data. If you store it in a
 compute neutral tier like kafka then you will need to suck the data out at
 runtime and stage it for the compute engine to use. So pick your poison,
 process at ingest and store multiple copies of data, one per compute
 engine, OR store in a neutral store and process at runtime. I am not saying
 one is better than the other but that's how I see the trade-off so
 depending on your use cases, YMMV.

 What I do is:
 - store raw data into kafka
 - use spark streaming to transform data to JSON and post it back to kafka
 - Hang multiple data stores off kafka that ingest the JSON
 - Not do any other transformations in the consumer stores and store the
 copy as immutable event

 So I do have multiple copies (one per compute tier) but they all look the
 same.

 Unless different compute engines, natively start to use a common data
 storage format, I don't see how one could get away from storing multiple
 copies. Primarily, I see Lucene based products have their format, the
 Hadoop ecosystem seems congregating around Parquet and then the NoSQL
 players have their formats (one per each product).

 My 2 cents worth :)



 On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 Am I correct in assuming that Kafka will only retain a file handle for the
 last segment of the log? If the number of handles grows unbounded, then it
 would be an issue. But I plan on writing to this topic continuously anyway,
 so not separating data into cold and hot storage is the entire point.

 Daniel Schierbeck

  On 13. jul. 2015, at 15.41, Scott Thibault 
 scott.thiba...@multiscalehn.com wrote:
 
  We've tried to use Kafka not as a persistent store, but as a long-term
  archival store.  An outstanding issue we've had with that is that the
  broker holds on to an open file handle on every file in the log!  The
 other
  issue we've had is when you create a long-term archival log on shared
  storage, you can't simply access that data from another cluster b/c of
 meta
  data being stored in zookeeper rather than in the log.
 
  --Scott Thibault
 
 
  On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  Would it be possible to document how to configure Kafka to never delete
  messages in a topic? It took a good while to figure this out, and I see
 it
  as an important use case for Kafka.
 
  On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and
  log.retention.bytes
  to
  -1 disables both.
 
  Thanks!
 
 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
  with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is
  the
  way
  to go. If they are truly read only, you can go without log
  compaction.
 
  I'd rather be free to use the key for partitioning, and the records
  are
  immutable — they're event records — so disabling compaction
 altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts
  to
  our
  various database engines. It's easy to change how it all works and
  simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
  alternative
  to
  HDFS. The idea is that I'd load the data into various other systems
  in
  order to solve specific needs such as full-text 

Re: Using Kafka as a persistent store

2015-07-13 Thread James Cheng
For what it's worth, I did something similar to Rad's suggestion of 
cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis is 
also a message bus, but only has a 24 hour retention window.

I wrote a Kinesis consumer that would take all messages from Kinesis and save 
them into S3. I stored them in S3 in such a way that the structure mirrors the 
original Kinesis stream, and all message metadata is preserved (message offsets 
and primary keys, for example).

This means that I can write a consumer that would consume from S3 files in 
the same way that it would consume from the Kinesis stream itself. And the data 
is structured such that when you are done reading from S3, you can connect to 
the Kinesis stream at the point where the S3 archive left off.

This effectively allowed me to add a configurable retention period when 
consuming from Kinesis.

-James

On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com wrote:

 I have had a similar issue where I wanted a single source of truth between
 Search and HDFS. First, if you zoom out a little, eventually you are going
 to have some compute engine(s) process the data. If you store it in a
 compute neutral tier like kafka then you will need to suck the data out at
 runtime and stage it for the compute engine to use. So pick your poison,
 process at ingest and store multiple copies of data, one per compute
 engine, OR store in a neutral store and process at runtime. I am not saying
 one is better than the other but that's how I see the trade-off so
 depending on your use cases, YMMV.
 
 What I do is:
 - store raw data into kafka
 - use spark streaming to transform data to JSON and post it back to kafka
 - Hang multiple data stores off kafka that ingest the JSON
 - Not do any other transformations in the consumer stores and store the
 copy as immutable event
 
 So I do have multiple copies (one per compute tier) but they all look the
 same.
 
 Unless different compute engines, natively start to use a common data
 storage format, I don't see how one could get away from storing multiple
 copies. Primarily, I see Lucene based products have their format, the
 Hadoop ecosystem seems congregating around Parquet and then the NoSQL
 players have their formats (one per each product).
 
 My 2 cents worth :)
 
 
 
 On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 Am I correct in assuming that Kafka will only retain a file handle for the
 last segment of the log? If the number of handles grows unbounded, then it
 would be an issue. But I plan on writing to this topic continuously anyway,
 so not separating data into cold and hot storage is the entire point.
 
 Daniel Schierbeck
 
 On 13. jul. 2015, at 15.41, Scott Thibault 
 scott.thiba...@multiscalehn.com wrote:
 
 We've tried to use Kafka not as a persistent store, but as a long-term
 archival store.  An outstanding issue we've had with that is that the
 broker holds on to an open file handle on every file in the log!  The
 other
 issue we've had is when you create a long-term archival log on shared
 storage, you can't simply access that data from another cluster b/c of
 meta
 data being stored in zookeeper rather than in the log.
 
 --Scott Thibault
 
 
 On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 Would it be possible to document how to configure Kafka to never delete
 messages in a topic? It took a good while to figure this out, and I see
 it
 as an important use case for Kafka.
 
 On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
 If I recall correctly, setting log.retention.ms and
 log.retention.bytes
 to
 -1 disables both.
 
 Thanks!
 
 
 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
 There are two ways you can configure your topics, log compaction and
 with
 no cleaning. The choice depends on your use case. Are the records
 uniquely
 identifiable and will they receive updates? Then log compaction is
 the
 way
 to go. If they are truly read only, you can go without log
 compaction.
 
 I'd rather be free to use the key for partitioning, and the records
 are
 immutable — they're event records — so disabling compaction
 altogether
 would be preferable. How is that accomplished?
 
 We have a small processes which consume a topic and perform upserts
 to
 our
 various database engines. It's easy to change how it all works and
 simply
 consume the single source of truth again.
 
 I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
 On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 I'd like to use Kafka as a persistent store – sort of as an
 alternative
 to
 HDFS. The idea is that 

Re: Using Kafka as a persistent store

2015-07-13 Thread Tim Smith
I have had a similar issue where I wanted a single source of truth between
Search and HDFS. First, if you zoom out a little, eventually you are going
to have some compute engine(s) process the data. If you store it in a
compute neutral tier like kafka then you will need to suck the data out at
runtime and stage it for the compute engine to use. So pick your poison,
process at ingest and store multiple copies of data, one per compute
engine, OR store in a neutral store and process at runtime. I am not saying
one is better than the other but that's how I see the trade-off so
depending on your use cases, YMMV.

What I do is:
- store raw data into kafka
- use spark streaming to transform data to JSON and post it back to kafka
- Hang multiple data stores off kafka that ingest the JSON
- Not do any other transformations in the consumer stores and store the
copy as immutable event

So I do have multiple copies (one per compute tier) but they all look the
same.

Unless different compute engines, natively start to use a common data
storage format, I don't see how one could get away from storing multiple
copies. Primarily, I see Lucene based products have their format, the
Hadoop ecosystem seems congregating around Parquet and then the NoSQL
players have their formats (one per each product).

My 2 cents worth :)



On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:

 Am I correct in assuming that Kafka will only retain a file handle for the
 last segment of the log? If the number of handles grows unbounded, then it
 would be an issue. But I plan on writing to this topic continuously anyway,
 so not separating data into cold and hot storage is the entire point.

 Daniel Schierbeck

  On 13. jul. 2015, at 15.41, Scott Thibault 
 scott.thiba...@multiscalehn.com wrote:
 
  We've tried to use Kafka not as a persistent store, but as a long-term
  archival store.  An outstanding issue we've had with that is that the
  broker holds on to an open file handle on every file in the log!  The
 other
  issue we've had is when you create a long-term archival log on shared
  storage, you can't simply access that data from another cluster b/c of
 meta
  data being stored in zookeeper rather than in the log.
 
  --Scott Thibault
 
 
  On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  Would it be possible to document how to configure Kafka to never delete
  messages in a topic? It took a good while to figure this out, and I see
 it
  as an important use case for Kafka.
 
  On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and
  log.retention.bytes
  to
  -1 disables both.
 
  Thanks!
 
 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
  with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is
  the
  way
  to go. If they are truly read only, you can go without log
  compaction.
 
  I'd rather be free to use the key for partitioning, and the records
  are
  immutable — they're event records — so disabling compaction
 altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts
  to
  our
  various database engines. It's easy to change how it all works and
  simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
  alternative
  to
  HDFS. The idea is that I'd load the data into various other systems
  in
  order to solve specific needs such as full-text search, analytics,
  indexing
  by various attributes, etc. I'd like to keep a single source of
  truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to
  retain
  messages indefinitely. I want to make sure that my data isn't
  deleted.
  Is
  there a guide to configuring Kafka like this?
 
 
 
  --
  *This e-mail is not encrypted.  Due to the unsecured nature of
 unencrypted
  e-mail, there may be some level of risk that the information in this
 e-mail
  could be read by a third party.  Accordingly, the recipient(s) named
 above
  are hereby advised to not communicate protected health information using
  this e-mail address.  If you desire to send protected health information
  electronically, please contact MultiScale Health Networks at
 (206)538-6090*



Re: Using Kafka as a persistent store

2015-07-13 Thread Rad Gruchalski
Sounds like the same idea. The nice thing about having such option is that, 
with a correct application of containers, backup and restore strategy, one can 
create an infinite ordered backup of raw input stream using native Kafka 
storage format.
I understand the point of having the data in other formats in other systems. 
Impossible to get away from that.
My concept presented a few days ago is to address having “multiple same-looking 
copies of the truth”.

At the end of the day, if something happens with operational data, it will have 
to be recreated from the truth”. But, if the data was once ingested over Kafka 
and there is already a pipeline for building operational state from Kafka, why 
would someone write another processing logic to get the truth, say, from 
Hadoop? And if fast, parallel processing of native Kafka format is required, it 
can still be done with Samza or Hadoop / whathaveyou.










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Monday, 13 July 2015 at 21:17, James Cheng wrote:

 For what it's worth, I did something similar to Rad's suggestion of 
 cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis 
 is also a message bus, but only has a 24 hour retention window.
  
 I wrote a Kinesis consumer that would take all messages from Kinesis and save 
 them into S3. I stored them in S3 in such a way that the structure mirrors 
 the original Kinesis stream, and all message metadata is preserved (message 
 offsets and primary keys, for example).
  
 This means that I can write a consumer that would consume from S3 files in 
 the same way that it would consume from the Kinesis stream itself. And the 
 data is structured such that when you are done reading from S3, you can 
 connect to the Kinesis stream at the point where the S3 archive left off.
  
 This effectively allowed me to add a configurable retention period when 
 consuming from Kinesis.
  
 -James
  
 On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com 
 (mailto:secs...@gmail.com) wrote:
  
  I have had a similar issue where I wanted a single source of truth between
  Search and HDFS. First, if you zoom out a little, eventually you are going
  to have some compute engine(s) process the data. If you store it in a
  compute neutral tier like kafka then you will need to suck the data out at
  runtime and stage it for the compute engine to use. So pick your poison,
  process at ingest and store multiple copies of data, one per compute
  engine, OR store in a neutral store and process at runtime. I am not saying
  one is better than the other but that's how I see the trade-off so
  depending on your use cases, YMMV.
   
  What I do is:
  - store raw data into kafka
  - use spark streaming to transform data to JSON and post it back to kafka
  - Hang multiple data stores off kafka that ingest the JSON
  - Not do any other transformations in the consumer stores and store the
  copy as immutable event
   
  So I do have multiple copies (one per compute tier) but they all look the
  same.
   
  Unless different compute engines, natively start to use a common data
  storage format, I don't see how one could get away from storing multiple
  copies. Primarily, I see Lucene based products have their format, the
  Hadoop ecosystem seems congregating around Parquet and then the NoSQL
  players have their formats (one per each product).
   
  My 2 cents worth :)
   
   
   
  On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
   
   Am I correct in assuming that Kafka will only retain a file handle for the
   last segment of the log? If the number of handles grows unbounded, then it
   would be an issue. But I plan on writing to this topic continuously 
   anyway,
   so not separating data into cold and hot storage is the entire point.

   Daniel Schierbeck

On 13. jul. 2015, at 15.41, Scott Thibault 
   scott.thiba...@multiscalehn.com (mailto:scott.thiba...@multiscalehn.com) 
   wrote:
 
We've tried to use Kafka not as a persistent store, but as a long-term
archival store. An outstanding issue we've had with that is that the
broker holds on to an open file handle on every file in the log! The
 

   other
issue we've had is when you create a long-term archival log on shared
storage, you can't simply access that data from another cluster b/c of
 

   meta
data being stored in zookeeper rather than in the log.
 
--Scott Thibault
 
 
  

Re: Using Kafka as a persistent store

2015-07-13 Thread Rad Gruchalski
Indeed, the files would have to be moved to some separate, dedicated storage.  
There are basically 3 options, as kafka does not allow adding logs at runtime:

1. make the consumer able to read from an arbitrary file
2. add ability to drop files in (I believe this adds a lot of complexity)
3. read files with another program, as suggested in my first email

I’d love to get some input from someone who knows the code and options a bit 
better!  










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Monday, 13 July 2015 at 18:02, Scott Thibault wrote:

 Yes, consider my e-mail an up vote!
  
 I guess the files would automatically moved somewhere else to separate the
 active from cold segments? Ideally, one could run an unmodified consumer
 application on the cold segments.
  
  
 --Scott
  
  
 On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com 
 (mailto:ra...@gruchalski.com)
 wrote:
  
  Scott,
   
  This is what I was trying to target in one of my previous responses to
  Daniel. The one in which I suggest another compaction setting for kafka.
   
   
   
   
   
   
   
   
   
   
  Kind regards,
  Radek Gruchalski
  ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
  ra...@gruchalski.com (mailto:ra...@gruchalski.com))
  de.linkedin.com/in/radgruchalski/ 
  (http://de.linkedin.com/in/radgruchalski/) (
  http://de.linkedin.com/in/radgruchalski/)
   
  Confidentiality:
  This communication is intended for the above-named person and may be
  confidential and/or legally privileged.
  If it has come to you in error you must take no action based on it, nor
  must you copy or show it to anyone; please delete/destroy and inform the
  sender immediately.
   
   
   
  On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:
   
   We've tried to use Kafka not as a persistent store, but as a long-term
   archival store. An outstanding issue we've had with that is that the
   broker holds on to an open file handle on every file in the log! The

   
  other
   issue we've had is when you create a long-term archival log on shared
   storage, you can't simply access that data from another cluster b/c of

   
  meta
   data being stored in zookeeper rather than in the log.

   --Scott Thibault


   On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:

Would it be possible to document how to configure Kafka to never delete
messages in a topic? It took a good while to figure this out, and I
 


   
  see it
as an important use case for Kafka.
 
On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 

   
  wrote:
 
  
  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io 
  (mailto:j...@confluent.io) (mailto:
  j...@confluent.io (mailto:j...@confluent.io)) wrote:
   
  If I recall correctly, setting log.retention.ms 
  (http://log.retention.ms) (
  http://log.retention.ms) and
log.retention.bytes
 to
  -1 disables both.
  
  
  
 Thanks!
  
   
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
   
  
  
 
 

   
  wrote:
   

On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com 
(mailto:shaynest...@gmail.com)
  (mailto:shaynest...@gmail.com) wrote:
 
There are two ways you can configure your topics, log
  compaction and
 with
no cleaning. The choice depends on your use case. Are the

   
  
  
 

   
  records

   uniquely
identifiable and will they receive updates? Then log


   
  
 

   
  compaction is

   
  
 
 
the
   way
to go. If they are truly read only, you can go without log


   
  
 
 
compaction.

   I'd rather be free to use the key for partitioning, and the
  records
are
   immutable — they're event records — so disabling compaction
   
  
 
 

   
  altogether
   would be preferable. How is that accomplished?
 
We have a small processes which consume a topic and perform
  upserts
to
   our
various database engines. It's easy to change how it all works


   
 

Re: Using Kafka as a persistent store

2015-07-13 Thread Daniel Schierbeck
Am I correct in assuming that Kafka will only retain a file handle for the last 
segment of the log? If the number of handles grows unbounded, then it would be 
an issue. But I plan on writing to this topic continuously anyway, so not 
separating data into cold and hot storage is the entire point. 

Daniel Schierbeck

 On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com 
 wrote:
 
 We've tried to use Kafka not as a persistent store, but as a long-term
 archival store.  An outstanding issue we've had with that is that the
 broker holds on to an open file handle on every file in the log!  The other
 issue we've had is when you create a long-term archival log on shared
 storage, you can't simply access that data from another cluster b/c of meta
 data being stored in zookeeper rather than in the log.
 
 --Scott Thibault
 
 
 On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 Would it be possible to document how to configure Kafka to never delete
 messages in a topic? It took a good while to figure this out, and I see it
 as an important use case for Kafka.
 
 On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
 If I recall correctly, setting log.retention.ms and
 log.retention.bytes
 to
 -1 disables both.
 
 Thanks!
 
 
 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
 There are two ways you can configure your topics, log compaction and
 with
 no cleaning. The choice depends on your use case. Are the records
 uniquely
 identifiable and will they receive updates? Then log compaction is
 the
 way
 to go. If they are truly read only, you can go without log
 compaction.
 
 I'd rather be free to use the key for partitioning, and the records
 are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
 
 We have a small processes which consume a topic and perform upserts
 to
 our
 various database engines. It's easy to change how it all works and
 simply
 consume the single source of truth again.
 
 I've written a bit about log compaction here:
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
 On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 I'd like to use Kafka as a persistent store – sort of as an
 alternative
 to
 HDFS. The idea is that I'd load the data into various other systems
 in
 order to solve specific needs such as full-text search, analytics,
 indexing
 by various attributes, etc. I'd like to keep a single source of
 truth,
 however.
 
 I'm struggling a bit to understand how I can configure a topic to
 retain
 messages indefinitely. I want to make sure that my data isn't
 deleted.
 Is
 there a guide to configuring Kafka like this?
 
 
 
 -- 
 *This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
 e-mail, there may be some level of risk that the information in this e-mail
 could be read by a third party.  Accordingly, the recipient(s) named above
 are hereby advised to not communicate protected health information using
 this e-mail address.  If you desire to send protected health information
 electronically, please contact MultiScale Health Networks at (206)538-6090*


Re: Using Kafka as a persistent store

2015-07-13 Thread Scott Thibault
Yes, consider my e-mail an up vote!

I guess the files would automatically moved somewhere else to separate the
active from cold segments?  Ideally, one could run an unmodified consumer
application on the cold segments.


--Scott


On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com
wrote:

 Scott,

 This is what I was trying to target in one of my previous responses to
 Daniel. The one in which I suggest another compaction setting for kafka.










 Kind regards,
 Radek Gruchalski
 ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
 ra...@gruchalski.com)
 de.linkedin.com/in/radgruchalski/ (
 http://de.linkedin.com/in/radgruchalski/)

 Confidentiality:
 This communication is intended for the above-named person and may be
 confidential and/or legally privileged.
 If it has come to you in error you must take no action based on it, nor
 must you copy or show it to anyone; please delete/destroy and inform the
 sender immediately.



 On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:

  We've tried to use Kafka not as a persistent store, but as a long-term
  archival store. An outstanding issue we've had with that is that the
  broker holds on to an open file handle on every file in the log! The
 other
  issue we've had is when you create a long-term archival log on shared
  storage, you can't simply access that data from another cluster b/c of
 meta
  data being stored in zookeeper rather than in the log.
 
  --Scott Thibault
 
 
  On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
 
   Would it be possible to document how to configure Kafka to never delete
   messages in a topic? It took a good while to figure this out, and I
 see it
   as an important use case for Kafka.
  
   On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 wrote:
  
   
 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto:
 j...@confluent.io) wrote:

 If I recall correctly, setting log.retention.ms (
 http://log.retention.ms) and
   log.retention.bytes
to
 -1 disables both.
   
   
Thanks!
   

 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 wrote:

 
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com
 (mailto:shaynest...@gmail.com) wrote:
  
   There are two ways you can configure your topics, log
 compaction and
with
   no cleaning. The choice depends on your use case. Are the
 records
 
  uniquely
   identifiable and will they receive updates? Then log
 compaction is
 
 

   
   
  
   the
  way
   to go. If they are truly read only, you can go without log
 
 

   
  
   compaction.
 
  I'd rather be free to use the key for partitioning, and the
 records
   are
  immutable — they're event records — so disabling compaction
 altogether
  would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform
 upserts
   to
  our
   various database engines. It's easy to change how it all works
 and
 
 

   
simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
   
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:
 daniel.schierb...@gmail.com) wrote:
  
I'd like to use Kafka as a persistent store – sort of as an
alternative
  to
HDFS. The idea is that I'd load the data into various other
 systems
  
 
 

   
   
  
   in
order to solve specific needs such as full-text search,
 analytics,
  
 
  indexing
by various attributes, etc. I'd like to keep a single source
 of
  
 
 

   
  
   truth,
however.
   
I'm struggling a bit to understand how I can configure a
 topic to
retain
messages indefinitely. I want to make sure that my data isn't
  
 

   
   
  
   deleted.
  Is
there a guide to configuring Kafka like this?
  
 
 

   
  
  
 
 
 
 
  --
  *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
  e-mail, there may be some level of risk that the information in this
 e-mail
  could be read by a third party. Accordingly, the recipient(s) named above
  are hereby advised to not communicate protected health information using
  this e-mail address. If you desire to send protected health information
  electronically, please contact MultiScale Health Networks at
 (206)538-6090*
 
 





-- 
*This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
e-mail, there may be some level of risk 

Re: Using Kafka as a persistent store

2015-07-13 Thread Shayne S
Did this work for you? I set the topic settings to retention.ms=-1 and
retention.bytes=-1 and it looks like it is deleting segments immediately.

On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:


  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and log.retention.bytes
 to
  -1 disables both.

 Thanks!

 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
 with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is the
  way
  to go. If they are truly read only, you can go without log compaction.
 
  I'd rather be free to use the key for partitioning, and the records are
  immutable — they're event records — so disabling compaction altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts to
  our
  various database engines. It's easy to change how it all works and
 simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
 alternative
  to
  HDFS. The idea is that I'd load the data into various other systems in
  order to solve specific needs such as full-text search, analytics,
  indexing
  by various attributes, etc. I'd like to keep a single source of truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to
 retain
  messages indefinitely. I want to make sure that my data isn't deleted.
  Is
  there a guide to configuring Kafka like this?
 



Re: Using Kafka as a persistent store

2015-07-13 Thread Daniel Tamai
Using -1 for log.retention.ms should work only for 0.8.3 (
https://issues.apache.org/jira/browse/KAFKA-1990).

2015-07-13 17:08 GMT-03:00 Shayne S shaynest...@gmail.com:

 Did this work for you? I set the topic settings to retention.ms=-1 and
 retention.bytes=-1 and it looks like it is deleting segments immediately.

 On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 
   On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
  
   If I recall correctly, setting log.retention.ms and
 log.retention.bytes
  to
   -1 disables both.
 
  Thanks!
 
  
   On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
  
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
  
   There are two ways you can configure your topics, log compaction and
  with
   no cleaning. The choice depends on your use case. Are the records
   uniquely
   identifiable and will they receive updates? Then log compaction is
 the
   way
   to go. If they are truly read only, you can go without log
 compaction.
  
   I'd rather be free to use the key for partitioning, and the records
 are
   immutable — they're event records — so disabling compaction altogether
   would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform upserts
 to
   our
   various database engines. It's easy to change how it all works and
  simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
  
  http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
   I'd like to use Kafka as a persistent store – sort of as an
  alternative
   to
   HDFS. The idea is that I'd load the data into various other systems
 in
   order to solve specific needs such as full-text search, analytics,
   indexing
   by various attributes, etc. I'd like to keep a single source of
 truth,
   however.
  
   I'm struggling a bit to understand how I can configure a topic to
  retain
   messages indefinitely. I want to make sure that my data isn't
 deleted.
   Is
   there a guide to configuring Kafka like this?
  
 



Re: Using Kafka as a persistent store

2015-07-12 Thread Daniel Schierbeck

 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
 If I recall correctly, setting log.retention.ms and log.retention.bytes to
 -1 disables both.

Thanks! 

 
 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
 There are two ways you can configure your topics, log compaction and with
 no cleaning. The choice depends on your use case. Are the records
 uniquely
 identifiable and will they receive updates? Then log compaction is the
 way
 to go. If they are truly read only, you can go without log compaction.
 
 I'd rather be free to use the key for partitioning, and the records are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
 
 We have a small processes which consume a topic and perform upserts to
 our
 various database engines. It's easy to change how it all works and simply
 consume the single source of truth again.
 
 I've written a bit about log compaction here:
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
 On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 I'd like to use Kafka as a persistent store – sort of as an alternative
 to
 HDFS. The idea is that I'd load the data into various other systems in
 order to solve specific needs such as full-text search, analytics,
 indexing
 by various attributes, etc. I'd like to keep a single source of truth,
 however.
 
 I'm struggling a bit to understand how I can configure a topic to retain
 messages indefinitely. I want to make sure that my data isn't deleted.
 Is
 there a guide to configuring Kafka like this?
 


Re: Using Kafka as a persistent store

2015-07-11 Thread Daniel Schierbeck
Radek: I don't see how data could be stored more efficiently than in Kafka
itself. It's optimized for cheap storage and offers high-performance bulk
export, exactly what you want from long-term archival.
On fre. 10. jul. 2015 at 23.16 Rad Gruchalski ra...@gruchalski.com wrote:

 Hello all,

 This is a very interesting discussion. I’ve been thinking of a similar use
 case for Kafka over the last few days.
 The usual data workflow with Kafka is most likely something this:

 - ingest with Kafka
 - process with Storm / Samza / whathaveyou
   - put some processed data back on Kafka
   - at the same time store the raw data somewhere in case if everything
 has to be reprocessed in the future (hdfs, similar?)

 Currently Kafka offers a couple of types of topics: regular stream
 (non-compacted topic) and a compacted topic (key/value). In case of a
 stream topic, when the compaction kicks in, the “old” data is truncated. It
 is lost from Kafka. What if there was an additional compaction setting:
 cold-store.
 Instead of trimming old data, Kafka would compile old data into a separate
 log with its own index. The user would be free to decide what to do with
 such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file
 is not needed. The only 3 things are:

  - the folder name / partition index
  - the log itself
  - topic metadata at the time of taking the data out of the segment

 With all this info, reading data back is fairly easy, even without
 starting Kafka, sample program goes like this (scala-ish):

 val props = new Properties()
 props.put(log.segment.bytes, 1073741824)
 props.put(segment.index.bytes, 10485760) // should be 10MB

 val log = new Log(
   new File(“/somestorage/kafka-test-0),
   cfg,
   0L,
   null )

 val fdi = log.activeSegment.read( log.logStartOffset,
 Some(log.logEndOffset), 100 )
 var msgs = 1
 fdi.messageSet.iterator.foreach { msgoffset =
   println( s ${msgoffset.message.hasKey} :::  $msgs 
 ${msgoffset.offset} :: ${msgoffset.nextOffset} )
   msgs = msgs + 1
   val key = new String( msgoffset.message.key.array(), UTF-8)
   val msg = new String( msgoffset.message.payload.array(), UTF-8)
   println( s === ${key}  )
   println( s === ${msg}  )
 }


 This reads from active segment (the last known segment) but it’s easy to
 make it read from all segments. The interesting thing is - as long as the
 back up files are well formed, they can be read without having to put them
 in Kafka itself.

 The advantage is: what was once the raw data (as it came in), is the raw
 data forever, without having to introduce another format for storing this.
 Another advantage is: in case of reprocessing, no need to write a producer
 to ingest the data back and so on, so on (it’s possible but not necessary).
 Such raw Kafka files can be easily processed by Storm / Samza (would need
 another stream definition) / Hadoop.

 This sounds like a very useful addition to Kafka. But I could be
 overthinking this...










 Kind regards,
 Radek Gruchalski
 ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
 ra...@gruchalski.com)
 de.linkedin.com/in/radgruchalski/ (
 http://de.linkedin.com/in/radgruchalski/)

 Confidentiality:
 This communication is intended for the above-named person and may be
 confidential and/or legally privileged.
 If it has come to you in error you must take no action based on it, nor
 must you copy or show it to anyone; please delete/destroy and inform the
 sender immediately.



 On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:

 
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com (mailto:
 shaynest...@gmail.com) wrote:
  
   There are two ways you can configure your topics, log compaction and
 with
   no cleaning. The choice depends on your use case. Are the records
 uniquely
   identifiable and will they receive updates? Then log compaction is the
 way
   to go. If they are truly read only, you can go without log compaction.
  
 
 
  I'd rather be free to use the key for partitioning, and the records are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform upserts to
 our
   various database engines. It's easy to change how it all works and
 simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
  
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 wrote:
  
I'd like to use Kafka as a persistent store – sort of as an
 alternative to
HDFS. The idea is that I'd load the data into various other systems
 in
order to solve specific needs such as full-text search, analytics,
 indexing
by various attributes, etc. I'd like to keep a 

Re: Using Kafka as a persistent store

2015-07-11 Thread Rad Gruchalski
Daniel,  

I understand your point. From what I understand the mode that suits you is what 
Jay suggested: log.retention.ms (http://log.retention.ms) and 
log.retention.bytes both set to -1.

A few questions before I continue on something what may already be possible:

1. is it possible to attach additional storage without having to restart Kafka?
2. If answer to 1. is yes: will Kafka continue the topic on a new storage if 
all attached disks are full? Or is the assumption that one data_dir = one 
topic/partition (the code suggests so).
3. If answer to 1. is no: is it possible to take segments out without having to 
restart Kafka?










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Saturday, 11 July 2015 at 22:22, Daniel Schierbeck wrote:

 Radek: I don't see how data could be stored more efficiently than in Kafka
 itself. It's optimized for cheap storage and offers high-performance bulk
 export, exactly what you want from long-term archival.
 On fre. 10. jul. 2015 at 23.16 Rad Gruchalski ra...@gruchalski.com 
 (mailto:ra...@gruchalski.com) wrote:
  
  Hello all,
   
  This is a very interesting discussion. I’ve been thinking of a similar use
  case for Kafka over the last few days.
  The usual data workflow with Kafka is most likely something this:
   
  - ingest with Kafka
  - process with Storm / Samza / whathaveyou
  - put some processed data back on Kafka
  - at the same time store the raw data somewhere in case if everything
  has to be reprocessed in the future (hdfs, similar?)
   
  Currently Kafka offers a couple of types of topics: regular stream
  (non-compacted topic) and a compacted topic (key/value). In case of a
  stream topic, when the compaction kicks in, the “old” data is truncated. It
  is lost from Kafka. What if there was an additional compaction setting:
  cold-store.
  Instead of trimming old data, Kafka would compile old data into a separate
  log with its own index. The user would be free to decide what to do with
  such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file
  is not needed. The only 3 things are:
   
  - the folder name / partition index
  - the log itself
  - topic metadata at the time of taking the data out of the segment
   
  With all this info, reading data back is fairly easy, even without
  starting Kafka, sample program goes like this (scala-ish):
   
  val props = new Properties()
  props.put(log.segment.bytes, 1073741824)
  props.put(segment.index.bytes, 10485760) // should be 10MB
   
  val log = new Log(
  new File(“/somestorage/kafka-test-0),
  cfg,
  0L,
  null )
   
  val fdi = log.activeSegment.read( log.logStartOffset,
  Some(log.logEndOffset), 100 )
  var msgs = 1
  fdi.messageSet.iterator.foreach { msgoffset =
  println( s ${msgoffset.message.hasKey} :::  $msgs 
  ${msgoffset.offset} :: ${msgoffset.nextOffset} )
  msgs = msgs + 1
  val key = new String( msgoffset.message.key.array(), UTF-8)
  val msg = new String( msgoffset.message.payload.array(), UTF-8)
  println( s === ${key}  )
  println( s === ${msg}  )
  }
   
   
  This reads from active segment (the last known segment) but it’s easy to
  make it read from all segments. The interesting thing is - as long as the
  back up files are well formed, they can be read without having to put them
  in Kafka itself.
   
  The advantage is: what was once the raw data (as it came in), is the raw
  data forever, without having to introduce another format for storing this.
  Another advantage is: in case of reprocessing, no need to write a producer
  to ingest the data back and so on, so on (it’s possible but not necessary).
  Such raw Kafka files can be easily processed by Storm / Samza (would need
  another stream definition) / Hadoop.
   
  This sounds like a very useful addition to Kafka. But I could be
  overthinking this...
   
   
   
   
   
   
   
   
   
   
  Kind regards,
  Radek Gruchalski
  ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
  ra...@gruchalski.com (mailto:ra...@gruchalski.com))
  de.linkedin.com/in/radgruchalski/ 
  (http://de.linkedin.com/in/radgruchalski/) (
  http://de.linkedin.com/in/radgruchalski/)
   
  Confidentiality:
  This communication is intended for the above-named person and may be
  confidential and/or legally privileged.
  If it has come to you in error you must take no action based on it, nor
  must you copy or show it to anyone; please delete/destroy and inform the
  sender immediately.
   
   
   
  On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:
   

On 

Using Kafka as a persistent store

2015-07-10 Thread Daniel Schierbeck
I'd like to use Kafka as a persistent store – sort of as an alternative to
HDFS. The idea is that I'd load the data into various other systems in
order to solve specific needs such as full-text search, analytics, indexing
by various attributes, etc. I'd like to keep a single source of truth,
however.

I'm struggling a bit to understand how I can configure a topic to retain
messages indefinitely. I want to make sure that my data isn't deleted. Is
there a guide to configuring Kafka like this?


Re: Using Kafka as a persistent store

2015-07-10 Thread Shayne S
There are two ways you can configure your topics, log compaction and with
no cleaning. The choice depends on your use case. Are the records uniquely
identifiable and will they receive updates? Then log compaction is the way
to go. If they are truly read only, you can go without log compaction.

We have a small processes which consume a topic and perform upserts to our
various database engines. It's easy to change how it all works and simply
consume the single source of truth again.

I've written a bit about log compaction here:
http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/

On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:

 I'd like to use Kafka as a persistent store – sort of as an alternative to
 HDFS. The idea is that I'd load the data into various other systems in
 order to solve specific needs such as full-text search, analytics, indexing
 by various attributes, etc. I'd like to keep a single source of truth,
 however.

 I'm struggling a bit to understand how I can configure a topic to retain
 messages indefinitely. I want to make sure that my data isn't deleted. Is
 there a guide to configuring Kafka like this?



Re: Using Kafka as a persistent store

2015-07-10 Thread noah
I don't want to endorse this use of Kafka, but assuming you can give your
message unique identifiers, I believe using log compaction will keep all
unique messages forever. You can read about how consumer offsets stored in
Kafka are managed using a compacted topic here:
http://kafka.apache.org/documentation.html#distributionimpl  In that case,
the consumer group id+topic+partition forms a unique message id and the
brokers read that topic on startup into the offsets cache (and take updates
to the offsets cache via the same topic.) If you have a finite, smallish
data set that you want indexed in multiple systems, that might be a good
approach.

If your data can grow without bound, it doesn't seem to me like Kafka is a
good choice? Even with compaction, you will still have to sequentially read
it all, message by message, to get it into a different system. As far as I
know, there is no lookup by id, and even going to a specific date is a
manual O(log n) process.

(warning: I'm just another user, so I may have a few things wrong.)


On Fri, Jul 10, 2015 at 3:47 AM Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:

 I'd like to use Kafka as a persistent store – sort of as an alternative to
 HDFS. The idea is that I'd load the data into various other systems in
 order to solve specific needs such as full-text search, analytics, indexing
 by various attributes, etc. I'd like to keep a single source of truth,
 however.

 I'm struggling a bit to understand how I can configure a topic to retain
 messages indefinitely. I want to make sure that my data isn't deleted. Is
 there a guide to configuring Kafka like this?



Re: Using Kafka as a persistent store

2015-07-10 Thread Daniel Schierbeck

 On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
 There are two ways you can configure your topics, log compaction and with
 no cleaning. The choice depends on your use case. Are the records uniquely
 identifiable and will they receive updates? Then log compaction is the way
 to go. If they are truly read only, you can go without log compaction.

I'd rather be free to use the key for partitioning, and the records are 
immutable — they're event records — so disabling compaction altogether would be 
preferable. How is that accomplished?
 
 We have a small processes which consume a topic and perform upserts to our
 various database engines. It's easy to change how it all works and simply
 consume the single source of truth again.
 
 I've written a bit about log compaction here:
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
 On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 I'd like to use Kafka as a persistent store – sort of as an alternative to
 HDFS. The idea is that I'd load the data into various other systems in
 order to solve specific needs such as full-text search, analytics, indexing
 by various attributes, etc. I'd like to keep a single source of truth,
 however.
 
 I'm struggling a bit to understand how I can configure a topic to retain
 messages indefinitely. I want to make sure that my data isn't deleted. Is
 there a guide to configuring Kafka like this?
 


Re: Using Kafka as a persistent store

2015-07-10 Thread Jay Kreps
If I recall correctly, setting log.retention.ms and log.retention.bytes to
-1 disables both.

On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:


  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and with
  no cleaning. The choice depends on your use case. Are the records
 uniquely
  identifiable and will they receive updates? Then log compaction is the
 way
  to go. If they are truly read only, you can go without log compaction.

 I'd rather be free to use the key for partitioning, and the records are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts to
 our
  various database engines. It's easy to change how it all works and simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
  http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an alternative
 to
  HDFS. The idea is that I'd load the data into various other systems in
  order to solve specific needs such as full-text search, analytics,
 indexing
  by various attributes, etc. I'd like to keep a single source of truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to retain
  messages indefinitely. I want to make sure that my data isn't deleted.
 Is
  there a guide to configuring Kafka like this?
 



Re: Using Kafka as a persistent store

2015-07-10 Thread Rad Gruchalski
Hello all,

This is a very interesting discussion. I’ve been thinking of a similar use case 
for Kafka over the last few days.  
The usual data workflow with Kafka is most likely something this:

- ingest with Kafka
- process with Storm / Samza / whathaveyou
  - put some processed data back on Kafka
  - at the same time store the raw data somewhere in case if everything has to 
be reprocessed in the future (hdfs, similar?)

Currently Kafka offers a couple of types of topics: regular stream 
(non-compacted topic) and a compacted topic (key/value). In case of a stream 
topic, when the compaction kicks in, the “old” data is truncated. It is lost 
from Kafka. What if there was an additional compaction setting: cold-store.
Instead of trimming old data, Kafka would compile old data into a separate log 
with its own index. The user would be free to decide what to do with such 
files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not 
needed. The only 3 things are:

 - the folder name / partition index
 - the log itself
 - topic metadata at the time of taking the data out of the segment

With all this info, reading data back is fairly easy, even without starting 
Kafka, sample program goes like this (scala-ish):

val props = new Properties()
props.put(log.segment.bytes, 1073741824)
props.put(segment.index.bytes, 10485760) // should be 10MB

val log = new Log(
  new File(“/somestorage/kafka-test-0),
  cfg,
  0L,
  null )

val fdi = log.activeSegment.read( log.logStartOffset, 
Some(log.logEndOffset), 100 )
var msgs = 1
fdi.messageSet.iterator.foreach { msgoffset =
  println( s ${msgoffset.message.hasKey} :::  $msgs  
${msgoffset.offset} :: ${msgoffset.nextOffset} )
  msgs = msgs + 1
  val key = new String( msgoffset.message.key.array(), UTF-8)
  val msg = new String( msgoffset.message.payload.array(), UTF-8)
  println( s === ${key}  )
  println( s === ${msg}  )
}


This reads from active segment (the last known segment) but it’s easy to make 
it read from all segments. The interesting thing is - as long as the back up 
files are well formed, they can be read without having to put them in Kafka 
itself.

The advantage is: what was once the raw data (as it came in), is the raw data 
forever, without having to introduce another format for storing this. Another 
advantage is: in case of reprocessing, no need to write a producer to ingest 
the data back and so on, so on (it’s possible but not necessary). Such raw 
Kafka files can be easily processed by Storm / Samza (would need another stream 
definition) / Hadoop.

This sounds like a very useful addition to Kafka. But I could be overthinking 
this...  










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:

  
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com 
  (mailto:shaynest...@gmail.com) wrote:
   
  There are two ways you can configure your topics, log compaction and with
  no cleaning. The choice depends on your use case. Are the records uniquely
  identifiable and will they receive updates? Then log compaction is the way
  to go. If they are truly read only, you can go without log compaction.
   
  
  
 I'd rather be free to use the key for partitioning, and the records are 
 immutable — they're event records — so disabling compaction altogether would 
 be preferable. How is that accomplished?
   
  We have a small processes which consume a topic and perform upserts to our
  various database engines. It's easy to change how it all works and simply
  consume the single source of truth again.
   
  I've written a bit about log compaction here:
  http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
   
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
   
   I'd like to use Kafka as a persistent store – sort of as an alternative to
   HDFS. The idea is that I'd load the data into various other systems in
   order to solve specific needs such as full-text search, analytics, 
   indexing
   by various attributes, etc. I'd like to keep a single source of truth,
   however.

   I'm struggling a bit to understand how I can configure a topic to retain
   messages indefinitely. I want to make sure that my data isn't deleted. Is
   there a guide to configuring Kafka like this?