Re: Using Kafka as a persistent store

2015-07-13 Thread Daniel Schierbeck
Would it be possible to document how to configure Kafka to never delete
messages in a topic? It took a good while to figure this out, and I see it
as an important use case for Kafka.

On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:


  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and log.retention.bytes
 to
  -1 disables both.

 Thanks!

 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
 with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is the
  way
  to go. If they are truly read only, you can go without log compaction.
 
  I'd rather be free to use the key for partitioning, and the records are
  immutable — they're event records — so disabling compaction altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts to
  our
  various database engines. It's easy to change how it all works and
 simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
 alternative
  to
  HDFS. The idea is that I'd load the data into various other systems in
  order to solve specific needs such as full-text search, analytics,
  indexing
  by various attributes, etc. I'd like to keep a single source of truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to
 retain
  messages indefinitely. I want to make sure that my data isn't deleted.
  Is
  there a guide to configuring Kafka like this?
 



Re: performance benchmarking of kafka

2015-07-13 Thread Yuheng Du
Hi,

Appreciate your response. It works now! It is just a typo of the class
names : (.

It really has nothing to do with whether you are using the binaries or the
source version of kafka.

Thanks everyone!

On Mon, Jul 13, 2015 at 11:18 PM, tao xiao xiaotao...@gmail.com wrote:

 org.apache.kafka.clients.tools.ProducerPerformance resides in
 kafka-clients-0.8.2.1.jar.
 You need to make sure the jar exists in $KAFKA_HOME/libs/. I use
 kafka_2.10-0.8.2.1
 too and here is the output

 % bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance

 USAGE: java org.apache.kafka.clients.tools.ProducerPerformance topic_name
 num_records record_size target_records_sec [prop_name=prop_value]*



 On Tue, 14 Jul 2015 at 05:08 Yuheng Du yuheng.du.h...@gmail.com wrote:

  I am using the binaries of kafka_2.10-0.8.2.1. Could that be the problem?
  Should I use the source of kafka-0.8.2.1-src.tgz to each of my machiines,
  build them and run the test?
  Thanks.
 
  On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote:
 
   You may need to open up your run-class.sh in a text editor and modify
 the
   classpath -- I believe I had a similar error before.
  
   On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com
   wrote:
  
Hi guys,
   
I am trying to replicate the test of benchmarking kafka at
   
   
  
 
 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
.
   
When I run
   
bin/kafka-run-class.sh
  org.apache.kafka.clients.tools.ProducerPerformance
test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092
buffer.memory=67108864 batch.size=8196
   
and I got the following error:
Error: Could not find or load main class
org.apache.kafka.client.tools.ProducerPerformance
   
What should I fix? Thank you!
   
  
  
  
   --
  
   Jiefu Gong
   University of California, Berkeley | Class of 2017
   B.A Computer Science | College of Letters and Sciences
  
   jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
  
 



Re: performance benchmarking of kafka

2015-07-13 Thread tao xiao
org.apache.kafka.clients.tools.ProducerPerformance resides in
kafka-clients-0.8.2.1.jar.
You need to make sure the jar exists in $KAFKA_HOME/libs/. I use
kafka_2.10-0.8.2.1
too and here is the output

% bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance

USAGE: java org.apache.kafka.clients.tools.ProducerPerformance topic_name
num_records record_size target_records_sec [prop_name=prop_value]*



On Tue, 14 Jul 2015 at 05:08 Yuheng Du yuheng.du.h...@gmail.com wrote:

 I am using the binaries of kafka_2.10-0.8.2.1. Could that be the problem?
 Should I use the source of kafka-0.8.2.1-src.tgz to each of my machiines,
 build them and run the test?
 Thanks.

 On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote:

  You may need to open up your run-class.sh in a text editor and modify the
  classpath -- I believe I had a similar error before.
 
  On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com
  wrote:
 
   Hi guys,
  
   I am trying to replicate the test of benchmarking kafka at
  
  
 
 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
   .
  
   When I run
  
   bin/kafka-run-class.sh
 org.apache.kafka.clients.tools.ProducerPerformance
   test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092
   buffer.memory=67108864 batch.size=8196
  
   and I got the following error:
   Error: Could not find or load main class
   org.apache.kafka.client.tools.ProducerPerformance
  
   What should I fix? Thank you!
  
 
 
 
  --
 
  Jiefu Gong
  University of California, Berkeley | Class of 2017
  B.A Computer Science | College of Letters and Sciences
 
  jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427
 



Re: Data Structure abstractions over kafka

2015-07-13 Thread Ewen Cheslack-Postava
Tim,

Kafka can be used as a key-value store if you turn on log compaction:
http://kafka.apache.org/documentation.html#compaction You need to be
careful with that since it's purely last-writer-wins and doesn't have
anything like CAS that might help you manage concurrent writers, but the
basic functionality is there. This is used by the brokers to store offsets
in Kafka (where keys are (consumer-group, topic, partition), values are the
offset, and they already have a mechanism to ensure only a single writer at
a time).

You could possibly use this to implement the linked list functionality
you're talking about, although there are probably a number of challenges
(e.g., performing atomic updates if you need a doubly-linked list, ensuring
garbage is collected after removals even if you only need a singly-linked
list, etc). Also, I'm not sure it would be particularly efficient, you'd
still need to ensure a single writer (or at least single writer per linked
list node), etc.

You're almost definitely better off using a specialized store for something
like that simply because Kafka isn't designed around that use case, but
it'd be interesting to see how far you could get with Kafka's current
functionality, and what would be required to make it practical!

-Ewen

On Mon, Jul 13, 2015 at 11:36 AM, Tim Smith secs...@gmail.com wrote:

 Hi,

 In the big data ecosystem, I have started to use kafka, essentially, as a:
 -  unordered list/array, and
 - a cluster-wide pipe

 I guess you could argue that any message bus product is a simple array/pipe
 but kafka's scale and model make things so easy :)

 I am wondering if there are any abstractions on top of kafka that will let
 me use kafka to store/organize other simple data structures like a
 linked-list? I have a use case for massive linked list that can easily grow
 to tens of gigabytes and could easily use - (1) redundancy (2) multiple
 producers/consumers working on processing the list (implemented over spark,
 storm etc).

 Any ideas? Maybe maintain a linked-list of offsets in another store like
 ZooKeeper or a NoSQL DB while store the messages on kafka?

 Thanks,

 - Tim




-- 
Thanks,
Ewen


Re: Using Kafka as a persistent store

2015-07-13 Thread Scott Thibault
We've tried to use Kafka not as a persistent store, but as a long-term
archival store.  An outstanding issue we've had with that is that the
broker holds on to an open file handle on every file in the log!  The other
issue we've had is when you create a long-term archival log on shared
storage, you can't simply access that data from another cluster b/c of meta
data being stored in zookeeper rather than in the log.

--Scott Thibault


On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:

 Would it be possible to document how to configure Kafka to never delete
 messages in a topic? It took a good while to figure this out, and I see it
 as an important use case for Kafka.

 On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 
   On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
  
   If I recall correctly, setting log.retention.ms and
 log.retention.bytes
  to
   -1 disables both.
 
  Thanks!
 
  
   On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
  
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
  
   There are two ways you can configure your topics, log compaction and
  with
   no cleaning. The choice depends on your use case. Are the records
   uniquely
   identifiable and will they receive updates? Then log compaction is
 the
   way
   to go. If they are truly read only, you can go without log
 compaction.
  
   I'd rather be free to use the key for partitioning, and the records
 are
   immutable — they're event records — so disabling compaction altogether
   would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform upserts
 to
   our
   various database engines. It's easy to change how it all works and
  simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
  
  http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
   I'd like to use Kafka as a persistent store – sort of as an
  alternative
   to
   HDFS. The idea is that I'd load the data into various other systems
 in
   order to solve specific needs such as full-text search, analytics,
   indexing
   by various attributes, etc. I'd like to keep a single source of
 truth,
   however.
  
   I'm struggling a bit to understand how I can configure a topic to
  retain
   messages indefinitely. I want to make sure that my data isn't
 deleted.
   Is
   there a guide to configuring Kafka like this?
  
 




-- 
*This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
e-mail, there may be some level of risk that the information in this e-mail
could be read by a third party.  Accordingly, the recipient(s) named above
are hereby advised to not communicate protected health information using
this e-mail address.  If you desire to send protected health information
electronically, please contact MultiScale Health Networks at (206)538-6090*


Re: Using Kafka as a persistent store

2015-07-13 Thread Rad Gruchalski
Scott,  

This is what I was trying to target in one of my previous responses to Daniel. 
The one in which I suggest another compaction setting for kafka.










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:

 We've tried to use Kafka not as a persistent store, but as a long-term
 archival store. An outstanding issue we've had with that is that the
 broker holds on to an open file handle on every file in the log! The other
 issue we've had is when you create a long-term archival log on shared
 storage, you can't simply access that data from another cluster b/c of meta
 data being stored in zookeeper rather than in the log.
  
 --Scott Thibault
  
  
 On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
  
  Would it be possible to document how to configure Kafka to never delete
  messages in a topic? It took a good while to figure this out, and I see it
  as an important use case for Kafka.
   
  On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
   

On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io 
(mailto:j...@confluent.io) wrote:
 
If I recall correctly, setting log.retention.ms 
(http://log.retention.ms) and
  log.retention.bytes
   to
-1 disables both.


   Thanks!

 
On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
 
  
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com 
  (mailto:shaynest...@gmail.com) wrote:
   
  There are two ways you can configure your topics, log compaction and
   with
  no cleaning. The choice depends on your use case. Are the records
  
 uniquely
  identifiable and will they receive updates? Then log compaction is
  
  
 


   
  the
 way
  to go. If they are truly read only, you can go without log
  
  
 

   
  compaction.
  
 I'd rather be free to use the key for partitioning, and the records
  are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
   
  We have a small processes which consume a topic and perform upserts
  to
 our
  various database engines. It's easy to change how it all works and
  
  
 

   simply
  consume the single source of truth again.
   
  I've written a bit about log compaction here:
   http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
   
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) 
  wrote:
   
   I'd like to use Kafka as a persistent store – sort of as an
   alternative
 to
   HDFS. The idea is that I'd load the data into various other 
   systems
   
  
  
 


   
  in
   order to solve specific needs such as full-text search, analytics,
   
  
 indexing
   by various attributes, etc. I'd like to keep a single source of
   
  
  
 

   
  truth,
   however.

   I'm struggling a bit to understand how I can configure a topic to
   retain
   messages indefinitely. I want to make sure that my data isn't
   
  
 


   
  deleted.
 Is
   there a guide to configuring Kafka like this?
   
  
  
 

   
   
  
  
  
  
 --  
 *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
 e-mail, there may be some level of risk that the information in this e-mail
 could be read by a third party. Accordingly, the recipient(s) named above
 are hereby advised to not communicate protected health information using
 this e-mail address. If you desire to send protected health information
 electronically, please contact MultiScale Health Networks at (206)538-6090*
  
  




Re: Kafka as an event store for Event Sourcing

2015-07-13 Thread Ben Kirwin
Ah, just saw this. I actually just submitted a patch this evening --
just for the partitionwide version at the moment, since it turns out
to be pretty simple to implement. Still very interested in moving
forward with this stuff, though not always as much time as I would
like...

On Thu, Jul 9, 2015 at 9:39 AM, Daniel Schierbeck
daniel.schierb...@gmail.com wrote:
 Ben, are you still interested in working on this?

 On Mon, Jun 15, 2015 at 9:49 AM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 I like to refer to it as conditional write or conditional request,
 semantically similar to HTTP's If-Match header.

 Ben: I'm adding a comment about per-key checking to your JIRA.

 On Mon, Jun 15, 2015 at 4:06 AM Ben Kirwin b...@kirw.in wrote:

 Yeah, it's definitely not a standard CAS, but it feels like the right
 fit for the commit log abstraction -- CAS on a 'current value' does
 seem a bit too key-value-store-ish for Kafka to support natively.

 I tried to avoid referring to the check-offset-before-publish
 functionality as a CAS in the ticket because, while they're both types
 of 'optimistic concurrency control', they are a bit different -- and
 the offset check is both easier to implement and handier for the stuff
 I tend to work on. (Though that ticket's about checking the latest
 offset on a whole partition, not the key -- there's a different set of
 tradeoffs for the latter, and I haven't thought it through properly
 yet.)

 On Sat, Jun 13, 2015 at 3:35 PM, Ewen Cheslack-Postava
 e...@confluent.io wrote:
  If you do CAS where you compare the offset of the current record for the
  key, then yes. This might work fine for applications that track key,
 value,
  and offset. It is not quite the same as doing a normal CAS.
 
  On Sat, Jun 13, 2015 at 12:07 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  But wouldn't the key-offset table be enough to accept or reject a
 write?
  I'm not familiar with the exact implementation of Kafka, so I may be
 wrong.
 
  On lør. 13. jun. 2015 at 21.05 Ewen Cheslack-Postava 
 e...@confluent.io
  wrote:
 
   Daniel: By random read, I meant not reading the data sequentially as
 is
  the
   norm in Kafka, not necessarily a random disk seek. That in-memory
 data
   structure is what enables the random read. You're either going to
 need
  the
   disk seek if the data isn't in the fs cache or you're trading memory
 to
   avoid it. If it's a full index containing keys and values then you're
   potentially committing to a much larger JVM memory footprint (and
 all the
   GC issues that come with it) since you'd be storing that data in the
 JVM
   heap. If you're only storing the keys + offset info, then you
 potentially
   introduce random disk seeks on any CAS operation (and making page
 caching
   harder for the OS, etc.).
  
  
   On Sat, Jun 13, 2015 at 11:33 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
Ewen: would single-key CAS necessitate random reads? My idea was to
  have
the broker maintain an in-memory table that could be rebuilt from
 the
  log
or a snapshot.
On lør. 13. jun. 2015 at 20.26 Ewen Cheslack-Postava 
  e...@confluent.io
wrote:
   
 Jay - I think you need broker support if you want CAS to work
 with
 compacted topics. With the approach you described you can't turn
 on
 compaction since that would make it last-writer-wins, and using
 any
 non-infinite retention policy would require some external
 process to
 monitor keys that might expire and refresh them by rewriting the
  data.

 That said, I think any addition like this warrants a lot of
  discussion
 about potential use cases since there are a lot of ways you
 could go
adding
 support for something like this. I think this is an obvious next
 incremental step, but someone is bound to have a use case that
 would
 require multi-key CAS and would be costly to build atop single
 key
  CAS.
Or,
 since the compare requires a random read anyway, why not throw in
 read-by-key rather than sequential log reads, which would allow
 for
 minitransactions a la Sinfonia?

 I'm not convinced trying to make Kafka support traditional
 key-value
store
 functionality is a good idea. Compacted topics made it possible
 to
  use
it a
 bit more in that way, but didn't change the public interface,
 only
  the
way
 storage was implemented, and importantly all the potential
 additional
 performance costs  data structures are isolated to background
  threads.

 -Ewen

 On Sat, Jun 13, 2015 at 9:59 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

  @Jay:
 
  Regarding your first proposal: wouldn't that mean that a
 producer
 wouldn't
  know whether a write succeeded? In the case of event sourcing,
 a
   failed
 CAS
  may require re-validating the input with the new state. Simply
discarding
  the write would 

Re: kafka benchmark tests

2015-07-13 Thread Ewen Cheslack-Postava
I implemented (nearly) the same basic set of tests in the system test
framework we started at Confluent and that is going to move into Kafka --
see the wip patch for KIP-25 here: https://github.com/apache/kafka/pull/70
In particular, that test is implemented in benchmark_test.py:
https://github.com/apache/kafka/pull/70/files#diff-ca984778cf9943407645eb6784f19dc8

Hopefully once that's merged people can reuse that benchmark (and add to
it!) so they can easily run the same benchmarks across different hardware.
Here are some results from an older version of that test on m3.2xlarge
instances on EC2 using local ephemeral storage (I think... it's been awhile
since I ran these numbers and I didn't document methodology that carefully):

INFO:_.KafkaBenchmark:=
INFO:_.KafkaBenchmark:BENCHMARK RESULTS
INFO:_.KafkaBenchmark:=
INFO:_.KafkaBenchmark:Single producer, no replication: 684097.470208
rec/sec (65.24 MB/s)
INFO:_.KafkaBenchmark:Single producer, async 3x replication:
667494.359673 rec/sec (63.66 MB/s)
INFO:_.KafkaBenchmark:Single producer, sync 3x replication:
116485.764275 rec/sec (11.11 MB/s)
INFO:_.KafkaBenchmark:Three producers, async 3x replication:
1696519.022182 rec/sec (161.79 MB/s)
INFO:_.KafkaBenchmark:Message size:
INFO:_.KafkaBenchmark: 10: 1637825.195625 rec/sec (15.62 MB/s)
INFO:_.KafkaBenchmark: 100: 605504.877911 rec/sec (57.75 MB/s)
INFO:_.KafkaBenchmark: 1000: 90351.817570 rec/sec (86.17 MB/s)
INFO:_.KafkaBenchmark: 1: 8306.180862 rec/sec (79.21 MB/s)
INFO:_.KafkaBenchmark: 10: 978.403499 rec/sec (93.31 MB/s)
INFO:_.KafkaBenchmark:Throughput over long run, data  memory:
INFO:_.KafkaBenchmark: Time block 0: 684725.151324 rec/sec (65.30 MB/s)
INFO:_.KafkaBenchmark:Single consumer: 701031.14 rec/sec (56.830500 MB/s)
INFO:_.KafkaBenchmark:Three consumers: 3304011.014900 rec/sec (267.830800 MB/s)
INFO:_.KafkaBenchmark:Producer + consumer:
INFO:_.KafkaBenchmark: Producer: 624984.375391 rec/sec (59.60 MB/s)
INFO:_.KafkaBenchmark: Consumer: 624984.375391 rec/sec (59.60 MB/s)
INFO:_.KafkaBenchmark:End-to-end latency: median 2.00 ms, 99%
4.00 ms, 99.9% 19.00 ms

Don't trust these numbers for anything, the were a quick one-off test. I'm
just pasting the output so you get some idea of what the results might look
like. Once we merge the KIP-25 patch, Confluent will be running the tests
regularly and results will be available publicly so we'll be able to keep
better tabs on performance, albeit for only a specific class of hardware.

For the batch.size question -- I'm not sure the results in the blog post
actually have different settings, it could be accidental divergence between
the script and the blog post. The post specifically notes that tuning the
batch size in the synchronous case might help, but that he didn't do that.
If you're trying to benchmark the *optimal* throughput, tuning the batch
size would make sense. Since synchronous replication will have higher
latency and there's a limit to how many requests can be in flight at once,
you'll want a larger batch size to compensate for the additional latency.
However, in practice the increase you see may be negligible. Somebody who
has spent more time fiddling with tweaking producer performance may have
more insight.

-Ewen

On Mon, Jul 13, 2015 at 10:08 AM, JIEFU GONG jg...@berkeley.edu wrote:

 Hi all,

 I was wondering if any of you guys have done benchmarks on Kafka
 performance before, and if they or their details (# nodes in cluster, #
 records / size(s) of messages, etc.) could be shared.

 For comparison purposes, I am trying to benchmark Kafka against some
 similar services such as Kinesis or Scribe. Additionally, I was wondering
 if anyone could shed some insight on Jay Kreps' benchmarks that he has
 openly published here:

 https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

 Specifically, I am unsure of why between his tests of 3x synchronous
 replication and 3x async replication he changed the batch.size, as well as
 why he is seemingly publishing to incorrect topics:

 Configs:
 https://gist.github.com/jkreps/c7ddb4041ef62a900e6c

 Any help is greatly appreciated!



 --

 Jiefu Gong
 University of California, Berkeley | Class of 2017
 B.A Computer Science | College of Letters and Sciences

 jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427




-- 
Thanks,
Ewen


Re: Using Kafka as a persistent store

2015-07-13 Thread Gwen Shapira
Hi,

1. What you described sounds like a reasonable architecture, but may I
ask why JSON? Avro seems better supported in the ecosystem
(Confluent's tools, Hadoop integration, schema evolution, tools, etc).

1.5 If all you do is convert data into JSON, SparkStreaming sounds
like a difficult-to-manage overkill. Compared to Flume or a slightly
modified MirrorMaker (Or CopyCat, if it exists yet). Any specific
reasons for SparkStreaming?

2. Different compute engines prefer different storage formats because
in most cases thats where optimizations come from. Parquet improves
scan performance for Impala and MR, but will be pretty horrible for
NoSQL. So, I wouldn't hold my breath for compute engines to start
sharing data storage suddenly.

Gwen

On Mon, Jul 13, 2015 at 11:45 AM, Tim Smith secs...@gmail.com wrote:
 I have had a similar issue where I wanted a single source of truth between
 Search and HDFS. First, if you zoom out a little, eventually you are going
 to have some compute engine(s) process the data. If you store it in a
 compute neutral tier like kafka then you will need to suck the data out at
 runtime and stage it for the compute engine to use. So pick your poison,
 process at ingest and store multiple copies of data, one per compute
 engine, OR store in a neutral store and process at runtime. I am not saying
 one is better than the other but that's how I see the trade-off so
 depending on your use cases, YMMV.

 What I do is:
 - store raw data into kafka
 - use spark streaming to transform data to JSON and post it back to kafka
 - Hang multiple data stores off kafka that ingest the JSON
 - Not do any other transformations in the consumer stores and store the
 copy as immutable event

 So I do have multiple copies (one per compute tier) but they all look the
 same.

 Unless different compute engines, natively start to use a common data
 storage format, I don't see how one could get away from storing multiple
 copies. Primarily, I see Lucene based products have their format, the
 Hadoop ecosystem seems congregating around Parquet and then the NoSQL
 players have their formats (one per each product).

 My 2 cents worth :)



 On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 Am I correct in assuming that Kafka will only retain a file handle for the
 last segment of the log? If the number of handles grows unbounded, then it
 would be an issue. But I plan on writing to this topic continuously anyway,
 so not separating data into cold and hot storage is the entire point.

 Daniel Schierbeck

  On 13. jul. 2015, at 15.41, Scott Thibault 
 scott.thiba...@multiscalehn.com wrote:
 
  We've tried to use Kafka not as a persistent store, but as a long-term
  archival store.  An outstanding issue we've had with that is that the
  broker holds on to an open file handle on every file in the log!  The
 other
  issue we've had is when you create a long-term archival log on shared
  storage, you can't simply access that data from another cluster b/c of
 meta
  data being stored in zookeeper rather than in the log.
 
  --Scott Thibault
 
 
  On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  Would it be possible to document how to configure Kafka to never delete
  messages in a topic? It took a good while to figure this out, and I see
 it
  as an important use case for Kafka.
 
  On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and
  log.retention.bytes
  to
  -1 disables both.
 
  Thanks!
 
 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
  with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is
  the
  way
  to go. If they are truly read only, you can go without log
  compaction.
 
  I'd rather be free to use the key for partitioning, and the records
  are
  immutable — they're event records — so disabling compaction
 altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts
  to
  our
  various database engines. It's easy to change how it all works and
  simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
  alternative
  to
  HDFS. The idea is that I'd load the data into various other systems
  in
  order to solve specific needs such as full-text 

kafka benchmark tests

2015-07-13 Thread JIEFU GONG
Hi all,

I was wondering if any of you guys have done benchmarks on Kafka
performance before, and if they or their details (# nodes in cluster, #
records / size(s) of messages, etc.) could be shared.

For comparison purposes, I am trying to benchmark Kafka against some
similar services such as Kinesis or Scribe. Additionally, I was wondering
if anyone could shed some insight on Jay Kreps' benchmarks that he has
openly published here:
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Specifically, I am unsure of why between his tests of 3x synchronous
replication and 3x async replication he changed the batch.size, as well as
why he is seemingly publishing to incorrect topics:

Configs:
https://gist.github.com/jkreps/c7ddb4041ef62a900e6c

Any help is greatly appreciated!



-- 

Jiefu Gong
University of California, Berkeley | Class of 2017
B.A Computer Science | College of Letters and Sciences

jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427


Re: Using Kafka as a persistent store

2015-07-13 Thread James Cheng
For what it's worth, I did something similar to Rad's suggestion of 
cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis is 
also a message bus, but only has a 24 hour retention window.

I wrote a Kinesis consumer that would take all messages from Kinesis and save 
them into S3. I stored them in S3 in such a way that the structure mirrors the 
original Kinesis stream, and all message metadata is preserved (message offsets 
and primary keys, for example).

This means that I can write a consumer that would consume from S3 files in 
the same way that it would consume from the Kinesis stream itself. And the data 
is structured such that when you are done reading from S3, you can connect to 
the Kinesis stream at the point where the S3 archive left off.

This effectively allowed me to add a configurable retention period when 
consuming from Kinesis.

-James

On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com wrote:

 I have had a similar issue where I wanted a single source of truth between
 Search and HDFS. First, if you zoom out a little, eventually you are going
 to have some compute engine(s) process the data. If you store it in a
 compute neutral tier like kafka then you will need to suck the data out at
 runtime and stage it for the compute engine to use. So pick your poison,
 process at ingest and store multiple copies of data, one per compute
 engine, OR store in a neutral store and process at runtime. I am not saying
 one is better than the other but that's how I see the trade-off so
 depending on your use cases, YMMV.
 
 What I do is:
 - store raw data into kafka
 - use spark streaming to transform data to JSON and post it back to kafka
 - Hang multiple data stores off kafka that ingest the JSON
 - Not do any other transformations in the consumer stores and store the
 copy as immutable event
 
 So I do have multiple copies (one per compute tier) but they all look the
 same.
 
 Unless different compute engines, natively start to use a common data
 storage format, I don't see how one could get away from storing multiple
 copies. Primarily, I see Lucene based products have their format, the
 Hadoop ecosystem seems congregating around Parquet and then the NoSQL
 players have their formats (one per each product).
 
 My 2 cents worth :)
 
 
 
 On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 Am I correct in assuming that Kafka will only retain a file handle for the
 last segment of the log? If the number of handles grows unbounded, then it
 would be an issue. But I plan on writing to this topic continuously anyway,
 so not separating data into cold and hot storage is the entire point.
 
 Daniel Schierbeck
 
 On 13. jul. 2015, at 15.41, Scott Thibault 
 scott.thiba...@multiscalehn.com wrote:
 
 We've tried to use Kafka not as a persistent store, but as a long-term
 archival store.  An outstanding issue we've had with that is that the
 broker holds on to an open file handle on every file in the log!  The
 other
 issue we've had is when you create a long-term archival log on shared
 storage, you can't simply access that data from another cluster b/c of
 meta
 data being stored in zookeeper rather than in the log.
 
 --Scott Thibault
 
 
 On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 Would it be possible to document how to configure Kafka to never delete
 messages in a topic? It took a good while to figure this out, and I see
 it
 as an important use case for Kafka.
 
 On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
 If I recall correctly, setting log.retention.ms and
 log.retention.bytes
 to
 -1 disables both.
 
 Thanks!
 
 
 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
 There are two ways you can configure your topics, log compaction and
 with
 no cleaning. The choice depends on your use case. Are the records
 uniquely
 identifiable and will they receive updates? Then log compaction is
 the
 way
 to go. If they are truly read only, you can go without log
 compaction.
 
 I'd rather be free to use the key for partitioning, and the records
 are
 immutable — they're event records — so disabling compaction
 altogether
 would be preferable. How is that accomplished?
 
 We have a small processes which consume a topic and perform upserts
 to
 our
 various database engines. It's easy to change how it all works and
 simply
 consume the single source of truth again.
 
 I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
 On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 I'd like to use Kafka as a persistent store – sort of as an
 alternative
 to
 HDFS. The idea is that 

Re: Using Kafka as a persistent store

2015-07-13 Thread Tim Smith
I have had a similar issue where I wanted a single source of truth between
Search and HDFS. First, if you zoom out a little, eventually you are going
to have some compute engine(s) process the data. If you store it in a
compute neutral tier like kafka then you will need to suck the data out at
runtime and stage it for the compute engine to use. So pick your poison,
process at ingest and store multiple copies of data, one per compute
engine, OR store in a neutral store and process at runtime. I am not saying
one is better than the other but that's how I see the trade-off so
depending on your use cases, YMMV.

What I do is:
- store raw data into kafka
- use spark streaming to transform data to JSON and post it back to kafka
- Hang multiple data stores off kafka that ingest the JSON
- Not do any other transformations in the consumer stores and store the
copy as immutable event

So I do have multiple copies (one per compute tier) but they all look the
same.

Unless different compute engines, natively start to use a common data
storage format, I don't see how one could get away from storing multiple
copies. Primarily, I see Lucene based products have their format, the
Hadoop ecosystem seems congregating around Parquet and then the NoSQL
players have their formats (one per each product).

My 2 cents worth :)



On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:

 Am I correct in assuming that Kafka will only retain a file handle for the
 last segment of the log? If the number of handles grows unbounded, then it
 would be an issue. But I plan on writing to this topic continuously anyway,
 so not separating data into cold and hot storage is the entire point.

 Daniel Schierbeck

  On 13. jul. 2015, at 15.41, Scott Thibault 
 scott.thiba...@multiscalehn.com wrote:
 
  We've tried to use Kafka not as a persistent store, but as a long-term
  archival store.  An outstanding issue we've had with that is that the
  broker holds on to an open file handle on every file in the log!  The
 other
  issue we've had is when you create a long-term archival log on shared
  storage, you can't simply access that data from another cluster b/c of
 meta
  data being stored in zookeeper rather than in the log.
 
  --Scott Thibault
 
 
  On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  Would it be possible to document how to configure Kafka to never delete
  messages in a topic? It took a good while to figure this out, and I see
 it
  as an important use case for Kafka.
 
  On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and
  log.retention.bytes
  to
  -1 disables both.
 
  Thanks!
 
 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
  with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is
  the
  way
  to go. If they are truly read only, you can go without log
  compaction.
 
  I'd rather be free to use the key for partitioning, and the records
  are
  immutable — they're event records — so disabling compaction
 altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts
  to
  our
  various database engines. It's easy to change how it all works and
  simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
  alternative
  to
  HDFS. The idea is that I'd load the data into various other systems
  in
  order to solve specific needs such as full-text search, analytics,
  indexing
  by various attributes, etc. I'd like to keep a single source of
  truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to
  retain
  messages indefinitely. I want to make sure that my data isn't
  deleted.
  Is
  there a guide to configuring Kafka like this?
 
 
 
  --
  *This e-mail is not encrypted.  Due to the unsecured nature of
 unencrypted
  e-mail, there may be some level of risk that the information in this
 e-mail
  could be read by a third party.  Accordingly, the recipient(s) named
 above
  are hereby advised to not communicate protected health information using
  this e-mail address.  If you desire to send protected health information
  electronically, please contact MultiScale Health Networks at
 (206)538-6090*



Data Structure abstractions over kafka

2015-07-13 Thread Tim Smith
Hi,

In the big data ecosystem, I have started to use kafka, essentially, as a:
-  unordered list/array, and
- a cluster-wide pipe

I guess you could argue that any message bus product is a simple array/pipe
but kafka's scale and model make things so easy :)

I am wondering if there are any abstractions on top of kafka that will let
me use kafka to store/organize other simple data structures like a
linked-list? I have a use case for massive linked list that can easily grow
to tens of gigabytes and could easily use - (1) redundancy (2) multiple
producers/consumers working on processing the list (implemented over spark,
storm etc).

Any ideas? Maybe maintain a linked-list of offsets in another store like
ZooKeeper or a NoSQL DB while store the messages on kafka?

Thanks,

- Tim


Re: Using Kafka as a persistent store

2015-07-13 Thread Rad Gruchalski
Sounds like the same idea. The nice thing about having such option is that, 
with a correct application of containers, backup and restore strategy, one can 
create an infinite ordered backup of raw input stream using native Kafka 
storage format.
I understand the point of having the data in other formats in other systems. 
Impossible to get away from that.
My concept presented a few days ago is to address having “multiple same-looking 
copies of the truth”.

At the end of the day, if something happens with operational data, it will have 
to be recreated from the truth”. But, if the data was once ingested over Kafka 
and there is already a pipeline for building operational state from Kafka, why 
would someone write another processing logic to get the truth, say, from 
Hadoop? And if fast, parallel processing of native Kafka format is required, it 
can still be done with Samza or Hadoop / whathaveyou.










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Monday, 13 July 2015 at 21:17, James Cheng wrote:

 For what it's worth, I did something similar to Rad's suggestion of 
 cold-storage to add long-term archiving when using Amazon Kinesis. Kinesis 
 is also a message bus, but only has a 24 hour retention window.
  
 I wrote a Kinesis consumer that would take all messages from Kinesis and save 
 them into S3. I stored them in S3 in such a way that the structure mirrors 
 the original Kinesis stream, and all message metadata is preserved (message 
 offsets and primary keys, for example).
  
 This means that I can write a consumer that would consume from S3 files in 
 the same way that it would consume from the Kinesis stream itself. And the 
 data is structured such that when you are done reading from S3, you can 
 connect to the Kinesis stream at the point where the S3 archive left off.
  
 This effectively allowed me to add a configurable retention period when 
 consuming from Kinesis.
  
 -James
  
 On Jul 13, 2015, at 11:45 AM, Tim Smith secs...@gmail.com 
 (mailto:secs...@gmail.com) wrote:
  
  I have had a similar issue where I wanted a single source of truth between
  Search and HDFS. First, if you zoom out a little, eventually you are going
  to have some compute engine(s) process the data. If you store it in a
  compute neutral tier like kafka then you will need to suck the data out at
  runtime and stage it for the compute engine to use. So pick your poison,
  process at ingest and store multiple copies of data, one per compute
  engine, OR store in a neutral store and process at runtime. I am not saying
  one is better than the other but that's how I see the trade-off so
  depending on your use cases, YMMV.
   
  What I do is:
  - store raw data into kafka
  - use spark streaming to transform data to JSON and post it back to kafka
  - Hang multiple data stores off kafka that ingest the JSON
  - Not do any other transformations in the consumer stores and store the
  copy as immutable event
   
  So I do have multiple copies (one per compute tier) but they all look the
  same.
   
  Unless different compute engines, natively start to use a common data
  storage format, I don't see how one could get away from storing multiple
  copies. Primarily, I see Lucene based products have their format, the
  Hadoop ecosystem seems congregating around Parquet and then the NoSQL
  players have their formats (one per each product).
   
  My 2 cents worth :)
   
   
   
  On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
   
   Am I correct in assuming that Kafka will only retain a file handle for the
   last segment of the log? If the number of handles grows unbounded, then it
   would be an issue. But I plan on writing to this topic continuously 
   anyway,
   so not separating data into cold and hot storage is the entire point.

   Daniel Schierbeck

On 13. jul. 2015, at 15.41, Scott Thibault 
   scott.thiba...@multiscalehn.com (mailto:scott.thiba...@multiscalehn.com) 
   wrote:
 
We've tried to use Kafka not as a persistent store, but as a long-term
archival store. An outstanding issue we've had with that is that the
broker holds on to an open file handle on every file in the log! The
 

   other
issue we've had is when you create a long-term archival log on shared
storage, you can't simply access that data from another cluster b/c of
 

   meta
data being stored in zookeeper rather than in the log.
 
--Scott Thibault
 
 
  

Re: Using Kafka as a persistent store

2015-07-13 Thread Rad Gruchalski
Indeed, the files would have to be moved to some separate, dedicated storage.  
There are basically 3 options, as kafka does not allow adding logs at runtime:

1. make the consumer able to read from an arbitrary file
2. add ability to drop files in (I believe this adds a lot of complexity)
3. read files with another program, as suggested in my first email

I’d love to get some input from someone who knows the code and options a bit 
better!  










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Monday, 13 July 2015 at 18:02, Scott Thibault wrote:

 Yes, consider my e-mail an up vote!
  
 I guess the files would automatically moved somewhere else to separate the
 active from cold segments? Ideally, one could run an unmodified consumer
 application on the cold segments.
  
  
 --Scott
  
  
 On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com 
 (mailto:ra...@gruchalski.com)
 wrote:
  
  Scott,
   
  This is what I was trying to target in one of my previous responses to
  Daniel. The one in which I suggest another compaction setting for kafka.
   
   
   
   
   
   
   
   
   
   
  Kind regards,
  Radek Gruchalski
  ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
  ra...@gruchalski.com (mailto:ra...@gruchalski.com))
  de.linkedin.com/in/radgruchalski/ 
  (http://de.linkedin.com/in/radgruchalski/) (
  http://de.linkedin.com/in/radgruchalski/)
   
  Confidentiality:
  This communication is intended for the above-named person and may be
  confidential and/or legally privileged.
  If it has come to you in error you must take no action based on it, nor
  must you copy or show it to anyone; please delete/destroy and inform the
  sender immediately.
   
   
   
  On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:
   
   We've tried to use Kafka not as a persistent store, but as a long-term
   archival store. An outstanding issue we've had with that is that the
   broker holds on to an open file handle on every file in the log! The

   
  other
   issue we've had is when you create a long-term archival log on shared
   storage, you can't simply access that data from another cluster b/c of

   
  meta
   data being stored in zookeeper rather than in the log.

   --Scott Thibault


   On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:

Would it be possible to document how to configure Kafka to never delete
messages in a topic? It took a good while to figure this out, and I
 


   
  see it
as an important use case for Kafka.
 
On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 

   
  wrote:
 
  
  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io 
  (mailto:j...@confluent.io) (mailto:
  j...@confluent.io (mailto:j...@confluent.io)) wrote:
   
  If I recall correctly, setting log.retention.ms 
  (http://log.retention.ms) (
  http://log.retention.ms) and
log.retention.bytes
 to
  -1 disables both.
  
  
  
 Thanks!
  
   
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
   
  
  
 
 

   
  wrote:
   

On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com 
(mailto:shaynest...@gmail.com)
  (mailto:shaynest...@gmail.com) wrote:
 
There are two ways you can configure your topics, log
  compaction and
 with
no cleaning. The choice depends on your use case. Are the

   
  
  
 

   
  records

   uniquely
identifiable and will they receive updates? Then log


   
  
 

   
  compaction is

   
  
 
 
the
   way
to go. If they are truly read only, you can go without log


   
  
 
 
compaction.

   I'd rather be free to use the key for partitioning, and the
  records
are
   immutable — they're event records — so disabling compaction
   
  
 
 

   
  altogether
   would be preferable. How is that accomplished?
 
We have a small processes which consume a topic and perform
  upserts
to
   our
various database engines. It's easy to change how it all works


   
 

Re: Using Kafka as a persistent store

2015-07-13 Thread Daniel Schierbeck
Am I correct in assuming that Kafka will only retain a file handle for the last 
segment of the log? If the number of handles grows unbounded, then it would be 
an issue. But I plan on writing to this topic continuously anyway, so not 
separating data into cold and hot storage is the entire point. 

Daniel Schierbeck

 On 13. jul. 2015, at 15.41, Scott Thibault scott.thiba...@multiscalehn.com 
 wrote:
 
 We've tried to use Kafka not as a persistent store, but as a long-term
 archival store.  An outstanding issue we've had with that is that the
 broker holds on to an open file handle on every file in the log!  The other
 issue we've had is when you create a long-term archival log on shared
 storage, you can't simply access that data from another cluster b/c of meta
 data being stored in zookeeper rather than in the log.
 
 --Scott Thibault
 
 
 On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 Would it be possible to document how to configure Kafka to never delete
 messages in a topic? It took a good while to figure this out, and I see it
 as an important use case for Kafka.
 
 On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
 If I recall correctly, setting log.retention.ms and
 log.retention.bytes
 to
 -1 disables both.
 
 Thanks!
 
 
 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 
 On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
 There are two ways you can configure your topics, log compaction and
 with
 no cleaning. The choice depends on your use case. Are the records
 uniquely
 identifiable and will they receive updates? Then log compaction is
 the
 way
 to go. If they are truly read only, you can go without log
 compaction.
 
 I'd rather be free to use the key for partitioning, and the records
 are
 immutable — they're event records — so disabling compaction altogether
 would be preferable. How is that accomplished?
 
 We have a small processes which consume a topic and perform upserts
 to
 our
 various database engines. It's easy to change how it all works and
 simply
 consume the single source of truth again.
 
 I've written a bit about log compaction here:
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
 On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:
 
 I'd like to use Kafka as a persistent store – sort of as an
 alternative
 to
 HDFS. The idea is that I'd load the data into various other systems
 in
 order to solve specific needs such as full-text search, analytics,
 indexing
 by various attributes, etc. I'd like to keep a single source of
 truth,
 however.
 
 I'm struggling a bit to understand how I can configure a topic to
 retain
 messages indefinitely. I want to make sure that my data isn't
 deleted.
 Is
 there a guide to configuring Kafka like this?
 
 
 
 -- 
 *This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
 e-mail, there may be some level of risk that the information in this e-mail
 could be read by a third party.  Accordingly, the recipient(s) named above
 are hereby advised to not communicate protected health information using
 this e-mail address.  If you desire to send protected health information
 electronically, please contact MultiScale Health Networks at (206)538-6090*


Re: Fetching details from Kafka Server

2015-07-13 Thread pushkar priyadarshi
2)You need to implement MetricReporter and provider that implementation
class name against producer side configuration metric.reporters

On Mon, Jul 13, 2015 at 9:08 PM, Swati Suman swatisuman1...@gmail.com
wrote:

 Hi Team,
 We are using Kafka 0.8.2

 I have two questions:

 1)Is there any Java Api in Kafka that gives me the list of all the consumer
 groups along with the topic/partition from which they are consuming
 Also, is there any way that I can fetch the zookeeper list from the kafka
 server side .
 Note: I am able to fetch the above information from the Zookeeper. But I
 want to fetch it from Kafka Server.

 2). I have implemented a Custom Metrics Reporter which is implementing
 KafkaMetricsReporter and KafkaMetricsMBeanReporter. So it is extracting all
 the Server Metrics as seen in page
 http://docs.confluent.io/1.0/kafka/monitoring.html and not the Producer
 and
 Consumer Metrics. Is there any way I can fetch them from the kafka server
 side or do the Producer/Consumer need to implement something to be able to
 fetch/emit them.

 I will be very thankful if you could share your thoughts on this.

 Thanks In Advance!!

 Best Regards,
 Swati Suman



Fetching details from Kafka Server

2015-07-13 Thread Swati Suman
Hi Team,
We are using Kafka 0.8.2

I have two questions:

1)Is there any Java Api in Kafka that gives me the list of all the consumer
groups along with the topic/partition from which they are consuming
Also, is there any way that I can fetch the zookeeper list from the kafka
server side .
Note: I am able to fetch the above information from the Zookeeper. But I
want to fetch it from Kafka Server.

2). I have implemented a Custom Metrics Reporter which is implementing
KafkaMetricsReporter and KafkaMetricsMBeanReporter. So it is extracting all
the Server Metrics as seen in page
http://docs.confluent.io/1.0/kafka/monitoring.html and not the Producer and
Consumer Metrics. Is there any way I can fetch them from the kafka server
side or do the Producer/Consumer need to implement something to be able to
fetch/emit them.

I will be very thankful if you could share your thoughts on this.

Thanks In Advance!!

Best Regards,
Swati Suman


Re: Using Kafka as a persistent store

2015-07-13 Thread Scott Thibault
Yes, consider my e-mail an up vote!

I guess the files would automatically moved somewhere else to separate the
active from cold segments?  Ideally, one could run an unmodified consumer
application on the cold segments.


--Scott


On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski ra...@gruchalski.com
wrote:

 Scott,

 This is what I was trying to target in one of my previous responses to
 Daniel. The one in which I suggest another compaction setting for kafka.










 Kind regards,
 Radek Gruchalski
 ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
 ra...@gruchalski.com)
 de.linkedin.com/in/radgruchalski/ (
 http://de.linkedin.com/in/radgruchalski/)

 Confidentiality:
 This communication is intended for the above-named person and may be
 confidential and/or legally privileged.
 If it has come to you in error you must take no action based on it, nor
 must you copy or show it to anyone; please delete/destroy and inform the
 sender immediately.



 On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:

  We've tried to use Kafka not as a persistent store, but as a long-term
  archival store. An outstanding issue we've had with that is that the
  broker holds on to an open file handle on every file in the log! The
 other
  issue we've had is when you create a long-term archival log on shared
  storage, you can't simply access that data from another cluster b/c of
 meta
  data being stored in zookeeper rather than in the log.
 
  --Scott Thibault
 
 
  On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com) wrote:
 
   Would it be possible to document how to configure Kafka to never delete
   messages in a topic? It took a good while to figure this out, and I
 see it
   as an important use case for Kafka.
  
   On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 wrote:
  
   
 On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io (mailto:
 j...@confluent.io) wrote:

 If I recall correctly, setting log.retention.ms (
 http://log.retention.ms) and
   log.retention.bytes
to
 -1 disables both.
   
   
Thanks!
   

 On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
 daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)
 wrote:

 
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com
 (mailto:shaynest...@gmail.com) wrote:
  
   There are two ways you can configure your topics, log
 compaction and
with
   no cleaning. The choice depends on your use case. Are the
 records
 
  uniquely
   identifiable and will they receive updates? Then log
 compaction is
 
 

   
   
  
   the
  way
   to go. If they are truly read only, you can go without log
 
 

   
  
   compaction.
 
  I'd rather be free to use the key for partitioning, and the
 records
   are
  immutable — they're event records — so disabling compaction
 altogether
  would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform
 upserts
   to
  our
   various database engines. It's easy to change how it all works
 and
 
 

   
simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
   
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com (mailto:
 daniel.schierb...@gmail.com) wrote:
  
I'd like to use Kafka as a persistent store – sort of as an
alternative
  to
HDFS. The idea is that I'd load the data into various other
 systems
  
 
 

   
   
  
   in
order to solve specific needs such as full-text search,
 analytics,
  
 
  indexing
by various attributes, etc. I'd like to keep a single source
 of
  
 
 

   
  
   truth,
however.
   
I'm struggling a bit to understand how I can configure a
 topic to
retain
messages indefinitely. I want to make sure that my data isn't
  
 

   
   
  
   deleted.
  Is
there a guide to configuring Kafka like this?
  
 
 

   
  
  
 
 
 
 
  --
  *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
  e-mail, there may be some level of risk that the information in this
 e-mail
  could be read by a third party. Accordingly, the recipient(s) named above
  are hereby advised to not communicate protected health information using
  this e-mail address. If you desire to send protected health information
  electronically, please contact MultiScale Health Networks at
 (206)538-6090*
 
 





-- 
*This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
e-mail, there may be some level of risk 

Re: Using Kafka as a persistent store

2015-07-13 Thread Shayne S
Did this work for you? I set the topic settings to retention.ms=-1 and
retention.bytes=-1 and it looks like it is deleting segments immediately.

On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck 
daniel.schierb...@gmail.com wrote:


  On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
 
  If I recall correctly, setting log.retention.ms and log.retention.bytes
 to
  -1 disables both.

 Thanks!

 
  On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
 
  On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
 
  There are two ways you can configure your topics, log compaction and
 with
  no cleaning. The choice depends on your use case. Are the records
  uniquely
  identifiable and will they receive updates? Then log compaction is the
  way
  to go. If they are truly read only, you can go without log compaction.
 
  I'd rather be free to use the key for partitioning, and the records are
  immutable — they're event records — so disabling compaction altogether
  would be preferable. How is that accomplished?
 
  We have a small processes which consume a topic and perform upserts to
  our
  various database engines. It's easy to change how it all works and
 simply
  consume the single source of truth again.
 
  I've written a bit about log compaction here:
 
 http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
 
  On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
  daniel.schierb...@gmail.com wrote:
 
  I'd like to use Kafka as a persistent store – sort of as an
 alternative
  to
  HDFS. The idea is that I'd load the data into various other systems in
  order to solve specific needs such as full-text search, analytics,
  indexing
  by various attributes, etc. I'd like to keep a single source of truth,
  however.
 
  I'm struggling a bit to understand how I can configure a topic to
 retain
  messages indefinitely. I want to make sure that my data isn't deleted.
  Is
  there a guide to configuring Kafka like this?
 



performance benchmarking of kafka

2015-07-13 Thread Yuheng Du
Hi guys,

I am trying to replicate the test of benchmarking kafka at
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
.

When I run

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092
buffer.memory=67108864 batch.size=8196

and I got the following error:
Error: Could not find or load main class
org.apache.kafka.client.tools.ProducerPerformance

What should I fix? Thank you!


Re: performance benchmarking of kafka

2015-07-13 Thread Yuheng Du
Thank you. I see that in run-class.sh, they have the following lines:

 63 for file in $base_dir/clients/build/libs/kafka-clients*.jar;

 64 do

 65   CLASSPATH=$CLASSPATH:$file

 66 done

So I believe all the jars in the libs/ directory have already been included
in the classpath?

Which directory is the ProducerPerformance class resides?

Thanks.

On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote:

 You may need to open up your run-class.sh in a text editor and modify the
 classpath -- I believe I had a similar error before.

 On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com
 wrote:

  Hi guys,
 
  I am trying to replicate the test of benchmarking kafka at
 
 
 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  .
 
  When I run
 
  bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
  test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092
  buffer.memory=67108864 batch.size=8196
 
  and I got the following error:
  Error: Could not find or load main class
  org.apache.kafka.client.tools.ProducerPerformance
 
  What should I fix? Thank you!
 



 --

 Jiefu Gong
 University of California, Berkeley | Class of 2017
 B.A Computer Science | College of Letters and Sciences

 jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427



Re: performance benchmarking of kafka

2015-07-13 Thread Yuheng Du
I am using the binaries of kafka_2.10-0.8.2.1. Could that be the problem?
Should I use the source of kafka-0.8.2.1-src.tgz to each of my machiines,
build them and run the test?
Thanks.

On Mon, Jul 13, 2015 at 4:37 PM, JIEFU GONG jg...@berkeley.edu wrote:

 You may need to open up your run-class.sh in a text editor and modify the
 classpath -- I believe I had a similar error before.

 On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com
 wrote:

  Hi guys,
 
  I am trying to replicate the test of benchmarking kafka at
 
 
 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  .
 
  When I run
 
  bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
  test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092
  buffer.memory=67108864 batch.size=8196
 
  and I got the following error:
  Error: Could not find or load main class
  org.apache.kafka.client.tools.ProducerPerformance
 
  What should I fix? Thank you!
 



 --

 Jiefu Gong
 University of California, Berkeley | Class of 2017
 B.A Computer Science | College of Letters and Sciences

 jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427



Re: performance benchmarking of kafka

2015-07-13 Thread JIEFU GONG
You may need to open up your run-class.sh in a text editor and modify the
classpath -- I believe I had a similar error before.

On Mon, Jul 13, 2015 at 1:16 PM, Yuheng Du yuheng.du.h...@gmail.com wrote:

 Hi guys,

 I am trying to replicate the test of benchmarking kafka at

 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
 .

 When I run

 bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
 test7 5000 100 -1 acks=1 bootstrap.servers=192.168.1.1:9092
 buffer.memory=67108864 batch.size=8196

 and I got the following error:
 Error: Could not find or load main class
 org.apache.kafka.client.tools.ProducerPerformance

 What should I fix? Thank you!




-- 

Jiefu Gong
University of California, Berkeley | Class of 2017
B.A Computer Science | College of Letters and Sciences

jg...@berkeley.edu elise...@berkeley.edu | (925) 400-3427


Offset not committed

2015-07-13 Thread Vadim Bobrov
I am trying to replace ActiveMQ with Kafka in our environment however I
have encountered a strange problem that basically prevents from using Kafka
in production. The problem is that sometimes the offsets are not committed.

I am using Kafka 0.8.2.1, offset storage = kafka, high level consumer,
auto-commit = off. Every N messages I issue commitOffsets(). Now here is
the problem - if N is below a certain number (180 000 for me) it works and
the offset is moving. If N is 180 000 or more the offset is not updated
after commitOffsets

I am looking at offsets using kafka-run-class.sh
kafka.tools.ConsumerOffsetChecker
Any help?


Re: Using Kafka as a persistent store

2015-07-13 Thread Daniel Tamai
Using -1 for log.retention.ms should work only for 0.8.3 (
https://issues.apache.org/jira/browse/KAFKA-1990).

2015-07-13 17:08 GMT-03:00 Shayne S shaynest...@gmail.com:

 Did this work for you? I set the topic settings to retention.ms=-1 and
 retention.bytes=-1 and it looks like it is deleting segments immediately.

 On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck 
 daniel.schierb...@gmail.com wrote:

 
   On 10. jul. 2015, at 23.03, Jay Kreps j...@confluent.io wrote:
  
   If I recall correctly, setting log.retention.ms and
 log.retention.bytes
  to
   -1 disables both.
 
  Thanks!
 
  
   On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
  
   On 10. jul. 2015, at 15.16, Shayne S shaynest...@gmail.com wrote:
  
   There are two ways you can configure your topics, log compaction and
  with
   no cleaning. The choice depends on your use case. Are the records
   uniquely
   identifiable and will they receive updates? Then log compaction is
 the
   way
   to go. If they are truly read only, you can go without log
 compaction.
  
   I'd rather be free to use the key for partitioning, and the records
 are
   immutable — they're event records — so disabling compaction altogether
   would be preferable. How is that accomplished?
  
   We have a small processes which consume a topic and perform upserts
 to
   our
   various database engines. It's easy to change how it all works and
  simply
   consume the single source of truth again.
  
   I've written a bit about log compaction here:
  
  http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
  
   On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck 
   daniel.schierb...@gmail.com wrote:
  
   I'd like to use Kafka as a persistent store – sort of as an
  alternative
   to
   HDFS. The idea is that I'd load the data into various other systems
 in
   order to solve specific needs such as full-text search, analytics,
   indexing
   by various attributes, etc. I'd like to keep a single source of
 truth,
   however.
  
   I'm struggling a bit to understand how I can configure a topic to
  retain
   messages indefinitely. I want to make sure that my data isn't
 deleted.
   Is
   there a guide to configuring Kafka like this?
  
 



New producer and ordering of Callbacks when sending to multiple partitions

2015-07-13 Thread James Cheng
Hi,

I'm trying to understand the new producer, and the order in which the Callbacks 
will be called.

From my understanding, records are batched up per partition. So all records 
destined for a specific partition will be sent in order, and that means that 
their callbacks will be called in order.

What about message batches that cover multiple partitions? E.g. If I send three 
messages to three partitions A, B, and C, in the following order:

A1 A2 A3 B1 B2 B3 C1 C2 D3

Then is it possible that messages B1 B2 B3 will be sent prior to A1 A2 A3? 
Which means the callbacks for B1 B2 B3 will also be called prior to the ones 
from A1 A2 A3?

Thanks,
-James



Re: New producer and ordering of Callbacks when sending to multiple partitions

2015-07-13 Thread Gwen Shapira
James,

There are separate queues for each partition, so there are no
guarantees on the order of the sends (or callbacks) between
partitions.
(Actually, IIRC, the code intentionally randomizes the partition order
a bit, possibly to avoid starvation)

Gwen

On Mon, Jul 13, 2015 at 5:41 PM, James Cheng jch...@tivo.com wrote:
 Hi,

 I'm trying to understand the new producer, and the order in which the 
 Callbacks will be called.

 From my understanding, records are batched up per partition. So all records 
 destined for a specific partition will be sent in order, and that means that 
 their callbacks will be called in order.

 What about message batches that cover multiple partitions? E.g. If I send 
 three messages to three partitions A, B, and C, in the following order:

 A1 A2 A3 B1 B2 B3 C1 C2 D3

 Then is it possible that messages B1 B2 B3 will be sent prior to A1 A2 A3? 
 Which means the callbacks for B1 B2 B3 will also be called prior to the ones 
 from A1 A2 A3?

 Thanks,
 -James