Re: Kafka security

2017-04-11 Thread Christian Csar
Don't hard code it. Martin's suggestion allows it to be read from a
configuration file or injected from another source such as an environment
variable at runtime.

If you neither of these are acceptable for corporate policy I suggest
asking how it has been handled before at your company.

Christian


On Apr 11, 2017 11:10, "IT Consultant" <0binarybudd...@gmail.com> wrote:

Thanks for your response .

We aren't allowed to hard code  password in any of our program

On Apr 11, 2017 23:39, "Mar Ian"  wrote:

> Since is a java property you could set the property (keystore password)
> programmatically,
>
> before you connect to kafka (ie, before creating a consumer or producer)
>
> System.setProperty("zookeeper.ssl.keyStore.password", password);
>
> martin
>
> 
> From: IT Consultant <0binarybudd...@gmail.com>
> Sent: April 11, 2017 2:01 PM
> To: users@kafka.apache.org
> Subject: Kafka security
>
> Hi All
>
> How can I avoid using password for keystore creation ?
>
> Our corporate policies doesn'tallow us to hardcore password. We are
> currently passing keystore password while accessing TLS enabled Kafka
> instance .
>
> I would like to use either passwordless keystore or avoid password for
> cleint accessing Kafka .
>
>
> Please help
>


Re: Encryption at Rest

2016-05-02 Thread Christian Csar
"We need to be capable of changing encryption keys on regular
intervals and in case of expected key compromise." is achievable with
full disk encryption particularly if you are willing to add and remove
Kafka servers so that you replicate the data to new machines/disks
with new keys and take the machines with old keys out of use and wipe
them.

For the second part of it I would suggest reevaluating your threat
model since you are looking at a machine that is compromised but not
compromised enough to be able to read the key from Kafka or to use
Kafka to read the data.

While you could add support to encrypt data on the way in and out of
compression I believe you would need either substantial work in Kafka
to support rewriting/reencrypting the logfiles (with performance
penalties) or rotate machines in and out as with full disk encryption.
Though I'll let someone with more knowledge of the implementation
comment further on what would be required.

Christian

On Mon, May 2, 2016 at 9:41 PM, Bruno Rassaerts
 wrote:
> We did try indeed the last scenario you describe as encrypted disks do not 
> fulfil our requirements.
> We need to be capable of changing encryption keys on regular intervals and in 
> case of expected key compromise.
> Also, when a running machine is hacked, disk based or file system based 
> encryption doesn’t offer any protection.
>
> Our goal is indeed to have the content in the broker files encrypted. The 
> problem is that only way to achieve this is through custom serialisers.
> This works, but the overhead is quite dramatic as the messages are no longer 
> efficiently compressed (in batch).
> Compression in the serialiser, before the encryption, doesn’t really solve 
> the performance problem.
>
> The best thing for us would be able to encrypt after the batch compression 
> offered by kafka.
> The hook to do this is missing in the current implementation.
>
> Bruno
>
>> On 02 May 2016, at 22:46, Tom Brown  wrote:
>>
>> I'm trying to understand your use-case for encrypted data.
>>
>> Does it need to be encrypted only over the wire? This can be accomplished
>> using TLS encryption (v0.9.0.0+). See
>> https://issues.apache.org/jira/browse/KAFKA-1690
>>
>> Does it need to be encrypted only when at rest? This can be accomplished
>> using full disk encryption as others have mentioned.
>>
>> Does it need to be encrypted during both? Use both TLS and full disk
>> encryption.
>>
>> Does it need to be encrypted fully from end-to-end so even Kafka can't read
>> it? Since Kafka shouldn't be able to know the contents, the key should not
>> be known to Kafka. What remains is manually encrypting each message before
>> giving it to the producer (or by implementing an encrypting serializer).
>> Either way, each message is still encrypted individually.
>>
>> Have I left out a scenario?
>>
>> --Tom
>>
>>
>> On Mon, May 2, 2016 at 2:01 PM, Bruno Rassaerts >> wrote:
>>
>>> Hello,
>>>
>>> We tried encrypting the data before sending it to kafka, however this
>>> makes the compression done by kafka almost impossible.
>>> Also the performance overhead of encrypting the individual messages was
>>> quite significant.
>>>
>>> Ideally, a pluggable “compression” algorithm would be best. Where message
>>> can first be compressed, then encrypted in batch.
>>> However, the current kafka implementation does not allow this.
>>>
>>> Bruno
>>>
 On 26 Apr 2016, at 19:02, Jim Hoagland 
>>> wrote:

 Another option is to encrypt the data before you hand it to Kafka and
>>> have
 the downstream decrypt it.  This takes care of on-disk on on-wire
 encryption.  We did a proof of concept of this:


>>> http://www.symantec.com/connect/blogs/end-end-encryption-though-kafka-our-p
 roof-concept

 ( http://symc.ly/1pC2CEG )

 -- Jim

 On 4/25/16, 11:39 AM, "David Buschman"  wrote:

> Kafka handles messages which are compose of an array of bytes. Kafka
>>> does
> not care what is in those byte arrays.
>
> You could use a custom Serializer and Deserializer to encrypt and
>>> decrypt
> the data from with your application(s) easily enough.
>
> This give the benefit of having encryption at rest and over the wire.
>>> Two
> birds, one stone.
>
> DaVe.
>
>
>> On Apr 25, 2016, at 2:14 AM, Jens Rantil  wrote:
>>
>> IMHO, I think that responsibility should lie on the file system, not
>> Kafka.
>> Feels like a waste of time and double work to implement that unless
>> there's
>> a really good reason for it. Let's try to keep Kafka a focused product
>> that
>> does one thing well.
>>
>> Cheers,
>> Jens
>>
>> On Fri, Apr 22, 2016 at 3:31 AM Tauzell, Dave
>> 
>> wrote:
>>
>>> I meant encryption 

Re: Encryption at Rest

2016-04-21 Thread Christian Csar
>From what I know of previous discussions encryption at rest can be
handled with transparent disk encryption. When that's sufficient it's
nice and easy.

Christian

On Thu, Apr 21, 2016 at 2:31 PM, Tauzell, Dave
 wrote:
> Has there been any discussion or work on at rest encryption for Kafka?
>
> Thanks,
>   Dave
>
> This e-mail and any files transmitted with it are confidential, may contain 
> sensitive information, and are intended solely for the use of the individual 
> or entity to whom they are addressed. If you have received this e-mail in 
> error, please notify the sender by reply e-mail immediately and destroy all 
> copies of the e-mail and any attachments.


Re: Kafka over Satellite links

2016-03-02 Thread Christian Csar
I would not do that. I admit I may be a bit biased due to working for
Buddy Platform (IoT backend stuff including telemetry collection), but
you want to send the data via some protocol (HTTP? MQTT? COAP?) to the
central hub and then have those servers put the data into Kafka. Now
if you want to use Kafka there are the various HTTP front ends that
will basically put the data into Kafka for you without the client
needing to deal with the partition management part. But putting data
into Kafka directly really seems like a bad idea even if it's a large
number of messages per second per node, even if the security parts
work out for you.

Christian

On Wed, Mar 2, 2016 at 9:52 PM, Jan  wrote:
> Hi folks;
> does anyone know of Kafka's ability to work over Satellite links. We have a 
> IoT Telemetry application that uses Satellite communication to send data from 
> remote sites to a Central hub.
> Any help/ input/ links/ gotchas would be much appreciated.
> Regards,Jan


Re: Kafka behind AWS ELB

2015-05-04 Thread Christian Csar
Dillian,

On Mon, May 4, 2015 at 1:52 PM, Dillian Murphey crackshotm...@gmail.com
wrote:

 I'm interested in this topic as well.  If you put kafka brokers inside an
 autoscaling group, then AWS will automatically add brokers if demand
 increases, and the ELB will automatically round-robin across all of your
 kafka instances.  So in your config files and code, you only need to
 provide a single DNS name (the load balancer). You don't need to specify
 all your kafka brokers inside your config file.  If a broker dies, the ELB
 will only route to healthy nodes.

 So you get a lot of robustness, scalability, and fault-tolerance by using
 the AWS services. Kafka Brokers will automatically load balance, but the
 question is whether it is ok to put all your brokers behind an ELB and
 expect the system to work properly.

You should not expect it to work properly. Broker nodes are data bearing
which means that any operation to scale down would need to be aware of the
data distribution. The client connects to specific nodes to send them data
so even the Level 4 load balancing wouldn't work.

 What alternatives are there to dynamic/scalable broker clusters?  I don't
 want to have to modify my config files or code if I add more brokers, and
I
 want to be able to handle a broker going down. So these are the reasons
AWS
 questions like this come up.


The clients already give you options for specifying only a subset of
brokers
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
one of which must be alive to discover the rest of the cluster. The main
clients handle node failures (you'll still have some operational work).
Kafka and other data storage systems do not work the same as an HTTP driven
web application. While it can certainly be scaled, and automation could be
done to do so in response to load it's going to be more complicated. AWS's
off the shelf solution/low operations offering for some (definitely not
all) of Kafka's use cases is Kinesis, Azure's is EventHubs. Before using
Kakfa or any system in production you'll want to be sure you understand the
operational aspects of it.

Christian


 Thanks for any comments too. :)




 On Mon, May 4, 2015 at 9:03 AM, Mayuresh Gharat 
gharatmayures...@gmail.com
 wrote:

  Ok. You can deploy kafka in AWS. You can have brokers on AWS servers.
  Kafka is not a push system. So you will need someone writing to kafka
and
  consuming from kafka. It will work. My suggestion will be to try it out
on
  a smaller instance in AWS and see the effects.
 
  As I do not know the actual use case about why you want to use kafka
for, I
  cannot comment on whether it will work for you personalized use case.
 
  Thanks,
 
  Mayuresh
 
  On Mon, May 4, 2015 at 8:55 AM, Chandrashekhar Kotekar 
  shekhar.kote...@gmail.com wrote:
 
   I am sorry but I cannot reveal those details due to confidentiality
  issues.
   I hope you understand.
  
  
   Regards,
   Chandrash3khar Kotekar
   Mobile - +91 8600011455
  
   On Mon, May 4, 2015 at 9:18 PM, Mayuresh Gharat 
   gharatmayures...@gmail.com
   wrote:
  
Hi Chandrashekar,
   
Can you please elaborate the use case for Kafka here, like how you
are
planning to use it.
   
   
Thanks,
   
Mayuresh
   
On Sat, May 2, 2015 at 9:08 PM, Chandrashekhar Kotekar 
shekhar.kote...@gmail.com wrote:
   
 Hi,

 I am new to Apache Kafka. I have played with it on my laptop.

 I want to use Kafka in AWS. Currently we have tomcat web servers
  based
REST
 API. We want to replace REST API with Apache Kafka, web servers
are
behind
 ELB.

 I would like to know if we can keep Kafka brokers behind ELB?
Will it
work?

 Regards,
 Chandrash3khar Kotekar
 Mobile - +91 8600011455

   
   
   
--
-Regards,
Mayuresh R. Gharat
(862) 250-7125
   
  
 
 
 
  --
  -Regards,
  Mayuresh R. Gharat
  (862) 250-7125
 


Re: Kafka Poll: Version You Use?

2015-03-04 Thread Christian Csar
Do you have a anything on the number of voters, or audience breakdown?

Christian

On Wed, Mar 4, 2015 at 8:08 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Hello hello,

 Results of the poll are here!
 Any guesses before looking?
 What % of Kafka users are on 0.8.2.x already?
 What % of people are still on 0.7.x?


 http://blog.sematext.com/2015/03/04/poll-results-kafka-version-distribution/

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Thu, Feb 26, 2015 at 3:32 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,
 
  With 0.8.2 out I thought it might be useful for everyone to see which
  version(s) of Kafka people are using.
 
  Here's a quick poll:
  http://blog.sematext.com/2015/02/23/kafka-poll-version-you-use/
 
  We'll publish the results next week.
 
  Thanks,
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 



Re: kafka producer does not distribute messages to partitions evenly?

2015-03-02 Thread Christian Csar
I believe you are seeing the behavior where the random partitioner is
sticky.
http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3ccahwhrrxax5ynimqnacsk7jcggnhjc340y4qbqoqcismm43u...@mail.gmail.com%3E
has details. So with the default 10 minute refresh if your test is only an
hour or two with a single producer you would not expect to see all
partitions be hit.

Christian

On Mon, Mar 2, 2015 at 4:23 PM, Yang tedd...@gmail.com wrote:

 thanks. just checked code below. in the code below, the line that calls
 Random.nextInt() seems to be called only *a few times* , and all the rest
 of the cases getPartition() is called, the
 cached sendPartitionPerTopicCache.get(topic) seems to be called, so
 apparently you won't get an even partition distribution ?

 the code I got is from commit 7847e9c703f3a0b70519666cdb8a6e4c8e37c3a7


 ./core/src/main/scala/kafka/producer/async/DefaultEventHandler.scala 336
 lines --66%--
 222,4673%


   private def getPartition(topic: String, key: Any, topicPartitionList:
 Seq[PartitionAndLeader]): Int = {
 val numPartitions = topicPartitionList.size
 if(numPartitions = 0)
   throw new UnknownTopicOrPartitionException(Topic  + topic + 
 doesn't exist)
 val partition =
   if(key == null) {
 // If the key is null, we don't really need a partitioner
 // So we look up in the send partition cache for the topic to
 decide the target partition
 val id = sendPartitionPerTopicCache.get(topic)
 id match {
   case Some(partitionId) =
 // directly return the partitionId without checking
 availability of the leader,
 // since we want to postpone the failure until the send
 operation anyways
 partitionId
   case None =
 val availablePartitions =
 topicPartitionList.filter(_.leaderBrokerIdOpt.isDefined)
 if (availablePartitions.isEmpty)
   throw new LeaderNotAvailableException(No leader for any
 partition in topic  + topic)
 val index = Utils.abs(Random.nextInt) %
 availablePartitions.size
 val partitionId = availablePartitions(index).partitionId
 sendPartitionPerTopicCache.put(topic, partitionId)
 partitionId
 }
   } else
 partitioner.partition(key, numPartitions)
 if(partition  0 || partition = numPartitions)
   throw new UnknownTopicOrPartitionException(Invalid partition id:  +
 partition +  for topic  + topic +
 ; Valid values are in the inclusive range of [0,  +
 (numPartitions-1) + ])
 trace(Assigning message of topic %s and key %s to a selected partition
 %d.format(topic, if (key == null) [none] else key.toString, partition))
 partition
   }


 On Mon, Mar 2, 2015 at 3:58 PM, Mayuresh Gharat 
 gharatmayures...@gmail.com
 wrote:

  Probably your keys are getting hashed to only those partitions. I don't
  think anything is wrong here.
  You can check how the default hashPartitioner is used in the code and try
  to do the same for your keys before you send them and check which
  partitions are those going to.
 
  The default hashpartitioner does something like this :
 
  hash(key) % numPartitions.
 
  Thanks,
 
  Mayuresh
 
  On Mon, Mar 2, 2015 at 3:52 PM, Yang tedd...@gmail.com wrote:
 
   we have 10 partitions for a topic, and omit the explicit partition
 param
  in
   the message creation:
  
   KeyedMessageString, String data = new KeyedMessageString, String
   (mytopic,   myMessageContent);   // partition key need to be polished
   producer.send(data);
  
  
  
   but on average 3--5 of the partitions are empty.
  
  
  
   what went wrong?
  
   thanks
   Yang
  
 
 
 
  --
  -Regards,
  Mayuresh R. Gharat
  (862) 250-7125
 



Re: Tips for working with Kafka and data streams

2015-02-25 Thread Christian Csar
I wouldn't say no to some discussion of encryption. We're running on Azure
EventHubs (with preparations for Kinesis for EC2, and Kafka for deployments
in customer datacenters when needed) so can't just use disk level
encryption (which would have its own overhead). We're putting all of our
messages inside of encrypted envelopes before sending them to the stream
which limits our opportunities for schema verification of the underlying
messages to the declared type of the message.

Encryption at rest mostly works out to a sales point for customers who want
assurances, and in a Kafka focused discussion might be dealt with by
covering disk encryption and how the conversations between Kafka instances
are protected.

Christian


On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps j...@confluent.io wrote:

 Hey guys,

 One thing we tried to do along with the product release was start to put
 together a practical guide for using Kafka. I wrote this up here:
 http://blog.confluent.io/2015/02/25/stream-data-platform-1/

 I'd like to keep expanding on this as good practices emerge and we learn
 more stuff. So two questions:
 1. Anything you think other people should know about working with data
 streams? What did you wish you knew when you got started?
 2. Anything you don't know about but would like to hear more about?

 -Jay



Re: Tips for working with Kafka and data streams

2015-02-25 Thread Christian Csar
Yeah, we do have scenarios where we use customer specific keys so our
envelopes end up containing key identification information for accessing
our key repository. I'll certainly follow any changes you propose in this
area with interest, but I'd expect that sort of centralized key thing to be
fairly separate from Kafka even if there's a handy optional layer that
integrates with it.

Christian

On Wed, Feb 25, 2015 at 5:34 PM, Julio Castillo 
jcasti...@financialengines.com wrote:

 Although full disk encryption appears to be an easy solution, in our case
 that may not be sufficient. For cases where the actual payload needs to be
 encrypted, the cost of encryption is paid by the consumer and producers.
 Further complicating the matter would be the handling of encryption keys,
 etc. I think this is the area where enhancements to Kafka may facilitate
 that key exchange between consumers and producers, still leaving it up to
 the clients, but facilitating the key handling.

 Julio

 On 2/25/15, 4:24 PM, Christian Csar christ...@csar.us wrote:

 The questions we get from customers typically end up being general so we
 break out our answer into network level and on disk scenarios.
 
 On disk/at rest scenario may just be use full disk encryption at the OS
 level and Kafka doesn't need to worry about it. But documenting any issues
 around it would be good. For example what sort of Kafka specific
 performance impacts does it have, ie budgeting for better processors.
 
 The security story right now is to run on a private network, but I believe
 some of our customers like to be told that within datacenter transmissions
 are encrypted on the wire. Based on
 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_conf
 luence_display_KAFKA_Securityd=AwIBaQc=cKbMccWasSe6U4u_qE0M-qEjqwAh3shju
 L5QPa1B7Ykr=rJHFl4LhCQ-6kvKROhIocflKqVSHRTvT-PgdZ5MFuS0m=jhFmJTJBQfbq0sN
 jxtKA4M1tvSVgBLKOr2ePaK6zqwws=HqZ4N2gLpCZ796dRG7Fo-KLOBc0tgnnvDnC_8VTUo84
 e=  that might mean
 waiting for TLS support, or using a VPN/ssh tunnel for the network
 connections.
 
 Since we're in hosted stream land we can't do either of the above and
 encrypt the messages themselves. For those enterprises that are like our
 customers but would run Kafka or use Confluent, having a story like the
 above so they don't give up the benefits of your schema management layers
 would be good.
 
 Since I didn't mention it before I did find your blog posts handy (though
 I'm already moving us towards stream centric land).
 
 Christian
 
 On Wed, Feb 25, 2015 at 3:57 PM, Jay Kreps jay.kr...@gmail.com wrote:
 
  Hey Christian,
 
  That makes sense. I agree that would be a good area to dive into. Are
 you
  primarily interested in network level security or encryption on disk?
 
  -Jay
 
  On Wed, Feb 25, 2015 at 1:38 PM, Christian Csar christ...@csar.us
 wrote:
 
   I wouldn't say no to some discussion of encryption. We're running on
  Azure
   EventHubs (with preparations for Kinesis for EC2, and Kafka for
  deployments
   in customer datacenters when needed) so can't just use disk level
   encryption (which would have its own overhead). We're putting all of
 our
   messages inside of encrypted envelopes before sending them to the
 stream
   which limits our opportunities for schema verification of the
 underlying
   messages to the declared type of the message.
  
   Encryption at rest mostly works out to a sales point for customers who
  want
   assurances, and in a Kafka focused discussion might be dealt with by
   covering disk encryption and how the conversations between Kafka
  instances
   are protected.
  
   Christian
  
  
   On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps j...@confluent.io wrote:
  
Hey guys,
   
One thing we tried to do along with the product release was start to
  put
together a practical guide for using Kafka. I wrote this up here:
   
 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.confluent.io_201
 5_02_25_stream-2Ddata-2Dplatform-2D1_d=AwIBaQc=cKbMccWasSe6U4u_qE0M-qEj
 qwAh3shjuL5QPa1B7Ykr=rJHFl4LhCQ-6kvKROhIocflKqVSHRTvT-PgdZ5MFuS0m=jhFmJ
 TJBQfbq0sNjxtKA4M1tvSVgBLKOr2ePaK6zqwws=0I9x4bCw1kN3y9Y22l9lK_YbhSYEZpp4
 ZwBBrP-dSLke=
   
I'd like to keep expanding on this as good practices emerge and we
  learn
more stuff. So two questions:
1. Anything you think other people should know about working with
 data
streams? What did you wish you knew when you got started?
2. Anything you don't know about but would like to hear more about?
   
-Jay
   
  
 

 NOTICE: This e-mail and any attachments to it may be privileged,
 confidential or contain trade secret information and is intended only for
 the use of the individual or entity to which it is addressed. If this
 e-mail was sent to you in error, please notify me immediately by either
 reply e-mail or by phone at 408.498.6000, and do not use, disseminate,
 retain, print or copy the e-mail or any attachment. All messages sent to
 and from

Re: Tips for working with Kafka and data streams

2015-02-25 Thread Christian Csar
The questions we get from customers typically end up being general so we
break out our answer into network level and on disk scenarios.

On disk/at rest scenario may just be use full disk encryption at the OS
level and Kafka doesn't need to worry about it. But documenting any issues
around it would be good. For example what sort of Kafka specific
performance impacts does it have, ie budgeting for better processors.

The security story right now is to run on a private network, but I believe
some of our customers like to be told that within datacenter transmissions
are encrypted on the wire. Based on
https://cwiki.apache.org/confluence/display/KAFKA/Security that might mean
waiting for TLS support, or using a VPN/ssh tunnel for the network
connections.

Since we're in hosted stream land we can't do either of the above and
encrypt the messages themselves. For those enterprises that are like our
customers but would run Kafka or use Confluent, having a story like the
above so they don't give up the benefits of your schema management layers
would be good.

Since I didn't mention it before I did find your blog posts handy (though
I'm already moving us towards stream centric land).

Christian

On Wed, Feb 25, 2015 at 3:57 PM, Jay Kreps jay.kr...@gmail.com wrote:

 Hey Christian,

 That makes sense. I agree that would be a good area to dive into. Are you
 primarily interested in network level security or encryption on disk?

 -Jay

 On Wed, Feb 25, 2015 at 1:38 PM, Christian Csar christ...@csar.us wrote:

  I wouldn't say no to some discussion of encryption. We're running on
 Azure
  EventHubs (with preparations for Kinesis for EC2, and Kafka for
 deployments
  in customer datacenters when needed) so can't just use disk level
  encryption (which would have its own overhead). We're putting all of our
  messages inside of encrypted envelopes before sending them to the stream
  which limits our opportunities for schema verification of the underlying
  messages to the declared type of the message.
 
  Encryption at rest mostly works out to a sales point for customers who
 want
  assurances, and in a Kafka focused discussion might be dealt with by
  covering disk encryption and how the conversations between Kafka
 instances
  are protected.
 
  Christian
 
 
  On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps j...@confluent.io wrote:
 
   Hey guys,
  
   One thing we tried to do along with the product release was start to
 put
   together a practical guide for using Kafka. I wrote this up here:
   http://blog.confluent.io/2015/02/25/stream-data-platform-1/
  
   I'd like to keep expanding on this as good practices emerge and we
 learn
   more stuff. So two questions:
   1. Anything you think other people should know about working with data
   streams? What did you wish you knew when you got started?
   2. Anything you don't know about but would like to hear more about?
  
   -Jay
  
 



Re: Proper Relationship Between Partition and Threads

2015-01-28 Thread Christian Csar
Ricardo,
   The parallelism of each logical consumer (consumer group) is the number
of partitions. So with four partitions it could make sense to have one
logical consumer (application) have two processes on different machines
each with two threads, or one process with four. While with two logical
consumers (two different applications) you would want each to have 4
threads (4*2 = 8 threads total).

There are also considerations depending on which consumer code you are
using (which I'm decidedly not someone with good information on)

Christian

On Wed, Jan 28, 2015 at 1:28 PM, Ricardo Ferreira 
jricardoferre...@gmail.com wrote:

 Hi experts,

 I'm newbie in the Kafka world, so excuse me for such basic question.

 I'm in the process of designing a client for Kafka, and after few hours of
 study, I was told that to achieve a proper level of parallelism, it is a
 best practice having one thread for each partition of an topic.

 My question is that this rule-of-thumb also applies for multiple consumer
 applications. For instance:

 Considering a topic with 4 partitions, it is OK to have one consumer
 application with 4 threads, just like would be OK to have two consumer
 applications with 2 threads each. But what about having two consumer
 applications with 4 threads each? It would break any load-balancing made by
 Kafka brokers?

 Anyway, I'd like to understand if the proper number of threads that should
 match the number of partitions is per application or if there is some other
 best practice.

 Thanks in advance,

 Ricardo Ferreira



Re: How to mark a message as needing to retry in Kafka?

2015-01-28 Thread Christian Csar
noodles,
   Without an external mechanism you won't be able to mark individual
messages/offsets as needing to be retried at a later time. Guozhang is
describing a way to get the offset of a message that's been received so
that you can find it later. You would need to save that into a 'failed
messages' store somewhere else and have code that looks in there to make
retries happen (assuming you want the failure/retry to persist beyond the
lifetime of the process).

Christian


On Wed, Jan 28, 2015 at 7:00 PM, Guozhang Wang wangg...@gmail.com wrote:

 I see. If you are using the high-level consumer, once the message is
 returned to the application it is considered consumed, and current it is
 not supported to re-wind to a previously consumed message.

 With the new consumer coming in 0.8.3 release, we have an api for you to
 get the offset of each message and do the rewinding based on offsets. For
 example, you can do sth. like

 

   message = // get one message from consumer

   try {
 // process message
   } catch {
 consumer.seek(message.offset)
   }

 

 Guozhang

 On Wed, Jan 28, 2015 at 6:26 PM, noodles rungumpth...@gmail.com wrote:

  I did not describe my problem clearly. In my case, I got the message from
  Kakfa, but I could not handle this message because of some reason, for
  example the external server is down. So I want to mark the message as not
  being consumed directly.
 
  2015-01-28 23:26 GMT+08:00 Guozhang Wang wangg...@gmail.com:
 
   Hi,
  
   Which consumer are you using? If you are using a high level consumer
 then
   retry would be automatic upon network exceptions.
  
   Guozhang
  
   On Wed, Jan 28, 2015 at 1:32 AM, noodles rungumpth...@gmail.com
 wrote:
  
Hi group:
   
I'm working for building a webhook notification service based on
  Kafka. I
produce all of the payloads into Kafka, and consumers consume these
payloads by offset.
   
Sometimes some payloads cannot be consumed because of network
 exception
   or
http server exception. So I want to mark the failed payloads and
 retry
   them
by timers. But I have no idea if I don't use a storage (like MySQL)
   except
kafka and zookeeper.
   
   
--
*noodles!*
   
  
  
  
   --
   -- Guozhang
  
 
 
 
  --
  *Yeah, I'm noodles!*
 



 --
 -- Guozhang



Re: Is kafka suitable for our architecture?

2014-10-10 Thread Christian Csar
 in a production
environment I can't speak to that. I don't think there is any reason you
*need* to use Storm rather than just Kafka to achieve your needs though.

Christian

 
 Christian

 
 Thanks
 
 
 

 On Thu, Oct 9, 2014 at 11:57 PM, Albert Vila albert.v...@augure.com
 wrote:

 Hi

 We process data in real time, and we are taking a look at Storm and Spark
 streaming too, however our actions are atomic, done at a document level
 so
 I don't know if it fits on something like Storm/Spark.

 Regarding what you Christian said, isn't Kafka used for scenarios like
 the
 one I described? I mean, we do have work queues right now with Gearman,
 but
 with a bunch of workers on each step. I thought we could change that to a
 producer and a bunch of consumers (where the message should only reach
 one
 and exact one consumer).

 And what I said about the data locally, it was only an optimization we
 did
 some time ago because we was moving more data back then. Maybe now its
 not
 necessary and we could move messages around the system using Kafka, so it
 will allow us to simplify the architecture a little bit. I've seen people
 saying they move Tb of data every day using Kafka.

 Just to be clear on the size of each document/message, we are talking
 about
 tweets, blog posts, ... (on 90% of cases the size is less than 50Kb)

 Regards

 On 9 October 2014 20:02, Christian Csar cac...@gmail.com wrote:

 Apart from your data locality problem it sounds like what you want is a
 workqueue. Kafka's consumer structure doesn't lend itself too well to
 that use case as a single partition of a topic should only have one
 consumer instance per logical subscriber of the topic, and that
 consumer
 would not be able to mark jobs as completed except in a strict order
 (while maintaining a processed successfully at least once guarantee).
 This is not to say it cannot be done, but I believe your workqueue
 would
 end up working a bit strangely if built with Kafka.

 Christian

 On 10/09/2014 06:13 AM, William Briggs wrote:
 Manually managing data locality will become difficult to scale. Kafka
 is
 one potential tool you can use to help scale, but by itself, it will
 not
 solve your problem. If you need the data in near-real time, you could
 use a
 technology like Spark or Storm to stream data from Kafka and perform
 your
 processing. If you can batch the data, you might be better off
 pulling
 it
 into a distributed filesystem like HDFS, and using MapReduce, Spark
 or
 another scalable processing framework to handle your transformations.
 Once
 you've paid the initial price for moving the document into HDFS, your
 network traffic should be fairly manageable; most clusters built for
 this
 purpose will schedule work to be run local to the data, and typically
 have
 separate, high-speed network interfaces and a dedicated switch in
 order
 to
 optimize intra-cluster communications when moving data is
 unavoidable.

 -Will

 On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila albert.v...@augure.com
 wrote:

 Hi

 I just came across Kafta when I was trying to find solutions to
 scale
 our
 current architecture.

 We are currently downloading and processing 6M documents per day
 from
 online and social media. We have a different workflow for each type
 of
 document, but some of the steps are keyword extraction, language
 detection,
 clustering, classification, indexation,  We are using Gearman to
 dispatch the job to workers and we have some queues on a database.

 I'm wondering if we could integrate Kafka on the current workflow
 and
 if
 it's feasible. One of our main discussions are if we have to go to a
 fully
 distributed architecture or to a semi-distributed one. I mean,
 distribute
 everything or process some steps on the same machine (crawling,
 keyword
 extraction, language detection, indexation). We don't know which one
 scales
 more, each one has pros and cont.

 Now we have a semi-distributed one as we had network problems taking
 into
 account the amount of data we were moving around. So now, all
 documents
 crawled on server X, later on are dispatched through Gearman to the
 same
 server. What we dispatch on Gearman is only the document id, and the
 document data remains on the crawling server on a Memcached, so the
 network
 traffic is keep at minimum.

 What do you think?
 It's feasible to remove all database queues and Gearman and move to
 Kafka?
 As Kafka is mainly based on messages I think we should move the
 messages
 around, should we take into account the network? We may face the
 same
 problems?
 If so, there is a way to isolate some steps to be processed on the
 same
 machine, to avoid network traffic?

 Any help or comment will be appreciate. And If someone has had a
 similar
 problem and has knowledge about the architecture approach will be
 more
 than
 welcomed.

 Thanks






 --
 *Albert Vila*
 RD Manager  Software Developer


 Tél. : +34 972 982 968

 *www.augure.com* http://www.augure.com/ | *Blog.* Reputation in action

Re: Is kafka suitable for our architecture?

2014-10-09 Thread Christian Csar
Apart from your data locality problem it sounds like what you want is a
workqueue. Kafka's consumer structure doesn't lend itself too well to
that use case as a single partition of a topic should only have one
consumer instance per logical subscriber of the topic, and that consumer
would not be able to mark jobs as completed except in a strict order
(while maintaining a processed successfully at least once guarantee).
This is not to say it cannot be done, but I believe your workqueue would
end up working a bit strangely if built with Kafka.

Christian

On 10/09/2014 06:13 AM, William Briggs wrote:
 Manually managing data locality will become difficult to scale. Kafka is
 one potential tool you can use to help scale, but by itself, it will not
 solve your problem. If you need the data in near-real time, you could use a
 technology like Spark or Storm to stream data from Kafka and perform your
 processing. If you can batch the data, you might be better off pulling it
 into a distributed filesystem like HDFS, and using MapReduce, Spark or
 another scalable processing framework to handle your transformations. Once
 you've paid the initial price for moving the document into HDFS, your
 network traffic should be fairly manageable; most clusters built for this
 purpose will schedule work to be run local to the data, and typically have
 separate, high-speed network interfaces and a dedicated switch in order to
 optimize intra-cluster communications when moving data is unavoidable.
 
 -Will
 
 On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila albert.v...@augure.com wrote:
 
 Hi

 I just came across Kafta when I was trying to find solutions to scale our
 current architecture.

 We are currently downloading and processing 6M documents per day from
 online and social media. We have a different workflow for each type of
 document, but some of the steps are keyword extraction, language detection,
 clustering, classification, indexation,  We are using Gearman to
 dispatch the job to workers and we have some queues on a database.

 I'm wondering if we could integrate Kafka on the current workflow and if
 it's feasible. One of our main discussions are if we have to go to a fully
 distributed architecture or to a semi-distributed one. I mean, distribute
 everything or process some steps on the same machine (crawling, keyword
 extraction, language detection, indexation). We don't know which one scales
 more, each one has pros and cont.

 Now we have a semi-distributed one as we had network problems taking into
 account the amount of data we were moving around. So now, all documents
 crawled on server X, later on are dispatched through Gearman to the same
 server. What we dispatch on Gearman is only the document id, and the
 document data remains on the crawling server on a Memcached, so the network
 traffic is keep at minimum.

 What do you think?
 It's feasible to remove all database queues and Gearman and move to Kafka?
 As Kafka is mainly based on messages I think we should move the messages
 around, should we take into account the network? We may face the same
 problems?
 If so, there is a way to isolate some steps to be processed on the same
 machine, to avoid network traffic?

 Any help or comment will be appreciate. And If someone has had a similar
 problem and has knowledge about the architecture approach will be more than
 welcomed.

 Thanks

 



Re: Use case

2014-09-05 Thread Christian Csar
The thought experiment I did ended up having a set of front end servers
corresponding to a given chunk of the user id space, each of which was a
separate subscriber to the same set of partitions. The you have one or
more partitions corresponding to that same chunk of users. You want the
chunk/set of partitions to be of a size where each of those front end
servers can process all the messages in it and send out the
chats/notifications/status change notifications perhaps/read receipts,
to those users who happen to be connected to the particular front end node.

You would need to handle some deduplication on the consumers/FE servers
and would need to decide where to produce. Producing from every front
end server to potentially every broker could be expensive in terms of
connections and you might want to first relay the messages to the
corresponding front end cluster, but since we don't use large numbers of
producers it's hard for me to say.

For persistence and offline delivery you can probably accept a delay in
user receipt so you can use another set of consumers that persist the
messages to the longer latency datastore on the backend and then get the
last 50 or so messages with a bit of lag when the user first looks at
history (see hipchat and hangouts lag).

This gives you a smaller number of partitions and avoids the issue of
having to keep too much history on the Kafka brokers. There are
obviously a significant number of complexities to deal with. For example
if you are using default consumer code that commits offsets into
zookeeper it may be inadvisable at large scales you probably don't need
to worry about reaching. And remember I had done this only as a thought
experiment not a proper technical evaluation. I expect Kafka, used
correctly, can make aspects of building such a chat system much much
easier (you can avoid writing your own message replication system) but
it is definitely not plug and play using topics for users.

Christian


On 09/05/2014 09:46 AM, Jonathan Weeks wrote:
 +1
 
 Topic Deletion with 0.8.1.1 is extremely problematic, and coupled with the 
 fact that rebalance/broker membership changes pay a cost per partition today, 
 whereby excessive partitions extend downtime in the case of a failure; this 
 means fewer topics (e.g. hundreds or thousands) is a best practice in the 
 published version of kafka. 
 
 There are also secondary impacts on topic count — e.g. useful operational 
 tools such as: http://quantifind.com/KafkaOffsetMonitor/ start to become 
 problematic in terms of UX with a massive number of topics.
 
 Once topic deletion is a supported feature, the use-case outlined might be 
 more tenable.
 
 Best Regards,
 
 -Jonathan
 
 On Sep 5, 2014, at 4:20 AM, Sharninder sharnin...@gmail.com wrote:
 
 I'm not really sure about your exact use-case but I don't think having a
 topic per user is very efficient. Deleting topics in kafka, at the moment,
 isn't really straightforward. You should rethink your date pipeline a bit.

 Also, just because kafka has the ability to store messages for a certain
 time, don't think of it as a data store. Kafka is a streaming system, think
 of it as a fast queue that gives you the ability to move your pointer back.

 --
 Sharninder



 On Fri, Sep 5, 2014 at 4:27 PM, Aris Alexis aris.alexis@gmail.com
 wrote:

 Thanks for the reply. If I use it only for activity streams like twitter:

 I would want a topic for each #tag and a topic for each user and maybe
 foreach city. Would that be too many topics or it doesn't matter since most
 of them will be deleted in a specified interval.



 Best Regards,
 Aris Giachnis


 On Fri, Sep 5, 2014 at 6:57 AM, Sharninder sharnin...@gmail.com wrote:

 Since you want all chats and mail history persisted all the time, I
 personally wouldn't recommend kafka for your requirement. Kafka is more
 suitable as a streaming system where events expire after a certain time.
 Look at something more general purpose like hbase for persisting data
 indefinitely.

 So, for example all activity streams can go into kafka from where
 consumers
 will pick up messages to parse and put them to hbase or other clients.

 --
 Sharninder





 On Fri, Sep 5, 2014 at 12:05 AM, Aris Alexis snowboard...@gmail.com
 wrote:

 Hello,

 I am building a big web application that I want to be massively
 scalable
 (I
 am using cassandra and titan as a general db).

 I want to implement the following:

 real time web chat that is persisted so that user a in the future can
 recall his chat with user b,c,d much like facebook.
 mail like messages in the web application (not sure about this as it is
 somewhat covered by the first one)
 user activity streams
 users subscribing to topics for example florida/musicevents

 Could i use kafka for this? can you recommend another technology maybe?



 
 




signature.asc
Description: OpenPGP digital signature


Re: Handling send failures with async producer

2014-08-26 Thread Christian Csar
TLDR: I use one Callback per job I send to Kafka and include that sort
of information by reference in the Callback instance.

Our system is currently moving data from beanstalkd to Kafka due to
historical reasons so we use the callback to either delete or release
the message depending on success. The
org.apache.kafka.clients.producer.Callback I give to the send method is
an instance of a class that stores all the additional information I need
to process the callback. Remember that the async call operates in the
Kafka producer thread so they must be fast to avoid constraining the
throughput. My call back ends up putting information about the call to
beanstalk into another executor service for later processing.

Christian

On 08/26/2014 12:35 PM, Ryan Persaud wrote:
 Hello,
 
 I'm looking to insert log lines from log files into kafka, but I'm concerned 
 with handling asynchronous send() failures.  Specifically, if some of the log 
 lines fail to send, I want to be notified of the failure so that I can 
 attempt to resend them.
 
 Based on previous threads on the mailing list 
 (http://comments.gmane.org/gmane.comp.apache.kafka.user/1322), I know that 
 the trunk version of kafka supports callbacks for dealing with failures.  
 However, the callback function is not passed any metadata that can be used by 
 the producer end to reference the original message.  Including the key of the 
 message in the RecordMetadata seems like it would be really useful for 
 recovery purposes.  Is anyone using the callback functionality to trigger 
 resends of failed messages?  If so, how are they tying the callbacks to 
 messages?  Is anyone using other methods for handling async errors/resending 
 today?  I can’t imagine that I am the only one trying to do this.  I asked 
 this question on the IRC channel today, and it sparked some discussion, but I 
 wanted to hear from a wider audience.
 
 Thanks for the information,
 -Ryan
 
 




signature.asc
Description: OpenPGP digital signature


Re: EBCDIC support

2014-08-25 Thread Christian Csar
Having been spared any EBCDIC experience whatsoever (ie from a positio
of thorough ignorance), if you are transmitting text or things with a
designated textual form (presumably) I would recommend that your
conversion be to unicode rather than ascii if you don't already have
consumers expecting a given conversion. That way you will avoid losing
information, particularly if you expect any of your conversion tools to
be of more general use.

Christian

On 08/25/2014 05:36 PM, Gwen Shapira wrote:
 Personally, I like converting data before writing to Kafka, so I can
 easily support many consumers who don't know about EBCDIC.
 
 A third option is to have a consumer that reads EBCDIC data from one
 Kafka topic and writes ASCII to another Kafka topic. This has the
 benefits of preserving the raw data in Kafka, in case you need it for
 troubleshooting, and also supporting non-EBCDIC consumers.
 
 The cost is a more complex architecture, but if you already have a
 stream processing system around (Storm, Samza, Spark), it can be an
 easy addition.
 
 
 On Mon, Aug 25, 2014 at 5:28 PM,  sonali.parthasara...@accenture.com wrote:
 Thanks Gwen! makes sense. So I'll have to weigh the pros and cons of doing 
 an EBCDIC to ASCII conversion before sending to Kafka Vs. using an ebcdic 
 library after in the consumer

 Thanks!
 S

 -Original Message-
 From: Gwen Shapira [mailto:gshap...@cloudera.com]
 Sent: Monday, August 25, 2014 5:22 PM
 To: users@kafka.apache.org
 Subject: Re: EBCDIC support

 Hi Sonali,

 Kafka doesn't really care about EBCDIC or any other format -  for Kafka bits 
 are just bits. So they are all supported.

 Kafka does not read data from a socket though. Well, it does, but the data 
 has to be sent by a Kafka producer. Most likely you'll need to implement a 
 producer that will get the data from the socket and send it as a message to 
 Kafka. The content of the message can be anything, including EBCDIC -.

 Then  you'll need a consumer to read the data from Kafka and do something 
 with this - the consumer will need to know what to do with a message that 
 contains EBCDIC data. Perhaps you have EBCDIC libraries you can reuse there.

 Hope this helps.

 Gwen

 On Mon, Aug 25, 2014 at 5:14 PM,  sonali.parthasara...@accenture.com wrote:
 Hey all,

 This might seem like a silly question, but does kafka have support for 
 EBCDIC? Say I had to read data from an IBM mainframe via a TCP/IP socket 
 where the data resides in EBCDIC format, can Kafka read that directly?

 Thanks,
 Sonali

 

 This message is for the designated recipient only and may contain 
 privileged, proprietary, or otherwise confidential information. If you have 
 received it in error, please notify the sender immediately and delete the 
 original. Any other use of the e-mail by you is prohibited. Where allowed 
 by local law, electronic communications with Accenture and its affiliates, 
 including e-mail and instant messaging (including content), may be scanned 
 by our systems for the purposes of information security and assessment of 
 internal compliance with Accenture policy.
 __
 

 www.accenture.com




signature.asc
Description: OpenPGP digital signature


Re: Which producer to use?

2014-06-23 Thread Christian Csar
I ended up coding against the new one,
org.apache.kafka.clients.producer.Producer, though it is not yet in
production here. It might be slightly more painful to select a partition
since there isn't a place to plug in a partitioner class, but overall it
was quite easy and had the key feature of an Async callback.

Christian

On 06/23/2014 04:54 PM, Guozhang Wang wrote:
 Hi Kyle,
 
 We have not fully completed the test in production yet for the new
 producer, currently some improvement jiras like KAFKA-1498 are still open.
 Once we have it stabilized in production at LinkedIn we plan to update the
 wiki in favor of the new producer.
 
 Guozhang
 
 
 On Mon, Jun 23, 2014 at 3:39 PM, Kyle Banker kyleban...@gmail.com wrote:
 
 As of today, the latest Kafka docs show kafka.javaapi.producer.Producer in
 their example of the producer API (
 https://kafka.apache.org/documentation.html#producerapi).

 Is there a reason why the latest producer client
 (org.apache.kafka.clients.producer.Producer)
 isn't mentioned? Is this client not preferred or production-ready?

 
 
 




signature.asc
Description: OpenPGP digital signature


0.8.1 Java Producer API Callbacks

2014-05-01 Thread Christian Csar
I'm looking at using the java producer api for 0.8.1 and I'm slightly
confused by this passage from section 4.4 of
https://kafka.apache.org/documentation.html#theproducer
Note that as of Kafka 0.8.1 the async producer does not have a
callback, which could be used to register handlers to catch send errors.
Adding such callback functionality is proposed for Kafka 0.9, see
[Proposed Producer
API](https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ProposedProducerAPI).

org.apache.kafka.clients.producer.KafkaProducer in 0.8.1 appears to have
public FutureRecordMetadata send(ProducerRecord record, Callback
callback) which looks like the mentioned callback.

How do the callbacks with the async producer? Is it as described in the
comment on the send method (see
https://github.com/apache/kafka/blob/0.8.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L151
for reference)?

Looking around it seems plausible the language in the documentation
might refer to a separate sort of callback that existed in 0.7 but not
0.8. In our use case we have something useful to do if we can detect
messages failing to be sent.

Christian



signature.asc
Description: OpenPGP digital signature


Re: 0.8.1 Java Producer API Callbacks

2014-05-01 Thread Christian Csar
On 05/01/2014 07:22 PM, Christian Csar wrote:
 I'm looking at using the java producer api for 0.8.1 and I'm slightly
 confused by this passage from section 4.4 of
 https://kafka.apache.org/documentation.html#theproducer
 Note that as of Kafka 0.8.1 the async producer does not have a
 callback, which could be used to register handlers to catch send errors.
 Adding such callback functionality is proposed for Kafka 0.9, see
 [Proposed Producer
 API](https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ProposedProducerAPI).
 
 org.apache.kafka.clients.producer.KafkaProducer in 0.8.1 appears to have
 public FutureRecordMetadata send(ProducerRecord record, Callback
 callback) which looks like the mentioned callback.
 
 How do the callbacks with the async producer? Is it as described in the
 comment on the send method (see
 https://github.com/apache/kafka/blob/0.8.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L151
 for reference)?
 
 Looking around it seems plausible the language in the documentation
 might refer to a separate sort of callback that existed in 0.7 but not
 0.8. In our use case we have something useful to do if we can detect
 messages failing to be sent.
 
 Christian
 

It appears that I was looking at the Java client rather than the Scala
java api referenced by the documentation
https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/javaapi/producer/Producer.scala

Are both of these currently suited for use from java and still
supported? Given the support for callbacks in the event of failure I am
inclined to use the Java one despite the currently limited support for
specifying partitioners (though it supports specifying the partition) or
encoders.

Any guidance on this would be appreciated.

Christian



signature.asc
Description: OpenPGP digital signature