Re: Poor performance running performance test

2015-01-28 Thread Ewen Cheslack-Postava
That error indicates the broker closed the connection for some reason. Any
useful logs from the broker? It looks like you're using ELB, which could
also be the culprit. A connection timeout seems doubtful, but ELB can also
close connections for other reasons, like failed health checks.

-Ewen

On Tue, Jan 27, 2015 at 11:21 PM, Dillian Murphey crackshotm...@gmail.com
wrote:

 I was running the performance command from a virtual box server, so that
 seems like it was part of the problem.  I'm getting better results running
 this on a server on aws, but that's kind of expected.  Can you look at
 these results, and comment on the occasional warning I see?  I appreciate
 it!

 1220375 records sent, 243928.6 records/sec (23.26 MB/sec), 2111.5 ms avg
 latency, 4435.0 max latency.
 1195090 records sent, 239018.0 records/sec (22.79 MB/sec), 2203.1 ms avg
 latency, 4595.0 max latency.
 1257165 records sent, 251433.0 records/sec (23.98 MB/sec), 2172.6 ms avg
 latency, 4525.0 max latency.
 1230981 records sent, 246196.2 records/sec (23.48 MB/sec), 2173.5 ms avg
 latency, 4465.0 max latency.
 [2015-01-28 07:19:07,274] WARN Error in I/O with
 myawsloadbalancer(org.apache.kafka.common.network.Selector)
 java.io.EOFException
 at

 org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
 at org.apache.kafka.common.network.Selector.poll(Selector.java:248)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122)
 at java.lang.Thread.run(Thread.java:745)
 1090689 records sent, 218137.8 records/sec (20.80 MB/sec), 2413.6 ms avg
 latency, 4829.0 max latency.

 On Tue, Jan 27, 2015 at 7:37 PM, Ewen Cheslack-Postava e...@confluent.io
 wrote:

  Where are you running ProducerPerformance in relation to ZK and the Kafka
  brokers? You should definitely see much higher performance than this.
 
  A couple of other things I can think of that might be going wrong: Are
 all
  your VMs in the same AZ? Are you storing Kafka data in EBS or local
  ephemeral storage? If EBS, have you provisioned enough IOPS.
 
 
  On Tue, Jan 27, 2015 at 4:29 PM, Dillian Murphey 
 crackshotm...@gmail.com
  wrote:
 
   I'm a new user/admin to kafka. I'm running a 3 node ZK and a 6 brokers
 on
   aws.
  
   The performance I'm seeing is shockingly bad. I need some advice!
  
   bin/kafka-run-class.sh
 org.apache.kafka.clients.tools.ProducerPerformance
   test2 5000 100 -1 acks=1 bootstrap.servers=5myloadbalancer:9092
   buffer.memory=67108864 batch.size=8196
  
  
  
  
   6097 records sent, 13198.3 records/sec (1.26 MB/sec), 2098.0 ms avg
   latency, 4306.0 max latency.
   71695 records sent, 14339.0 records/sec (1.37 MB/sec), 6658.1 ms avg
   latency, 9053.0 max latency.
   65195 records sent, 13028.6 records/sec (1.24 MB/sec), 11504.0 ms avg
   latency, 13809.0 max latency.
   71955 records sent, 14391.0 records/sec (1.37 MB/sec), 16137.4 ms avg
   latency, 18541.0 max latency.
  
   Thanks for any help!
  
 
 
 
  --
  Thanks,
  Ewen
 




-- 
Thanks,
Ewen


How to mark a message as needing to retry in Kafka?

2015-01-28 Thread noodles
Hi group:

I'm working for building a webhook notification service based on Kafka. I
produce all of the payloads into Kafka, and consumers consume these
payloads by offset.

Sometimes some payloads cannot be consumed because of network exception or
http server exception. So I want to mark the failed payloads and retry them
by timers. But I have no idea if I don't use a storage (like MySQL) except
kafka and zookeeper.


-- 
*noodles!*


Re: How to mark a message as needing to retry in Kafka?

2015-01-28 Thread Guozhang Wang
Hi,

Which consumer are you using? If you are using a high level consumer then
retry would be automatic upon network exceptions.

Guozhang

On Wed, Jan 28, 2015 at 1:32 AM, noodles rungumpth...@gmail.com wrote:

 Hi group:

 I'm working for building a webhook notification service based on Kafka. I
 produce all of the payloads into Kafka, and consumers consume these
 payloads by offset.

 Sometimes some payloads cannot be consumed because of network exception or
 http server exception. So I want to mark the failed payloads and retry them
 by timers. But I have no idea if I don't use a storage (like MySQL) except
 kafka and zookeeper.


 --
 *noodles!*




-- 
-- Guozhang


Re: Poor performance running performance test

2015-01-28 Thread Dillian Murphey
You could be right Ewen. I was starting to wonder about the load balancer
too. Is using a load balancer a bad idea? How else do users know which
kafka broker to connect to?

I'm using one of the IPs directly and I don't see that error. I am seeing
an occasional connection refused. What the heck. Maybe this is another aws
specific thing.

OR, I am running kafka brokers in a docker container. I think I will remove
the docker component and see if that makes a difference.

Thanks for the reply. bow


question on the mailing list

2015-01-28 Thread Dillian Murphey
Hi all,

Sorry for asking, but is there some easier way to use the mailing list?
Maybe a tool which makes reading and replying to messages more like google
groups?  I like the hadoop searcher, but the UI on that is really bad.

tnx


Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread David McNelis
I agree with Stephen, it would be really unfortunate to see the Scala api
go away.

On Wed, Jan 28, 2015 at 11:57 AM, Stephen Boesch java...@gmail.com wrote:

 The scala API going away would be a minus. As Koert mentioned we could use
 the java api but it is less ..  well .. functional.

 Kafka is included in the Spark examples and external modules and is popular
 as a component of ecosystems on Spark (for which scala is the primary
 language).

 2015-01-28 8:51 GMT-08:00 Otis Gospodnetic otis.gospodne...@gmail.com:

  Hi,
 
  I don't have a good excuse here. :(
  I thought about including Scala, but for some reason didn't do it.  I see
  12-13% of people chose Other.  Do you think that is because I didn't
  include Scala?
 
  Also, is the Scala API reeally going away?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
   no scala? although scala can indeed use the java api, its ugly we
   prefer to use the scala api (which i believe will go away
 unfortunately)
  
   On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
Hi,
   
I was wondering which implementations/languages people use for their
   Kafka
Producer/Consumers not everyone is using the Java APIs.  So
 here's
  a
1-question poll:
   
   
  http://blog.sematext.com/2015/01/20/kafka-poll-producer-consumer-client/
   
Will share the results in about a week when we have enough votes.
   
Thanks!
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log
 Management
Solr  Elasticsearch Support * http://sematext.com/
   
  
 



Proper Relationship Between Partition and Threads

2015-01-28 Thread Ricardo Ferreira
Hi experts,

I'm newbie in the Kafka world, so excuse me for such basic question.

I'm in the process of designing a client for Kafka, and after few hours of
study, I was told that to achieve a proper level of parallelism, it is a
best practice having one thread for each partition of an topic.

My question is that this rule-of-thumb also applies for multiple consumer
applications. For instance:

Considering a topic with 4 partitions, it is OK to have one consumer
application with 4 threads, just like would be OK to have two consumer
applications with 2 threads each. But what about having two consumer
applications with 4 threads each? It would break any load-balancing made by
Kafka brokers?

Anyway, I'd like to understand if the proper number of threads that should
match the number of partitions is per application or if there is some other
best practice.

Thanks in advance,

Ricardo Ferreira


Re: Proper Relationship Between Partition and Threads

2015-01-28 Thread Ricardo Ferreira
Thank you very much Christian.

That's what I concluded too, I wanted just to double check.

Best regards,

Ricardo Ferreira

On Wed, Jan 28, 2015 at 4:44 PM, Christian Csar christ...@csar.us wrote:

 Ricardo,
The parallelism of each logical consumer (consumer group) is the number
 of partitions. So with four partitions it could make sense to have one
 logical consumer (application) have two processes on different machines
 each with two threads, or one process with four. While with two logical
 consumers (two different applications) you would want each to have 4
 threads (4*2 = 8 threads total).

 There are also considerations depending on which consumer code you are
 using (which I'm decidedly not someone with good information on)

 Christian

 On Wed, Jan 28, 2015 at 1:28 PM, Ricardo Ferreira 
 jricardoferre...@gmail.com wrote:

  Hi experts,
 
  I'm newbie in the Kafka world, so excuse me for such basic question.
 
  I'm in the process of designing a client for Kafka, and after few hours
 of
  study, I was told that to achieve a proper level of parallelism, it is a
  best practice having one thread for each partition of an topic.
 
  My question is that this rule-of-thumb also applies for multiple consumer
  applications. For instance:
 
  Considering a topic with 4 partitions, it is OK to have one consumer
  application with 4 threads, just like would be OK to have two consumer
  applications with 2 threads each. But what about having two consumer
  applications with 4 threads each? It would break any load-balancing made
 by
  Kafka brokers?
 
  Anyway, I'd like to understand if the proper number of threads that
 should
  match the number of partitions is per application or if there is some
 other
  best practice.
 
  Thanks in advance,
 
  Ricardo Ferreira
 



Re: Proper Relationship Between Partition and Threads

2015-01-28 Thread Christian Csar
Ricardo,
   The parallelism of each logical consumer (consumer group) is the number
of partitions. So with four partitions it could make sense to have one
logical consumer (application) have two processes on different machines
each with two threads, or one process with four. While with two logical
consumers (two different applications) you would want each to have 4
threads (4*2 = 8 threads total).

There are also considerations depending on which consumer code you are
using (which I'm decidedly not someone with good information on)

Christian

On Wed, Jan 28, 2015 at 1:28 PM, Ricardo Ferreira 
jricardoferre...@gmail.com wrote:

 Hi experts,

 I'm newbie in the Kafka world, so excuse me for such basic question.

 I'm in the process of designing a client for Kafka, and after few hours of
 study, I was told that to achieve a proper level of parallelism, it is a
 best practice having one thread for each partition of an topic.

 My question is that this rule-of-thumb also applies for multiple consumer
 applications. For instance:

 Considering a topic with 4 partitions, it is OK to have one consumer
 application with 4 threads, just like would be OK to have two consumer
 applications with 2 threads each. But what about having two consumer
 applications with 4 threads each? It would break any load-balancing made by
 Kafka brokers?

 Anyway, I'd like to understand if the proper number of threads that should
 match the number of partitions is per application or if there is some other
 best practice.

 Thanks in advance,

 Ricardo Ferreira



Re: Consuming Kafka Messages Inside of EC2 Instances

2015-01-28 Thread Dillian Murphey
Am I understanding your question correctly... You're asking how do you
establish connectivity to an instance in a private subnet from the outside
world?  Are you thinking in terms of zookeeper or just general aws network
connectivity?

On Wed, Jan 28, 2015 at 11:03 AM, Su She suhsheka...@gmail.com wrote:

 Hello All,

 I have set up a cluster of EC2 instances using this method:


 http://blogs.aws.amazon.com/bigdata/post/Tx2D0J7QOVRJBRX/Deploying-Cloudera-s-Enterprise-Data-Hub-on-AWS

 As you can see the instances are w/in a private subnet. I was wondering if
 anyone has any advice on how I can set up a Kafka zookeeper/server on an
 instance that receives messages from a Kafka Producer outside of the
 private subnet. I have tried using the cluster launcher, but I feel like it
 is not a best practice and only a temporary situation.

 Thank you for the help!

 Best,

 Su



Re: Resilient Producer

2015-01-28 Thread Lakshmanan Muthuraman
We have been using Flume to solve a very similar usecase. Our servers write
the log files to a local file system, and then we have flume agent which
ships the data to kafka.

Flume you can use as exec source running tail. Though the exec source runs
well with tail, there are issues if the agent goes down or the file channel
starts building up. If the agent goes down, you can request flume exec tail
source to go back n number of lines or read from beginning of the file. The
challenge is we roll our log files on a daily basis. What if goes down in
the evening. We need to go back to the entire days worth of data for
reprocessing which slows down the data flow. We can also go back arbitarily
number of lines, but then we dont know what is the right number to go back.
This is kind of challenge for us. We have tried spooling directory. Which
works, but we need to have a different log file rotation policy. We
considered evening going a file rotation for a minute, but it will  still
affect the real time data flow in our kafka---storm--Elastic search
pipeline with a minute delay.

We are going to do a poc on logstash to see how this solves the problem of
flume.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:

 Hi all,
 I'm evaluating using Kafka.

 I liked this thing of Facebook scribe that you log to your own machine and
 then there's a separate process that forwards messages to the central
 logger.

 With Kafka it seems that I have to embed the publisher in my app, and deal
 with any communication problem managing that on the producer side.

 I googled quite a bit trying to find a project that would basically use
 daemon that parses a log file and send the lines to the Kafka cluster
 (something like a tail file.log but instead of redirecting the output to
 the console: send it to kafka)

 Does anyone knows about something like that?


 Thanks!
 Fernando.



Re: One or multiple instances of MM to aggregate kafka data to one hadoop

2015-01-28 Thread Daniel Compton
Hi Mingjie

I would recommend the first option of running one mirrormaker instance
pulling from multiple DC's.

A single MM instance will be able to make more efficient use of the machine
resources in two ways:
1. You will only have to run one process which will be able to be allocated
the full amount of resources
2. Within the process, if you run enough consumer threads, I think that
they should be able to rebalance and pick up the load if they don't have
anything to do. I'm not 100% sure on this, but 1 still holds.

A single MM instance should handle connectivity issues with one DC without
affecting the rest of the consumer threads for other DC's.

You would gain process isolation running a MM per DC, but this would raise
the operational burden and resource requirements. I'm not sure what benefit
you'd actually get from process isolation, so I'd recommend against it.
However I'd be interested to hear if others do things differently.

Daniel.

On Thu Jan 29 2015 at 11:14:29 AM Mingjie Lai m...@apache.org wrote:

 Hi.

 We have a pretty typical data ingestion use case that we use mirrormaker at
 one hadoop data center, to mirror kafka data from multiple remote
 application data centers. I know mirrormaker can support to consume kafka
 data from multiple kafka source, by one instance at one physical node. By
 this, we can give one instance of mm multiple consumer config files, so it
 can consume data from muti places.

 Another option is to have multiple mirrormaker instances at one node, each
 mm instance is dedicated to grab data from one single source data center.
 Certainly there will be multiple mm nodes to balance the load.

 The second option looks better since it kind of has an isolation for
 different data centers.

 Any recommendation for this kind of data aggregation cases?

 Still new to kafka and mirrormaker. Welcome any information.

 Thanks,
 Mingjie



Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread Jay Kreps
Yeah Joe is exactly right.

Let's not confuse scala apis with the existing Scala clients There are a
ton of downsides to those clients. They aren't going away any time in the
forceable future, so don't stress, but I think we can kind of deprecate
them and try to shame people into upgrading.

For the Java producer I actually think that API is no worse in Scala than
the existing Scala api, so I'm not sure if there is much we can do to
improve it for scala, but if there is there would be no harm in adding a
scala wrapper.

For the new Java consumer there are some Java-isms in the client (e.g. a
couple methods take java maps as arguments). I actually think the existing
scala implicit conversions for java collections might be totally sufficient
but someone would have to try. If not we can add a scala wrapper.

I actually think there is room for lots of wrappers that experiment with
different styles of data access, especially for the consumer. The reactive
people have a bunch of stream related things, Joe mentioned scalaz streams,
etc. Actually the iterator api in the existing consumer was an attempt to
make stream processing easy (since you can apply some of the scala
collections things to the resulting iterator) but I think it wasn't thought
all the way through. I think having this simpler, more flexible base api
will let people experiment with this stuff in any way they want.

-Jay



On Wed, Jan 28, 2015 at 10:45 AM, Joe Stein joe.st...@stealth.ly wrote:

 I kind of look at the Storm, Spark, Samza, etc integrations as
 producers/consumers too.

 Not sure if that maybe was getting lumped also into other.

 I think Jason's 90/10 80/20 70/30 would be found to be typical.

 As far as the Scala API goes, I think we should have a wrapper around the
 shiny new Java Consumer. Folks I know use things like scalaz streams
 https://github.com/scalaz/scalaz-stream which the new consumer can work
 nicely with I think. It would be great if we could come up with a new Scala
 layer on top of the Java consumer that we release in the project. One of my
 engineers is taking a look at that now unless someone is already working on
 that? We are using the partition static assignment in the new consumer and
 just using Mesos for handling re-balance for us. When he gets further along
 and if it makes sense we will shoot a KIP around people can chat about it
 on dev.

 - Joe Stein

 On Wed, Jan 28, 2015 at 10:33 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Good point, Jason.  Not sure how we could account for that easily.  But
  maybe that is at least a partial explanation of the Java % being under
 50%
  when Java in general is more popular than that...
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Wed, Jan 28, 2015 at 1:22 PM, Jason Rosenberg j...@squareup.com
 wrote:
 
   I think the results could be a bit skewed, in cases where an
 organization
   uses multiple languages, but not equally.  In our case, we
 overwhelmingly
   use java clients (90%).  But we also have ruby and Go clients too.
 But
  in
   the poll, these come out as equally used client languages.
  
   Jason
  
   On Wed, Jan 28, 2015 at 12:05 PM, David McNelis 
   dmcne...@emergingthreats.net wrote:
  
I agree with Stephen, it would be really unfortunate to see the Scala
  api
go away.
   
On Wed, Jan 28, 2015 at 11:57 AM, Stephen Boesch java...@gmail.com
wrote:
   
 The scala API going away would be a minus. As Koert mentioned we
  could
use
 the java api but it is less ..  well .. functional.

 Kafka is included in the Spark examples and external modules and is
popular
 as a component of ecosystems on Spark (for which scala is the
 primary
 language).

 2015-01-28 8:51 GMT-08:00 Otis Gospodnetic 
  otis.gospodne...@gmail.com
   :

  Hi,
 
  I don't have a good excuse here. :(
  I thought about including Scala, but for some reason didn't do
  it.  I
see
  12-13% of people chose Other.  Do you think that is because I
   didn't
  include Scala?
 
  Also, is the Scala API reeally going away?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log
   Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers 
 ko...@tresata.com
 wrote:
 
   no scala? although scala can indeed use the java api, its
  ugly
   we
   prefer to use the scala api (which i believe will go away
 unfortunately)
  
   On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
Hi,
   
I was wondering which implementations/languages people use
 for
their
   Kafka
Producer/Consumers not everyone is using the Java APIs.
 So
 here's
  a
1-question 

Re: Routing modifications at runtime

2015-01-28 Thread Lakshmanan Muthuraman
Hi Toni,

Couple of thoughts.

1. Kafka behaviour need not be changed at run time. Your producers which
push your MAC data into kafka should know to which topic it should write.
Your producer can be flume, log stash or it can  be your own custom written
java producer.

As long as your producer know which topic to write, they can keep creating
new topics as new MAC data comes through your pipeline.

On Wed, Jan 28, 2015 at 12:10 PM, Toni Cebrián toni.cebr...@gmail.com
wrote:

 Hi,

 I'm starting to weight different alternatives for data ingestion and
 I'd like to know whether Kafka meets the problem I have.
 Say we have a set of devices each with its own MAC and then we receive
 data in Kafka. There is a dictionary defined elsewhere that says each MAC
 to which topic must publish. So I have basically 2 questions:
 New MACs keep comming and the dictionary must be updated accordingly. How
 could I change this Kafka behaviour during runtime?
 A problem for the future. Say that dictionaries are so big that they don't
 fit in memory. Are there any patterns for bookkeeping internal data
 structures and how route to them?

 T.



One or multiple instances of MM to aggregate kafka data to one hadoop

2015-01-28 Thread Mingjie Lai
Hi.

We have a pretty typical data ingestion use case that we use mirrormaker at
one hadoop data center, to mirror kafka data from multiple remote
application data centers. I know mirrormaker can support to consume kafka
data from multiple kafka source, by one instance at one physical node. By
this, we can give one instance of mm multiple consumer config files, so it
can consume data from muti places.

Another option is to have multiple mirrormaker instances at one node, each
mm instance is dedicated to grab data from one single source data center.
Certainly there will be multiple mm nodes to balance the load.

The second option looks better since it kind of has an isolation for
different data centers.

Any recommendation for this kind of data aggregation cases?

Still new to kafka and mirrormaker. Welcome any information.

Thanks,
Mingjie


Error writing to highwatermark file

2015-01-28 Thread wan...@act.buaa.edu.cn
Hi,
I use 2 brokers as a cluster, and write data into nfs.

I encountered this problem several days after starting the brokers :
FATAL [Replica Manager on Broker 1]: Error writing to highwatermark file:  
(kafka.server.ReplicaManager)
 java.io.IOException: File rename from 
/nfs/trust-machine/machine11/kafka-logs/replication-offset-checkpoint.tmp to 
/nfs/trust-machine/machine11/kafka-logs/replication-offset-checkpoint failed.

why?
My kafka version is 0.8.1.

Thanks!

Re: How to mark a message as needing to retry in Kafka?

2015-01-28 Thread noodles
I did not describe my problem clearly. In my case, I got the message from
Kakfa, but I could not handle this message because of some reason, for
example the external server is down. So I want to mark the message as not
being consumed directly.

2015-01-28 23:26 GMT+08:00 Guozhang Wang wangg...@gmail.com:

 Hi,

 Which consumer are you using? If you are using a high level consumer then
 retry would be automatic upon network exceptions.

 Guozhang

 On Wed, Jan 28, 2015 at 1:32 AM, noodles rungumpth...@gmail.com wrote:

  Hi group:
 
  I'm working for building a webhook notification service based on Kafka. I
  produce all of the payloads into Kafka, and consumers consume these
  payloads by offset.
 
  Sometimes some payloads cannot be consumed because of network exception
 or
  http server exception. So I want to mark the failed payloads and retry
 them
  by timers. But I have no idea if I don't use a storage (like MySQL)
 except
  kafka and zookeeper.
 
 
  --
  *noodles!*
 



 --
 -- Guozhang




-- 
*Yeah, I'm noodles!*


Re: How to mark a message as needing to retry in Kafka?

2015-01-28 Thread Guozhang Wang
I see. If you are using the high-level consumer, once the message is
returned to the application it is considered consumed, and current it is
not supported to re-wind to a previously consumed message.

With the new consumer coming in 0.8.3 release, we have an api for you to
get the offset of each message and do the rewinding based on offsets. For
example, you can do sth. like



  message = // get one message from consumer

  try {
// process message
  } catch {
consumer.seek(message.offset)
  }



Guozhang

On Wed, Jan 28, 2015 at 6:26 PM, noodles rungumpth...@gmail.com wrote:

 I did not describe my problem clearly. In my case, I got the message from
 Kakfa, but I could not handle this message because of some reason, for
 example the external server is down. So I want to mark the message as not
 being consumed directly.

 2015-01-28 23:26 GMT+08:00 Guozhang Wang wangg...@gmail.com:

  Hi,
 
  Which consumer are you using? If you are using a high level consumer then
  retry would be automatic upon network exceptions.
 
  Guozhang
 
  On Wed, Jan 28, 2015 at 1:32 AM, noodles rungumpth...@gmail.com wrote:
 
   Hi group:
  
   I'm working for building a webhook notification service based on
 Kafka. I
   produce all of the payloads into Kafka, and consumers consume these
   payloads by offset.
  
   Sometimes some payloads cannot be consumed because of network exception
  or
   http server exception. So I want to mark the failed payloads and retry
  them
   by timers. But I have no idea if I don't use a storage (like MySQL)
  except
   kafka and zookeeper.
  
  
   --
   *noodles!*
  
 
 
 
  --
  -- Guozhang
 



 --
 *Yeah, I'm noodles!*




-- 
-- Guozhang


Re: How to mark a message as needing to retry in Kafka?

2015-01-28 Thread Christian Csar
noodles,
   Without an external mechanism you won't be able to mark individual
messages/offsets as needing to be retried at a later time. Guozhang is
describing a way to get the offset of a message that's been received so
that you can find it later. You would need to save that into a 'failed
messages' store somewhere else and have code that looks in there to make
retries happen (assuming you want the failure/retry to persist beyond the
lifetime of the process).

Christian


On Wed, Jan 28, 2015 at 7:00 PM, Guozhang Wang wangg...@gmail.com wrote:

 I see. If you are using the high-level consumer, once the message is
 returned to the application it is considered consumed, and current it is
 not supported to re-wind to a previously consumed message.

 With the new consumer coming in 0.8.3 release, we have an api for you to
 get the offset of each message and do the rewinding based on offsets. For
 example, you can do sth. like

 

   message = // get one message from consumer

   try {
 // process message
   } catch {
 consumer.seek(message.offset)
   }

 

 Guozhang

 On Wed, Jan 28, 2015 at 6:26 PM, noodles rungumpth...@gmail.com wrote:

  I did not describe my problem clearly. In my case, I got the message from
  Kakfa, but I could not handle this message because of some reason, for
  example the external server is down. So I want to mark the message as not
  being consumed directly.
 
  2015-01-28 23:26 GMT+08:00 Guozhang Wang wangg...@gmail.com:
 
   Hi,
  
   Which consumer are you using? If you are using a high level consumer
 then
   retry would be automatic upon network exceptions.
  
   Guozhang
  
   On Wed, Jan 28, 2015 at 1:32 AM, noodles rungumpth...@gmail.com
 wrote:
  
Hi group:
   
I'm working for building a webhook notification service based on
  Kafka. I
produce all of the payloads into Kafka, and consumers consume these
payloads by offset.
   
Sometimes some payloads cannot be consumed because of network
 exception
   or
http server exception. So I want to mark the failed payloads and
 retry
   them
by timers. But I have no idea if I don't use a storage (like MySQL)
   except
kafka and zookeeper.
   
   
--
*noodles!*
   
  
  
  
   --
   -- Guozhang
  
 
 
 
  --
  *Yeah, I'm noodles!*
 



 --
 -- Guozhang



Re: Consuming Kafka Messages Inside of EC2 Instances

2015-01-28 Thread Guozhang Wang
Su,

Does this help for your case?

https://cwiki.apache.org/confluence/display/KAFKA/FAQ

Guozhang

On Wed, Jan 28, 2015 at 3:36 PM, Dillian Murphey crackshotm...@gmail.com
wrote:

 Am I understanding your question correctly... You're asking how do you
 establish connectivity to an instance in a private subnet from the outside
 world?  Are you thinking in terms of zookeeper or just general aws network
 connectivity?

 On Wed, Jan 28, 2015 at 11:03 AM, Su She suhsheka...@gmail.com wrote:

  Hello All,
 
  I have set up a cluster of EC2 instances using this method:
 
 
 
 http://blogs.aws.amazon.com/bigdata/post/Tx2D0J7QOVRJBRX/Deploying-Cloudera-s-Enterprise-Data-Hub-on-AWS
 
  As you can see the instances are w/in a private subnet. I was wondering
 if
  anyone has any advice on how I can set up a Kafka zookeeper/server on an
  instance that receives messages from a Kafka Producer outside of the
  private subnet. I have tried using the cluster launcher, but I feel like
 it
  is not a best practice and only a temporary situation.
 
  Thank you for the help!
 
  Best,
 
  Su
 




-- 
-- Guozhang


[VOTE] 0.8.2.0 Candidate 3

2015-01-28 Thread Jun Rao
This is the third candidate for release of Apache Kafka 0.8.2.0.

Release Notes for the 0.8.2.0 release
https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/RELEASE_NOTES.html

*** Please download, test and vote by Saturday, Jan 31, 11:30pm PT

Kafka's KEYS file containing PGP keys we use to sign the release:
http://kafka.apache.org/KEYS in addition to the md5, sha1 and sha2 (SHA256)
checksum.

* Release artifacts to be voted upon (source and binary):
https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/

* Maven artifacts to be voted upon prior to release:
https://repository.apache.org/content/groups/staging/

* scala-doc
https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/scaladoc/

* java-doc
https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/javadoc/

* The tag to be voted upon (off the 0.8.2 branch) is the 0.8.2.0 tag
https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=223ac42a7a2a0dab378cc411f4938a9cea1eb7ea
(commit 7130da90a9ee9e6fb4beb2a2a6ab05c06c9bfac4)

/***

Thanks,

Jun


Re: Error writing to highwatermark file

2015-01-28 Thread Guozhang Wang
Hi,

Could you check if the specified file:

/nfs/trust-machine/machine11/kafka-logs/replication-offset-checkpoint.tmp

is not deleted beforehand?

Guozhang


On Tue, Jan 27, 2015 at 10:54 PM, wan...@act.buaa.edu.cn 
wan...@act.buaa.edu.cn wrote:

 Hi,
 I use 2 brokers as a cluster, and write data into nfs.

 I encountered this problem several days after starting the brokers :
 FATAL [Replica Manager on Broker 1]: Error writing to highwatermark file:
 (kafka.server.ReplicaManager)
  java.io.IOException: File rename from
 /nfs/trust-machine/machine11/kafka-logs/replication-offset-checkpoint.tmp
 to /nfs/trust-machine/machine11/kafka-logs/replication-offset-checkpoint
 failed.

 why?
 My kafka version is 0.8.1.

 Thanks!




-- 
-- Guozhang


Re: Consuming Kafka Messages Inside of EC2 Instances

2015-01-28 Thread Su She
Thank you Dillian and Guozhang for the responses.

Yes, Dillian you are understanding my issue correctly. I am not sure what
the best approach to this is...I'm not sure if there's a way to whitelist
certain IPs, create a VPC, use the cluster launcher as the kafka
zookeeper/broker. I guess this is more of an AWS question, but I thought
this is a problem some Kafka users must have solved already.

Edit: I just tried using the cluster launcher as an intermediate. I started
Zookeeper/Kafka Server on my Cluster launcher and then created a
topic/produced messages. I set up a kafka consumer on one of my private EC2
instances, but I got a No Route to host error. I pinged the cluster
launcher - private instance and it works fine. I was hoping I could use
this is as a temporary solution...any suggestions on this issue would also
be greatly appreciated. Thanks!

Best,

Su


On Wed, Jan 28, 2015 at 9:11 PM, Guozhang Wang wangg...@gmail.com wrote:

 Su,

 Does this help for your case?

 https://cwiki.apache.org/confluence/display/KAFKA/FAQ

 Guozhang

 On Wed, Jan 28, 2015 at 3:36 PM, Dillian Murphey crackshotm...@gmail.com
 wrote:

  Am I understanding your question correctly... You're asking how do you
  establish connectivity to an instance in a private subnet from the
 outside
  world?  Are you thinking in terms of zookeeper or just general aws
 network
  connectivity?
 
  On Wed, Jan 28, 2015 at 11:03 AM, Su She suhsheka...@gmail.com wrote:
 
   Hello All,
  
   I have set up a cluster of EC2 instances using this method:
  
  
  
 
 http://blogs.aws.amazon.com/bigdata/post/Tx2D0J7QOVRJBRX/Deploying-Cloudera-s-Enterprise-Data-Hub-on-AWS
  
   As you can see the instances are w/in a private subnet. I was wondering
  if
   anyone has any advice on how I can set up a Kafka zookeeper/server on
 an
   instance that receives messages from a Kafka Producer outside of the
   private subnet. I have tried using the cluster launcher, but I feel
 like
  it
   is not a best practice and only a temporary situation.
  
   Thank you for the help!
  
   Best,
  
   Su
  
 



 --
 -- Guozhang



Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread Otis Gospodnetic
Hi,

I don't have a good excuse here. :(
I thought about including Scala, but for some reason didn't do it.  I see
12-13% of people chose Other.  Do you think that is because I didn't
include Scala?

Also, is the Scala API reeally going away?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers ko...@tresata.com wrote:

 no scala? although scala can indeed use the java api, its ugly we
 prefer to use the scala api (which i believe will go away unfortunately)

 On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,
 
  I was wondering which implementations/languages people use for their
 Kafka
  Producer/Consumers not everyone is using the Java APIs.  So here's a
  1-question poll:
 
  http://blog.sematext.com/2015/01/20/kafka-poll-producer-consumer-client/
 
  Will share the results in about a week when we have enough votes.
 
  Thanks!
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 



WARN Error in I/O with NetworkReceive.readFrom(NetworkReceive.java

2015-01-28 Thread Dillian Murphey
Running the performance test. What is the nature of this error??  I'm
running a very high end cluster on aws. Tried this even within the same
subnet on aws.

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
topic9 5000 100 -1 acks=1 bootstrap.servers=$IP:9092
buffer.memory=67108864 batch.size=8196



2015-01-28 16:32:22,178] WARN Error in I/O with /IP ADDR
(org.apache.kafka.common.network.Selector)
java.io.EOFException
at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
at org.apache.kafka.common.network.Selector.poll(Selector.java:248)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122)
at java.lang.Thread.run(Thread.java:745)
65910 records sent, 13158.3 records/sec (1.25 MB/sec), 38525.9 ms avg
latency, 39478.0 max latency.

Thanks for any ideas


Resilient Producer

2015-01-28 Thread Fernando O.
Hi all,
I'm evaluating using Kafka.

I liked this thing of Facebook scribe that you log to your own machine and
then there's a separate process that forwards messages to the central
logger.

With Kafka it seems that I have to embed the publisher in my app, and deal
with any communication problem managing that on the producer side.

I googled quite a bit trying to find a project that would basically use
daemon that parses a log file and send the lines to the Kafka cluster
(something like a tail file.log but instead of redirecting the output to
the console: send it to kafka)

Does anyone knows about something like that?


Thanks!
Fernando.


Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread Joe Stein
I kind of look at the Storm, Spark, Samza, etc integrations as
producers/consumers too.

Not sure if that maybe was getting lumped also into other.

I think Jason's 90/10 80/20 70/30 would be found to be typical.

As far as the Scala API goes, I think we should have a wrapper around the
shiny new Java Consumer. Folks I know use things like scalaz streams
https://github.com/scalaz/scalaz-stream which the new consumer can work
nicely with I think. It would be great if we could come up with a new Scala
layer on top of the Java consumer that we release in the project. One of my
engineers is taking a look at that now unless someone is already working on
that? We are using the partition static assignment in the new consumer and
just using Mesos for handling re-balance for us. When he gets further along
and if it makes sense we will shoot a KIP around people can chat about it
on dev.

- Joe Stein

On Wed, Jan 28, 2015 at 10:33 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Good point, Jason.  Not sure how we could account for that easily.  But
 maybe that is at least a partial explanation of the Java % being under 50%
 when Java in general is more popular than that...

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Wed, Jan 28, 2015 at 1:22 PM, Jason Rosenberg j...@squareup.com wrote:

  I think the results could be a bit skewed, in cases where an organization
  uses multiple languages, but not equally.  In our case, we overwhelmingly
  use java clients (90%).  But we also have ruby and Go clients too.  But
 in
  the poll, these come out as equally used client languages.
 
  Jason
 
  On Wed, Jan 28, 2015 at 12:05 PM, David McNelis 
  dmcne...@emergingthreats.net wrote:
 
   I agree with Stephen, it would be really unfortunate to see the Scala
 api
   go away.
  
   On Wed, Jan 28, 2015 at 11:57 AM, Stephen Boesch java...@gmail.com
   wrote:
  
The scala API going away would be a minus. As Koert mentioned we
 could
   use
the java api but it is less ..  well .. functional.
   
Kafka is included in the Spark examples and external modules and is
   popular
as a component of ecosystems on Spark (for which scala is the primary
language).
   
2015-01-28 8:51 GMT-08:00 Otis Gospodnetic 
 otis.gospodne...@gmail.com
  :
   
 Hi,

 I don't have a good excuse here. :(
 I thought about including Scala, but for some reason didn't do
 it.  I
   see
 12-13% of people chose Other.  Do you think that is because I
  didn't
 include Scala?

 Also, is the Scala API reeally going away?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log
  Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers ko...@tresata.com
wrote:

  no scala? although scala can indeed use the java api, its
 ugly
  we
  prefer to use the scala api (which i believe will go away
unfortunately)
 
  On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hi,
  
   I was wondering which implementations/languages people use for
   their
  Kafka
   Producer/Consumers not everyone is using the Java APIs.  So
here's
 a
   1-question poll:
  
  

  
 http://blog.sematext.com/2015/01/20/kafka-poll-producer-consumer-client/
  
   Will share the results in about a week when we have enough
 votes.
  
   Thanks!
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log
Management
   Solr  Elasticsearch Support * http://sematext.com/
  
 

   
  
 



Re: Resilient Producer

2015-01-28 Thread Gwen Shapira
It sounds like you are describing Flume, with SpoolingDirectory source
(or exec source running tail) and Kafka channel.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
 Hi all,
 I'm evaluating using Kafka.

 I liked this thing of Facebook scribe that you log to your own machine and
 then there's a separate process that forwards messages to the central
 logger.

 With Kafka it seems that I have to embed the publisher in my app, and deal
 with any communication problem managing that on the producer side.

 I googled quite a bit trying to find a project that would basically use
 daemon that parses a log file and send the lines to the Kafka cluster
 (something like a tail file.log but instead of redirecting the output to
 the console: send it to kafka)

 Does anyone knows about something like that?


 Thanks!
 Fernando.


Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread Otis Gospodnetic
Good point, Jason.  Not sure how we could account for that easily.  But
maybe that is at least a partial explanation of the Java % being under 50%
when Java in general is more popular than that...

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Jan 28, 2015 at 1:22 PM, Jason Rosenberg j...@squareup.com wrote:

 I think the results could be a bit skewed, in cases where an organization
 uses multiple languages, but not equally.  In our case, we overwhelmingly
 use java clients (90%).  But we also have ruby and Go clients too.  But in
 the poll, these come out as equally used client languages.

 Jason

 On Wed, Jan 28, 2015 at 12:05 PM, David McNelis 
 dmcne...@emergingthreats.net wrote:

  I agree with Stephen, it would be really unfortunate to see the Scala api
  go away.
 
  On Wed, Jan 28, 2015 at 11:57 AM, Stephen Boesch java...@gmail.com
  wrote:
 
   The scala API going away would be a minus. As Koert mentioned we could
  use
   the java api but it is less ..  well .. functional.
  
   Kafka is included in the Spark examples and external modules and is
  popular
   as a component of ecosystems on Spark (for which scala is the primary
   language).
  
   2015-01-28 8:51 GMT-08:00 Otis Gospodnetic otis.gospodne...@gmail.com
 :
  
Hi,
   
I don't have a good excuse here. :(
I thought about including Scala, but for some reason didn't do it.  I
  see
12-13% of people chose Other.  Do you think that is because I
 didn't
include Scala?
   
Also, is the Scala API reeally going away?
   
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log
 Management
Solr  Elasticsearch Support * http://sematext.com/
   
   
On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers ko...@tresata.com
   wrote:
   
 no scala? although scala can indeed use the java api, its ugly
 we
 prefer to use the scala api (which i believe will go away
   unfortunately)

 On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,
 
  I was wondering which implementations/languages people use for
  their
 Kafka
  Producer/Consumers not everyone is using the Java APIs.  So
   here's
a
  1-question poll:
 
 
   
  http://blog.sematext.com/2015/01/20/kafka-poll-producer-consumer-client/
 
  Will share the results in about a week when we have enough votes.
 
  Thanks!
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log
   Management
  Solr  Elasticsearch Support * http://sematext.com/
 

   
  
 



Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread Jason Rosenberg
I think the results could be a bit skewed, in cases where an organization
uses multiple languages, but not equally.  In our case, we overwhelmingly
use java clients (90%).  But we also have ruby and Go clients too.  But in
the poll, these come out as equally used client languages.

Jason

On Wed, Jan 28, 2015 at 12:05 PM, David McNelis 
dmcne...@emergingthreats.net wrote:

 I agree with Stephen, it would be really unfortunate to see the Scala api
 go away.

 On Wed, Jan 28, 2015 at 11:57 AM, Stephen Boesch java...@gmail.com
 wrote:

  The scala API going away would be a minus. As Koert mentioned we could
 use
  the java api but it is less ..  well .. functional.
 
  Kafka is included in the Spark examples and external modules and is
 popular
  as a component of ecosystems on Spark (for which scala is the primary
  language).
 
  2015-01-28 8:51 GMT-08:00 Otis Gospodnetic otis.gospodne...@gmail.com:
 
   Hi,
  
   I don't have a good excuse here. :(
   I thought about including Scala, but for some reason didn't do it.  I
 see
   12-13% of people chose Other.  Do you think that is because I didn't
   include Scala?
  
   Also, is the Scala API reeally going away?
  
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
   On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
no scala? although scala can indeed use the java api, its ugly we
prefer to use the scala api (which i believe will go away
  unfortunately)
   
On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:
   
 Hi,

 I was wondering which implementations/languages people use for
 their
Kafka
 Producer/Consumers not everyone is using the Java APIs.  So
  here's
   a
 1-question poll:


  
 http://blog.sematext.com/2015/01/20/kafka-poll-producer-consumer-client/

 Will share the results in about a week when we have enough votes.

 Thanks!
 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log
  Management
 Solr  Elasticsearch Support * http://sematext.com/

   
  
 



Poll RESULTS: Producer/Consumer languages

2015-01-28 Thread Otis Gospodnetic
Hi,

I promised to share the results of this poll, and here they are:

http://blog.sematext.com/2015/01/28/kafka-poll-results-producer-consumer/

List of surprises is there.  I wonder if anyone else is surprised by any
aspect of the breakdown, or is the breakdown just as you expected?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


Re: Poll: Producer/Consumer impl/language you use?

2015-01-28 Thread Stephen Boesch
The scala API going away would be a minus. As Koert mentioned we could use
the java api but it is less ..  well .. functional.

Kafka is included in the Spark examples and external modules and is popular
as a component of ecosystems on Spark (for which scala is the primary
language).

2015-01-28 8:51 GMT-08:00 Otis Gospodnetic otis.gospodne...@gmail.com:

 Hi,

 I don't have a good excuse here. :(
 I thought about including Scala, but for some reason didn't do it.  I see
 12-13% of people chose Other.  Do you think that is because I didn't
 include Scala?

 Also, is the Scala API reeally going away?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Jan 20, 2015 at 4:43 PM, Koert Kuipers ko...@tresata.com wrote:

  no scala? although scala can indeed use the java api, its ugly we
  prefer to use the scala api (which i believe will go away unfortunately)
 
  On Tue, Jan 20, 2015 at 2:52 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hi,
  
   I was wondering which implementations/languages people use for their
  Kafka
   Producer/Consumers not everyone is using the Java APIs.  So here's
 a
   1-question poll:
  
  
 http://blog.sematext.com/2015/01/20/kafka-poll-producer-consumer-client/
  
   Will share the results in about a week when we have enough votes.
  
   Thanks!
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
 



Consuming Kafka Messages Inside of EC2 Instances

2015-01-28 Thread Su She
Hello All,

I have set up a cluster of EC2 instances using this method:

http://blogs.aws.amazon.com/bigdata/post/Tx2D0J7QOVRJBRX/Deploying-Cloudera-s-Enterprise-Data-Hub-on-AWS

As you can see the instances are w/in a private subnet. I was wondering if
anyone has any advice on how I can set up a Kafka zookeeper/server on an
instance that receives messages from a Kafka Producer outside of the
private subnet. I have tried using the cluster launcher, but I feel like it
is not a best practice and only a temporary situation.

Thank you for the help!

Best,

Su


Re: Resilient Producer

2015-01-28 Thread Colin
Logstash

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Jan 28, 2015, at 10:47 AM, Gwen Shapira gshap...@cloudera.com wrote:
 
 It sounds like you are describing Flume, with SpoolingDirectory source
 (or exec source running tail) and Kafka channel.
 
 On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
 Hi all,
I'm evaluating using Kafka.
 
 I liked this thing of Facebook scribe that you log to your own machine and
 then there's a separate process that forwards messages to the central
 logger.
 
 With Kafka it seems that I have to embed the publisher in my app, and deal
 with any communication problem managing that on the producer side.
 
 I googled quite a bit trying to find a project that would basically use
 daemon that parses a log file and send the lines to the Kafka cluster
 (something like a tail file.log but instead of redirecting the output to
 the console: send it to kafka)
 
 Does anyone knows about something like that?
 
 
 Thanks!
 Fernando.


Re: Resilient Producer

2015-01-28 Thread Magnus Edenhill
The big syslog daemons support Kafka since a while back.

rsyslog:
http://www.rsyslog.com/doc/master/configuration/modules/omkafka.html

syslog-ng:
https://czanik.blogs.balabit.com/2015/01/syslog-ng-kafka-destination-support/#more-1013

And Bruce might be of interest aswell:
https://github.com/tagged/bruce


On the less daemony and more tooly side of things are:

https://github.com/fsaintjacques/tail-kafka
https://github.com/mguindin/tail-kafka
https://github.com/edenhill/kafkacat


2015-01-28 19:47 GMT+01:00 Gwen Shapira gshap...@cloudera.com:

 It sounds like you are describing Flume, with SpoolingDirectory source
 (or exec source running tail) and Kafka channel.

 On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
  Hi all,
  I'm evaluating using Kafka.
 
  I liked this thing of Facebook scribe that you log to your own machine
 and
  then there's a separate process that forwards messages to the central
  logger.
 
  With Kafka it seems that I have to embed the publisher in my app, and
 deal
  with any communication problem managing that on the producer side.
 
  I googled quite a bit trying to find a project that would basically use
  daemon that parses a log file and send the lines to the Kafka cluster
  (something like a tail file.log but instead of redirecting the output to
  the console: send it to kafka)
 
  Does anyone knows about something like that?
 
 
  Thanks!
  Fernando.



Re: Can't create a topic; can't delete it either

2015-01-28 Thread Sumit Rangwala
On Tue, Jan 27, 2015 at 10:54 PM, Joel Koshy jjkosh...@gmail.com wrote:

 Do you still have the controller and state change logs from the time
 you originally tried to delete the topic?


If you can tell me where the find the logs I can check. I haven't restarted
my brokers since the issue.

Sumit



 On Tue, Jan 27, 2015 at 03:11:48PM -0800, Sumit Rangwala wrote:
  I am using 0.8.2-beta on brokers 0.8.1.1 for client (producer and
  consumers). delete.topic.enable=true on all brokers. replication factor
 is
   number of brokers. I see this issue with just one single topic, all
 other
  topics are fine (creation and deletion). Even after a day it is still in
  marked for deletion stage. Let me know what other  information from the
  brokers or the zookeepers can help me debug this issue.
 
  On Tue, Jan 27, 2015 at 9:47 AM, Gwen Shapira gshap...@cloudera.com
 wrote:
 
   Also, do you have delete.topic.enable=true on all brokers?
  
   The automatic topic creation can fail if the default number of
   replicas is greater than number of available brokers. Check the
   default.replication.factor parameter.
  
   Gwen
  
   On Tue, Jan 27, 2015 at 12:29 AM, Joel Koshy jjkosh...@gmail.com
 wrote:
Which version of the broker are you using?
   
On Mon, Jan 26, 2015 at 10:27:14PM -0800, Sumit Rangwala wrote:
While running kafka in production I found an issue where a topic
 wasn't
getting created even with auto topic enabled. I then went ahead and
   created
the topic manually (from the command line). I then delete the topic,
   again
manually. Now my broker won't allow me to either create *the* topic
 or
delete *the* topic. (other topic creation and deletion is working
 fine).
   
The topic is in marked for deletion stage for more than 3 hours.
   
$ bin/kafka-topics.sh --zookeeper zookeeper1:2181/replication/kafka
   --list
--topic GRIFFIN-TldAdFormat.csv-1422321736886
GRIFFIN-TldAdFormat.csv-1422321736886 - marked for deletion
   
If this is a known issue, is there a workaround?
   
Sumit
   
  




Routing modifications at runtime

2015-01-28 Thread Toni Cebrián
Hi,

I'm starting to weight different alternatives for data ingestion and
I'd like to know whether Kafka meets the problem I have.
Say we have a set of devices each with its own MAC and then we receive
data in Kafka. There is a dictionary defined elsewhere that says each MAC
to which topic must publish. So I have basically 2 questions:
New MACs keep comming and the dictionary must be updated accordingly. How
could I change this Kafka behaviour during runtime?
A problem for the future. Say that dictionaries are so big that they don't
fit in memory. Are there any patterns for bookkeeping internal data
structures and how route to them?

T.