date:20160929

Consumer hangs when partition leader is power off

2016-09-29 Thread dimitri tombroff

Hello all; I easily reproduce an annoying scenario, using Kafka brokers kafka_2.10-0.8.2.0, but I believe it has nothing to do with the brokersbut only with the consumer API. Here is the problem: a producer is continuously writing using the sync producer api to a topic (2 partitions, replicated)

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman

"Using Spark to query the data in the backend of the web UI?" Dont do that. I would recommend that spark streaming process stores data into some nosql or sql database and the web ui to query data from that database. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel

Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does Y

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel

Spark standalone is not Yarn… or secure for that matter… ;-) > On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: > > Spark streaming helps with aggregation because > > A. raw kafka consumers have no built in framework for shuffling > amongst nodes, short of writing into an intermediate topic

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel

OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. (Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) I never said anythin

Re: How to move a broker out of rotation?

2016-09-29 Thread Praveen

Nice. That has some nice set of functionality. Thanks. I'll take a look. Praveen On Thu, Sep 29, 2016 at 4:07 PM, Todd Palino wrote: > There’s not a good answer for this with just the Kafka tools. We opened > sourced the tool that we use for removing brokers and rebalancing > partitions in a c

Re: How to move a broker out of rotation?

2016-09-29 Thread Todd Palino

There’s not a good answer for this with just the Kafka tools. We opened sourced the tool that we use for removing brokers and rebalancing partitions in a cluster: https://github.com/linkedin/kafka-tools So when we want to remove a broker (with an ID of 1 in this example) from a cluster, we run: ka

How to move a broker out of rotation?

2016-09-29 Thread Praveen

I have 16 brokers. Now one of the brokers (B-16) got completely messed up and is sent for repair. But I can still see some partitions including the B-16 in its replicas, thereby becoming under-replicated. Is there a proper way to take broker out of rotation? Praveen

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

Avi, Why did you choose Druid over Postgres / Cassandra / Elasticsearch? On Fri, Sep 30, 2016 at 1:09 AM, Avi Flax wrote: > > > On Sep 29, 2016, at 09:54, Ali Akhtar wrote: > > > > I'd appreciate some thoughts / suggestions on which of these > alternatives I > > should go with (e.g, using raw

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Avi Flax

> On Sep 29, 2016, at 09:54, Ali Akhtar wrote: > > I'd appreciate some thoughts / suggestions on which of these alternatives I > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > persistent data store to use, and how to query that data store in the > backend of the web UI,

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

The OP didn't say anything about Yarn, and why are you contemplating putting Kafka or Spark on public networks to begin with? Gwen's right, absent any actual requirements this is kind of pointless. On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel wrote: > Spark standalone is not Yarn… or secure fo

Re: librdkafka

2016-09-29 Thread Magnus Edenhill

Hey Dave, yes that's the general plan. Regards, Magnus 2016-09-29 19:33 GMT+02:00 Tauzell, Dave : > Does anybody know if the librdkafka releases are kept in step with kafka > releases? > > -Dave > This e-mail and any files transmitted with it are confidential, may > contain sensitive informatio

librdkafka

2016-09-29 Thread Tauzell, Dave

Does anybody know if the librdkafka releases are kept in step with kafka releases? -Dave This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have rec

rack aware consumer

2016-09-29 Thread Ezra Stuetzel

Hi, In kafka 0.10 is there a way to configure the consumer such that it is rack aware? We replicate data across all our 'racks' and want consumers to choose brokers that are rack local whenever possible. Our configured racks are actually in different datacenters so there is much higher network cost

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Gwen Shapira

The original post made no mention of throughput or latency or correctness requirements, so pretty much any data store will fit the bill... discussion of "what is better" degrade fast when there are no concrete standards to choose between. Who cares about anything when we don't know what we need? :

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

> I still don't understand why writing to a transactional database with locking > and concurrency (read and writes) through JDBC will be fast for this sort of > data ingestion. Who cares about fast if your data is wrong? And it's still plenty fast enough https://youtu.be/NVl9_6J1G60?list=WL&t=

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally decide

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh

The way I see this, there are two things involved. 1. Data ingestion through source to Kafka 2. Date conversion and Storage ETL/ELT 3. Presentation Item 2 is the one that needs to be designed correctly. I presume raw data has to confirm to some form of MDM that requires schema mapping e

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

The business use case is to read a user's data from a variety of different services through their API, and then allowing the user to query that data, on a per service basis, as well as an aggregate across all services. The way I'm considering doing it, is to do some basic ETL (drop all the unneces

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

No, direct stream in and of itself won't ensure an end-to-end guarantee, because it doesn't know anything about your output actions. You still need to do some work. The point is having easy access to offsets for batches on a per-partition basis makes it easier to do that work, especially in conju

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh

Hi Ali, What is the business use case for this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaim

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma

If you use spark direct streams , it ensure end to end guarantee for messages. On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could be idempotent if done as

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

If you're doing any kind of pre-aggregation during ETL, spark direct stream will let you more easily get the delivery semantics you need, especially if you're using a transactional data store. If you're literally just copying individual uniquely keyed items from kafka to a key-value store, use kaf

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh

Yes but still these writes from Spark have to go through JDBC? Correct. Having said that I don't see how doing this through Spark streaming to postgress is going to be faster than source -> Kafka - flume via zookeeper -> HDFS. I believe there is direct streaming from Kakfa to Hive as well and fr

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

My concern with Postgres / Cassandra is only scalability. I will look further into Postgres horizontal scaling, thanks. Writes could be idempotent if done as upserts, otherwise updates will be idempotent but not inserts. Data should not be lost. The system should be as fault tolerant as possible.

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

I wouldn't give up the flexibility and maturity of a relational database, unless you have a very specific use case. I'm not trashing cassandra, I've used cassandra, but if all I know is that you're doing analytics, I wouldn't want to give up the ability to easily do ad-hoc aggregations without a l

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma

Hi Cody Spark direct stream is just fine for this use case. But why postgres and not cassandra? Is there anything specific here that i may not be aware? Thanks Deepak On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: > How are you going to handle etl failures? Do you care about lost / > d

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

Is there an advantage to that vs directly consuming from Kafka? Nothing is being done to the data except some light ETL and then storing it in Cassandra On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma wrote: > Its better you use spark's direct stream to ingest from kafka. > > On Thu, Sep 29, 2016

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger

How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04 A

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma

Its better you use spark's direct stream to ingest from kafka. On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar wrote: > I don't think I need a different speed storage and batch storage. Just > taking in raw data from Kafka, standardizing, and storing it somewhere > where the web UI can query it, see

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma

Since the inflow is huge , flume would also need to be run with multiple channels in distributed fashion. In that case , the resource utilization will be high in that case as well. Thanks Deepak On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh wrote: > - Spark Streaming to read data from Kafka

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

I don't think I need a different speed storage and batch storage. Just taking in raw data from Kafka, standardizing, and storing it somewhere where the web UI can query it, seems like it will be enough. I'm thinking about: - Reading data from Kafka via Spark Streaming - Standardizing, then storin

Re: To find the Lag of consumer offset using kafka client library

2016-09-29 Thread Gourab Chowdhury

Thanks for your suggestion, I had previously read about Yahoo Kafka monitor as suggested some where. What I actually need is function/class in kafka java libaray (if any) that helps to find the lag and other details? Can kafka.admin help in this matter? A code snippet equivalent to:- bin/kafka-co

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh

- Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume You don't need Spark streaming to read data from Kafka and store on HDFS. It is a waste of resources. Couple Flume to use Kafka as source and HDFS as sink directly KafkaAgent.sources = kafka-sources KafkaAgent.sinks

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma

For ui , you need DB such as Cassandra that is designed to work around queries . Ingest the data to spark streaming (speed layer) and write to hdfs(for batch layer). Now you have data at rest as well as in motion(real time). >From spark streaming itself , do further processing and write the final r

RE: Architecture recommendations for a tricky use case

2016-09-29 Thread Tauzell, Dave

Spark Streaming needs to store the output somewhere. Cassandra is a possible target for that. -Dave -Original Message- From: Ali Akhtar [mailto:ali.rac...@gmail.com] Sent: Thursday, September 29, 2016 9:16 AM Cc: users@kafka.apache.org; spark users Subject: Re: Architecture recommendati

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time. It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end. Thanks for the recommendation of Flume. Do you think this w

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma

What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and writing their raw > data

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh

You need a batch layer and a speed layer. Data from Kafka can be stored on HDFS using flume. - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports) This is basically batch layer and you need something like Tab

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

It needs to be able to scale to a very large amount of data, yes. On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma wrote: > What is the message inflow ? > If it's really high , definitely spark will be of great use . > > Thanks > Deepak > > On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > >> I have a

Re: To find the Lag of consumer offset using kafka client library

2016-09-29 Thread Jan Omar

Hi Gourab, Check this out: https://github.com/linkedin/Burrow Regards Jan > On 29 Sep 2016, at 15:47, Gourab Chowdhury wrote: > > I can get the *Lag* of offsets with the following command:- > > bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --z

Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar

I have a somewhat tricky use case, and I'm looking for ideas. I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka. I need to: - Do ETL on the data, and standardize it. - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Pos

To find the Lag of consumer offset using kafka client library

2016-09-29 Thread Gourab Chowdhury

I can get the *Lag* of offsets with the following command:- bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --zookeeper localhost:2182 --describe --group DemoConsumer I am trying to find code that uses kafka library to find the *Lag* of offsets in consumer? Also is there any other docume

500ms FetchConsumer RemoteTimeMs

2016-09-29 Thread Peter Sinoros Szabo

Hi, I am setting up metrics monitoring on a new Kafka 0.10.0.1 cluster and observed the following: - kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 's 99th percentile is 18.71ms (remote=15.71, response send time, local and queue times are both 1ms), so far this seems to be

44 matches

Mail list logo