Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?" Dont do that. I would recommend that spark streaming process stores data into some nosql or sql database and the web ui to query data from that database. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-) > On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: > > Spark streaming helps with aggregation because > > A. raw kafka consumers have no built in framework for shuffling > amongst nodes, short of writing into

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. (Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) I never said

Re: How to move a broker out of rotation?

2016-09-29 Thread Praveen
Nice. That has some nice set of functionality. Thanks. I'll take a look. Praveen On Thu, Sep 29, 2016 at 4:07 PM, Todd Palino wrote: > There’s not a good answer for this with just the Kafka tools. We opened > sourced the tool that we use for removing brokers and rebalancing

Re: How to move a broker out of rotation?

2016-09-29 Thread Todd Palino
There’s not a good answer for this with just the Kafka tools. We opened sourced the tool that we use for removing brokers and rebalancing partitions in a cluster: https://github.com/linkedin/kafka-tools So when we want to remove a broker (with an ID of 1 in this example) from a cluster, we run:

How to move a broker out of rotation?

2016-09-29 Thread Praveen
I have 16 brokers. Now one of the brokers (B-16) got completely messed up and is sent for repair. But I can still see some partitions including the B-16 in its replicas, thereby becoming under-replicated. Is there a proper way to take broker out of rotation? Praveen

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Avi, Why did you choose Druid over Postgres / Cassandra / Elasticsearch? On Fri, Sep 30, 2016 at 1:09 AM, Avi Flax wrote: > > > On Sep 29, 2016, at 09:54, Ali Akhtar wrote: > > > > I'd appreciate some thoughts / suggestions on which of these >

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Avi Flax
> On Sep 29, 2016, at 09:54, Ali Akhtar wrote: > > I'd appreciate some thoughts / suggestions on which of these alternatives I > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > persistent data store to use, and how to query that data store in the >

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
The OP didn't say anything about Yarn, and why are you contemplating putting Kafka or Spark on public networks to begin with? Gwen's right, absent any actual requirements this is kind of pointless. On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel wrote: > Spark

Re: librdkafka

2016-09-29 Thread Magnus Edenhill
Hey Dave, yes that's the general plan. Regards, Magnus 2016-09-29 19:33 GMT+02:00 Tauzell, Dave : > Does anybody know if the librdkafka releases are kept in step with kafka > releases? > > -Dave > This e-mail and any files transmitted with it are confidential, may

librdkafka

2016-09-29 Thread Tauzell, Dave
Does anybody know if the librdkafka releases are kept in step with kafka releases? -Dave This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have

rack aware consumer

2016-09-29 Thread Ezra Stuetzel
Hi, In kafka 0.10 is there a way to configure the consumer such that it is rack aware? We replicate data across all our 'racks' and want consumers to choose brokers that are rack local whenever possible. Our configured racks are actually in different datacenters so there is much higher network

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Gwen Shapira
The original post made no mention of throughput or latency or correctness requirements, so pretty much any data store will fit the bill... discussion of "what is better" degrade fast when there are no concrete standards to choose between. Who cares about anything when we don't know what we need?

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> I still don't understand why writing to a transactional database with locking > and concurrency (read and writes) through JDBC will be fast for this sort of > data ingestion. Who cares about fast if your data is wrong? And it's still plenty fast enough

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
The way I see this, there are two things involved. 1. Data ingestion through source to Kafka 2. Date conversion and Storage ETL/ELT 3. Presentation Item 2 is the one that needs to be designed correctly. I presume raw data has to confirm to some form of MDM that requires schema mapping

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The business use case is to read a user's data from a variety of different services through their API, and then allowing the user to query that data, on a per service basis, as well as an aggregate across all services. The way I'm considering doing it, is to do some basic ETL (drop all the

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
No, direct stream in and of itself won't ensure an end-to-end guarantee, because it doesn't know anything about your output actions. You still need to do some work. The point is having easy access to offsets for batches on a per-partition basis makes it easier to do that work, especially in

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali, What is the business use case for this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for messages. On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
If you're doing any kind of pre-aggregation during ETL, spark direct stream will let you more easily get the delivery semantics you need, especially if you're using a transactional data store. If you're literally just copying individual uniquely keyed items from kafka to a key-value store, use

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Yes but still these writes from Spark have to go through JDBC? Correct. Having said that I don't see how doing this through Spark streaming to postgress is going to be faster than source -> Kafka - flume via zookeeper -> HDFS. I believe there is direct streaming from Kakfa to Hive as well and

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
My concern with Postgres / Cassandra is only scalability. I will look further into Postgres horizontal scaling, thanks. Writes could be idempotent if done as upserts, otherwise updates will be idempotent but not inserts. Data should not be lost. The system should be as fault tolerant as

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
I wouldn't give up the flexibility and maturity of a relational database, unless you have a very specific use case. I'm not trashing cassandra, I've used cassandra, but if all I know is that you're doing analytics, I wouldn't want to give up the ability to easily do ad-hoc aggregations without a

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody Spark direct stream is just fine for this use case. But why postgres and not cassandra? Is there anything specific here that i may not be aware? Thanks Deepak On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: > How are you going to handle etl failures? Do you

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Is there an advantage to that vs directly consuming from Kafka? Nothing is being done to the data except some light ETL and then storing it in Cassandra On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma wrote: > Its better you use spark's direct stream to ingest from kafka.

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka. On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar wrote: > I don't think I need a different speed storage and batch storage. Just > taking in raw data from Kafka, standardizing, and storing it somewhere > where the

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple channels in distributed fashion. In that case , the resource utilization will be high in that case as well. Thanks Deepak On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh wrote: > - Spark

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I don't think I need a different speed storage and batch storage. Just taking in raw data from Kafka, standardizing, and storing it somewhere where the web UI can query it, seems like it will be enough. I'm thinking about: - Reading data from Kafka via Spark Streaming - Standardizing, then

Re: To find the Lag of consumer offset using kafka client library

2016-09-29 Thread Gourab Chowdhury
Thanks for your suggestion, I had previously read about Yahoo Kafka monitor as suggested some where. What I actually need is function/class in kafka java libaray (if any) that helps to find the lag and other details? Can kafka.admin help in this matter? A code snippet equivalent to:-

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
- Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume You don't need Spark streaming to read data from Kafka and store on HDFS. It is a waste of resources. Couple Flume to use Kafka as source and HDFS as sink directly KafkaAgent.sources = kafka-sources

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around queries . Ingest the data to spark streaming (speed layer) and write to hdfs(for batch layer). Now you have data at rest as well as in motion(real time). >From spark streaming itself , do further processing and write the final

RE: Architecture recommendations for a tricky use case

2016-09-29 Thread Tauzell, Dave
Spark Streaming needs to store the output somewhere. Cassandra is a possible target for that. -Dave -Original Message- From: Ali Akhtar [mailto:ali.rac...@gmail.com] Sent: Thursday, September 29, 2016 9:16 AM Cc: users@kafka.apache.org; spark users Subject: Re: Architecture

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time. It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end. Thanks for the recommendation of Flume. Do you think this

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
You need a batch layer and a speed layer. Data from Kafka can be stored on HDFS using flume. - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports) This is basically batch layer and you need something like

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
It needs to be able to scale to a very large amount of data, yes. On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma wrote: > What is the message inflow ? > If it's really high , definitely spark will be of great use . > > Thanks > Deepak > > On Sep 29, 2016 19:24, "Ali

Re: To find the Lag of consumer offset using kafka client library

2016-09-29 Thread Jan Omar
Hi Gourab, Check this out: https://github.com/linkedin/Burrow Regards Jan > On 29 Sep 2016, at 15:47, Gourab Chowdhury wrote: > > I can get the *Lag* of offsets with the following command:- > > bin/kafka-run-class.sh

Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I have a somewhat tricky use case, and I'm looking for ideas. I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka. I need to: - Do ETL on the data, and standardize it. - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch /

To find the Lag of consumer offset using kafka client library

2016-09-29 Thread Gourab Chowdhury
I can get the *Lag* of offsets with the following command:- bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --zookeeper localhost:2182 --describe --group DemoConsumer I am trying to find code that uses kafka library to find the *Lag* of offsets in consumer? Also is there any other

500ms FetchConsumer RemoteTimeMs

2016-09-29 Thread Peter Sinoros Szabo
Hi, I am setting up metrics monitoring on a new Kafka 0.10.0.1 cluster and observed the following: - kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 's 99th percentile is 18.71ms (remote=15.71, response send time, local and queue times are both 1ms), so far this seems to be

RE: producer can't push msg sometimes with 1 broker recoved

2016-09-29 Thread FEI Aggie
Kamal, Thanks very much for your testing. I also tried script (kafka-console-producer.sh) provided by kafka and find it does work at this situation. The original testing I did is with the test program written by ourselves. I'll try to find the difference. Thanks for your help! Regards, Aggie