Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be 
running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use 
standalone spark outside your cluster. You can, but you have more moving 
parts…) 

I never said anything about putting something on a public network. I mentioned 
running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real 
information. 
Based on what the OP said, there are design considerations since every tool he 
mentioned has pluses and minuses and the problem isn’t really that challenging 
unless you have something extraordinary like high velocity or some other 
constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become 
problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger  wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
>  wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>>  wrote:
 Ok… so what’s the tricky part?
 Spark Streaming isn’t real time so if you don’t mind a slight delay in 
 processing… it would work.
 
 The drawback is that you now have a long running Spark Job (assuming under 
 YARN) and that could become a problem in terms of security and resources.
 (How well does Yarn handle long running jobs these days in a secured 
 Cluster? Steve L. may have some insight… )
 
 Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
 you want to write your own compaction code? Or use Hive 1.x+?)
 
 HBase? Depending on your admin… stability could be a problem.
 Cassandra? That would be a separate cluster and that in itself could be a 
 problem…
 
 YMMV so you need to address the pros/cons of each tool specific to your 
 environment and skill level.
 
 HTH
 
 -Mike
 
> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
> data into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the 
> web UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer 
> (receive raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
> and to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against 
> Cassandra / HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives 
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the 
> backend of the web UI, for displaying the reports).
> 
> 
> Thanks.
 
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
The OP didn't say anything about Yarn, and why are you contemplating
putting Kafka or Spark on public networks to begin with?

Gwen's right, absent any actual requirements this is kind of pointless.

On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
 wrote:
> Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an intermediate topic (I'm not
>> touching Kafka Streams here, I don't have experience), and
>>
>> B. it deals with batches, so you can transactionally decide to commit
>> or rollback your aggregate data and your offsets.  Otherwise your
>> offsets and data store can get out of sync, leading to lost /
>> duplicate data.
>>
>> Regarding long running spark jobs, I have streaming jobs in the
>> standalone manager that have been running for 6 months or more.
>>
>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>  wrote:
>>> Ok… so what’s the tricky part?
>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>>> processing… it would work.
>>>
>>> The drawback is that you now have a long running Spark Job (assuming under 
>>> YARN) and that could become a problem in terms of security and resources.
>>> (How well does Yarn handle long running jobs these days in a secured 
>>> Cluster? Steve L. may have some insight… )
>>>
>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
>>> you want to write your own compaction code? Or use Hive 1.x+?)
>>>
>>> HBase? Depending on your admin… stability could be a problem.
>>> Cassandra? That would be a separate cluster and that in itself could be a 
>>> problem…
>>>
>>> YMMV so you need to address the pros/cons of each tool specific to your 
>>> environment and skill level.
>>>
>>> HTH
>>>
>>> -Mike
>>>
 On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:

 I have a somewhat tricky use case, and I'm looking for ideas.

 I have 5-6 Kafka producers, reading various APIs, and writing their raw 
 data into Kafka.

 I need to:

 - Do ETL on the data, and standardize it.

 - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
 ElasticSearch / Postgres)

 - Query this data to generate reports / analytics (There will be a web UI 
 which will be the front-end to the data, and will show the reports)

 Java is being used as the backend language for everything (backend of the 
 web UI, as well as the ETL layer)

 I'm considering:

 - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
 raw data from Kafka, standardize & store it)

 - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
 and to allow queries

 - In the backend of the web UI, I could either use Spark to run queries 
 across the data (mostly filters), or directly run queries against 
 Cassandra / HBase

 I'd appreciate some thoughts / suggestions on which of these alternatives 
 I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
 persistent data store to use, and how to query that data store in the 
 backend of the web UI, for displaying the reports).


 Thanks.
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>  wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>> processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under 
>> YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured 
>> Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
>> want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a 
>> problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your 
>> environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>> data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>> ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI 
>>> which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the 
>>> web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>>> raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>> and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries 
>>> across the data (mostly filters), or directly run queries against Cassandra 
>>> / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>> persistent data store to use, and how to query that data store in the 
>>> backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>> 



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Michael,

How about druid  here.

Hive ORC tables are another option that have  Streaming data ingest
to
Flume and storm

However, Spark cannot read ORC transactional tables because of delta files,
unless the compaction is done (a nightmare)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 17:01, Michael Segel 
wrote:

> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in
> processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under
> YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured
> Cluster? Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do
> you want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a
> problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your
> environment and skill level.
>
> HTH
>
> -Mike
>
> > On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> >
> > I have a somewhat tricky use case, and I'm looking for ideas.
> >
> > I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
> >
> > I need to:
> >
> > - Do ETL on the data, and standardize it.
> >
> > - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
> >
> > - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
> >
> > Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
> >
> > I'm considering:
> >
> > - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
> >
> > - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
> >
> > - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
> >
> > I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> in the backend of the web UI, for displaying the reports).
> >
> >
> > Thanks.
>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> I still don't understand why writing to a transactional database with locking 
> and concurrency (read and writes) through JDBC will be fast for this sort of 
> data ingestion.

Who cares about fast if your data is wrong?  And it's still plenty fast enough

https://youtu.be/NVl9_6J1G60?list=WL=1819

https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/



On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
 wrote:
> The way I see this, there are two things involved.
>
> Data ingestion through source to Kafka
> Date conversion and Storage ETL/ELT
> Presentation
>
> Item 2 is the one that needs to be designed correctly. I presume raw data
> has to confirm to some form of MDM that requires schema mapping etc before
> putting into persistent storage (DB, HDFS etc). Which one to choose depends
> on your volume of ingestion and your cluster size and complexity of data
> conversion. Then your users will use some form of UI (Tableau, QlikView,
> Zeppelin, direct SQL) to query data one way or other. Your users can
> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
> offers the same). Your mileage varies according to your needs.
>
> I still don't understand why writing to a transactional database with
> locking and concurrency (read and writes) through JDBC will be fast for this
> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
> write to as my sink,I would use Oracle which offers the best locking and
> concurrency among RDBMs and also handles key value pairs as well (assuming
> that is what you want). In addition, it can be used as a Data Warehouse as
> well.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 29 September 2016 at 16:49, Ali Akhtar  wrote:
>>
>> The business use case is to read a user's data from a variety of different
>> services through their API, and then allowing the user to query that data,
>> on a per service basis, as well as an aggregate across all services.
>>
>> The way I'm considering doing it, is to do some basic ETL (drop all the
>> unnecessary fields, rename some fields into something more manageable, etc)
>> and then store the data in Cassandra / Postgres.
>>
>> Then, when the user wants to view a particular report, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between  and  and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>
>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
>> wrote:
>>>
>>> No, direct stream in and of itself won't ensure an end-to-end
>>> guarantee, because it doesn't know anything about your output actions.
>>>
>>> You still need to do some work.  The point is having easy access to
>>> offsets for batches on a per-partition basis makes it easier to do
>>> that work, especially in conjunction with aggregation.
>>>
>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>>> wrote:
>>> > If you use spark direct streams , it ensure end to end guarantee for
>>> > messages.
>>> >
>>> >
>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>>> > wrote:
>>> >>
>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>> >> further into Postgres horizontal scaling, thanks.
>>> >>
>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>> >> be
>>> >> idempotent but not inserts.
>>> >>
>>> >> Data should not be lost. The system should be as fault tolerant as
>>> >> possible.
>>> >>
>>> >> What's the advantage of using Spark for reading Kafka instead of
>>> >> direct
>>> >> Kafka consumers?
>>> >>
>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>>> >> wrote:
>>> >>>
>>> >>> I wouldn't give up the flexibility and maturity of a relational
>>> >>> database, unless you have a very specific use case.  I'm not trashing
>>> >>> cassandra, I've used cassandra, but if all I know is that you're
>>> >>> doing
>>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> >>> aggregations without a lot of forethought.  If you're worried about
>>> >>> scaling, there are several options for horizontally scaling Postgres
>>> >>> in particular.  One of the current best from what I've worked with is
>>> >>> Citus.
>>> >>>
>>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma
>>> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because

A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and

B. it deals with batches, so you can transactionally decide to commit
or rollback your aggregate data and your offsets.  Otherwise your
offsets and data store can get out of sync, leading to lost /
duplicate data.

Regarding long running spark jobs, I have streaming jobs in the
standalone manager that have been running for 6 months or more.

On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
 wrote:
> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
> processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under 
> YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured Cluster? 
> Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
> want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a 
> problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your 
> environment and skill level.
>
> HTH
>
> -Mike
>
>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
>> into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI 
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the 
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>> raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries 
>> across the data (mostly filters), or directly run queries against Cassandra 
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>> persistent data store to use, and how to query that data store in the 
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
The way I see this, there are two things involved.


   1. Data ingestion through source to Kafka
   2. Date conversion and Storage ETL/ELT
   3. Presentation

Item 2 is the one that needs to be designed correctly. I presume raw data
has to confirm to some form of MDM that requires schema mapping etc before
putting into persistent storage (DB, HDFS etc). Which one to choose depends
on your volume of ingestion and your cluster size and complexity of data
conversion. Then your users will use some form of UI (Tableau, QlikView,
Zeppelin, direct SQL) to query data one way or other. Your users can
directly use UI like Tableau that offer in built analytics on SQL. Spark
sql offers the same). Your mileage varies according to your needs.

I still don't understand why writing to a transactional database with
locking and concurrency (read and writes) through JDBC will be fast for
this sort of data ingestion. If you ask me if I wanted to choose an RDBMS
to write to as my sink,I would use Oracle which offers the best locking and
concurrency among RDBMs and also handles key value pairs as well (assuming
that is what you want). In addition, it can be used as a Data Warehouse as
well.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:49, Ali Akhtar  wrote:

> The business use case is to read a user's data from a variety of different
> services through their API, and then allowing the user to query that data,
> on a per service basis, as well as an aggregate across all services.
>
> The way I'm considering doing it, is to do some basic ETL (drop all the
> unnecessary fields, rename some fields into something more manageable, etc)
> and then store the data in Cassandra / Postgres.
>
> Then, when the user wants to view a particular report, query the
> respective table in Cassandra / Postgres. (select .. from data where user =
> ? and date between  and  and some_field = ?)
>
> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>
> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
> wrote:
>
>> No, direct stream in and of itself won't ensure an end-to-end
>> guarantee, because it doesn't know anything about your output actions.
>>
>> You still need to do some work.  The point is having easy access to
>> offsets for batches on a per-partition basis makes it easier to do
>> that work, especially in conjunction with aggregation.
>>
>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>> wrote:
>> > If you use spark direct streams , it ensure end to end guarantee for
>> > messages.
>> >
>> >
>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>> wrote:
>> >>
>> >> My concern with Postgres / Cassandra is only scalability. I will look
>> >> further into Postgres horizontal scaling, thanks.
>> >>
>> >> Writes could be idempotent if done as upserts, otherwise updates will
>> be
>> >> idempotent but not inserts.
>> >>
>> >> Data should not be lost. The system should be as fault tolerant as
>> >> possible.
>> >>
>> >> What's the advantage of using Spark for reading Kafka instead of direct
>> >> Kafka consumers?
>> >>
>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> >> wrote:
>> >>>
>> >>> I wouldn't give up the flexibility and maturity of a relational
>> >>> database, unless you have a very specific use case.  I'm not trashing
>> >>> cassandra, I've used cassandra, but if all I know is that you're doing
>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> >>> aggregations without a lot of forethought.  If you're worried about
>> >>> scaling, there are several options for horizontally scaling Postgres
>> >>> in particular.  One of the current best from what I've worked with is
>> >>> Citus.
>> >>>
>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <
>> deepakmc...@gmail.com>
>> >>> wrote:
>> >>> > Hi Cody
>> >>> > Spark direct stream is just fine for this use case.
>> >>> > But why postgres and not cassandra?
>> >>> > Is there anything specific here that i may not be aware?
>> >>> >
>> >>> > Thanks
>> >>> > Deepak
>> >>> >
>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger > >
>> >>> > wrote:
>> >>> >>
>> >>> >> How are you going to handle etl failures?  Do you care about lost /
>> >>> >> duplicated data?  Are your writes 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in 
processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under 
YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? 
Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a 
problem… 

YMMV so you need to address the pros/cons of each tool specific to your 
environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
> into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web 
> UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
> raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and 
> to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against Cassandra / 
> HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I 
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the backend 
> of the web UI, for displaying the reports).
> 
> 
> Thanks.



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The business use case is to read a user's data from a variety of different
services through their API, and then allowing the user to query that data,
on a per service basis, as well as an aggregate across all services.

The way I'm considering doing it, is to do some basic ETL (drop all the
unnecessary fields, rename some fields into something more manageable, etc)
and then store the data in Cassandra / Postgres.

Then, when the user wants to view a particular report, query the respective
table in Cassandra / Postgres. (select .. from data where user = ? and date
between  and  and some_field = ?)

How will Spark Streaming help w/ aggregation? Couldn't the data be queried
from Cassandra / Postgres via the Kafka consumer and aggregated that way?

On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger  wrote:

> No, direct stream in and of itself won't ensure an end-to-end
> guarantee, because it doesn't know anything about your output actions.
>
> You still need to do some work.  The point is having easy access to
> offsets for batches on a per-partition basis makes it easier to do
> that work, especially in conjunction with aggregation.
>
> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
> wrote:
> > If you use spark direct streams , it ensure end to end guarantee for
> > messages.
> >
> >
> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
> wrote:
> >>
> >> My concern with Postgres / Cassandra is only scalability. I will look
> >> further into Postgres horizontal scaling, thanks.
> >>
> >> Writes could be idempotent if done as upserts, otherwise updates will be
> >> idempotent but not inserts.
> >>
> >> Data should not be lost. The system should be as fault tolerant as
> >> possible.
> >>
> >> What's the advantage of using Spark for reading Kafka instead of direct
> >> Kafka consumers?
> >>
> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
> >> wrote:
> >>>
> >>> I wouldn't give up the flexibility and maturity of a relational
> >>> database, unless you have a very specific use case.  I'm not trashing
> >>> cassandra, I've used cassandra, but if all I know is that you're doing
> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> >>> aggregations without a lot of forethought.  If you're worried about
> >>> scaling, there are several options for horizontally scaling Postgres
> >>> in particular.  One of the current best from what I've worked with is
> >>> Citus.
> >>>
> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma  >
> >>> wrote:
> >>> > Hi Cody
> >>> > Spark direct stream is just fine for this use case.
> >>> > But why postgres and not cassandra?
> >>> > Is there anything specific here that i may not be aware?
> >>> >
> >>> > Thanks
> >>> > Deepak
> >>> >
> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> >>> > wrote:
> >>> >>
> >>> >> How are you going to handle etl failures?  Do you care about lost /
> >>> >> duplicated data?  Are your writes idempotent?
> >>> >>
> >>> >> Absent any other information about the problem, I'd stay away from
> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >>> >> feeding postgres.
> >>> >>
> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> >>> >> wrote:
> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
> >>> >> > Nothing
> >>> >> > is
> >>> >> > being done to the data except some light ETL and then storing it
> in
> >>> >> > Cassandra
> >>> >> >
> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
> >>> >> > 
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
> >>> >> >>
> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
> ali.rac...@gmail.com>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> I don't think I need a different speed storage and batch
> storage.
> >>> >> >>> Just
> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
> >>> >> >>> somewhere
> >>> >> >>> where
> >>> >> >>> the web UI can query it, seems like it will be enough.
> >>> >> >>>
> >>> >> >>> I'm thinking about:
> >>> >> >>>
> >>> >> >>> - Reading data from Kafka via Spark Streaming
> >>> >> >>> - Standardizing, then storing it in Cassandra
> >>> >> >>> - Querying Cassandra from the web ui
> >>> >> >>>
> >>> >> >>> That seems like it will work. My question now is whether to use
> >>> >> >>> Spark
> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> >> >>>  wrote:
> >>> >> 
> >>> >>  - Spark Streaming to read data from Kafka
> >>> >>  - Storing the data on HDFS using Flume
> >>> >> 
> >>> >>  You don't need Spark streaming to read data from Kafka and
> store
> >>> >>  on
> >>> >>  HDFS. It is a 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
No, direct stream in and of itself won't ensure an end-to-end
guarantee, because it doesn't know anything about your output actions.

You still need to do some work.  The point is having easy access to
offsets for batches on a per-partition basis makes it easier to do
that work, especially in conjunction with aggregation.

On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma  wrote:
> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:
>>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> wrote:
>>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>>> > wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>>> >> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >> > Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >> > 
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> >> >>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >> >>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> >> >>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>>  wrote:
>>> >> 
>>> >>  - Spark Streaming to read data from Kafka
>>> >>  - Storing the data on HDFS using Flume
>>> >> 
>>> >>  You don't need Spark streaming to read data from Kafka and store
>>> >>  on
>>> >>  HDFS. It is a waste of resources.
>>> >> 
>>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> 
>>> >>  KafkaAgent.sources = kafka-sources
>>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> 
>>> >>  That will be for your batch layer. To analyse you can directly
>>> >>  read
>>> >>  from
>>> >>  hdfs files with Spark or simply store data in a database of your
>>> >>  choice via
>>> >>  cron or something. Do not mix your batch layer with speed layer.
>>> >> 
>>> >>  Your speed layer will ingest the same data directly from Kafka
>>> >>  into
>>> >>  spark streaming and that will be  online or near real time
>>> >>  (defined
>>> >>  by your
>>> >>  window).
>>> >> 
>>> >>  Then you have a a serving layer to present data from both speed
>>> >>  (the
>>> >>  one from SS) and batch layer.
>>> >> 
>>> >>  HTH
>>> >> 
>>> >> 
>>> >> 
>>> >> 
>>> >>  Dr Mich Talebzadeh
>>> >> 
>>> >> 
>>> >> 
>>> >> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali,

What is the business use case for this?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:40, Deepak Sharma  wrote:

> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:
>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> wrote:
>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>>> wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>>> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>>> deepakmc...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>>  wrote:
>>> >> 
>>> >>  - Spark Streaming to read data from Kafka
>>> >>  - Storing the data on HDFS using Flume
>>> >> 
>>> >>  You don't need Spark streaming to read data from Kafka and store
>>> on
>>> >>  HDFS. It is a waste of resources.
>>> >> 
>>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> 
>>> >>  KafkaAgent.sources = kafka-sources
>>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> 
>>> >>  That will be for your batch layer. To analyse you can directly
>>> read
>>> >>  from
>>> >>  hdfs files with Spark or simply store data in a database of your
>>> >>  choice via
>>> >>  cron or something. Do not mix your batch layer with speed layer.
>>> >> 
>>> >>  Your speed layer will ingest the same data directly from Kafka
>>> into
>>> >>  spark streaming and that will be  online or near real time
>>> (defined
>>> >>  by your
>>> >>  window).
>>> >> 
>>> >>  Then you have a a serving layer to present data 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for
messages.


On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:

> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as
> possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
> wrote:
>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>> wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>> deepakmc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>>  wrote:
>> >> 
>> >>  - Spark Streaming to read data from Kafka
>> >>  - Storing the data on HDFS using Flume
>> >> 
>> >>  You don't need Spark streaming to read data from Kafka and store
>> on
>> >>  HDFS. It is a waste of resources.
>> >> 
>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>> >> 
>> >>  KafkaAgent.sources = kafka-sources
>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> 
>> >>  That will be for your batch layer. To analyse you can directly
>> read
>> >>  from
>> >>  hdfs files with Spark or simply store data in a database of your
>> >>  choice via
>> >>  cron or something. Do not mix your batch layer with speed layer.
>> >> 
>> >>  Your speed layer will ingest the same data directly from Kafka
>> into
>> >>  spark streaming and that will be  online or near real time
>> (defined
>> >>  by your
>> >>  window).
>> >> 
>> >>  Then you have a a serving layer to present data from both speed
>> (the
>> >>  one from SS) and batch layer.
>> >> 
>> >>  HTH
>> >> 
>> >> 
>> >> 
>> >> 
>> >>  Dr Mich Talebzadeh
>> >> 
>> >> 
>> >> 
>> >>  LinkedIn
>> >> 
>> >>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >> 
>> >> 
>> >> 
>> >>  http://talebzadehmich.wordpress.com
>> >> 
>> >> 
>> >>  Disclaimer: Use it at your own risk. Any and all responsibility
>> for
>> >>  any
>> >>  loss, damage or destruction of data or any other property which
>> may
>> >>  arise
>> >>  from relying on this email's technical content is explicitly
>> >>  disclaimed. The
>> >>  author will in no case be liable for any monetary damages arising
>> >>  from such

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Yes but still these writes from Spark have  to go through JDBC? Correct.

Having said that I don't see how doing this through Spark streaming to
postgress is going to be faster than source -> Kafka - flume via zookeeper
-> HDFS.

I believe there is direct streaming from Kakfa to Hive as well and from
Flume to Hbase

I would have thought that if one wanted to do real time analytics with SS,
then that would be a good fit with a real time dashboard.

What is not so clear is the business use case for this.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:28, Cody Koeninger  wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma  >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>>  wrote:
> >> 
> >>  - Spark Streaming to read data from Kafka
> >>  - Storing the data on HDFS using Flume
> >> 
> >>  You don't need Spark streaming to read data from Kafka and store on
> >>  HDFS. It is a waste of resources.
> >> 
> >>  Couple Flume to use Kafka as source and HDFS as sink directly
> >> 
> >>  KafkaAgent.sources = kafka-sources
> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> 
> >>  That will be for your batch layer. To analyse you can directly read
> >>  from
> >>  hdfs files with Spark or simply store data in a database of your
> >>  choice via
> >>  cron or something. Do not mix your batch layer with speed layer.
> >> 
> >>  Your speed layer will ingest the same data directly from Kafka into
> >>  spark streaming and that will be  online or near real time (defined
> >>  by your
> >>  window).
> >> 
> >>  Then you have a a serving layer to present data from both speed
> (the
> >>  one from SS) and batch layer.
> >> 
> >>  HTH
> >> 
> >> 
> >> 
> >> 
> >>  Dr Mich Talebzadeh
> >> 
> >> 
> >> 
> >>  LinkedIn
> >> 
> >>  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> 
> >> 
> >> 
> >>  http://talebzadehmich.wordpress.com
> >> 
> >> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
If you're doing any kind of pre-aggregation during ETL, spark direct
stream will let you more easily get the delivery semantics you need,
especially if you're using a transactional data store.

If you're literally just copying individual uniquely keyed items from
kafka to a key-value store, use kafka consumers, sure.

On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar  wrote:
> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger  wrote:
>>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>> > wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>> >> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >> > Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> >> >>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >> >>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> >> >>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>>  wrote:
>> >> 
>> >>  - Spark Streaming to read data from Kafka
>> >>  - Storing the data on HDFS using Flume
>> >> 
>> >>  You don't need Spark streaming to read data from Kafka and store
>> >>  on
>> >>  HDFS. It is a waste of resources.
>> >> 
>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>> >> 
>> >>  KafkaAgent.sources = kafka-sources
>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> 
>> >>  That will be for your batch layer. To analyse you can directly
>> >>  read
>> >>  from
>> >>  hdfs files with Spark or simply store data in a database of your
>> >>  choice via
>> >>  cron or something. Do not mix your batch layer with speed layer.
>> >> 
>> >>  Your speed layer will ingest the same data directly from Kafka
>> >>  into
>> >>  spark streaming and that will be  online or near real time
>> >>  (defined
>> >>  by your
>> >>  window).
>> >> 
>> >>  Then you have a a serving layer to present data from both speed
>> >>  (the
>> >>  one from SS) and batch layer.
>> >> 
>> >>  HTH
>> >> 
>> >> 
>> >> 
>> >> 
>> >>  Dr Mich Talebzadeh
>> >> 
>> >> 
>> >> 
>> >>  LinkedIn
>> >> 
>> >> 
>> >>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> 
>> >> 
>> >> 
>> >>  http://talebzadehmich.wordpress.com
>> >> 
>> >> 
>> >>  Disclaimer: Use it at your own risk. Any and all responsibility

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
My concern with Postgres / Cassandra is only scalability. I will look
further into Postgres horizontal scaling, thanks.

Writes could be idempotent if done as upserts, otherwise updates will be
idempotent but not inserts.

Data should not be lost. The system should be as fault tolerant as possible.

What's the advantage of using Spark for reading Kafka instead of direct
Kafka consumers?

On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger  wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma  >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>>  wrote:
> >> 
> >>  - Spark Streaming to read data from Kafka
> >>  - Storing the data on HDFS using Flume
> >> 
> >>  You don't need Spark streaming to read data from Kafka and store on
> >>  HDFS. It is a waste of resources.
> >> 
> >>  Couple Flume to use Kafka as source and HDFS as sink directly
> >> 
> >>  KafkaAgent.sources = kafka-sources
> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> 
> >>  That will be for your batch layer. To analyse you can directly read
> >>  from
> >>  hdfs files with Spark or simply store data in a database of your
> >>  choice via
> >>  cron or something. Do not mix your batch layer with speed layer.
> >> 
> >>  Your speed layer will ingest the same data directly from Kafka into
> >>  spark streaming and that will be  online or near real time (defined
> >>  by your
> >>  window).
> >> 
> >>  Then you have a a serving layer to present data from both speed
> (the
> >>  one from SS) and batch layer.
> >> 
> >>  HTH
> >> 
> >> 
> >> 
> >> 
> >>  Dr Mich Talebzadeh
> >> 
> >> 
> >> 
> >>  LinkedIn
> >> 
> >>  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> 
> >> 
> >> 
> >>  http://talebzadehmich.wordpress.com
> >> 
> >> 
> >>  Disclaimer: Use it at your own risk. Any and all responsibility for
> >>  any
> >>  loss, damage or destruction of data or any other property which may
> >>  arise
> >>  from relying on this email's technical content is explicitly
> >>  disclaimed. The
> >>  author will in no case be liable for any monetary damages arising
> >>  from such
> >>  loss, damage or destruction.
> >> 
> >> 
> >> 
> >> 
> >>  On 29 September 2016 at 15:15, Ali Akhtar 
> >>  wrote:
> >> >
> >> > The web UI is actually the speed layer, it needs to be able to
> query
> >> > the data online, and show the results in 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
I wouldn't give up the flexibility and maturity of a relational
database, unless you have a very specific use case.  I'm not trashing
cassandra, I've used cassandra, but if all I know is that you're doing
analytics, I wouldn't want to give up the ability to easily do ad-hoc
aggregations without a lot of forethought.  If you're worried about
scaling, there are several options for horizontally scaling Postgres
in particular.  One of the current best from what I've worked with is
Citus.

On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma  wrote:
> Hi Cody
> Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger  wrote:
>>
>> How are you going to handle etl failures?  Do you care about lost /
>> duplicated data?  Are your writes idempotent?
>>
>> Absent any other information about the problem, I'd stay away from
>> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> feeding postgres.
>>
>> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
>> > Is there an advantage to that vs directly consuming from Kafka? Nothing
>> > is
>> > being done to the data except some light ETL and then storing it in
>> > Cassandra
>> >
>> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
>> > wrote:
>> >>
>> >> Its better you use spark's direct stream to ingest from kafka.
>> >>
>> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> wrote:
>> >>>
>> >>> I don't think I need a different speed storage and batch storage. Just
>> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> >>> where
>> >>> the web UI can query it, seems like it will be enough.
>> >>>
>> >>> I'm thinking about:
>> >>>
>> >>> - Reading data from Kafka via Spark Streaming
>> >>> - Standardizing, then storing it in Cassandra
>> >>> - Querying Cassandra from the web ui
>> >>>
>> >>> That seems like it will work. My question now is whether to use Spark
>> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>>
>> >>>
>> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>>  wrote:
>> 
>>  - Spark Streaming to read data from Kafka
>>  - Storing the data on HDFS using Flume
>> 
>>  You don't need Spark streaming to read data from Kafka and store on
>>  HDFS. It is a waste of resources.
>> 
>>  Couple Flume to use Kafka as source and HDFS as sink directly
>> 
>>  KafkaAgent.sources = kafka-sources
>>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> 
>>  That will be for your batch layer. To analyse you can directly read
>>  from
>>  hdfs files with Spark or simply store data in a database of your
>>  choice via
>>  cron or something. Do not mix your batch layer with speed layer.
>> 
>>  Your speed layer will ingest the same data directly from Kafka into
>>  spark streaming and that will be  online or near real time (defined
>>  by your
>>  window).
>> 
>>  Then you have a a serving layer to present data from both speed  (the
>>  one from SS) and batch layer.
>> 
>>  HTH
>> 
>> 
>> 
>> 
>>  Dr Mich Talebzadeh
>> 
>> 
>> 
>>  LinkedIn
>> 
>>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> 
>> 
>>  http://talebzadehmich.wordpress.com
>> 
>> 
>>  Disclaimer: Use it at your own risk. Any and all responsibility for
>>  any
>>  loss, damage or destruction of data or any other property which may
>>  arise
>>  from relying on this email's technical content is explicitly
>>  disclaimed. The
>>  author will in no case be liable for any monetary damages arising
>>  from such
>>  loss, damage or destruction.
>> 
>> 
>> 
>> 
>>  On 29 September 2016 at 15:15, Ali Akhtar 
>>  wrote:
>> >
>> > The web UI is actually the speed layer, it needs to be able to query
>> > the data online, and show the results in real-time.
>> >
>> > It also needs a custom front-end, so a system like Tableau can't be
>> > used, it must have a custom backend + front-end.
>> >
>> > Thanks for the recommendation of Flume. Do you think this will work:
>> >
>> > - Spark Streaming to read data from Kafka
>> > - Storing the data on HDFS using Flume
>> > - Using Spark to query the data in the backend of the web UI?
>> >
>> >
>> >
>> > On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >  wrote:
>> >>
>> >> You need a batch layer and a speed layer. Data from Kafka can be
>> >> stored on HDFS using flume.
>> >>
>> >> -  Query this data to generate reports 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger  wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>>  wrote:
> 
>  - Spark Streaming to read data from Kafka
>  - Storing the data on HDFS using Flume
> 
>  You don't need Spark streaming to read data from Kafka and store on
>  HDFS. It is a waste of resources.
> 
>  Couple Flume to use Kafka as source and HDFS as sink directly
> 
>  KafkaAgent.sources = kafka-sources
>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
> 
>  That will be for your batch layer. To analyse you can directly read
> from
>  hdfs files with Spark or simply store data in a database of your
> choice via
>  cron or something. Do not mix your batch layer with speed layer.
> 
>  Your speed layer will ingest the same data directly from Kafka into
>  spark streaming and that will be  online or near real time (defined
> by your
>  window).
> 
>  Then you have a a serving layer to present data from both speed  (the
>  one from SS) and batch layer.
> 
>  HTH
> 
> 
> 
> 
>  Dr Mich Talebzadeh
> 
> 
> 
>  LinkedIn
>  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> 
> 
>  http://talebzadehmich.wordpress.com
> 
> 
>  Disclaimer: Use it at your own risk. Any and all responsibility for
> any
>  loss, damage or destruction of data or any other property which may
> arise
>  from relying on this email's technical content is explicitly
> disclaimed. The
>  author will in no case be liable for any monetary damages arising
> from such
>  loss, damage or destruction.
> 
> 
> 
> 
>  On 29 September 2016 at 15:15, Ali Akhtar 
> wrote:
> >
> > The web UI is actually the speed layer, it needs to be able to query
> > the data online, and show the results in real-time.
> >
> > It also needs a custom front-end, so a system like Tableau can't be
> > used, it must have a custom backend + front-end.
> >
> > Thanks for the recommendation of Flume. Do you think this will work:
> >
> > - Spark Streaming to read data from Kafka
> > - Storing the data on HDFS using Flume
> > - Using Spark to query the data in the backend of the web UI?
> >
> >
> >
> > On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >  wrote:
> >>
> >> You need a batch layer and a speed layer. Data from Kafka can be
> >> stored on HDFS using flume.
> >>
> >> -  Query this data to generate reports / analytics (There will be a
> >> web UI which will be the front-end to the data, and will show the
> reports)
> >>
> >> This is basically batch layer and you need something like Tableau or
> >> Zeppelin to query data
> >>
> >> You will also need spark streaming to query data online for speed
> >> layer. That data could be stored in some transient fabric like
> ignite or
> >> even druid.
> >>
> >> HTH
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures?  Do you care about lost /
duplicated data?  Are your writes idempotent?

Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.

On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
> Is there an advantage to that vs directly consuming from Kafka? Nothing is
> being done to the data except some light ETL and then storing it in
> Cassandra
>
> On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
> wrote:
>>
>> Its better you use spark's direct stream to ingest from kafka.
>>
>> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:
>>>
>>> I don't think I need a different speed storage and batch storage. Just
>>> taking in raw data from Kafka, standardizing, and storing it somewhere where
>>> the web UI can query it, seems like it will be enough.
>>>
>>> I'm thinking about:
>>>
>>> - Reading data from Kafka via Spark Streaming
>>> - Standardizing, then storing it in Cassandra
>>> - Querying Cassandra from the web ui
>>>
>>> That seems like it will work. My question now is whether to use Spark
>>> Streaming to read Kafka, or use Kafka consumers directly.
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>>  wrote:

 - Spark Streaming to read data from Kafka
 - Storing the data on HDFS using Flume

 You don't need Spark streaming to read data from Kafka and store on
 HDFS. It is a waste of resources.

 Couple Flume to use Kafka as source and HDFS as sink directly

 KafkaAgent.sources = kafka-sources
 KafkaAgent.sinks.hdfs-sinks.type = hdfs

 That will be for your batch layer. To analyse you can directly read from
 hdfs files with Spark or simply store data in a database of your choice via
 cron or something. Do not mix your batch layer with speed layer.

 Your speed layer will ingest the same data directly from Kafka into
 spark streaming and that will be  online or near real time (defined by your
 window).

 Then you have a a serving layer to present data from both speed  (the
 one from SS) and batch layer.

 HTH




 Dr Mich Talebzadeh



 LinkedIn
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



 http://talebzadehmich.wordpress.com


 Disclaimer: Use it at your own risk. Any and all responsibility for any
 loss, damage or destruction of data or any other property which may arise
 from relying on this email's technical content is explicitly disclaimed. 
 The
 author will in no case be liable for any monetary damages arising from such
 loss, damage or destruction.




 On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
> The web UI is actually the speed layer, it needs to be able to query
> the data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be
> used, it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>  wrote:
>>
>> You need a batch layer and a speed layer. Data from Kafka can be
>> stored on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed
>> layer. That data could be stored in some transient fabric like ignite or
>> even druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar 
>> wrote:
>>>
>>> It needs to be able to scale 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Is there an advantage to that vs directly consuming from Kafka? Nothing is
being done to the data except some light ETL and then storing it in
Cassandra

On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
wrote:

> Its better you use spark's direct stream to ingest from kafka.
>
> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:
>
>> I don't think I need a different speed storage and batch storage. Just
>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> where the web UI can query it, seems like it will be enough.
>>
>> I'm thinking about:
>>
>> - Reading data from Kafka via Spark Streaming
>> - Standardizing, then storing it in Cassandra
>> - Querying Cassandra from the web ui
>>
>> That seems like it will work. My question now is whether to use Spark
>> Streaming to read Kafka, or use Kafka consumers directly.
>>
>>
>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>>
>>> You don't need Spark streaming to read data from Kafka and store on
>>> HDFS. It is a waste of resources.
>>>
>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>
>>> KafkaAgent.sources = kafka-sources
>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>
>>> That will be for your batch layer. To analyse you can directly read from
>>> hdfs files with Spark or simply store data in a database of your choice via
>>> cron or something. Do not mix your batch layer with speed layer.
>>>
>>> Your speed layer will ingest the same data directly from Kafka into
>>> spark streaming and that will be  online or near real time (defined by your
>>> window).
>>>
>>> Then you have a a serving layer to present data from both speed  (the
>>> one from SS) and batch layer.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>>>
 The web UI is actually the speed layer, it needs to be able to query
 the data online, and show the results in real-time.

 It also needs a custom front-end, so a system like Tableau can't be
 used, it must have a custom backend + front-end.

 Thanks for the recommendation of Flume. Do you think this will work:

 - Spark Streaming to read data from Kafka
 - Storing the data on HDFS using Flume
 - Using Spark to query the data in the backend of the web UI?



 On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> You need a batch layer and a speed layer. Data from Kafka can be
> stored on HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a
> web UI which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed
> layer. That data could be stored in some transient fabric like ignite or
> even druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar 
> wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma > > wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:

> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 You need a batch layer and a speed layer. Data from Kafka can be stored
 on HDFS using flume.

 -  Query this data to generate reports / analytics (There will be a web
 UI which will be the front-end to the data, and will show the reports)

 This is basically batch layer and you need something like Tableau or
 Zeppelin to query data

 You will also need spark streaming to query data online for speed
 layer. That data could be stored in some transient fabric like ignite or
 even druid.

 HTH








 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 29 September 2016 at 15:01, Ali Akhtar  wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>> raw data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I don't think I need a different speed storage and batch storage. Just
taking in raw data from Kafka, standardizing, and storing it somewhere
where the web UI can query it, seems like it will be enough.

I'm thinking about:

- Reading data from Kafka via Spark Streaming
- Standardizing, then storing it in Cassandra
- Querying Cassandra from the web ui

That seems like it will work. My question now is whether to use Spark
Streaming to read Kafka, or use Kafka consumers directly.


On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh 
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
 It needs to be able to scale to a very large amount of data, yes.

 On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
 wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their
>> raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>> / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> Java is being used as the backend language for everything (backend 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple
channels in distributed fashion.
In that case , the resource utilization will be high in that case as well.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh 
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
 It needs to be able to scale to a very large amount of data, yes.

 On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
 wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their
>> raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>> / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> Java is being used as the backend language for everything (backend of
>> the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume

You don't need Spark streaming to read data from Kafka and store on HDFS.
It is a waste of resources.

Couple Flume to use Kafka as source and HDFS as sink directly

KafkaAgent.sources = kafka-sources
KafkaAgent.sinks.hdfs-sinks.type = hdfs

That will be for your batch layer. To analyse you can directly read from
hdfs files with Spark or simply store data in a database of your choice via
cron or something. Do not mix your batch layer with speed layer.

Your speed layer will ingest the same data directly from Kafka into spark
streaming and that will be  online or near real time (defined by your
window).

Then you have a a serving layer to present data from both speed  (the one
from SS) and batch layer.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:15, Ali Akhtar  wrote:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>> wrote:
>>>
 What is the message inflow ?
 If it's really high , definitely spark will be of great use .

 Thanks
 Deepak

 On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their
> raw data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
> / ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run
> queries across the data (mostly filters), or directly run queries against
> Cassandra / HBase
>
> I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around
queries .
Ingest the data to spark streaming (speed layer) and write to hdfs(for
batch layer).
Now you have data at rest as well as in motion(real time).
>From spark streaming itself , do further processing and write the final
result to Cassandra/nosql DB.
UI can pick the data from the DB now.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman 
wrote:

> "Using Spark to query the data in the backend of the web UI?"
>
> Dont do that. I would recommend that spark streaming process stores data
> into some nosql or sql database and the web ui to query data from that
> database.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-09-29 16:15 GMT+02:00 Ali Akhtar :
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
 It needs to be able to scale to a very large amount of data, yes.

 On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
 wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their
>> raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>> / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> Java is being used as the backend language for everything (backend of
>> the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>> data, and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run
>> queries across the data (mostly filters), or directly run queries against
>> Cassandra / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these
>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark 
>> for
>> ETL, which persistent data store to use, and how to query that data store
>> in the backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>

>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?"

Dont do that. I would recommend that spark streaming process stores data
into some nosql or sql database and the web ui to query data from that
database.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-29 16:15 GMT+02:00 Ali Akhtar :

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>> wrote:
>>>
 What is the message inflow ?
 If it's really high , definitely spark will be of great use .

 Thanks
 Deepak

 On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their
> raw data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
> / ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run
> queries across the data (mostly filters), or directly run queries against
> Cassandra / HBase
>
> I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> in the backend of the web UI, for displaying the reports).
>
>
> Thanks.
>

>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The web UI is actually the speed layer, it needs to be able to query the
data online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used,
it must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh 
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be stored on
> HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>
 I have a somewhat tricky use case, and I'm looking for ideas.

 I have 5-6 Kafka producers, reading various APIs, and writing their raw
 data into Kafka.

 I need to:

 - Do ETL on the data, and standardize it.

 - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
 ElasticSearch / Postgres)

 - Query this data to generate reports / analytics (There will be a web
 UI which will be the front-end to the data, and will show the reports)

 Java is being used as the backend language for everything (backend of
 the web UI, as well as the ETL layer)

 I'm considering:

 - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
 (receive raw data from Kafka, standardize & store it)

 - Using Cassandra, HBase, or raw HDFS, for storing the standardized
 data, and to allow queries

 - In the backend of the web UI, I could either use Spark to run queries
 across the data (mostly filters), or directly run queries against Cassandra
 / HBase

 I'd appreciate some thoughts / suggestions on which of these
 alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
 ETL, which persistent data store to use, and how to query that data store
 in the backend of the web UI, for displaying the reports).


 Thanks.

>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
You need a batch layer and a speed layer. Data from Kafka can be stored on
HDFS using flume.

-  Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

This is basically batch layer and you need something like Tableau or
Zeppelin to query data

You will also need spark streaming to query data online for speed layer.
That data could be stored in some transient fabric like ignite or even
druid.

HTH








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:01, Ali Akhtar  wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>> data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>> ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> Java is being used as the backend language for everything (backend of
>>> the web UI, as well as the ETL layer)
>>>
>>> I'm considering:
>>>
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>> (receive raw data from Kafka, standardize & store it)
>>>
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>> data, and to allow queries
>>>
>>> - In the backend of the web UI, I could either use Spark to run queries
>>> across the data (mostly filters), or directly run queries against Cassandra
>>> / HBase
>>>
>>> I'd appreciate some thoughts / suggestions on which of these
>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>> ETL, which persistent data store to use, and how to query that data store
>>> in the backend of the web UI, for displaying the reports).
>>>
>>>
>>> Thanks.
>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
It needs to be able to scale to a very large amount of data, yes.

On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>> data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries
>> across the data (mostly filters), or directly run queries against Cassandra
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives
>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
>> persistent data store to use, and how to query that data store in the
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ?
If it's really high , definitely spark will be of great use .

Thanks
Deepak

On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of the
> web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
> raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
> and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
>
> I'd appreciate some thoughts / suggestions on which of these alternatives
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).
>
>
> Thanks.
>