Re: Design patterns involving Spark

2017-04-12 Thread Harish Butani
BTW, we now support OLAP functionality natively in spark w/o the need for
Druid, through our Spark native BI platform(SNAP):
https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani

 - we provide SQL commands to: create star schema, create olap index, and
insert into olap index. So you can be up and running very quickly in a
Spark env.
- Query Acceleration is provided through an OLAP Index FileFormat and Query
Optimizer extensions(just like spark-druid-olap).
- We have also posted details on a BI Benchmark

to quantify
query acceleration and cost.
- haven't looked at integration with Spark Streaming yet, but since we have
a FileFormat should be possible to integrate. Please ping me if this is of
interest.

regards,
Harish.


On Mon, Aug 29, 2016 at 7:19 PM, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because Druid
> can store the raw data and can integrate with Spark also (theoretical).
> In that case do we need to store 2 separate storage Druid (store segment
> in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/
> SparklineData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write  straight
> to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can combine
> data from the current (Hbase) and the historical (Hive tables) to give the
> user visual analytics. Now that visual analytics can be Real time dashboard
> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
> offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The idea
> is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
> data and aggregate views. As usual the devil in the detail.
>
>
>
> Let me know your thoughts. Anyway this is first cut pattern.
>
> ​​
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 August 2016 at 18:53, Bhaarat Sharma  wrote:
>
>> Hi Mich
>>
>> This is really helpful. I'm trying to wrap my head around the last
>> diagram you shared (the one with kafka). In this diagram spark streaming is
>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>> Queries, Dashboards" annotation. Based on this diagram, will real time
>> queries be running on Spark or HBase?
>>
>> PS: My intention was not to steer the conversation away from what Ashok
>> asked but I found the diagrams shared by Mich very insightful.
>>
>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In terms of positioning, Spark is really the first Big Data platform to
>>> integrate batch, streaming and interactive computations in a unified
>>> framework. What this boils down to is the fact that whichever way one look
>>> at it there is somewhere that Spark can make a contribution to. In general,
>>> there are few design patterns common to Big Data
>>>
>>>
>>>
>>>- *ETL & Batch*
>>>
>>> The first one is the most common one with Established tools like Sqoop,
>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>>> execution engine for Hive at the storage level which  actually makes it
>>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>>> vendors) processing engine. Personally I use Spark at ETL layer by
>>> extracting data from sources through plug ins (JDBC and o

Re: Design patterns involving Spark

2016-08-30 Thread Todd Nist
Have not tried this, but looks quite useful if one is using Druid:

https://github.com/implydata/pivot  - An interactive data exploration UI
for Druid

On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman 
wrote:

> Thanks Mitch, i will check it.
>
> Cheers
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh :
>
>> You can use Hbase for building real time dashboards
>>
>> Check this link
>> 
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 August 2016 at 08:33, Alonso Isidoro Roman 
>> wrote:
>>
>>> HBase for real time queries? HBase was designed with the batch in mind.
>>> Impala should be a best choice, but i do not know what Druid can do
>>>
>>>
>>> Cheers
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>>>
 Hi Chanh,

 Druid sounds like a good choice.

 But again the point being is that what else Druid brings on top of
 Hbase.

 Unless one decides to use Druid for both historical data and real time
 data in place of Hbase!

 It is easier to write API against Druid that Hbase? You still want a UI
 dashboard?

 Cheers

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 30 August 2016 at 03:19, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because
> Druid can store the raw data and can integrate with Spark also
> (theoretical).
> In that case do we need to store 2 separate storage Druid (store
> segment in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/Sparkli
> neData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we 
> can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write
>  straight to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can
> combine data from the current (Hbase) and the historical (Hive tables) to
> give the user visual analytics. Now that visual analytics can be Real time
> dashboard on top of Serving Layer. That Serving layer could be an 
> in-memory
> NoSQL offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The
> idea is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
>>>

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
Thanks Mitch, i will check it.

Cheers


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-08-30 9:52 GMT+02:00 Mich Talebzadeh :

> You can use Hbase for building real time dashboards
>
> Check this link
> 
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 08:33, Alonso Isidoro Roman 
> wrote:
>
>> HBase for real time queries? HBase was designed with the batch in mind.
>> Impala should be a best choice, but i do not know what Druid can do
>>
>>
>> Cheers
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>>
>>> Hi Chanh,
>>>
>>> Druid sounds like a good choice.
>>>
>>> But again the point being is that what else Druid brings on top of
>>> Hbase.
>>>
>>> Unless one decides to use Druid for both historical data and real time
>>> data in place of Hbase!
>>>
>>> It is easier to write API against Druid that Hbase? You still want a UI
>>> dashboard?
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 30 August 2016 at 03:19, Chanh Le  wrote:
>>>
 Hi everyone,

 Seems a lot people using Druid for realtime Dashboard.
 I’m just wondering of using Druid for main storage engine because Druid
 can store the raw data and can integrate with Spark also (theoretical).
 In that case do we need to store 2 separate storage Druid (store
 segment in HDFS) and HDFS?.
 BTW did anyone try this one https://github.com/Sparkli
 neData/spark-druid-olap?


 Regards,
 Chanh


 On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
 wrote:

 Thanks Bhaarat and everyone.

 This is an updated version of the same diagram

 
 ​​​
 The frequency of Recent data is defined by the Windows length in Spark
 Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
 move any Spark granularity below 0.5 seconds in anger. For some
 applications like Credit card transactions and fraud detection. Data is
 stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
 well. The same Spark Streaming will write asynchronously to HDFS Hive
 tables.
 One school of thought is never write to Hive from Spark, write
  straight to Hbase and then read Hbase tables into Hive periodically?

 Now the third component in this layer is Serving Layer that can combine
 data from the current (Hbase) and the historical (Hive tables) to give the
 user visual analytics. Now that visual analytics can be Real time dashboard
 on top of Serving Layer. That Serving layer could be an in-memory NoSQL
 offering or Data from Hbase (Red Box) combined with Hive tables.

 I am not aware of any industrial strength Real time Dashboard.  The
 idea is that one uses such dashboard in real time. Dashboard in this sense
 meaning a general purpose API to data store of some type like on Serving
 layer to provide visual analytics real time on demand, combining real time
 data and aggregate views. As usual the devil in the detail.



 Let me know your thoughts. Anyway this is first cut pattern.

 ​​

 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://

Re: Design patterns involving Spark

2016-08-30 Thread Mich Talebzadeh
You can use Hbase for building real time dashboards

Check this link


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 August 2016 at 08:33, Alonso Isidoro Roman  wrote:

> HBase for real time queries? HBase was designed with the batch in mind.
> Impala should be a best choice, but i do not know what Druid can do
>
>
> Cheers
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>
>> Hi Chanh,
>>
>> Druid sounds like a good choice.
>>
>> But again the point being is that what else Druid brings on top of Hbase.
>>
>> Unless one decides to use Druid for both historical data and real time
>> data in place of Hbase!
>>
>> It is easier to write API against Druid that Hbase? You still want a UI
>> dashboard?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 August 2016 at 03:19, Chanh Le  wrote:
>>
>>> Hi everyone,
>>>
>>> Seems a lot people using Druid for realtime Dashboard.
>>> I’m just wondering of using Druid for main storage engine because Druid
>>> can store the raw data and can integrate with Spark also (theoretical).
>>> In that case do we need to store 2 separate storage Druid (store segment
>>> in HDFS) and HDFS?.
>>> BTW did anyone try this one https://github.com/Sparkli
>>> neData/spark-druid-olap?
>>>
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Bhaarat and everyone.
>>>
>>> This is an updated version of the same diagram
>>>
>>> 
>>> ​​​
>>> The frequency of Recent data is defined by the Windows length in Spark
>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>>> move any Spark granularity below 0.5 seconds in anger. For some
>>> applications like Credit card transactions and fraud detection. Data is
>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>>> tables.
>>> One school of thought is never write to Hive from Spark, write  straight
>>> to Hbase and then read Hbase tables into Hive periodically?
>>>
>>> Now the third component in this layer is Serving Layer that can combine
>>> data from the current (Hbase) and the historical (Hive tables) to give the
>>> user visual analytics. Now that visual analytics can be Real time dashboard
>>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>>
>>> I am not aware of any industrial strength Real time Dashboard.  The idea
>>> is that one uses such dashboard in real time. Dashboard in this sense
>>> meaning a general purpose API to data store of some type like on Serving
>>> layer to provide visual analytics real time on demand, combining real time
>>> data and aggregate views. As usual the devil in the detail.
>>>
>>>
>>>
>>> Let me know your thoughts. Anyway this is first cut pattern.
>>>
>>> ​​
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 August 2016 at 18:53, Bhaarat Sharma  w

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
HBase for real time queries? HBase was designed with the batch in mind.
Impala should be a best choice, but i do not know what Druid can do


Cheers

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :

> Hi Chanh,
>
> Druid sounds like a good choice.
>
> But again the point being is that what else Druid brings on top of Hbase.
>
> Unless one decides to use Druid for both historical data and real time
> data in place of Hbase!
>
> It is easier to write API against Druid that Hbase? You still want a UI
> dashboard?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 03:19, Chanh Le  wrote:
>
>> Hi everyone,
>>
>> Seems a lot people using Druid for realtime Dashboard.
>> I’m just wondering of using Druid for main storage engine because Druid
>> can store the raw data and can integrate with Spark also (theoretical).
>> In that case do we need to store 2 separate storage Druid (store segment
>> in HDFS) and HDFS?.
>> BTW did anyone try this one https://github.com/Sparkli
>> neData/spark-druid-olap?
>>
>>
>> Regards,
>> Chanh
>>
>>
>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Bhaarat and everyone.
>>
>> This is an updated version of the same diagram
>>
>> 
>> ​​​
>> The frequency of Recent data is defined by the Windows length in Spark
>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>> move any Spark granularity below 0.5 seconds in anger. For some
>> applications like Credit card transactions and fraud detection. Data is
>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>> tables.
>> One school of thought is never write to Hive from Spark, write  straight
>> to Hbase and then read Hbase tables into Hive periodically?
>>
>> Now the third component in this layer is Serving Layer that can combine
>> data from the current (Hbase) and the historical (Hive tables) to give the
>> user visual analytics. Now that visual analytics can be Real time dashboard
>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>
>> I am not aware of any industrial strength Real time Dashboard.  The idea
>> is that one uses such dashboard in real time. Dashboard in this sense
>> meaning a general purpose API to data store of some type like on Serving
>> layer to provide visual analytics real time on demand, combining real time
>> data and aggregate views. As usual the devil in the detail.
>>
>>
>>
>> Let me know your thoughts. Anyway this is first cut pattern.
>>
>> ​​
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 August 2016 at 18:53, Bhaarat Sharma  wrote:
>>
>>> Hi Mich
>>>
>>> This is really helpful. I'm trying to wrap my head around the last
>>> diagram you shared (the one with kafka). In this diagram spark streaming is
>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>>> Queries, Dashboards" annotation. Based on this diagram, will real time
>>> queries be running on Spark or HBase?
>>>
>>> PS: My intention was not to steer the conversation away from what Ashok
>>> asked but I found the diagrams shared by Mich very insightful.
>>>
>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 In terms of positioning, Spark is really the first Big Data platform to
 integrate batch, streaming and interactive computations in a unified
 framework. What this boils down to is the fact that whichever way one look
 at it there is somewhere that Spark can make a contributi

Re: Design patterns involving Spark

2016-08-29 Thread Mich Talebzadeh
Hi Chanh,

Druid sounds like a good choice.

But again the point being is that what else Druid brings on top of Hbase.

Unless one decides to use Druid for both historical data and real time data
in place of Hbase!

It is easier to write API against Druid that Hbase? You still want a UI
dashboard?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 August 2016 at 03:19, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because Druid
> can store the raw data and can integrate with Spark also (theoretical).
> In that case do we need to store 2 separate storage Druid (store segment
> in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/
> SparklineData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write  straight
> to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can combine
> data from the current (Hbase) and the historical (Hive tables) to give the
> user visual analytics. Now that visual analytics can be Real time dashboard
> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
> offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The idea
> is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
> data and aggregate views. As usual the devil in the detail.
>
>
>
> Let me know your thoughts. Anyway this is first cut pattern.
>
> ​​
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 August 2016 at 18:53, Bhaarat Sharma  wrote:
>
>> Hi Mich
>>
>> This is really helpful. I'm trying to wrap my head around the last
>> diagram you shared (the one with kafka). In this diagram spark streaming is
>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>> Queries, Dashboards" annotation. Based on this diagram, will real time
>> queries be running on Spark or HBase?
>>
>> PS: My intention was not to steer the conversation away from what Ashok
>> asked but I found the diagrams shared by Mich very insightful.
>>
>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In terms of positioning, Spark is really the first Big Data platform to
>>> integrate batch, streaming and interactive computations in a unified
>>> framework. What this boils down to is the fact that whichever way one look
>>> at it there is somewhere that Spark can make a contribution to. In general,
>>> there are few design patterns common to Big Data
>>>
>>>
>>>
>>>- *ETL & Batch*
>>>
>>> The first one is the most common one with Established tools like Sqoop,
>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>>> execution engine for Hive at the storage level which  actually makes it
>>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>>> vendors) processing engine. Personally I use Spark at ETL layer by
>>> extracting data from sources through plug ins (JD

Re: Design patterns involving Spark

2016-08-29 Thread Chanh Le
Hi everyone,

Seems a lot people using Druid for realtime Dashboard.
I’m just wondering of using Druid for main storage engine because Druid can 
store the raw data and can integrate with Spark also (theoretical). 
In that case do we need to store 2 separate storage Druid (store segment in 
HDFS) and HDFS?.
BTW did anyone try this one https://github.com/SparklineData/spark-druid-olap 
?


Regards,
Chanh


> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh  
> wrote:
> 
> Thanks Bhaarat and everyone.
> 
> This is an updated version of the same diagram
> 
> 
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark 
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can 
> move any Spark granularity below 0.5 seconds in anger. For some applications 
> like Credit card transactions and fraud detection. Data is stored real time 
> by Spark in Hbase tables. Hbase tables will be on HDFS as well. The same 
> Spark Streaming will write asynchronously to HDFS Hive tables.
> One school of thought is never write to Hive from Spark, write  straight to 
> Hbase and then read Hbase tables into Hive periodically?
> 
> Now the third component in this layer is Serving Layer that can combine data 
> from the current (Hbase) and the historical (Hive tables) to give the user 
> visual analytics. Now that visual analytics can be Real time dashboard on top 
> of Serving Layer. That Serving layer could be an in-memory NoSQL offering or 
> Data from Hbase (Red Box) combined with Hive tables.
> 
> I am not aware of any industrial strength Real time Dashboard.  The idea is 
> that one uses such dashboard in real time. Dashboard in this sense meaning a 
> general purpose API to data store of some type like on Serving layer to 
> provide visual analytics real time on demand, combining real time data and 
> aggregate views. As usual the devil in the detail.
> 
> 
> 
> Let me know your thoughts. Anyway this is first cut pattern.
> 
> ​​
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 August 2016 at 18:53, Bhaarat Sharma  > wrote:
> Hi Mich
> 
> This is really helpful. I'm trying to wrap my head around the last diagram 
> you shared (the one with kafka). In this diagram spark streaming is pushing 
> data to HDFS and NoSql. However, I'm confused by the "Real Time Queries, 
> Dashboards" annotation. Based on this diagram, will real time queries be 
> running on Spark or HBase?
> 
> PS: My intention was not to steer the conversation away from what Ashok asked 
> but I found the diagrams shared by Mich very insightful. 
> 
> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh  > wrote:
> Hi,
> 
> In terms of positioning, Spark is really the first Big Data platform to 
> integrate batch, streaming and interactive computations in a unified 
> framework. What this boils down to is the fact that whichever way one look at 
> it there is somewhere that Spark can make a contribution to. In general, 
> there are few design patterns common to Big Data
>  
> ETL & Batch
> The first one is the most common one with Established tools like Sqoop, 
> Talend for ETL and HDFS for storage of some kind. Spark can be used as the 
> execution engine for Hive at the storage level which  actually makes it a 
> true vendor independent (BTW, Impala and Tez and LLAP) are offered by 
> vendors) processing engine. Personally I use Spark at ETL layer by extracting 
> data from sources through plug ins (JDBC and others) and storing in on HDFS 
> in some kind
>  
> Batch, real time plus Analytics
> In this pattern you have data coming in real time and you want to query them 
> real time through real time dashboard. HDFS is not ideal for updating data in 
> real time and neither for random access of data. Source could be all sorts of 
> Web Servers and need Flume Agent with Flume. At the storage layer we are 
> probably looking at something like Hbase. The crucial point being that saved 
> data needs to be ready for queries immediately The dashboards requires Hbase 
> APIs. The Analytics can be done through Hive again running on Spark engine. 
> Again note here that we ideally should process batch and real time 
> separately.   
>  
> Real time / Streaming
> This is most relevant to Spark as we are moving to near real time. Where 

Re: Design patterns involving Spark

2016-08-28 Thread Sivakumaran S
Spark best fits for processing. But depending on the use case, you could expand 
the scope of Spark to moving data using the native connectors. The only that 
Spark is not, is Storage. Connectors are available for most storage options 
though.

Regards,

Sivakumaran S



> On 28-Aug-2016, at 6:04 PM, Ashok Kumar  wrote:
> 
> Hi,
> 
> There are design patterns that use Spark extensively. I am new to this area 
> so I would appreciate if someone explains where Spark fits in especially 
> within faster or streaming use case.
> 
> What are the best practices involving Spark. Is it always best to deploy it 
> for processing engine, 
> 
> For example when we have a pattern 
> 
> Input Data -> Data in Motion -> Processing -> Storage 
> 
> Where does Spark best fit in.
> 
> Thanking you 


Design patterns involving Spark

2016-08-28 Thread Ashok Kumar
Hi,
There are design patterns that use Spark extensively. I am new to this area so 
I would appreciate if someone explains where Spark fits in especially within 
faster or streaming use case.
What are the best practices involving Spark. Is it always best to deploy it for 
processing engine, 
For example when we have a pattern 
Input Data -> Data in Motion -> Processing -> Storage 
Where does Spark best fit in.
Thanking you