Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Mich Talebzadeh
Hi Sean,

At the moment I am using Zeppelin with Spark SQL to get data from Hive. So
any connection here using visitation has to be through this sort of API.

I know Tableau only uses SQL. Zeppelin can use Spark sql directly or
through Spark Thrift Server.

The question is a user may want to create a join or something involving
many tables and the preference would be to use some sort of database.

In this case Hive is running on Spark engine so we are not talking about
Map-reduce and the associated latency.

That Hive element can be easily plugged out. So our requirement is to
present multiple tables to dashboard and let the user slice and dice.

The factors are not just speed but also the functionality. At the moment
Zeppelin uses Spark SQL. I can get rid of Hive and replace it with another
but I think I still need to have a tabular interface to Flume delivered
data.

I will be happy to consider all options

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 September 2016 at 08:46, Sean Owen  wrote:

> Why Hive and why precompute data at 15 minute latency? there are
> several ways here to query the source data directly with no extra step
> or latency here. Even Spark SQL is real-time-ish for queries on the
> source data, and Impala (or heck Drill etc) are.
>
> On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
>  wrote:
> > OK this seems to be working for the "Batch layer". I will try to create a
> > functional diagram for it
> >
> > Publisher sends prices every two seconds
> > Kafka receives data
> > Flume delivers data from Kafka to HDFS on text files time stamped
> > A Hive ORC external table (source table) is created on the directory
> where
> > flume writes continuously
> > All temporary flume tables are prefixed by "." (hidden files), so Hive
> > external table does not see those
> > Every price row includes a timestamp
> > A conventional Hive table (target table) is created with all columns from
> > the external table + two additional columns with one being a timestamp
> from
> > Hive
> > A cron job set up that runs ever 15 minutes  as below
> > 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh
> -D
> > test > /var/tmp/populate_marketData_test.err 2>&1)
> >
> > This cron as can be seen runs runs every 15 minutes and refreshes the
> Hive
> > target table with the new data. New data meaning the price created time >
> > MAX(price created time) from the target table
> >
> > Target table statistics are updated at each run. It takes an average of 2
> > minutes to run the job
> > Thu Sep 15 22:45:01 BST 2016  === Started
> > /home/hduser/dba/bin/populate_marketData.ksh  ===
> > 15/09/2016 22:45:09.09
> > 15/09/2016 22:46:57.57
> > 2016-09-15T22:46:10
> > 2016-09-15T22:46:57
> > Thu Sep 15 22:47:21 BST 2016  === Completed
> > /home/hduser/dba/bin/populate_marketData.ksh  ===
> >
> >
> > So the target table is 15 minutes out of sync with flume data which is
> not
> > bad.
> >
> > Assuming that I replace ORC tables with Parquet, druid whatever, that
> can be
> > done pretty easily. However, although I am using Zeppelin here, people
> may
> > decide to use Tableau, QlikView etc which we need to think about the
> > connectivity between these notebooks and the underlying database. I know
> > Tableau and it is very SQL centric and works with ODBC and JDBC drivers
> or
> > native drivers. For example I know that Tableau comes with Hive supplied
> > ODBC drivers. I am not sure these database have drivers for Druid etc?
> >
> > Let me know your thoughts.
> >
> > Cheers
> >
> > Dr Mich Talebzadeh
> >
>


Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen
Why Hive and why precompute data at 15 minute latency? there are
several ways here to query the source data directly with no extra step
or latency here. Even Spark SQL is real-time-ish for queries on the
source data, and Impala (or heck Drill etc) are.

On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
 wrote:
> OK this seems to be working for the "Batch layer". I will try to create a
> functional diagram for it
>
> Publisher sends prices every two seconds
> Kafka receives data
> Flume delivers data from Kafka to HDFS on text files time stamped
> A Hive ORC external table (source table) is created on the directory where
> flume writes continuously
> All temporary flume tables are prefixed by "." (hidden files), so Hive
> external table does not see those
> Every price row includes a timestamp
> A conventional Hive table (target table) is created with all columns from
> the external table + two additional columns with one being a timestamp from
> Hive
> A cron job set up that runs ever 15 minutes  as below
> 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh -D
> test > /var/tmp/populate_marketData_test.err 2>&1)
>
> This cron as can be seen runs runs every 15 minutes and refreshes the Hive
> target table with the new data. New data meaning the price created time >
> MAX(price created time) from the target table
>
> Target table statistics are updated at each run. It takes an average of 2
> minutes to run the job
> Thu Sep 15 22:45:01 BST 2016  === Started
> /home/hduser/dba/bin/populate_marketData.ksh  ===
> 15/09/2016 22:45:09.09
> 15/09/2016 22:46:57.57
> 2016-09-15T22:46:10
> 2016-09-15T22:46:57
> Thu Sep 15 22:47:21 BST 2016  === Completed
> /home/hduser/dba/bin/populate_marketData.ksh  ===
>
>
> So the target table is 15 minutes out of sync with flume data which is not
> bad.
>
> Assuming that I replace ORC tables with Parquet, druid whatever, that can be
> done pretty easily. However, although I am using Zeppelin here, people may
> decide to use Tableau, QlikView etc which we need to think about the
> connectivity between these notebooks and the underlying database. I know
> Tableau and it is very SQL centric and works with ODBC and JDBC drivers or
> native drivers. For example I know that Tableau comes with Hive supplied
> ODBC drivers. I am not sure these database have drivers for Druid etc?
>
> Let me know your thoughts.
>
> Cheers
>
> Dr Mich Talebzadeh
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Jeff Nadler
Yes we do something very similar and it's working well:

Kafka ->
Spark Streaming (write temp files, serialized RDDs) ->
Spark Batch Application (build partitioned Parquet files on HDFS; this is
needed because building Parquet files of a reasonable size is too slow for
streaming) ->
query with SparkSQL


On Thu, Sep 15, 2016 at 7:33 AM, Sean Owen  wrote:

> If your core requirement is ad-hoc real-time queries over the data,
> then the standard Hadoop-centric answer would be:
>
> Ingest via Kafka,
> maybe using Flume, or possibly Spark Streaming, to read and land the data,
> in...
> Parquet on HDFS or possibly Kudu, and
> Impala to query
>
> >> On 15 September 2016 at 09:35, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> This is for fishing for some ideas.
> >>>
> >>> In the design we get prices directly through Kafka into Flume and store
> >>> it on HDFS as text files
> >>> We can then use Spark with Zeppelin to present data to the users.
> >>>
> >>> This works. However, I am aware that once the volume of flat files
> rises
> >>> one needs to do housekeeping. You don't want to read all files every
> time.
> >>>
> >>> A more viable alternative would be to read data into some form of
> tables
> >>> (Hive etc) periodically through an hourly cron set up so batch process
> will
> >>> have up to date and accurate data up to last hour.
> >>>
> >>> That certainly be an easier option for the users as well.
> >>>
> >>> I was wondering what would be the best strategy here. Druid, Hive
> others?
> >>>
> >>> The business case here is that users may want to access older data so a
> >>> database of some sort will be a better solution? In all likelihood
> they want
> >>> a week's data.
> >>>
> >>> Thanks
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >>> loss, damage or destruction of data or any other property which may
> arise
> >>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>> author will in no case be liable for any monetary damages arising from
> such
> >>> loss, damage or destruction.
> >>>
> >>>
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sean Owen
If your core requirement is ad-hoc real-time queries over the data,
then the standard Hadoop-centric answer would be:

Ingest via Kafka,
maybe using Flume, or possibly Spark Streaming, to read and land the data, in...
Parquet on HDFS or possibly Kudu, and
Impala to query

>> On 15 September 2016 at 09:35, Mich Talebzadeh 
>> wrote:
>>>
>>> Hi,
>>>
>>> This is for fishing for some ideas.
>>>
>>> In the design we get prices directly through Kafka into Flume and store
>>> it on HDFS as text files
>>> We can then use Spark with Zeppelin to present data to the users.
>>>
>>> This works. However, I am aware that once the volume of flat files rises
>>> one needs to do housekeeping. You don't want to read all files every time.
>>>
>>> A more viable alternative would be to read data into some form of tables
>>> (Hive etc) periodically through an hourly cron set up so batch process will
>>> have up to date and accurate data up to last hour.
>>>
>>> That certainly be an easier option for the users as well.
>>>
>>> I was wondering what would be the best strategy here. Druid, Hive others?
>>>
>>> The business case here is that users may want to access older data so a
>>> database of some sort will be a better solution? In all likelihood they want
>>> a week's data.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed. The
>>> author will in no case be liable for any monetary damages arising from such
>>> loss, damage or destruction.
>>>
>>>
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sachin Janani
Hi Mich,

I agree that the technology stack that you describe is more difficult to
manage due to different components (like HDFS,Flume,Kafka etc) involved.
The solution to this problem could be, to have some DB which has the
capability to support mix workloads (OLTP,OLAP,Streaming etc) and I think
snappydata  fits better for your problem.
Its an open source distributed in-memory data store with spark as
computational engine and supports real-time operational analytics,
delivering stream analytics, OLTP (online transaction processing) and OLAP
(online analytical processing) in a single integrated cluster.As it is
developed on top of spark ,your existing spark code will work as is.Please
have a look:
http://www.snappydata.io/
http://snappydatainc.github.io/snappydata/


Thanks and Regards,
Sachin Janani

On Thu, Sep 15, 2016 at 7:16 PM, Mich Talebzadeh 
wrote:

> any ideas on this?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 September 2016 at 09:35, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> This is for fishing for some ideas.
>>
>> In the design we get prices directly through Kafka into Flume and store
>> it on HDFS as text files
>> We can then use Spark with Zeppelin to present data to the users.
>>
>> This works. However, I am aware that once the volume of flat files rises
>> one needs to do housekeeping. You don't want to read all files every time.
>>
>> A more viable alternative would be to read data into some form of tables
>> (Hive etc) periodically through an hourly cron set up so batch process will
>> have up to date and accurate data up to last hour.
>>
>> That certainly be an easier option for the users as well.
>>
>> I was wondering what would be the best strategy here. Druid, Hive others?
>>
>> The business case here is that users may want to access older data so a
>> database of some sort will be a better solution? In all likelihood they
>> want a week's data.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
any ideas on this?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 September 2016 at 09:35, Mich Talebzadeh 
wrote:

> Hi,
>
> This is for fishing for some ideas.
>
> In the design we get prices directly through Kafka into Flume and store
> it on HDFS as text files
> We can then use Spark with Zeppelin to present data to the users.
>
> This works. However, I am aware that once the volume of flat files rises
> one needs to do housekeeping. You don't want to read all files every time.
>
> A more viable alternative would be to read data into some form of tables
> (Hive etc) periodically through an hourly cron set up so batch process will
> have up to date and accurate data up to last hour.
>
> That certainly be an easier option for the users as well.
>
> I was wondering what would be the best strategy here. Druid, Hive others?
>
> The business case here is that users may want to access older data so a
> database of some sort will be a better solution? In all likelihood they
> want a week's data.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>