Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Mich Talebzadeh
Hi Sean, At the moment I am using Zeppelin with Spark SQL to get data from Hive. So any connection here using visitation has to be through this sort of API. I know Tableau only uses SQL. Zeppelin can use Spark sql directly or through Spark Thrift Server. The question is a user may want to

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen
Why Hive and why precompute data at 15 minute latency? there are several ways here to query the source data directly with no extra step or latency here. Even Spark SQL is real-time-ish for queries on the source data, and Impala (or heck Drill etc) are. On Thu, Sep 15, 2016 at 10:56 PM, Mich

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Jeff Nadler
Yes we do something very similar and it's working well: Kafka -> Spark Streaming (write temp files, serialized RDDs) -> Spark Batch Application (build partitioned Parquet files on HDFS; this is needed because building Parquet files of a reasonable size is too slow for streaming) -> query with

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sean Owen
If your core requirement is ad-hoc real-time queries over the data, then the standard Hadoop-centric answer would be: Ingest via Kafka, maybe using Flume, or possibly Spark Streaming, to read and land the data, in... Parquet on HDFS or possibly Kudu, and Impala to query >> On 15 September 2016

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sachin Janani
Hi Mich, I agree that the technology stack that you describe is more difficult to manage due to different components (like HDFS,Flume,Kafka etc) involved. The solution to this problem could be, to have some DB which has the capability to support mix workloads (OLTP,OLAP,Streaming etc) and I think

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
any ideas on this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk.

Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
Hi, This is for fishing for some ideas. In the design we get prices directly through Kafka into Flume and store it on HDFS as text files We can then use Spark with Zeppelin to present data to the users. This works. However, I am aware that once the volume of flat files rises one needs to do