Have not tried this, but looks quite useful if one is using Druid: https://github.com/implydata/pivot - An interactive data exploration UI for Druid
On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman <alons...@gmail.com> wrote: > Thanks Mitch, i will check it. > > Cheers > > > Alonso Isidoro Roman > [image: https://]about.me/alonso.isidoro.roman > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > > 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > >> You can use Hbase for building real time dashboards >> >> Check this link >> <https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/> >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 30 August 2016 at 08:33, Alonso Isidoro Roman <alons...@gmail.com> >> wrote: >> >>> HBase for real time queries? HBase was designed with the batch in mind. >>> Impala should be a best choice, but i do not know what Druid can do.... >>> >>> >>> Cheers >>> >>> Alonso Isidoro Roman >>> [image: https://]about.me/alonso.isidoro.roman >>> >>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> >>> >>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: >>> >>>> Hi Chanh, >>>> >>>> Druid sounds like a good choice. >>>> >>>> But again the point being is that what else Druid brings on top of >>>> Hbase. >>>> >>>> Unless one decides to use Druid for both historical data and real time >>>> data in place of Hbase! >>>> >>>> It is easier to write API against Druid that Hbase? You still want a UI >>>> dashboard? >>>> >>>> Cheers >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 30 August 2016 at 03:19, Chanh Le <giaosu...@gmail.com> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> Seems a lot people using Druid for realtime Dashboard. >>>>> I’m just wondering of using Druid for main storage engine because >>>>> Druid can store the raw data and can integrate with Spark also >>>>> (theoretical). >>>>> In that case do we need to store 2 separate storage Druid (store >>>>> segment in HDFS) and HDFS?. >>>>> BTW did anyone try this one https://github.com/Sparkli >>>>> neData/spark-druid-olap? >>>>> >>>>> >>>>> Regards, >>>>> Chanh >>>>> >>>>> >>>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>> Thanks Bhaarat and everyone. >>>>> >>>>> This is an updated version of the same diagram >>>>> >>>>> <LambdaArchitecture.png> >>>>> >>>>> The frequency of Recent data is defined by the Windows length in Spark >>>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we >>>>> can >>>>> move any Spark granularity below 0.5 seconds in anger. For some >>>>> applications like Credit card transactions and fraud detection. Data is >>>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as >>>>> well. The same Spark Streaming will write asynchronously to HDFS Hive >>>>> tables. >>>>> One school of thought is never write to Hive from Spark, write >>>>> straight to Hbase and then read Hbase tables into Hive periodically? >>>>> >>>>> Now the third component in this layer is Serving Layer that can >>>>> combine data from the current (Hbase) and the historical (Hive tables) to >>>>> give the user visual analytics. Now that visual analytics can be Real time >>>>> dashboard on top of Serving Layer. That Serving layer could be an >>>>> in-memory >>>>> NoSQL offering or Data from Hbase (Red Box) combined with Hive tables. >>>>> >>>>> I am not aware of any industrial strength Real time Dashboard. The >>>>> idea is that one uses such dashboard in real time. Dashboard in this sense >>>>> meaning a general purpose API to data store of some type like on Serving >>>>> layer to provide visual analytics real time on demand, combining real time >>>>> data and aggregate views. As usual the devil in the detail. >>>>> >>>>> >>>>> >>>>> Let me know your thoughts. Anyway this is first cut pattern. >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaara...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Mich >>>>>> >>>>>> This is really helpful. I'm trying to wrap my head around the last >>>>>> diagram you shared (the one with kafka). In this diagram spark streaming >>>>>> is >>>>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time >>>>>> Queries, Dashboards" annotation. Based on this diagram, will real time >>>>>> queries be running on Spark or HBase? >>>>>> >>>>>> PS: My intention was not to steer the conversation away from what >>>>>> Ashok asked but I found the diagrams shared by Mich very insightful. >>>>>> >>>>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> In terms of positioning, Spark is really the first Big Data platform >>>>>>> to integrate batch, streaming and interactive computations in a unified >>>>>>> framework. What this boils down to is the fact that whichever way one >>>>>>> look >>>>>>> at it there is somewhere that Spark can make a contribution to. In >>>>>>> general, >>>>>>> there are few design patterns common to Big Data >>>>>>> >>>>>>> >>>>>>> >>>>>>> - *ETL & Batch* >>>>>>> >>>>>>> The first one is the most common one with Established tools like >>>>>>> Sqoop, Talend for ETL and HDFS for storage of some kind. Spark can be >>>>>>> used >>>>>>> as the execution engine for Hive at the storage level which actually >>>>>>> makes it a true vendor independent (BTW, Impala and Tez and LLAP) are >>>>>>> offered by vendors) processing engine. Personally I use Spark at ETL >>>>>>> layer >>>>>>> by extracting data from sources through plug ins (JDBC and others) and >>>>>>> storing in on HDFS in some kind >>>>>>> >>>>>>> >>>>>>> >>>>>>> - *Batch, real time plus Analytics* >>>>>>> >>>>>>> In this pattern you have data coming in real time and you want to >>>>>>> query them real time through real time dashboard. HDFS is not ideal for >>>>>>> updating data in real time and neither for random access of data. Source >>>>>>> could be all sorts of Web Servers and need Flume Agent with Flume. At >>>>>>> the >>>>>>> storage layer we are probably looking at something like Hbase. The >>>>>>> crucial >>>>>>> point being that saved data needs to be ready for queries immediately >>>>>>> The >>>>>>> dashboards requires Hbase APIs. The Analytics can be done through Hive >>>>>>> again running on Spark engine. Again note here that we ideally should >>>>>>> process batch and real time separately. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - *Real time / Streaming* >>>>>>> >>>>>>> This is most relevant to Spark as we are moving to near real time. >>>>>>> Where Spark excels. We need to capture the incoming events (logs, sensor >>>>>>> data, pricing, emails) through interfaces like Kafka, Message Queues >>>>>>> etc. >>>>>>> Need to process these events with minimum latency. Again Spark is >>>>>>> a very good candidate here with its Spark Streaming and micro-batching >>>>>>> capabilities. There are others like Storm, Flink etc. that are event >>>>>>> based >>>>>>> but you don’t hear much. Again for streaming architecture you need to >>>>>>> sync >>>>>>> data in real time using something like Hbase, Cassandra (?) and others >>>>>>> as >>>>>>> real time store or forever storage HDFS or Hive etc. >>>>>>> >>>>>>> >>>>>>> In general there is also *Lambda Architecture* that is >>>>>>> designed for streaming analytics. The streaming data ends up in both >>>>>>> batch >>>>>>> layer and speed layer. Batch layer is used to answer batch queries. On >>>>>>> the >>>>>>> other hand speed later is used ti handle fast/real time queries. This >>>>>>> model >>>>>>> is really cool as Spark Streaming can feed both the batch layer and >>>>>>> the speed layer. >>>>>>> >>>>>>> >>>>>>> At a high level this looks like this, from >>>>>>> http://lambda-architecture.net/ >>>>>>> >>>>>>> <image.png> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> My favourite would be something like below with Spark playing a >>>>>>> major role >>>>>>> >>>>>>> >>>>>>> <LambdaArchitecture.png> >>>>>>> >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> >>>>>>> LinkedIn * >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>> >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kuma...@me.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Spark best fits for processing. But depending on the use case, you >>>>>>>> could expand the scope of Spark to moving data using the native >>>>>>>> connectors. >>>>>>>> The only that Spark is not, is Storage. Connectors are available for >>>>>>>> most >>>>>>>> storage options though. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Sivakumaran S >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar < >>>>>>>> ashok34...@yahoo.com.INVALID <ashok34...@yahoo.com.invalid>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> There are design patterns that use Spark extensively. I am new to >>>>>>>> this area so I would appreciate if someone explains where Spark fits in >>>>>>>> especially within faster or streaming use case. >>>>>>>> >>>>>>>> What are the best practices involving Spark. Is it always best to >>>>>>>> deploy it for processing engine, >>>>>>>> >>>>>>>> For example when we have a pattern >>>>>>>> >>>>>>>> Input Data -> Data in Motion -> Processing -> Storage >>>>>>>> >>>>>>>> Where does Spark best fit in. >>>>>>>> >>>>>>>> Thanking you >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >