We do processing millions of records using Kafka, Elastic Search, Accumulo,
Mesos, Spark & Vertica.

Their a pattern for this type of pipeline today called SMACK more about
here --
http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka


On Fri, Sep 30, 2016 at 4:55 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:

> Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or
> something similar?
>
>
>
>
>
> On Friday, 30 September 2016, 17:17, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> I have designed this prototype for a risk business. Here I would like to
> discuss issues with batch layer. *Apologies about being long winded.*
>
> *Business objective*
>
> Reduce risk in the credit business while making better credit and trading
> decisions. Specifically, to identify risk trends within certain years of
> trading data. For example, measure the risk exposure in a give portfolio by
> industry, region, credit rating and other parameters. At the macroscopic
> level, analyze data across market sectors, over a given time horizon to
> asses risk changes
>
> *Deliverable*
> Enable real time and batch analysis of risk data
>
> *Batch technology stack used*
> Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query
> tool, Zeppelin
>
> *Test volumes for POC*
> 1 message queue (csv format), 100 stock prices streaming in very 2
> seconds, 180K prices per hour, 4 million + per day
>
>
>    1. prices to Kafka -> Zookeeper -> Flume -> HDFS
>    2. HDFS daily partition for that day's data
>    3. Hive external table looking at HDFS partitioned location
>    4. Hive managed table populated every 15 minutes via cron from Hive
>    external table (table type ORC partitioned by date). This is purely Hive
>    job. Hive table is populated using insert/overwrite for that day to
>    avoid boundary value/missing data etc.
>    5. Typical batch ingestion time (Hive table populated from HDFS files)
>    ~ 2 minutes
>    6. Data in Hive table has 15 minutes latency
>    7. Zeppelin to be used as UI with Spark
>
>
> Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell.
> Within Spark shell, users can access batch tables in Hive *or *they have
> a choice of accessing raw data on HDFS files which gives them* real time
> access * (not to be confused with speed layer).  Using typical query with
> Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1
> min. Running the same query (my typical query not user query) on Hive
> tables this time using Spark takes 6 seconds.
>
> However, there are some  design concerns:
>
>
>    1. Zeppelin starts slowing down by the end of day. Sometimes it throws
>    broken pipe message. I resolve this by restarting Zeppelin daemon.
>    Potential show stopper
>    2. As the volume of data increases throughout the day, performance
>    becomes an issue
>    3. Every 15 minutes when the cron starts, Hive insert/overwrites can
>    potentially get in conflict with users throwing queries from
>    Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all
>    users from accessing these tables (at partition level) until insert
>    overwrite is done. This can be improved by better partitioning of Hive
>    tables or relaxing ingestion time to half hour or one hour at a cost of
>    more lagging. I tried Parquet tables in Hive but really no difference in
>    performance gain. I have thought of replacing Hive with Hbase etc. but that
>    brings new complications in as well without necessarily solving the issue.
>    4. I am not convinced this design can scale up easily with 5 times
>    more volume of data.
>    5. We will also get real time data from RDBMS tables (Oracle, Sybase,
>    MSSQL)using replication technologies such as Sap Replication Server. These
>    currently deliver changed log data to Hive tables. So there is some
>    compatibility issue here.
>
>
> So I am sure some members can add useful ideas :)
>
> Thanks
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


-- 

[image: Orchard Platform] <http://www.orchardplatform.com/>

*Rodrick Brown */ *DevOPs*

9174456839 / rodr...@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY

-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.

Reply via email to