Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or 
something similar?


 

    On Friday, 30 September 2016, 17:17, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:
 

 I have designed this prototype for a risk business. Here I would like to 
discuss issues with batch layer. Apologies about being long winded.

Business objective
Reduce risk in the credit business while making better credit and trading 
decisions. Specifically, to identify risk trends within certain years of 
trading data. For example, measure the risk exposure in a give portfolio by 
industry, region, credit rating and other parameters. At the macroscopic level, 
analyze data across market sectors, over a given time horizon to asses risk 
changes  DeliverableEnable real time and batch analysis of risk data Batch 
technology stack usedKafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, 
Spark as the query tool, Zeppelin
Test volumes for POC1 message queue (csv format), 100 stock prices streaming in 
very 2 seconds, 180K prices per hour, 4 million + per day                       
                                 
   - prices to Kafka -> Zookeeper -> Flume -> HDFS
   - HDFS daily partition for that day's data
   - Hive external table looking at HDFS partitioned location
   - Hive managed table populated every 15 minutes via cron from Hive external 
table (table type ORC partitioned by date). This is purely Hive job. Hive table 
is populated using insert/overwrite for that day to avoid boundary 
value/missing data etc.
   - Typical batch ingestion time (Hive table populated from HDFS files) ~ 2 
minutes
   - Data in Hive table has 15 minutes latency
   - Zeppelin to be used as UI with Spark 

Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell. Within 
Spark shell, users can access batch tables in Hive or they have a choice of 
accessing raw data on HDFS files which gives them real time access  (not to be 
confused with speed layer).  Using typical query with Spark, to see the last 15 
minutes of real time data (T-15 -Now) takes 1 min. Running the same query (my 
typical query not user query) on Hive tables this time using Spark takes 6 
seconds.
However, there are some  design concerns:
   
   - Zeppelin starts slowing down by the end of day. Sometimes it throws broken 
pipe message. I resolve this by restarting Zeppelin daemon. Potential show 
stopper
   - As the volume of data increases throughout the day, performance becomes an 
issue
   - Every 15 minutes when the cron starts, Hive insert/overwrites can 
potentially get in conflict with users throwing queries from Zeppelin/Spark. I 
am sure that with exclusive writes, Hive will block all users from accessing 
these tables (at partition level) until insert overwrite is done. This can be 
improved by better partitioning of Hive tables or relaxing ingestion time to 
half hour or one hour at a cost of more lagging. I tried Parquet tables in Hive 
but really no difference in performance gain. I have thought of replacing Hive 
with Hbase etc. but that brings new complications in as well without 
necessarily solving the issue.
   - I am not convinced this design can scale up easily with 5 times more 
volume of data. 
   - We will also get real time data from RDBMS tables (Oracle, Sybase, 
MSSQL)using replication technologies such as Sap Replication Server. These 
currently deliver changed log data to Hive tables. So there is some 
compatibility issue here. 

So I am sure some members can add useful ideas :)
Thanks
Mich

 LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  

   

Reply via email to