Hi, I have been working on some Lambda Architecture for trading systems.
I think I have completed the dry runs for testing the modules. For batch layer the criteria is a day's lag (one day old data). This is acceptable for the users who come from BI background using Tableau but I think we can do better. *For batch layer* the design is as follows: 1. Trade prices are collect by Kafka at the rate of 2 seconds comprising a batch of 20 prices for each security (ticker), each row of csv has UUID, ticker, timestamp and price, comma separated 2. In a minute we have 600 prices, in an hour 36K and a day 864K prices 3. Flume is used to put these "raw" prices on HDFS (source -> Kafka -> Flume -> HDFS) 4. Every 15 minutes (24x7) a cron runs that picks up these files, creates a large file comprising prices from small files and does an insert into Hbase table. This process is done via ImportTsv and is pretty fast (under 2 minutes). Doing individual file reads is terribly slow with ImporTsv that uses map-reduce 5. As an additional test we also have a cron that picks up these individual raw files every 15 minutes and puts them into Hive table (this will be decommissioned eventually) 6. We have views built in Hive and Phoenix that are on top of Hbase tables. Phoenix works OK in Zeppelin (albeit not with rich SQL as Hive). Also we can use Spark on Hive tables in Zeppelin and we can use Tableau with Hive and Spark and its data caching facilities (in Tableau server) 7. It is my understanding that the new release of Hive will have LLAP as its in-memory database. So in summary the batch layer as of now offers all data with 15 minutes lag In designing this layer I had to take into account the ease of use and familiarity of our users with Tableau and Zeppelin (eventually). In addition, certain power users can use Spark FP in Zeppelin as well. The Tableau users tend to do analytics at macro level. In other words what they want is speedy queries not necessarily on the latest data. *For speed layer* 1. Kafka feeds data into spark streaming (SS) 2. For each RDD, prices are analysed for every row 3. Calculations are done on prices and if they satisfy the criteria, a message is sent to Alert (currently a simple Beep). The row is also displayed for these particular prices to alert the trader 4. The details of these trades are flushed to another Hbase table real time, impressively fast Mon Oct 10 22:37:52 BST 2016, Price on S16 hit 99.25724 Mon Oct 10 22:37:56 BST 2016, Price on S05 hit 99.5992 So this is the basics of speed layer The serving layer will have combined data from both batch and speed layers. We are in the process of building a real time dashboard as well. I thought a bit and decided that extracting Hbase data through Hive or Phoenix skins provide what we need together with Spark. I have not yet managed to use Phoenix on Spark and using Hbase directly for Spark 2 is not available and even if we did using SQL skin for visualisation tools are better. Sorry about this long monologue. Appreciate any feedbacks. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.