1. reading from kafka has exactly once guarantees - we are using it in production today (with the direct receiver) * you will probably have 2 topics, loading both into spark and joining / unioning as needed is not an issue * tons of optimizations you can do there, assuming everything else works 2. for ad-hoc query I would say you absolutely need to look at external storage * querying the Dstream or spark's RDD's directly should be done mostly for aggregates/metrics, not by users * if you look at HBase or Cassandra for storage then 50k writes /sec are not a problem at all, especially combined with a smart client that does batch puts (like async hbase<https://github.com/OpenTSDB/asynchbase>) * you could also consider writing the updates to another kafka topic and have a different component that updates the DB, if you think of other optimisations there 3. by stats I assume you mean metrics (operational or business) * there are multiple ways to do this, however I would not encourage you to query spark directly, especially if you need an archive/history of your datapoints * we are using OpenTSDB (we already have a HBase cluster) + Grafana for dashboarding * collecting the metrics is a bit hairy in a streaming app - we have experimented with both accumulators and RDDs specific for metrics - chose the RDDs that write to OpenTSDB using foreachRdd
-adrian ________________________________ From: Thúy Hằng Lê <thuyhang...@gmail.com> Sent: Sunday, September 20, 2015 7:26 AM To: Jörn Franke Cc: user@spark.apache.org Subject: Re: Using Spark for portfolio manager app Thanks Adrian and Jorn for the answers. Yes, you're right there are lot of things I need to consider if I want to use Spark for my app. I still have few concerns/questions from your information: 1/ I need to combine trading stream with tick stream, I am planning to use Kafka for that If I am using approach #2 (Direct Approach) in this tutorial https://spark.apache.org/docs/latest/streaming-kafka-integration.html [https://spark.apache.org/docs/latest/img/spark-logo-hd.png]<https://spark.apache.org/docs/latest/streaming-kafka-integration.html> Spark Streaming + Kafka Integration Guide - Spark 1.4.1 ... Spark Streaming + Kafka Integration Guide. Apache Kafka is publish-subscribe messaging rethought as a distributed, partitioned, replicated commit log service. Read more...<https://spark.apache.org/docs/latest/streaming-kafka-integration.html> Will I receive exactly one semantics? Or I have to add some logic in my code to archive that. As your suggestion of using delta update, exactly one semantic is required for this application. 2/ For ad-hoc query, I must output of Spark to external storage and query on that right? Is there any way to do ah-hoc query on Spark? my application could have 50k updates per second at pick time. Persistent to external storage lead to high latency in my app. 3/ How to get real-time statistics from Spark, In most of the Spark streaming examples, the statistics are echo to the stdout. However, I want to display those statics on GUI, is there any way to retrieve data from Spark directly without using external Storage? 2015-09-19 16:23 GMT+07:00 Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>: If you want to be able to let your users query their portfolio then you may want to think about storing the current state of the portfolios in hbase/phoenix or alternatively a cluster of relationaldatabases can make sense. For the rest you may use Spark. Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê <thuyhang...@gmail.com<mailto:thuyhang...@gmail.com>> a écrit : Hi all, I am going to build a financial application for Portfolio Manager, where each portfolio contains a list of stocks, the number of shares purchased, and the purchase price. Another source of information is stocks price from market data. The application need to calculate real-time gain or lost of each stock in each portfolio ( compared to the purchase price). I am new with Spark, i know using Spark Streaming I can aggregate portfolio possitions in real-time, for example: user A contains: - 100 IBM stock with transactionValue=$15000 - 500 AAPL stock with transactionValue=$11400 Now given the stock prices change in real-time too, e.g if IBM price at 151, i want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - 15000 = $100 My questions are: * What is the best method to combine 2 real-time streams( transaction made by user and market pricing data) in Spark. * How can I use real-time Adhoc SQL again portfolio's positions, is there any way i can do SQL on the output of Spark Streamming. For example, select sum(gainOrLost) from portfolio where user='A'; * What are prefered external storages for Spark in this use case. * Is spark is right choice for my use case?