1.  reading from kafka has exactly once guarantees - we are using it in 
production today (with the direct receiver)
     *   ​you will probably have 2 topics, loading both into spark and joining 
/ unioning as needed is not an issue
     *   tons of optimizations you can do there, assuming everything else works
  2.  ​for ad-hoc query I would say you absolutely need to look at external 
     *   ​querying the Dstream or spark's RDD's directly should be done mostly 
for aggregates/metrics, not by users
     *   if you look at HBase or Cassandra for storage then 50k writes /sec are 
not a problem at all, especially combined with a smart client that does batch 
puts (like async hbase<https://github.com/OpenTSDB/asynchbase>)
     *   you could also consider writing the updates to another kafka topic and 
have  a different component that updates the DB, if you think of other 
optimisations there
  3.  ​by stats I assume you mean metrics (operational or business)
     *   ​there are multiple ways to do this, however I would not encourage you 
to query spark directly, especially if you need an archive/history of your 
     *   we are using OpenTSDB (we already have a HBase cluster) + Grafana for 
     *   collecting the metrics is a bit hairy in a streaming app - we have 
experimented with both accumulators and RDDs specific for metrics - chose the 
RDDs that write to OpenTSDB using foreachRdd


From: Thúy Hằng Lê <thuyhang...@gmail.com>
Sent: Sunday, September 20, 2015 7:26 AM
To: Jörn Franke
Cc: user@spark.apache.org
Subject: Re: Using Spark for portfolio manager app

Thanks Adrian and Jorn for the answers.

Yes, you're right there are lot of things I need to consider if I want to use 
Spark for my app.

I still have few concerns/questions from your information:

1/ I need to combine trading stream with tick stream, I am planning to use 
Kafka for that
If I am using approach #2 (Direct Approach) in this tutorial 

Spark Streaming + Kafka Integration Guide - Spark 1.4.1 ...
Spark Streaming + Kafka Integration Guide. Apache Kafka is publish-subscribe 
messaging rethought as a distributed, partitioned, replicated commit log 

Will I receive exactly one semantics? Or I have to add some logic in my code to 
archive that.
As your suggestion of using delta update, exactly one semantic is required for 
this application.

2/ For ad-hoc query, I must output of Spark to external storage and query on 
that right?
Is there any way to do ah-hoc query on Spark? my application could have 50k 
updates per second at pick time.
Persistent to external storage lead to high latency in my app.

3/ How to get real-time statistics from Spark,
In  most of the Spark streaming examples, the statistics are echo to the stdout.
However, I want to display those statics on GUI, is there any way to retrieve 
data from Spark directly without using external Storage?

2015-09-19 16:23 GMT+07:00 Jörn Franke 

If you want to be able to let your users query their portfolio then you may 
want to think about storing the current state of the portfolios in 
hbase/phoenix or alternatively a cluster of relationaldatabases can make sense. 
For the rest you may use Spark.

Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê 
<thuyhang...@gmail.com<mailto:thuyhang...@gmail.com>> a écrit :
Hi all,

I am going to build a financial application for Portfolio Manager, where each 
portfolio contains a list of stocks, the number of shares purchased, and the 
purchase price.
Another source of information is stocks price from market data. The application 
need to calculate real-time gain or lost of each stock in each portfolio ( 
compared to the purchase price).

I am new with Spark, i know using Spark Streaming I can aggregate portfolio 
possitions in real-time, for example:
            user A contains:
                      - 100 IBM stock with transactionValue=$15000
                      - 500 AAPL stock with transactionValue=$11400

Now given the stock prices change in real-time too, e.g if IBM price at 151, i 
want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - 15000 = $100

My questions are:

         * What is the best method to combine 2 real-time streams( transaction 
made by user and market pricing data) in Spark.
         * How can I use real-time Adhoc SQL again portfolio's positions, is 
there any way i can do SQL on the output of Spark Streamming.
         For example,
              select sum(gainOrLost) from portfolio where user='A';
         * What are prefered external storages for Spark in this use case.
         * Is spark is right choice for my use case?

Reply via email to