1.  reading from kafka has exactly once guarantees - we are using it in 
production today (with the direct receiver)
     *   ​you will probably have 2 topics, loading both into spark and joining 
/ unioning as needed is not an issue
     *   tons of optimizations you can do there, assuming everything else works
  2.  ​for ad-hoc query I would say you absolutely need to look at external 
storage
     *   ​querying the Dstream or spark's RDD's directly should be done mostly 
for aggregates/metrics, not by users
     *   if you look at HBase or Cassandra for storage then 50k writes /sec are 
not a problem at all, especially combined with a smart client that does batch 
puts (like async hbase<https://github.com/OpenTSDB/asynchbase>)
     *   you could also consider writing the updates to another kafka topic and 
have  a different component that updates the DB, if you think of other 
optimisations there
  3.  ​by stats I assume you mean metrics (operational or business)
     *   ​there are multiple ways to do this, however I would not encourage you 
to query spark directly, especially if you need an archive/history of your 
datapoints
     *   we are using OpenTSDB (we already have a HBase cluster) + Grafana for 
dashboarding
     *   collecting the metrics is a bit hairy in a streaming app - we have 
experimented with both accumulators and RDDs specific for metrics - chose the 
RDDs that write to OpenTSDB using foreachRdd

​-adrian

________________________________
From: Thúy Hằng Lê <thuyhang...@gmail.com>
Sent: Sunday, September 20, 2015 7:26 AM
To: Jörn Franke
Cc: user@spark.apache.org
Subject: Re: Using Spark for portfolio manager app

Thanks Adrian and Jorn for the answers.

Yes, you're right there are lot of things I need to consider if I want to use 
Spark for my app.

I still have few concerns/questions from your information:

1/ I need to combine trading stream with tick stream, I am planning to use 
Kafka for that
If I am using approach #2 (Direct Approach) in this tutorial 
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
[https://spark.apache.org/docs/latest/img/spark-logo-hd.png]<https://spark.apache.org/docs/latest/streaming-kafka-integration.html>

Spark Streaming + Kafka Integration Guide - Spark 1.4.1 ...
Spark Streaming + Kafka Integration Guide. Apache Kafka is publish-subscribe 
messaging rethought as a distributed, partitioned, replicated commit log 
service.
Read 
more...<https://spark.apache.org/docs/latest/streaming-kafka-integration.html>


Will I receive exactly one semantics? Or I have to add some logic in my code to 
archive that.
As your suggestion of using delta update, exactly one semantic is required for 
this application.

2/ For ad-hoc query, I must output of Spark to external storage and query on 
that right?
Is there any way to do ah-hoc query on Spark? my application could have 50k 
updates per second at pick time.
Persistent to external storage lead to high latency in my app.

3/ How to get real-time statistics from Spark,
In  most of the Spark streaming examples, the statistics are echo to the stdout.
However, I want to display those statics on GUI, is there any way to retrieve 
data from Spark directly without using external Storage?


2015-09-19 16:23 GMT+07:00 Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>>:

If you want to be able to let your users query their portfolio then you may 
want to think about storing the current state of the portfolios in 
hbase/phoenix or alternatively a cluster of relationaldatabases can make sense. 
For the rest you may use Spark.

Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê 
<thuyhang...@gmail.com<mailto:thuyhang...@gmail.com>> a écrit :
Hi all,

I am going to build a financial application for Portfolio Manager, where each 
portfolio contains a list of stocks, the number of shares purchased, and the 
purchase price.
Another source of information is stocks price from market data. The application 
need to calculate real-time gain or lost of each stock in each portfolio ( 
compared to the purchase price).

I am new with Spark, i know using Spark Streaming I can aggregate portfolio 
possitions in real-time, for example:
            user A contains:
                      - 100 IBM stock with transactionValue=$15000
                      - 500 AAPL stock with transactionValue=$11400

Now given the stock prices change in real-time too, e.g if IBM price at 151, i 
want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - 15000 = $100

My questions are:

         * What is the best method to combine 2 real-time streams( transaction 
made by user and market pricing data) in Spark.
         * How can I use real-time Adhoc SQL again portfolio's positions, is 
there any way i can do SQL on the output of Spark Streamming.
         For example,
              select sum(gainOrLost) from portfolio where user='A';
         * What are prefered external storages for Spark in this use case.
         * Is spark is right choice for my use case?


Reply via email to