Re: Geode for time series data

Udo Kohlmeyer Mon, 29 Feb 2016 12:02:08 -0800

The reason I did not use subscription conflation was because of theperformance consideration. With AsyncEventListeners, I could havemultiple dispatchers which could keep up with the volume of data beingpushed.

From memory, subscription queue(s) are primary/secondary in nature,which meant that conflation would happen in a single queue location.Whereas using AsyncEventListeners, this would happen on all primarybucket nodes.


But, event conflation is an option as well....

Many ways to skin a cat....

On 1/03/2016 6:41 am, Anilkumar Gingade wrote:

Another option is to use event conflation:

http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/conflate_server_subscription_queue.html

Udo, You may have thought about this...Any reason you did not use thisin your solution...


-Anil.

On Sun, Feb 28, 2016 at 4:44 PM, Udo Kohlmeyer <[email protected]<mailto:[email protected]>> wrote:

Hi there Andrew.

As you have discovered, pushing 100s per second is just not
feasible for multiple reasons. This is mostly due to the fact that
the events are mostly redundant (as in not all of them actually
have changed data entries) or the fact that downstream systems
cannot deal with the volume of data events being pushed and fall
behind processing the data entries really quickly.

Coalescing data on the input is a great idea. As you can determine
the amount of data you pushing into the servers to process. Which
is beneficial to everything downstream from that point. But this
is something you'd have to write yourself on the ingest client.

I previously implemented something similar to what you are trying
to achieve, using an AsyncEventListener/AsyncEventQueue, which is
a batch (size or batch) processor. AsyncEventListeners are a
little under valued, as everyone always wants single event
processing response (like CacheListeners).

Although the main use case for AsyncEventListeners is mostly for
write-behind for DB's, it allows for the processing of events (in
batch) on the server-side. But turning on batch-conflation for the
AsyncEventListener you could coalesce data entries per batch. More
on AsyncEventListeners can be found in the docco.
AsyncEventListeners

<http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/implementing_write_behind_event_handler.html>

(http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/implementing_write_behind_event_handler.html)

The implementation I worked on, would ingest data on an
"input-region", which had an AsyncEventListener attached to it.
The AsyncEventListener would then fire processing when either the
time (200ms) or batch size (1000) was hit. Each batch would only
contain unique coalesced keys which would then be inserted onto a
"client-facing" region, which could then have either CQ's or
Subscriptions enabled. Susbcription Docco

<?ui=2&ik=1456b61886&view=fimg&th=1532a7bc8e33ddcf&attid=0.1.1&disp=emb&attbid=ANGjdJ9tK1aQ-V4F5eAqlB0_-Rsg0FiiphcCFFZROsSpZXdVnJiiymu41wfJHHPLQSUbKACL02-3O8uWX3jnwyRXyVczGvK6EnDQ4x2LrEk8oRsBhajsZsoqepOYprk&sz=s0-l75-ft&ats=1456774350579&rm=1532a7bc8e33ddcf&zw&atsh=0>

(http://gemfire.docs.pivotal.io/docs-gemfire/latest/developing/events/configure_client_server_event_messaging.html)

In this case we used subscriptions, as all the data events from
the server were required to be forwarded to the subscribed client.
The client would then make the decision if it required to process
the data. CQ's can be used as well, but I prefer not to add the
extra burden on the server to filter every data entry event to see
if it should be sent or not. But you could test either one and see
which one works better for you.

The subscription client would then provide a CacheListener
implementation on the "client-facing" region, and it can decide
how it wants to process any of the data events.

Hope this helps.

--Udo

On 28/02/2016 3:45 pm, Andrew Munn wrote:

    What if you have one market data feed handler that receives the incoming
    price changes rapidly from an exchange and you want to push those updated
    values out to listening objects using (I think) Continuous Queries?  I
    wrote a Coherence app years ago which does this and I would like to port
    it over to Geode or GF.  Any tips on cache configuration, etc?  These
    updates could come at rates of 100s per second.  In the past I would
    coalesce the updates in the feed handler and push them into the coherence
    cache only once per second.

    Thanks,
    Andrew



    On Tue, 23 Feb 2016, Michael Stolz wrote:

    Something like that.You might choose a smaller granularity than minute if 
you're really getting that many ticks per minute.
    But you probably want a consistent granularity to make it relatively easy 
to find what you are looking for.
    You'll probably also want the date in the key.


    --Mike Stolz
    Principal Engineer, GemFire Product Manager
    Mobile:631-835-4771 <tel:631-835-4771>

    On Tue, Feb 23, 2016 at 11:07 AM, Andrew Munn<[email protected]> 
<mailto:[email protected]>  wrote:
           How does that work when you're appending incoming data in realtime?  
Say
           you're getting 1,000,000 data points per day on each of 1,000 
incoming
           stock symbols.  That is 1bln data points.  Are you using keys like 
this
           that bucket the data into one array per minute of the day

                   MSFT-08:00
                   MSFT-08:01
                   ...
                   MSFT-08:59
                   etc?

           each array might have several thousand elements in that case.

           Thanks
           Andrew

           On Mon, 22 Feb 2016, Michael Stolz wrote:

           > You will definitely want to use arrays rather than storing each 
individual data point because the overhead of
           each entry in Geode is nearly 300 bytes.
           > You could choose to partition by day/week/month but it shouldn't 
be necessary because the default partitioning
           scheme should be random enough to get reasonable distribution
           > if you are using the metadata and starting timestamp of the array 
as the key.
           >
           >
           > --Mike Stolz
           > Principal Engineer, GemFire Product Manager
           > Mobile: 631-835-4771
           >
           > On Fri, Feb 19, 2016 at 1:43 PM, Alan Kash<[email protected]> 
<mailto:[email protected]>  wrote:
           >       Hi,
           > I am also building a dashboard prototype for time-series data,
           >
           > For time-series data, usually we target a single metric change 
(stock price, temperature, pressure, etc.) for
           an entity, but the associated metadata with event -
           > {StockName/Place, DeviceID, ApplicationID, EventType} remains 
constant.
           >
           > For a backend like Cassandra, we denormalize everything and put 
everything in a flat key-map with [Metric,
           Timestamp, DeviceID, Type] as the key. This results in data
           > duplication of the associated "Metadata".
           >
           > Do you recommend similar approach for Geode ?
           >
           > Alternatively,
           >
           > We can have an array for Metrics associated with a given Metadata 
key and store it in a Map ?
           >
           > Key = [Metadata, Timestamp]
           >
           > TSMAP<Key, Array<Metric>> series = [1,2,3,4,5,6,7,8,9]
           >
           > We can partition this at application level by day / week / month.
           >
           > Is this approach better ?
           >

> There is a metrics spec for TS data modeling for those who are interested -http://metrics20.org>

           > Thanks
           >
           >
           >
           > On Fri, Feb 19, 2016 at 1:11 PM, Michael Stolz<[email protected]> 
<mailto:[email protected]>  wrote:
           >       You will likely get best results in terms of speed of access 
if you put some structure around the way you
           store the data in-memory.
           > First off, you would probably want to parse the data into the 
individual fields and create a Java object that
           represents that structure.
           >
           > Then you would probably want to bundle those Java structures into 
arrays in such a way that it is easy to get
           to the array for a particular date and time by the
           > combination of a ticker and a date and time as the key.
           >
           > Those arrays of Java objects is what you would store as entries in 
Geode.
           > I think this would give you the fastest access to the data.
           >
           > By the way, probably better to use an integer Julian date and a 
long integer for the time rather than a Java
           Date. Java Dates in Geode PDX are way bigger than you
           > want when you have millions of them.
           >
           > Looking at the sample dataset you provided it appears there is a 
lot of redundant data in there. Repeating
           1926.75 for instance.
           > In fact, every field but 2 are all the same. Are the repetitious 
fields necessary? If they are, then you might
           consider using a columnar approach instead of the
           > Java structures I mentioned. Make an array for each column and 
compact the repetitions with a count. It would
           be slower but more compact.
           > The timestamps are all the same too. Strange.
           >
           >
           >
           > --Mike Stolz
           > Principal Engineer, GemFire Product Manager
           > Mobile:631-835-4771 <tel:631-835-4771>
           >
           > On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase<[email protected]> 
<mailto:[email protected]>  wrote:
           >       Hi Andrew,I'll let one of the committers answer to your 
specific data file question. However, you might
           find some inspiration in this open source demo
           >       that some of the Geode team presented at OSCON earlier this
           year:http://pivotal-open-source-hub.github.io/StockInference-Spark/
           >
           > This was based on a pre-release version of Geode, so you'll want 
to sub the M1 release in and see if any other
           tweaks are required at that point.
           >
           > I believe this video and presentation go with the Github
           project:http://www.infoq.com/presentations/r-gemfire-spring-xd
           >
           > On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn<[email protected]> 
<mailto:[email protected]>  wrote:
           >       What would be the best way to use Geode (or GF) to store and 
utilize
           >       financial time series data like a stream of stock trades?  I 
have ASCII
           >       files with timestamps that include microseconds:
           >
           >       2016-02-17 
18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01,
           >       2016-02-17 
18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01,
           >
           >       I have one file per day and each file can have over 
1,000,000 rows.  My
           >       thought is to fault in the files and parse the ASCII as 
needed.  I know I
           >       could store the data as binary primitives in a file on disk 
instead of
           >       ASCII for a bit more speed.
           >
           >       I don't have a cluster of machines to create an HDFS cluster 
with.  My
           >       machine does have 128GB of RAM though.
           >
           >       Thanks!
           >
           >
           >
           >
           > --
           > Greg Chase
           > Global Head, Big Data Communities
           >http://www.pivotal.io/big-data
           >
           > Pivotal Software
           >http://www.pivotal.io/
           >
           >650-215-0477 <tel:650-215-0477>
           > @GregChase
           > Blog:http://geekmarketing.biz/
           >
           >
           >
           >
           >
           >

Re: Geode for time series data

Reply via email to