Re: Geode for time series data

Anilkumar Gingade Mon, 29 Feb 2016 11:42:15 -0800

Another option is to use event conflation:

http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/conflate_server_subscription_queue.html


Udo, You may have thought about this...Any reason you did not use this in
your solution...

-Anil.



On Sun, Feb 28, 2016 at 4:44 PM, Udo Kohlmeyer <[email protected]>
wrote:

> Hi there Andrew.
>
> As you have discovered, pushing 100s per second is just not feasible for
> multiple reasons. This is mostly due to the fact that the events are mostly
> redundant (as in not all of them actually have changed data entries) or the
> fact that downstream systems cannot deal with the volume of data events
> being pushed and fall behind processing the data entries really quickly.
>
> Coalescing data on the input is a great idea. As you can determine the
> amount of data you pushing into the servers to process. Which is beneficial
> to everything downstream from that point. But this is something you'd have
> to write yourself on the ingest client.
>
> I previously implemented something similar to what you are trying to
> achieve, using an AsyncEventListener/AsyncEventQueue, which is a batch
> (size or batch) processor. AsyncEventListeners are a little under valued,
> as everyone always wants single event processing response (like
> CacheListeners).
>
> Although the main use case for AsyncEventListeners is mostly for
> write-behind for DB's, it allows for the processing of events (in batch) on
> the server-side. But turning on batch-conflation for the AsyncEventListener
> you could coalesce data entries per batch. More on AsyncEventListeners can
> be found in the docco. AsyncEventListeners
> <http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/implementing_write_behind_event_handler.html>
> (
> http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/implementing_write_behind_event_handler.html
> )
>
> The implementation I worked on, would ingest data on an "input-region",
> which had an AsyncEventListener attached to it. The AsyncEventListener
> would then fire processing when either the time (200ms) or batch size
> (1000) was hit. Each batch would only contain unique coalesced keys which
> would then be inserted onto a "client-facing" region, which could then have
> either CQ's or Subscriptions enabled. Susbcription Docco
> <?ui=2&ik=1456b61886&view=fimg&th=1532a7bc8e33ddcf&attid=0.1.1&disp=emb&attbid=ANGjdJ9tK1aQ-V4F5eAqlB0_-Rsg0FiiphcCFFZROsSpZXdVnJiiymu41wfJHHPLQSUbKACL02-3O8uWX3jnwyRXyVczGvK6EnDQ4x2LrEk8oRsBhajsZsoqepOYprk&sz=s0-l75-ft&ats=1456774350579&rm=1532a7bc8e33ddcf&zw&atsh=0>
> (
> http://gemfire.docs.pivotal.io/docs-gemfire/latest/developing/events/configure_client_server_event_messaging.html
> )
>
> In this case we used subscriptions, as all the data events from the server
> were required to be forwarded to the subscribed client. The client would
> then make the decision if it required to process the data. CQ's can be used
> as well, but I prefer not to add the extra burden on the server to filter
> every data entry event to see if it should be sent or not. But you could
> test either one and see which one works better for you.
>
> The subscription client would then provide a CacheListener implementation
> on the "client-facing" region, and it can decide how it wants to process
> any of the data events.
>
> Hope this helps.
>
> --Udo
>
> On 28/02/2016 3:45 pm, Andrew Munn wrote:
>
> What if you have one market data feed handler that receives the incoming
> price changes rapidly from an exchange and you want to push those updated
> values out to listening objects using (I think) Continuous Queries?  I
> wrote a Coherence app years ago which does this and I would like to port
> it over to Geode or GF.  Any tips on cache configuration, etc?  These
> updates could come at rates of 100s per second.  In the past I would
> coalesce the updates in the feed handler and push them into the coherence
> cache only once per second.
>
> Thanks,
> Andrew
>
>
>
> On Tue, 23 Feb 2016, Michael Stolz wrote:
>
>
> Something like that.You might choose a smaller granularity than minute if 
> you're really getting that many ticks per minute.
> But you probably want a consistent granularity to make it relatively easy to 
> find what you are looking for.
> You'll probably also want the date in the key.
>
>
> --Mike Stolz
> Principal Engineer, GemFire Product Manager
> Mobile: 631-835-4771
>
> On Tue, Feb 23, 2016 at 11:07 AM, Andrew Munn <[email protected]> 
> <[email protected]> wrote:
>       How does that work when you're appending incoming data in realtime?  Say
>       you're getting 1,000,000 data points per day on each of 1,000 incoming
>       stock symbols.  That is 1bln data points.  Are you using keys like this
>       that bucket the data into one array per minute of the day
>
>               MSFT-08:00
>               MSFT-08:01
>               ...
>               MSFT-08:59
>               etc?
>
>       each array might have several thousand elements in that case.
>
>       Thanks
>       Andrew
>
>       On Mon, 22 Feb 2016, Michael Stolz wrote:
>
>       > You will definitely want to use arrays rather than storing each 
> individual data point because the overhead of
>       each entry in Geode is nearly 300 bytes.
>       > You could choose to partition by day/week/month but it shouldn't be 
> necessary because the default partitioning
>       scheme should be random enough to get reasonable distribution
>       > if you are using the metadata and starting timestamp of the array as 
> the key.
>       >
>       >
>       > --Mike Stolz
>       > Principal Engineer, GemFire Product Manager
>       > Mobile: 631-835-4771
>       >
>       > On Fri, Feb 19, 2016 at 1:43 PM, Alan Kash <[email protected]> 
> <[email protected]> wrote:
>       >       Hi,
>       > I am also building a dashboard prototype for time-series data,
>       >
>       > For time-series data, usually we target a single metric change (stock 
> price, temperature, pressure, etc.) for
>       an entity, but the associated metadata with event -
>       > {StockName/Place, DeviceID, ApplicationID, EventType} remains 
> constant.
>       >
>       > For a backend like Cassandra, we denormalize everything and put 
> everything in a flat key-map with [Metric,
>       Timestamp, DeviceID, Type] as the key. This results in data
>       > duplication of the associated "Metadata".
>       >
>       > Do you recommend similar approach for Geode ?
>       >
>       > Alternatively,
>       >
>       > We can have an array for Metrics associated with a given Metadata key 
> and store it in a Map ?
>       >
>       > Key = [Metadata, Timestamp]
>       >
>       > TSMAP<Key, Array<Metric>> series = [1,2,3,4,5,6,7,8,9]
>       >
>       > We can partition this at application level by day / week / month.
>       >
>       > Is this approach better ?
>       >
>       > There is a metrics spec for TS data modeling for those who are 
> interested - http://metrics20.org
>       >
>       > Thanks
>       >
>       >
>       >
>       > On Fri, Feb 19, 2016 at 1:11 PM, Michael Stolz <[email protected]> 
> <[email protected]> wrote:
>       >       You will likely get best results in terms of speed of access if 
> you put some structure around the way you
>       store the data in-memory.
>       > First off, you would probably want to parse the data into the 
> individual fields and create a Java object that
>       represents that structure.
>       >
>       > Then you would probably want to bundle those Java structures into 
> arrays in such a way that it is easy to get
>       to the array for a particular date and time by the
>       > combination of a ticker and a date and time as the key.
>       >
>       > Those arrays of Java objects is what you would store as entries in 
> Geode.
>       > I think this would give you the fastest access to the data.
>       >
>       > By the way, probably better to use an integer Julian date and a long 
> integer for the time rather than a Java
>       Date. Java Dates in Geode PDX are way bigger than you
>       > want when you have millions of them.
>       >
>       > Looking at the sample dataset you provided it appears there is a lot 
> of redundant data in there. Repeating
>       1926.75 for instance.
>       > In fact, every field but 2 are all the same. Are the repetitious 
> fields necessary? If they are, then you might
>       consider using a columnar approach instead of the
>       > Java structures I mentioned. Make an array for each column and 
> compact the repetitions with a count. It would
>       be slower but more compact.
>       > The timestamps are all the same too. Strange.
>       >
>       >
>       >
>       > --Mike Stolz
>       > Principal Engineer, GemFire Product Manager
>       > Mobile: 631-835-4771
>       >
>       > On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase <[email protected]> 
> <[email protected]> wrote:
>       >       Hi Andrew,I'll let one of the committers answer to your 
> specific data file question. However, you might
>       find some inspiration in this open source demo
>       >       that some of the Geode team presented at OSCON earlier this
>       year: http://pivotal-open-source-hub.github.io/StockInference-Spark/
>       >
>       > This was based on a pre-release version of Geode, so you'll want to 
> sub the M1 release in and see if any other
>       tweaks are required at that point.
>       >
>       > I believe this video and presentation go with the Github
>       project: http://www.infoq.com/presentations/r-gemfire-spring-xd
>       >
>       > On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn <[email protected]> 
> <[email protected]> wrote:
>       >       What would be the best way to use Geode (or GF) to store and 
> utilize
>       >       financial time series data like a stream of stock trades?  I 
> have ASCII
>       >       files with timestamps that include microseconds:
>       >
>       >       2016-02-17 
> 18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01,
>       >       2016-02-17 
> 18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01,
>       >
>       >       I have one file per day and each file can have over 1,000,000 
> rows.  My
>       >       thought is to fault in the files and parse the ASCII as needed. 
>  I know I
>       >       could store the data as binary primitives in a file on disk 
> instead of
>       >       ASCII for a bit more speed.
>       >
>       >       I don't have a cluster of machines to create an HDFS cluster 
> with.  My
>       >       machine does have 128GB of RAM though.
>       >
>       >       Thanks!
>       >
>       >
>       >
>       >
>       > --
>       > Greg Chase
>       > Global Head, Big Data Communities
>       > http://www.pivotal.io/big-data
>       >
>       > Pivotal Software
>       > http://www.pivotal.io/
>       >
>       > 650-215-0477
>       > @GregChase
>       > Blog: http://geekmarketing.biz/
>       >
>       >
>       >
>       >
>       >
>       >
>
>
>
>
>
>

Re: Geode for time series data

Reply via email to