Another option is to use event conflation: http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/conflate_server_subscription_queue.html
Udo, You may have thought about this...Any reason you did not use this in your solution... -Anil. On Sun, Feb 28, 2016 at 4:44 PM, Udo Kohlmeyer <[email protected]> wrote: > Hi there Andrew. > > As you have discovered, pushing 100s per second is just not feasible for > multiple reasons. This is mostly due to the fact that the events are mostly > redundant (as in not all of them actually have changed data entries) or the > fact that downstream systems cannot deal with the volume of data events > being pushed and fall behind processing the data entries really quickly. > > Coalescing data on the input is a great idea. As you can determine the > amount of data you pushing into the servers to process. Which is beneficial > to everything downstream from that point. But this is something you'd have > to write yourself on the ingest client. > > I previously implemented something similar to what you are trying to > achieve, using an AsyncEventListener/AsyncEventQueue, which is a batch > (size or batch) processor. AsyncEventListeners are a little under valued, > as everyone always wants single event processing response (like > CacheListeners). > > Although the main use case for AsyncEventListeners is mostly for > write-behind for DB's, it allows for the processing of events (in batch) on > the server-side. But turning on batch-conflation for the AsyncEventListener > you could coalesce data entries per batch. More on AsyncEventListeners can > be found in the docco. AsyncEventListeners > <http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/implementing_write_behind_event_handler.html> > ( > http://gemfire.docs.pivotal.io/docs-gemfire/developing/events/implementing_write_behind_event_handler.html > ) > > The implementation I worked on, would ingest data on an "input-region", > which had an AsyncEventListener attached to it. The AsyncEventListener > would then fire processing when either the time (200ms) or batch size > (1000) was hit. Each batch would only contain unique coalesced keys which > would then be inserted onto a "client-facing" region, which could then have > either CQ's or Subscriptions enabled. Susbcription Docco > <?ui=2&ik=1456b61886&view=fimg&th=1532a7bc8e33ddcf&attid=0.1.1&disp=emb&attbid=ANGjdJ9tK1aQ-V4F5eAqlB0_-Rsg0FiiphcCFFZROsSpZXdVnJiiymu41wfJHHPLQSUbKACL02-3O8uWX3jnwyRXyVczGvK6EnDQ4x2LrEk8oRsBhajsZsoqepOYprk&sz=s0-l75-ft&ats=1456774350579&rm=1532a7bc8e33ddcf&zw&atsh=0> > ( > http://gemfire.docs.pivotal.io/docs-gemfire/latest/developing/events/configure_client_server_event_messaging.html > ) > > In this case we used subscriptions, as all the data events from the server > were required to be forwarded to the subscribed client. The client would > then make the decision if it required to process the data. CQ's can be used > as well, but I prefer not to add the extra burden on the server to filter > every data entry event to see if it should be sent or not. But you could > test either one and see which one works better for you. > > The subscription client would then provide a CacheListener implementation > on the "client-facing" region, and it can decide how it wants to process > any of the data events. > > Hope this helps. > > --Udo > > On 28/02/2016 3:45 pm, Andrew Munn wrote: > > What if you have one market data feed handler that receives the incoming > price changes rapidly from an exchange and you want to push those updated > values out to listening objects using (I think) Continuous Queries? I > wrote a Coherence app years ago which does this and I would like to port > it over to Geode or GF. Any tips on cache configuration, etc? These > updates could come at rates of 100s per second. In the past I would > coalesce the updates in the feed handler and push them into the coherence > cache only once per second. > > Thanks, > Andrew > > > > On Tue, 23 Feb 2016, Michael Stolz wrote: > > > Something like that.You might choose a smaller granularity than minute if > you're really getting that many ticks per minute. > But you probably want a consistent granularity to make it relatively easy to > find what you are looking for. > You'll probably also want the date in the key. > > > --Mike Stolz > Principal Engineer, GemFire Product Manager > Mobile: 631-835-4771 > > On Tue, Feb 23, 2016 at 11:07 AM, Andrew Munn <[email protected]> > <[email protected]> wrote: > How does that work when you're appending incoming data in realtime? Say > you're getting 1,000,000 data points per day on each of 1,000 incoming > stock symbols. That is 1bln data points. Are you using keys like this > that bucket the data into one array per minute of the day > > MSFT-08:00 > MSFT-08:01 > ... > MSFT-08:59 > etc? > > each array might have several thousand elements in that case. > > Thanks > Andrew > > On Mon, 22 Feb 2016, Michael Stolz wrote: > > > You will definitely want to use arrays rather than storing each > individual data point because the overhead of > each entry in Geode is nearly 300 bytes. > > You could choose to partition by day/week/month but it shouldn't be > necessary because the default partitioning > scheme should be random enough to get reasonable distribution > > if you are using the metadata and starting timestamp of the array as > the key. > > > > > > --Mike Stolz > > Principal Engineer, GemFire Product Manager > > Mobile: 631-835-4771 > > > > On Fri, Feb 19, 2016 at 1:43 PM, Alan Kash <[email protected]> > <[email protected]> wrote: > > Hi, > > I am also building a dashboard prototype for time-series data, > > > > For time-series data, usually we target a single metric change (stock > price, temperature, pressure, etc.) for > an entity, but the associated metadata with event - > > {StockName/Place, DeviceID, ApplicationID, EventType} remains > constant. > > > > For a backend like Cassandra, we denormalize everything and put > everything in a flat key-map with [Metric, > Timestamp, DeviceID, Type] as the key. This results in data > > duplication of the associated "Metadata". > > > > Do you recommend similar approach for Geode ? > > > > Alternatively, > > > > We can have an array for Metrics associated with a given Metadata key > and store it in a Map ? > > > > Key = [Metadata, Timestamp] > > > > TSMAP<Key, Array<Metric>> series = [1,2,3,4,5,6,7,8,9] > > > > We can partition this at application level by day / week / month. > > > > Is this approach better ? > > > > There is a metrics spec for TS data modeling for those who are > interested - http://metrics20.org > > > > Thanks > > > > > > > > On Fri, Feb 19, 2016 at 1:11 PM, Michael Stolz <[email protected]> > <[email protected]> wrote: > > You will likely get best results in terms of speed of access if > you put some structure around the way you > store the data in-memory. > > First off, you would probably want to parse the data into the > individual fields and create a Java object that > represents that structure. > > > > Then you would probably want to bundle those Java structures into > arrays in such a way that it is easy to get > to the array for a particular date and time by the > > combination of a ticker and a date and time as the key. > > > > Those arrays of Java objects is what you would store as entries in > Geode. > > I think this would give you the fastest access to the data. > > > > By the way, probably better to use an integer Julian date and a long > integer for the time rather than a Java > Date. Java Dates in Geode PDX are way bigger than you > > want when you have millions of them. > > > > Looking at the sample dataset you provided it appears there is a lot > of redundant data in there. Repeating > 1926.75 for instance. > > In fact, every field but 2 are all the same. Are the repetitious > fields necessary? If they are, then you might > consider using a columnar approach instead of the > > Java structures I mentioned. Make an array for each column and > compact the repetitions with a count. It would > be slower but more compact. > > The timestamps are all the same too. Strange. > > > > > > > > --Mike Stolz > > Principal Engineer, GemFire Product Manager > > Mobile: 631-835-4771 > > > > On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase <[email protected]> > <[email protected]> wrote: > > Hi Andrew,I'll let one of the committers answer to your > specific data file question. However, you might > find some inspiration in this open source demo > > that some of the Geode team presented at OSCON earlier this > year: http://pivotal-open-source-hub.github.io/StockInference-Spark/ > > > > This was based on a pre-release version of Geode, so you'll want to > sub the M1 release in and see if any other > tweaks are required at that point. > > > > I believe this video and presentation go with the Github > project: http://www.infoq.com/presentations/r-gemfire-spring-xd > > > > On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn <[email protected]> > <[email protected]> wrote: > > What would be the best way to use Geode (or GF) to store and > utilize > > financial time series data like a stream of stock trades? I > have ASCII > > files with timestamps that include microseconds: > > > > 2016-02-17 > 18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01, > > 2016-02-17 > 18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01, > > > > I have one file per day and each file can have over 1,000,000 > rows. My > > thought is to fault in the files and parse the ASCII as needed. > I know I > > could store the data as binary primitives in a file on disk > instead of > > ASCII for a bit more speed. > > > > I don't have a cluster of machines to create an HDFS cluster > with. My > > machine does have 128GB of RAM though. > > > > Thanks! > > > > > > > > > > -- > > Greg Chase > > Global Head, Big Data Communities > > http://www.pivotal.io/big-data > > > > Pivotal Software > > http://www.pivotal.io/ > > > > 650-215-0477 > > @GregChase > > Blog: http://geekmarketing.biz/ > > > > > > > > > > > > > > > > > >
