Hi,
I am also building a dashboard prototype for time-series data,
For time-series data, usually we target a single metric change (stock
price, temperature, pressure, etc.) for an entity, but the associated
metadata with event - {StockName/Place, DeviceID, ApplicationID, EventType}
remains constant.
For a backend like Cassandra, we denormalize everything and put everything
in a flat key-map with [Metric, Timestamp, DeviceID, Type] as the key. This
results in data duplication of the associated "Metadata".
Do you recommend similar approach for Geode ?
Alternatively,
We can have an array for Metrics associated with a given Metadata key and
store it in a Map ?
Key = [Metadata, Timestamp]
TSMAP<Key, Array<Metric>> series = [1,2,3,4,5,6,7,8,9]
We can partition this at application level by day / week / month.
Is this approach better ?
There is a metrics spec for TS data modeling for those who are interested -
http://metrics20.org
Thanks
On Fri, Feb 19, 2016 at 1:11 PM, Michael Stolz <[email protected]> wrote:
> You will likely get best results in terms of speed of access if you put
> some structure around the way you store the data in-memory.
>
> First off, you would probably want to parse the data into the individual
> fields and create a Java object that represents that structure.
>
> Then you would probably want to bundle those Java structures into arrays
> in such a way that it is easy to get to the array for a particular date and
> time by the combination of a ticker and a date and time as the key.
>
> Those arrays of Java objects is what you would store as entries in Geode.
> I think this would give you the fastest access to the data.
>
> By the way, probably better to use an integer Julian date and a long
> integer for the time rather than a Java Date. Java Dates in Geode PDX are
> way bigger than you want when you have millions of them.
>
> Looking at the sample dataset you provided it appears there is a lot of
> redundant data in there. Repeating 1926.75 for instance.
> In fact, every field but 2 are all the same. Are the repetitious fields
> necessary? If they are, then you might consider using a columnar approach
> instead of the Java structures I mentioned. Make an array for each column
> and compact the repetitions with a count. It would be slower but more
> compact.
> The timestamps are all the same too. Strange.
>
>
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Manager
> Mobile: 631-835-4771
>
> On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase <[email protected]> wrote:
>
>> Hi Andrew,
>> I'll let one of the committers answer to your specific data file
>> question. However, you might find some inspiration in this open source demo
>> that some of the Geode team presented at OSCON earlier this year:
>> http://pivotal-open-source-hub.github.io/StockInference-Spark/
>>
>> This was based on a pre-release version of Geode, so you'll want to sub
>> the M1 release in and see if any other tweaks are required at that point.
>>
>> I believe this video and presentation go with the Github project:
>> http://www.infoq.com/presentations/r-gemfire-spring-xd
>>
>> On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn <[email protected]> wrote:
>>
>>> What would be the best way to use Geode (or GF) to store and utilize
>>> financial time series data like a stream of stock trades? I have ASCII
>>> files with timestamps that include microseconds:
>>>
>>> 2016-02-17 18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17
>>> 18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17 18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17 18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17
>>> 18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17
>>> 18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17
>>> 18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17
>>> 18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01,
>>> 2016-02-17
>>> 18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01,
>>>
>>> I have one file per day and each file can have over 1,000,000 rows. My
>>> thought is to fault in the files and parse the ASCII as needed. I know I
>>> could store the data as binary primitives in a file on disk instead of
>>> ASCII for a bit more speed.
>>>
>>> I don't have a cluster of machines to create an HDFS cluster with. My
>>> machine does have 128GB of RAM though.
>>>
>>> Thanks!
>>>
>>
>>
>>
>> --
>> Greg Chase
>>
>> Global Head, Big Data Communities
>> http://www.pivotal.io/big-data
>>
>> Pivotal Software
>> http://www.pivotal.io/
>>
>> 650-215-0477
>> @GregChase
>> Blog: http://geekmarketing.biz/
>>
>>
>