Re: Schema and indexes for efficient time range queries

Yiannis Gkoufas Thu, 11 Jun 2015 01:48:51 -0700

Hi there,

I did try to reverse the order of the Composite Key and it worked!
I changed BIGINT to UNSIGNED_LONG because I was facing some issues with the
timezone of the DATE field.
There is one small detail I cannot understand why is happening:


When I give the query:
SELECT * FROM SENSOR_DATA WHERE DT>=1410677800000 AND DT <= 1500677800000
AND SID='ID1';
The time that it takes is:
Rows read: 100, Elapsed time (seconds) - *Total: 15.462*, SQL query:
0.02, *Reading
results: 15.442*

When I give a more difficult query (in my head):
SELECT * FROM SENSOR_DATA WHERE DT>=1410677800000 AND DT <= 1500677800000
AND SID='ID1' ORDER BY DT;
The time that it takes is:
Rows read: 100, Elapsed time (seconds) - *Total: 1.435*, SQL query:
0.014, *Reading
results: 1.421*

In either case I am interested in my application for the results of the
second query, I was just curious about the difference in the performance.

Thanks a lot!






On 9 June 2015 at 10:26, Yiannis Gkoufas <[email protected]> wrote:

> Thanks a lot for your replies!
> Will try the DATE field and change the order of the Composite Key.
>
> On 8 June 2015 at 17:37, James Taylor <[email protected]> wrote:
>
>> Both DATE and TIME have millisecond granularity (both are stored as 8
>> byte longs), so I'd recommend using either of those. Phoenix also
>> supports date arithmetic, so you can do queries like this to get the
>> last weeks worth of data:
>>
>> SELECT * FROM SENSOR_DATA
>> WHERE sid = 'ID1' AND dt > CURRENT_TIME() - 7;
>>
>> Also, built-in functions such as TRUNC(dt, 'HOUR') will let you group
>> by in time "buckets", and TO_CHAR(dt) will convert to a string
>> representation.
>>
>> I wouldn't recommend using a string-based representation for your
>> date/time column.
>>
>> Will most of your queries include a sensor ID (or a small subset of
>> sensor IDs)? If so, then it makes sense to lead your PK with sid. If
>> most of your queries will be across all or a large percentage of your
>> sensors, then it'd be better to lead with your date/time column (and
>> salt the table to prevent write hot spotting).
>>
>> Thanks,
>> James
>>
>> On Mon, Jun 8, 2015 at 9:17 AM, Vladimir Rodionov
>> <[email protected]> wrote:
>> > There are several time data types, natively supported by Phoenix: TIME
>> is
>> > probably most suitable for your case (it should have millisecond
>> accuracy,
>> > but you better check it yourself.)
>> >
>> > -Vlad
>> >
>> > On Mon, Jun 8, 2015 at 9:02 AM, Yiannis Gkoufas <[email protected]>
>> > wrote:
>> >>
>> >> Hi Vladimir,
>> >>
>> >> thanks a lot for your input, just some followup questions:
>> >> 1) When you say "try to fit it in long" you mean UNSIGNED_LONG from
>> >> https://phoenix.apache.org/language/datatypes.html right?
>> >> 2) Would also string format be efficient? Like YYYYMMDDHHmm right?
>> >>
>> >> Thanks a lot!
>> >>
>> >> On 8 June 2015 at 16:48, Vladimir Rodionov <[email protected]>
>> wrote:
>> >>>
>> >>> PRIMARY KEY(dt,sid)) won't work well for your query.
>> >>> PRIMARY KEY(sid, dt)) is much better for time range queries for a
>> >>> particular sensor. In a latter case this query will be translated into
>> >>> efficient range scan.
>> >>> Do not use bigint for timestamp, try to fit it into long or use
>> >>> stringified version in a format suitable for byte-by-byte comparison.
>> >>>
>> >>> "2015 06/08 05:23:25.345"
>> >>>
>> >>> -Vlad
>> >>>
>> >>> On Mon, Jun 8, 2015 at 2:48 AM, Yiannis Gkoufas <[email protected]
>> >
>> >>> wrote:
>> >>>>
>> >>>> Hi there,
>> >>>>
>> >>>> I am investigating Phoenix as a potential data-store for time-series
>> >>>> data on sensors.
>> >>>> What I am really interested in, as a first milestone, is to have
>> >>>> efficient time range queries for a particular sensor. From those
>> queries the
>> >>>> results would consist of 1 or 2 columns (so small rows).
>> >>>> I was looking on some advice about the schema design and indexes.
>> >>>> What I have done so far:
>> >>>>
>> >>>> CREATE TABLE VALUES (dt bigint not null,sid varchar not null, us
>> double,
>> >>>> gen double CONSTRAINT PK PRIMARY KEY(dt,sid)) COMPRESSION='GZ',
>> >>>> SALT_BUCKETS=120
>> >>>>
>> >>>> loaded the data and performed queries like:
>> >>>>
>> >>>> SELECT DT,US,GEN FROM VALUES WHERE DT>=1369676800000 AND DT <=
>> >>>> 1370000800000 AND SID='ID1'
>> >>>>
>> >>>> Is this the optimal way to go?
>> >>>>
>> >>>> Thanks a lot
>> >>>
>> >>>
>> >>
>> >
>>
>
>

Re: Schema and indexes for efficient time range queries

Reply via email to