Re: group by problem

James Taylor Mon, 06 Apr 2015 13:39:17 -0700

How did you input the data?


On Mon, Apr 6, 2015 at 1:27 PM, Marek Wiewiorka
<[email protected]> wrote:
> Oh I'm sorry I forgot to attach them:
> DDL of my table:
> create table se_dwh.homes_usage_hour (hu_ts time not null ,hu_ho_id integer
> not null ,hu_stream_id integer not null, hu_usage double constraint pk
> PRIMARY KEY(hu_ts,hu_ho_id,hu_stream_id) );
>
> Phoenix: 4.3.0
>
> Thanks,
> Marek
>
>
> 2015-04-06 22:25 GMT+02:00 James Taylor <[email protected]>:
>>
>> Hi Marek,
>> How did you input the data and what does your CREATE TABLE/VIEW
>> statement look like? What version of Phoenix and HBase are you using?
>> Thanks,
>> James
>>
>> On Monday, April 6, 2015, Marek Wiewiorka <[email protected]>
>> wrote:
>>>
>>> Hi James - sure here is the result of your query (limited to 5 rows):
>>>
>>>
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> |              TO_CHAR(HU_TS)              |                 HU_HO_ID
>>> |               HU_STREAM_ID               |                 HU_USAGE
>>> |
>>>
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> | 2015-03-19 00:59:59.000                  | 4720
>>> | 0                                        | 0.287
>>> |
>>> | 2015-03-19 00:59:59.000                  | 6854
>>> | 0                                        | 3.6189999999999998
>>> |
>>> | 2015-03-19 01:59:59.000                  | 4720
>>> | 0                                        | 0.323
>>> |
>>> | 2015-03-19 01:59:59.000                  | 6854
>>> | 0                                        | 2.556
>>> |
>>> | 2015-03-19 02:59:59.000                  | 4720
>>> | 0                                        | 0.37
>>> |
>>>
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>
>>> Ok - I will try to prepare a sample csv with DDL/DML.
>>>
>>> Thanks again,
>>> Marek
>>>
>>>
>>> 2015-04-06 22:06 GMT+02:00 James Taylor <[email protected]>:
>>>>
>>>> Hi Marek,
>>>> Thanks for the additional information. If you could answer my earlier
>>>> questions, that would be helpful.
>>>>
>>>> If you can't repro with a simple test case, then how about attaching a
>>>> csv dump of some of your data (the smallest amount of data that repros
>>>> the issue) to a JIRA along with the CREATE TABLE statement and the
>>>> query?
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>> On Mon, Apr 6, 2015 at 12:49 PM, Marek Wiewiorka
>>>> <[email protected]> wrote:
>>>> > Hi Guys - thanks for your messages.
>>>> >
>>>> > I did another round of testing I found that if I use to_char instead
>>>> > of
>>>> > trunc function I'm getting the expected result:
>>>> >
>>>> > select to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage where
>>>> > id in
>>>> > (4720,6854) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>>> >
>>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > |               TO_CHAR(DT)                |                    ID
>>>> > |                  STREAM                  |                SUM(USAGE)
>>>> > |
>>>> >
>>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > | 2015-03-19                               | 4720
>>>> > | 0                                        | 8.405999999999999
>>>> > |
>>>> > | 2015-03-19                               | 6854
>>>> > | 0                                        | 28.339000000000006
>>>> > |
>>>> >
>>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >
>>>> > the same query with trunc returns 48 rows (24 rows per id)
>>>> > select trunc(dt,'DAY') as day,id,stream,sum(usage) from usage where id
>>>> > in
>>>> > (4720,6854) group by trunc(dt,'DAY'),stream,id;
>>>> > e.g.:
>>>> > ....
>>>> > | 2015-03-19 00:00:00.000 | 4720                                     |
>>>> > 0
>>>> > | 0.406                                    |
>>>> > | 2015-03-19 00:00:00.000 | 6854                                     |
>>>> > 0
>>>> > | 0.9490000000000001                       |
>>>> > | 2015-03-19 00:00:00.000 | 4720                                     |
>>>> > 0
>>>> > | 0.332                                    |
>>>> > | 2015-03-19 00:00:00.000 | 6854                                     |
>>>> > 0
>>>> > | 2.738                                    |
>>>> > | 2015-03-19 00:00:00.000 | 4720                                     |
>>>> > 0
>>>> > | 0.33499999999999996                      |
>>>> > | 2015-03-19 00:00:00.000 | 6854                                     |
>>>> > 0
>>>> > | 2.421                                    |
>>>> > | 2015-03-19 00:00:00.000 | 4720                                     |
>>>> > 0
>>>> > | 0.45599999999999996                      |
>>>> > | 2015-03-19 00:00:00.000 | 6854                                     |
>>>> > 0
>>>> > | 2.167                                    |
>>>> >
>>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >
>>>> > so for some reason grouping by trunc is not working...but using
>>>> > to_char is.
>>>> > This applies for table loaded using psql from csv file.
>>>> >
>>>> > When tried to create a sample table and populate it with upserts
>>>> > everything
>>>> > worked as expected in both cases:
>>>> >
>>>> > create table usage (dt time not null ,id integer not null,stream
>>>> > integer not
>>>> > null, usage double constraint pk PRIMARY KEY(dt,id,stream ));
>>>> > UPSERT INTO usage VALUES('2015-04-01 10:00:00',100,0,2.0);
>>>> > UPSERT INTO usage VALUES('2015-04-01 11:00:00',100,0,3.0);
>>>> >
>>>> > UPSERT INTO usage VALUES('2015-04-01 12:00:00',200,0,4.0);
>>>> > UPSERT INTO usage VALUES('2015-04-01 13:00:00',200,0,6.0);
>>>> >
>>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select trunc(dt,'DAY')
>>>> > as
>>>> > day,id,stream,sum(usage) from usage where id in (100,200) group by
>>>> > trunc(dt,'DAY'),stream,id;
>>>> >
>>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > |           DAY           |                    ID                    |
>>>> > STREAM                  |                SUM(USAGE)                |
>>>> >
>>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > | 2015-04-01 00:00:00.000 | 100                                      |
>>>> > 0
>>>> > | 5.0                                      |
>>>> > | 2015-04-01 00:00:00.000 | 200                                      |
>>>> > 0
>>>> > | 10.0                                     |
>>>> >
>>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >
>>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>>> > to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage where id in
>>>> > (100,200) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>>> >
>>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > |               TO_CHAR(DT)                |                    ID
>>>> > |                  STREAM                  |                SUM(USAGE)
>>>> > |
>>>> >
>>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > | 2015-04-01                               | 100
>>>> > | 0                                        | 5.0
>>>> > |
>>>> > | 2015-04-01                               | 200
>>>> > | 0                                        | 10.0
>>>> > |
>>>> >
>>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> > 2 rows selected (1.49 seconds)
>>>> >
>>>> >
>>>> > Shall I open a jira for that?
>>>> >
>>>> > Regards,
>>>> > Marek
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 2015-04-06 20:16 GMT+02:00 James Taylor <[email protected]>:
>>>> >>
>>>> >> Hi Marek,
>>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>>>> >> statement look like? What version of Phoenix and HBase are you using?
>>>> >>
>>>> >> Also, would you mind running the following query and letting us know
>>>> >> the
>>>> >> output?
>>>> >>
>>>> >> select to_char(hu_ts,'yyyy-MM-dd
>>>> >> HH:mm:ss.SSS'),hu_ho_id,hu_stream_id,hu_usage
>>>> >> from se_dwh.homes_usage_hour
>>>> >> where hu_ho_id in (4720,6854);
>>>> >>
>>>> >> Thanks,
>>>> >> James
>>>> >>
>>>> >> On Mon, Apr 6, 2015 at 10:34 AM, Gabriel Reid
>>>> >> <[email protected]>
>>>> >> wrote:
>>>> >> > That certainly looks like a bug. Would it be possible to make a
>>>> >> > small
>>>> >> > reproducible test case and if possible, log this in the Phoenix
>>>> >> > JIRA
>>>> >> > (https://issues.apache.org/jira/browse/PHOENIX) ?
>>>> >> >
>>>> >> > - Gabriel
>>>> >> >
>>>> >> > On Mon, Apr 6, 2015 at 6:10 PM Marek Wiewiorka
>>>> >> > <[email protected]>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Hi All,
>>>> >> >> I came across a weird situation while running a query with group
>>>> >> >> by.
>>>> >> >> I executed 2 queries:
>>>> >> >>
>>>> >> >> 1)
>>>> >> >> select trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>>> >> >> from
>>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720,6854) group by
>>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>>> >> >>
>>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>>> >> >> |
>>>> >> >> 0
>>>> >> >> | 0.45599999999999996                      |
>>>> >> >> | 2015-03-19 00:00:00.000 | 6854
>>>> >> >> |
>>>> >> >> 0
>>>> >> >> | 2.167                                    |
>>>> >> >>
>>>> >> >>
>>>> >> >> 2)
>>>> >> >>
>>>> >> >> select trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>>> >> >> from
>>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720) group by
>>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>>> >> >>
>>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>>> >> >> |
>>>> >> >> 0
>>>> >> >> | 8.405999999999999                        |
>>>> >> >>
>>>> >> >>
>>>> >> >> The only difference is that in the first case I included 2
>>>> >> >> ids(4720,6854)
>>>> >> >> and in the other one only 1 (4720).
>>>> >> >> The result for hu_ho_id=4720 should be the same in both case but
>>>> >> >> it
>>>> >> >> isn't
>>>> >> >> (0.45 vs 8.4).
>>>> >> >> The second result(8.4) is correct.
>>>> >> >>
>>>> >> >> Am I doing something wrong?
>>>> >> >>
>>>> >> >> Regards,
>>>> >> >> Marek
>>>> >> >>
>>>> >> >
>>>> >
>>>> >
>>>
>>>
>

Re: group by problem

Reply via email to