Re: group by problem

James Taylor Tue, 07 Apr 2015 08:54:06 -0700

Yes, please open a JIRA and attach that CSV (or ideally the smallest subset
that exhibits the problem).
Thanks,
James


On Tuesday, April 7, 2015, Marek Wiewiorka <[email protected]>
wrote:

> Hi James - shall I still open a JIRA for that?
> Thanks!
> Marek
>
> 2015-04-06 22:48 GMT+02:00 Marek Wiewiorka <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>
>> psql from a csv file:
>> ./psql.py dwh:2181:/hbase-unsecure -t SE_DWH.HOMES_USAGE_HOUR
>> /mnt/spark/export/usage_convert.txt/usage_merged.csv
>>
>> Here is a sample:
>> 2015-03-19 23:59:59,6854,0,2.167
>> 2015-03-19 22:59:59,6854,0,2.421
>> 2015-03-19 21:59:59,6854,0,2.738
>> 2015-03-19 20:59:59,6854,0,0.9490000000000001
>> 2015-03-19 19:59:59,6854,0,0.748
>> 2015-03-19 18:59:59,6854,0,2.76
>> 2015-03-19 17:59:59,6854,0,1.801
>> 2015-03-19 16:59:59,6854,0,0.661
>> 2015-03-19 15:59:59,6854,0,1.082
>> 2015-03-19 14:59:59,6854,0,1.303
>>
>>
>> M.
>>
>> 2015-04-06 22:38 GMT+02:00 James Taylor <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>>
>>> How did you input the data?
>>>
>>> On Mon, Apr 6, 2015 at 1:27 PM, Marek Wiewiorka
>>> <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>> > Oh I'm sorry I forgot to attach them:
>>> > DDL of my table:
>>> > create table se_dwh.homes_usage_hour (hu_ts time not null ,hu_ho_id
>>> integer
>>> > not null ,hu_stream_id integer not null, hu_usage double constraint pk
>>> > PRIMARY KEY(hu_ts,hu_ho_id,hu_stream_id) );
>>> >
>>> > Phoenix: 4.3.0
>>> >
>>> > Thanks,
>>> > Marek
>>> >
>>> >
>>> > 2015-04-06 22:25 GMT+02:00 James Taylor <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>>> >>
>>> >> Hi Marek,
>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>>> >> statement look like? What version of Phoenix and HBase are you using?
>>> >> Thanks,
>>> >> James
>>> >>
>>> >> On Monday, April 6, 2015, Marek Wiewiorka <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> >> wrote:
>>> >>>
>>> >>> Hi James - sure here is the result of your query (limited to 5 rows):
>>> >>>
>>> >>>
>>> >>>
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>> |              TO_CHAR(HU_TS)              |                 HU_HO_ID
>>> >>> |               HU_STREAM_ID               |                 HU_USAGE
>>> >>> |
>>> >>>
>>> >>>
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>> | 2015-03-19 00:59:59.000                  | 4720
>>> >>> | 0                                        | 0.287
>>> >>> |
>>> >>> | 2015-03-19 00:59:59.000                  | 6854
>>> >>> | 0                                        | 3.6189999999999998
>>> >>> |
>>> >>> | 2015-03-19 01:59:59.000                  | 4720
>>> >>> | 0                                        | 0.323
>>> >>> |
>>> >>> | 2015-03-19 01:59:59.000                  | 6854
>>> >>> | 0                                        | 2.556
>>> >>> |
>>> >>> | 2015-03-19 02:59:59.000                  | 4720
>>> >>> | 0                                        | 0.37
>>> >>> |
>>> >>>
>>> >>>
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>
>>> >>> Ok - I will try to prepare a sample csv with DDL/DML.
>>> >>>
>>> >>> Thanks again,
>>> >>> Marek
>>> >>>
>>> >>>
>>> >>> 2015-04-06 22:06 GMT+02:00 James Taylor <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>>> >>>>
>>> >>>> Hi Marek,
>>> >>>> Thanks for the additional information. If you could answer my
>>> earlier
>>> >>>> questions, that would be helpful.
>>> >>>>
>>> >>>> If you can't repro with a simple test case, then how about
>>> attaching a
>>> >>>> csv dump of some of your data (the smallest amount of data that
>>> repros
>>> >>>> the issue) to a JIRA along with the CREATE TABLE statement and the
>>> >>>> query?
>>> >>>>
>>> >>>> Thanks,
>>> >>>> James
>>> >>>>
>>> >>>> On Mon, Apr 6, 2015 at 12:49 PM, Marek Wiewiorka
>>> >>>> <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>> >>>> > Hi Guys - thanks for your messages.
>>> >>>> >
>>> >>>> > I did another round of testing I found that if I use to_char
>>> instead
>>> >>>> > of
>>> >>>> > trunc function I'm getting the expected result:
>>> >>>> >
>>> >>>> > select to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage
>>> where
>>> >>>> > id in
>>> >>>> > (4720,6854) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>> >>>> >
>>> >>>> >
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > |               TO_CHAR(DT)                |                    ID
>>> >>>> > |                  STREAM                  |
>>> SUM(USAGE)
>>> >>>> > |
>>> >>>> >
>>> >>>> >
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > | 2015-03-19                               | 4720
>>> >>>> > | 0                                        | 8.405999999999999
>>> >>>> > |
>>> >>>> > | 2015-03-19                               | 6854
>>> >>>> > | 0                                        | 28.339000000000006
>>> >>>> > |
>>> >>>> >
>>> >>>> >
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> >
>>> >>>> > the same query with trunc returns 48 rows (24 rows per id)
>>> >>>> > select trunc(dt,'DAY') as day,id,stream,sum(usage) from usage
>>> where id
>>> >>>> > in
>>> >>>> > (4720,6854) group by trunc(dt,'DAY'),stream,id;
>>> >>>> > e.g.:
>>> >>>> > ....
>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>    |
>>> >>>> > 0
>>> >>>> > | 0.406                                    |
>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>    |
>>> >>>> > 0
>>> >>>> > | 0.9490000000000001                       |
>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>    |
>>> >>>> > 0
>>> >>>> > | 0.332                                    |
>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>    |
>>> >>>> > 0
>>> >>>> > | 2.738                                    |
>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>    |
>>> >>>> > 0
>>> >>>> > | 0.33499999999999996                      |
>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>    |
>>> >>>> > 0
>>> >>>> > | 2.421                                    |
>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>    |
>>> >>>> > 0
>>> >>>> > | 0.45599999999999996                      |
>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>    |
>>> >>>> > 0
>>> >>>> > | 2.167                                    |
>>> >>>> >
>>> >>>> >
>>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> >
>>> >>>> > so for some reason grouping by trunc is not working...but using
>>> >>>> > to_char is.
>>> >>>> > This applies for table loaded using psql from csv file.
>>> >>>> >
>>> >>>> > When tried to create a sample table and populate it with upserts
>>> >>>> > everything
>>> >>>> > worked as expected in both cases:
>>> >>>> >
>>> >>>> > create table usage (dt time not null ,id integer not null,stream
>>> >>>> > integer not
>>> >>>> > null, usage double constraint pk PRIMARY KEY(dt,id,stream ));
>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 10:00:00',100,0,2.0);
>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 11:00:00',100,0,3.0);
>>> >>>> >
>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 12:00:00',200,0,4.0);
>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 13:00:00',200,0,6.0);
>>> >>>> >
>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>> trunc(dt,'DAY')
>>> >>>> > as
>>> >>>> > day,id,stream,sum(usage) from usage where id in (100,200) group by
>>> >>>> > trunc(dt,'DAY'),stream,id;
>>> >>>> >
>>> >>>> >
>>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > |           DAY           |                    ID
>>>     |
>>> >>>> > STREAM                  |                SUM(USAGE)
>>>   |
>>> >>>> >
>>> >>>> >
>>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > | 2015-04-01 00:00:00.000 | 100
>>>     |
>>> >>>> > 0
>>> >>>> > | 5.0                                      |
>>> >>>> > | 2015-04-01 00:00:00.000 | 200
>>>     |
>>> >>>> > 0
>>> >>>> > | 10.0                                     |
>>> >>>> >
>>> >>>> >
>>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> >
>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>> >>>> > to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage where id
>>> in
>>> >>>> > (100,200) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>> >>>> >
>>> >>>> >
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > |               TO_CHAR(DT)                |                    ID
>>> >>>> > |                  STREAM                  |
>>> SUM(USAGE)
>>> >>>> > |
>>> >>>> >
>>> >>>> >
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > | 2015-04-01                               | 100
>>> >>>> > | 0                                        | 5.0
>>> >>>> > |
>>> >>>> > | 2015-04-01                               | 200
>>> >>>> > | 0                                        | 10.0
>>> >>>> > |
>>> >>>> >
>>> >>>> >
>>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>> >>>> > 2 rows selected (1.49 seconds)
>>> >>>> >
>>> >>>> >
>>> >>>> > Shall I open a jira for that?
>>> >>>> >
>>> >>>> > Regards,
>>> >>>> > Marek
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> > 2015-04-06 20:16 GMT+02:00 James Taylor <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>>> >>>> >>
>>> >>>> >> Hi Marek,
>>> >>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>>> >>>> >> statement look like? What version of Phoenix and HBase are you
>>> using?
>>> >>>> >>
>>> >>>> >> Also, would you mind running the following query and letting us
>>> know
>>> >>>> >> the
>>> >>>> >> output?
>>> >>>> >>
>>> >>>> >> select to_char(hu_ts,'yyyy-MM-dd
>>> >>>> >> HH:mm:ss.SSS'),hu_ho_id,hu_stream_id,hu_usage
>>> >>>> >> from se_dwh.homes_usage_hour
>>> >>>> >> where hu_ho_id in (4720,6854);
>>> >>>> >>
>>> >>>> >> Thanks,
>>> >>>> >> James
>>> >>>> >>
>>> >>>> >> On Mon, Apr 6, 2015 at 10:34 AM, Gabriel Reid
>>> >>>> >> <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> >>>> >> wrote:
>>> >>>> >> > That certainly looks like a bug. Would it be possible to make a
>>> >>>> >> > small
>>> >>>> >> > reproducible test case and if possible, log this in the Phoenix
>>> >>>> >> > JIRA
>>> >>>> >> > (https://issues.apache.org/jira/browse/PHOENIX) ?
>>> >>>> >> >
>>> >>>> >> > - Gabriel
>>> >>>> >> >
>>> >>>> >> > On Mon, Apr 6, 2015 at 6:10 PM Marek Wiewiorka
>>> >>>> >> > <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> >>>> >> > wrote:
>>> >>>> >> >>
>>> >>>> >> >> Hi All,
>>> >>>> >> >> I came across a weird situation while running a query with
>>> group
>>> >>>> >> >> by.
>>> >>>> >> >> I executed 2 queries:
>>> >>>> >> >>
>>> >>>> >> >> 1)
>>> >>>> >> >> select
>>> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>> >>>> >> >> from
>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720,6854) group by
>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>> >>>> >> >>
>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>> >>>> >> >> |
>>> >>>> >> >> 0
>>> >>>> >> >> | 0.45599999999999996                      |
>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 6854
>>> >>>> >> >> |
>>> >>>> >> >> 0
>>> >>>> >> >> | 2.167                                    |
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >> 2)
>>> >>>> >> >>
>>> >>>> >> >> select
>>> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>> >>>> >> >> from
>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720) group by
>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>> >>>> >> >>
>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>> >>>> >> >> |
>>> >>>> >> >> 0
>>> >>>> >> >> | 8.405999999999999                        |
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >> The only difference is that in the first case I included 2
>>> >>>> >> >> ids(4720,6854)
>>> >>>> >> >> and in the other one only 1 (4720).
>>> >>>> >> >> The result for hu_ho_id=4720 should be the same in both case
>>> but
>>> >>>> >> >> it
>>> >>>> >> >> isn't
>>> >>>> >> >> (0.45 vs 8.4).
>>> >>>> >> >> The second result(8.4) is correct.
>>> >>>> >> >>
>>> >>>> >> >> Am I doing something wrong?
>>> >>>> >> >>
>>> >>>> >> >> Regards,
>>> >>>> >> >> Marek
>>> >>>> >> >>
>>> >>>> >> >
>>> >>>> >
>>> >>>> >
>>> >>>
>>> >>>
>>> >
>>>
>>
>>
>

Re: group by problem

Reply via email to