Re: group by problem

Marek Wiewiorka Wed, 08 Apr 2015 22:51:46 -0700

Hi James - I'm sorry for delay I had to troubleshoot some other more urgent
issue.
I will try to open JIRA with a sample CSV today.


Thanks!
Marek


2015-04-08 23:08 GMT+02:00 James Taylor <[email protected]>:

> Any JIRA yet, Marek? It'd be good to get to the bottom of this. Your
> work around is not going to perform nearly as well as using TRUNC on
> the date.
> Thanks,
> James
>
> On Tue, Apr 7, 2015 at 8:53 AM, James Taylor <[email protected]>
> wrote:
> > Yes, please open a JIRA and attach that CSV (or ideally the smallest
> subset
> > that exhibits the problem).
> > Thanks,
> > James
> >
> >
> > On Tuesday, April 7, 2015, Marek Wiewiorka <[email protected]>
> > wrote:
> >>
> >> Hi James - shall I still open a JIRA for that?
> >> Thanks!
> >> Marek
> >>
> >> 2015-04-06 22:48 GMT+02:00 Marek Wiewiorka <[email protected]>:
> >>>
> >>> psql from a csv file:
> >>> ./psql.py dwh:2181:/hbase-unsecure -t SE_DWH.HOMES_USAGE_HOUR
> >>> /mnt/spark/export/usage_convert.txt/usage_merged.csv
> >>>
> >>> Here is a sample:
> >>> 2015-03-19 23:59:59,6854,0,2.167
> >>> 2015-03-19 22:59:59,6854,0,2.421
> >>> 2015-03-19 21:59:59,6854,0,2.738
> >>> 2015-03-19 20:59:59,6854,0,0.9490000000000001
> >>> 2015-03-19 19:59:59,6854,0,0.748
> >>> 2015-03-19 18:59:59,6854,0,2.76
> >>> 2015-03-19 17:59:59,6854,0,1.801
> >>> 2015-03-19 16:59:59,6854,0,0.661
> >>> 2015-03-19 15:59:59,6854,0,1.082
> >>> 2015-03-19 14:59:59,6854,0,1.303
> >>>
> >>>
> >>> M.
> >>>
> >>> 2015-04-06 22:38 GMT+02:00 James Taylor <[email protected]>:
> >>>>
> >>>> How did you input the data?
> >>>>
> >>>> On Mon, Apr 6, 2015 at 1:27 PM, Marek Wiewiorka
> >>>> <[email protected]> wrote:
> >>>> > Oh I'm sorry I forgot to attach them:
> >>>> > DDL of my table:
> >>>> > create table se_dwh.homes_usage_hour (hu_ts time not null ,hu_ho_id
> >>>> > integer
> >>>> > not null ,hu_stream_id integer not null, hu_usage double constraint
> pk
> >>>> > PRIMARY KEY(hu_ts,hu_ho_id,hu_stream_id) );
> >>>> >
> >>>> > Phoenix: 4.3.0
> >>>> >
> >>>> > Thanks,
> >>>> > Marek
> >>>> >
> >>>> >
> >>>> > 2015-04-06 22:25 GMT+02:00 James Taylor <[email protected]>:
> >>>> >>
> >>>> >> Hi Marek,
> >>>> >> How did you input the data and what does your CREATE TABLE/VIEW
> >>>> >> statement look like? What version of Phoenix and HBase are you
> using?
> >>>> >> Thanks,
> >>>> >> James
> >>>> >>
> >>>> >> On Monday, April 6, 2015, Marek Wiewiorka <
> [email protected]>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> Hi James - sure here is the result of your query (limited to 5
> >>>> >>> rows):
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>>
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>> |              TO_CHAR(HU_TS)              |
> >>>> >>> HU_HO_ID
> >>>> >>> |               HU_STREAM_ID               |
> >>>> >>> HU_USAGE
> >>>> >>> |
> >>>> >>>
> >>>> >>>
> >>>> >>>
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>> | 2015-03-19 00:59:59.000                  | 4720
> >>>> >>> | 0                                        | 0.287
> >>>> >>> |
> >>>> >>> | 2015-03-19 00:59:59.000                  | 6854
> >>>> >>> | 0                                        | 3.6189999999999998
> >>>> >>> |
> >>>> >>> | 2015-03-19 01:59:59.000                  | 4720
> >>>> >>> | 0                                        | 0.323
> >>>> >>> |
> >>>> >>> | 2015-03-19 01:59:59.000                  | 6854
> >>>> >>> | 0                                        | 2.556
> >>>> >>> |
> >>>> >>> | 2015-03-19 02:59:59.000                  | 4720
> >>>> >>> | 0                                        | 0.37
> >>>> >>> |
> >>>> >>>
> >>>> >>>
> >>>> >>>
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>
> >>>> >>> Ok - I will try to prepare a sample csv with DDL/DML.
> >>>> >>>
> >>>> >>> Thanks again,
> >>>> >>> Marek
> >>>> >>>
> >>>> >>>
> >>>> >>> 2015-04-06 22:06 GMT+02:00 James Taylor <[email protected]>:
> >>>> >>>>
> >>>> >>>> Hi Marek,
> >>>> >>>> Thanks for the additional information. If you could answer my
> >>>> >>>> earlier
> >>>> >>>> questions, that would be helpful.
> >>>> >>>>
> >>>> >>>> If you can't repro with a simple test case, then how about
> >>>> >>>> attaching a
> >>>> >>>> csv dump of some of your data (the smallest amount of data that
> >>>> >>>> repros
> >>>> >>>> the issue) to a JIRA along with the CREATE TABLE statement and
> the
> >>>> >>>> query?
> >>>> >>>>
> >>>> >>>> Thanks,
> >>>> >>>> James
> >>>> >>>>
> >>>> >>>> On Mon, Apr 6, 2015 at 12:49 PM, Marek Wiewiorka
> >>>> >>>> <[email protected]> wrote:
> >>>> >>>> > Hi Guys - thanks for your messages.
> >>>> >>>> >
> >>>> >>>> > I did another round of testing I found that if I use to_char
> >>>> >>>> > instead
> >>>> >>>> > of
> >>>> >>>> > trunc function I'm getting the expected result:
> >>>> >>>> >
> >>>> >>>> > select to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage
> >>>> >>>> > where
> >>>> >>>> > id in
> >>>> >>>> > (4720,6854) group by to_char(dt,'yyyy-MM-dd'),stream,id;
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > |               TO_CHAR(DT)                |
> >>>> >>>> > ID
> >>>> >>>> > |                  STREAM                  |
> >>>> >>>> > SUM(USAGE)
> >>>> >>>> > |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > | 2015-03-19                               | 4720
> >>>> >>>> > | 0                                        | 8.405999999999999
> >>>> >>>> > |
> >>>> >>>> > | 2015-03-19                               | 6854
> >>>> >>>> > | 0                                        | 28.339000000000006
> >>>> >>>> > |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> >
> >>>> >>>> > the same query with trunc returns 48 rows (24 rows per id)
> >>>> >>>> > select trunc(dt,'DAY') as day,id,stream,sum(usage) from usage
> >>>> >>>> > where id
> >>>> >>>> > in
> >>>> >>>> > (4720,6854) group by trunc(dt,'DAY'),stream,id;
> >>>> >>>> > e.g.:
> >>>> >>>> > ....
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 0.406                                    |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 0.9490000000000001                       |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 0.332                                    |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 2.738                                    |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 0.33499999999999996                      |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 2.421                                    |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 0.45599999999999996                      |
> >>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 2.167                                    |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> >
> >>>> >>>> > so for some reason grouping by trunc is not working...but using
> >>>> >>>> > to_char is.
> >>>> >>>> > This applies for table loaded using psql from csv file.
> >>>> >>>> >
> >>>> >>>> > When tried to create a sample table and populate it with
> upserts
> >>>> >>>> > everything
> >>>> >>>> > worked as expected in both cases:
> >>>> >>>> >
> >>>> >>>> > create table usage (dt time not null ,id integer not
> null,stream
> >>>> >>>> > integer not
> >>>> >>>> > null, usage double constraint pk PRIMARY KEY(dt,id,stream ));
> >>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 10:00:00',100,0,2.0);
> >>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 11:00:00',100,0,3.0);
> >>>> >>>> >
> >>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 12:00:00',200,0,4.0);
> >>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 13:00:00',200,0,6.0);
> >>>> >>>> >
> >>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
> >>>> >>>> > trunc(dt,'DAY')
> >>>> >>>> > as
> >>>> >>>> > day,id,stream,sum(usage) from usage where id in (100,200) group
> >>>> >>>> > by
> >>>> >>>> > trunc(dt,'DAY'),stream,id;
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > |           DAY           |                    ID
> >>>> >>>> > |
> >>>> >>>> > STREAM                  |                SUM(USAGE)
> >>>> >>>> > |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > | 2015-04-01 00:00:00.000 | 100
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 5.0                                      |
> >>>> >>>> > | 2015-04-01 00:00:00.000 | 200
> >>>> >>>> > |
> >>>> >>>> > 0
> >>>> >>>> > | 10.0                                     |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> >
> >>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
> >>>> >>>> > to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage where
> id
> >>>> >>>> > in
> >>>> >>>> > (100,200) group by to_char(dt,'yyyy-MM-dd'),stream,id;
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > |               TO_CHAR(DT)                |
> >>>> >>>> > ID
> >>>> >>>> > |                  STREAM                  |
> >>>> >>>> > SUM(USAGE)
> >>>> >>>> > |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > | 2015-04-01                               | 100
> >>>> >>>> > | 0                                        | 5.0
> >>>> >>>> > |
> >>>> >>>> > | 2015-04-01                               | 200
> >>>> >>>> > | 0                                        | 10.0
> >>>> >>>> > |
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
> >>>> >>>> > 2 rows selected (1.49 seconds)
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> > Shall I open a jira for that?
> >>>> >>>> >
> >>>> >>>> > Regards,
> >>>> >>>> > Marek
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>> > 2015-04-06 20:16 GMT+02:00 James Taylor <
> [email protected]>:
> >>>> >>>> >>
> >>>> >>>> >> Hi Marek,
> >>>> >>>> >> How did you input the data and what does your CREATE
> TABLE/VIEW
> >>>> >>>> >> statement look like? What version of Phoenix and HBase are you
> >>>> >>>> >> using?
> >>>> >>>> >>
> >>>> >>>> >> Also, would you mind running the following query and letting
> us
> >>>> >>>> >> know
> >>>> >>>> >> the
> >>>> >>>> >> output?
> >>>> >>>> >>
> >>>> >>>> >> select to_char(hu_ts,'yyyy-MM-dd
> >>>> >>>> >> HH:mm:ss.SSS'),hu_ho_id,hu_stream_id,hu_usage
> >>>> >>>> >> from se_dwh.homes_usage_hour
> >>>> >>>> >> where hu_ho_id in (4720,6854);
> >>>> >>>> >>
> >>>> >>>> >> Thanks,
> >>>> >>>> >> James
> >>>> >>>> >>
> >>>> >>>> >> On Mon, Apr 6, 2015 at 10:34 AM, Gabriel Reid
> >>>> >>>> >> <[email protected]>
> >>>> >>>> >> wrote:
> >>>> >>>> >> > That certainly looks like a bug. Would it be possible to
> make
> >>>> >>>> >> > a
> >>>> >>>> >> > small
> >>>> >>>> >> > reproducible test case and if possible, log this in the
> >>>> >>>> >> > Phoenix
> >>>> >>>> >> > JIRA
> >>>> >>>> >> > (https://issues.apache.org/jira/browse/PHOENIX) ?
> >>>> >>>> >> >
> >>>> >>>> >> > - Gabriel
> >>>> >>>> >> >
> >>>> >>>> >> > On Mon, Apr 6, 2015 at 6:10 PM Marek Wiewiorka
> >>>> >>>> >> > <[email protected]>
> >>>> >>>> >> > wrote:
> >>>> >>>> >> >>
> >>>> >>>> >> >> Hi All,
> >>>> >>>> >> >> I came across a weird situation while running a query with
> >>>> >>>> >> >> group
> >>>> >>>> >> >> by.
> >>>> >>>> >> >> I executed 2 queries:
> >>>> >>>> >> >>
> >>>> >>>> >> >> 1)
> >>>> >>>> >> >> select
> >>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
> >>>> >>>> >> >> from
> >>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720,6854) group
> >>>> >>>> >> >> by
> >>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
> >>>> >>>> >> >>
> >>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
> >>>> >>>> >> >> |
> >>>> >>>> >> >> 0
> >>>> >>>> >> >> | 0.45599999999999996                      |
> >>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 6854
> >>>> >>>> >> >> |
> >>>> >>>> >> >> 0
> >>>> >>>> >> >> | 2.167                                    |
> >>>> >>>> >> >>
> >>>> >>>> >> >>
> >>>> >>>> >> >> 2)
> >>>> >>>> >> >>
> >>>> >>>> >> >> select
> >>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
> >>>> >>>> >> >> from
> >>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720) group by
> >>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
> >>>> >>>> >> >>
> >>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
> >>>> >>>> >> >> |
> >>>> >>>> >> >> 0
> >>>> >>>> >> >> | 8.405999999999999                        |
> >>>> >>>> >> >>
> >>>> >>>> >> >>
> >>>> >>>> >> >> The only difference is that in the first case I included 2
> >>>> >>>> >> >> ids(4720,6854)
> >>>> >>>> >> >> and in the other one only 1 (4720).
> >>>> >>>> >> >> The result for hu_ho_id=4720 should be the same in both
> case
> >>>> >>>> >> >> but
> >>>> >>>> >> >> it
> >>>> >>>> >> >> isn't
> >>>> >>>> >> >> (0.45 vs 8.4).
> >>>> >>>> >> >> The second result(8.4) is correct.
> >>>> >>>> >> >>
> >>>> >>>> >> >> Am I doing something wrong?
> >>>> >>>> >> >>
> >>>> >>>> >> >> Regards,
> >>>> >>>> >> >> Marek
> >>>> >>>> >> >>
> >>>> >>>> >> >
> >>>> >>>> >
> >>>> >>>> >
> >>>> >>>
> >>>> >>>
> >>>> >
> >>>
> >>>
> >>
> >
>

Re: group by problem

Reply via email to