Hi,
I'm seeing some strange results when trying to group by
date_trunc('QUARTER', <timestamp column>). I can work around it by doing
more or less the same thing as in DateTruncFunctions. Am I missing
something, or is this a bug?
0: jdbc:drill:> SELECT date_trunc('QUARTER',
`taxi_trips`.`dropoff_datetime`), COUNT(*) FROM `hive.default`.`taxi_trips`
GROUP BY date_trunc('QUARTER', `taxi_trips`.`dropoff_datetime`) ORDER BY
date_trunc('QUARTER', `taxi_trips`.`dropoff_datetime`);
+------------------------+------------+
| EXPR$0 | EXPR$1 |
+------------------------+------------+
| 2012-01-01 00:00:00.0 | 21817 |
| 2013-01-01 00:00:00.0 | 173157926 |
| 2013-04-01 00:00:00.0 | 3 |
| 2013-07-01 00:00:00.0 | 2 |
| 2013-10-01 00:00:00.0 | 3 |
| 2014-01-01 00:00:00.0 | 8 |
| 2020-01-01 00:00:00.0 | 4 |
+------------------------+------------+
7 rows selected (12.734 seconds)
The data is spread out over all months of 2013:
0: jdbc:drill:> SELECT date_trunc('MONTH',
`taxi_trips`.`dropoff_datetime`), COUNT(*) FROM `hive.default`.`taxi_trips`
GROUP BY date_trunc('MONTH', `taxi_trips`.`dropoff_datetime`) ORDER BY
date_trunc('MONTH', `taxi_trips`.`dropoff_datetime`);
+------------------------+-----------+
| EXPR$0 | EXPR$1 |
+------------------------+-----------+
| 2012-12-01 00:00:00.0 | 21817 |
| 2013-01-01 00:00:00.0 | 14772657 |
| 2013-02-01 00:00:00.0 | 13990803 |
| 2013-03-01 00:00:00.0 | 15744402 |
| 2013-04-01 00:00:00.0 | 15108210 |
| 2013-05-01 00:00:00.0 | 15313848 |
| 2013-06-01 00:00:00.0 | 14355098 |
| 2013-07-01 00:00:00.0 | 13830436 |
| 2013-08-01 00:00:00.0 | 12613596 |
| 2013-09-01 00:00:00.0 | 14080300 |
| 2013-10-01 00:00:00.0 | 15009363 |
| 2013-11-01 00:00:00.0 | 14388420 |
| 2013-12-01 00:00:00.0 | 13950801 |
| 2014-01-01 00:00:00.0 | 8 |
| 2020-05-01 00:00:00.0 | 4 |
+------------------------+-----------+
15 rows selected (12.25 seconds)
This workaround gives the correct results:
0: jdbc:drill:> SELECT date_trunc('YEAR', `taxi_trips`.`dropoff_datetime`)
+ ((extract(month from `taxi_trips`.`dropoff_datetime`)-1)/3) * interval
'3' MONTH, COUNT(*) FROM `hive.default`.`taxi_trips` GROUP BY
date_trunc('YEAR', `taxi_trips`.`dropoff_datetime`) + ((extract(month from
`taxi_trips`.`dropoff_datetime`)-1)/3) * interval '3' MONTH ORDER BY
date_trunc('YEAR', `taxi_trips`.`dropoff_datetime`) + ((extract(month from
`taxi_trips`.`dropoff_datetime`)-1)/3) * interval '3' MONTH;
+------------------------+-----------+
| EXPR$0 | EXPR$1 |
+------------------------+-----------+
| 2012-10-01 00:00:00.0 | 21817 |
| 2013-01-01 00:00:00.0 | 44507862 |
| 2013-04-01 00:00:00.0 | 44777156 |
| 2013-07-01 00:00:00.0 | 40524332 |
| 2013-10-01 00:00:00.0 | 43348584 |
| 2014-01-01 00:00:00.0 | 8 |
| 2020-04-01 00:00:00.0 | 4 |
+------------------------+-----------+
7 rows selected (13.261 seconds)
The data is read from an external Parquet table:
0: jdbc:drill:> describe `hive.default`.`taxi_trips`;
+---------------------+--------------------+--------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+---------------------+--------------------+--------------+
| dropoff_datetime | TIMESTAMP | YES |
| dropoff_latitude | DOUBLE | YES |
| dropoff_longitude | DOUBLE | YES |
| hack_license | CHARACTER VARYING | YES |
| medallion | CHARACTER VARYING | YES |
| passenger_count | BIGINT | YES |
| pickup_datetime | TIMESTAMP | YES |
| pickup_latitude | DOUBLE | YES |
| pickup_longitude | DOUBLE | YES |
| rate_code | BIGINT | YES |
| store_and_fwd_flag | CHARACTER VARYING | YES |
| trip_distance | DOUBLE | YES |
| trip_time_in_secs | BIGINT | YES |
| vendor_id | CHARACTER VARYING | YES |
+---------------------+--------------------+--------------+
14 rows selected (0.184 seconds)
0: jdbc:drill:>
Best regards,
Joel