I am Comparing Querying speed between GZ and BZ2.

Below are the 2 files and their sizes (This 2 files have same data):
kafka_3_25-Jul-2016-12a.json.gz = 1.8G
kafka_3_25-Jul-2016-12a.json.bz2= 1.1G



Results:

0: jdbc:drill:> select channelid, count(serverTime) from
dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
+------------+----------+
| channelid  |  EXPR$1  |
+------------+----------+
| 3          | 977134   |
| 0          | 836850   |
| 2          | 3202854  |
+------------+----------+
3 rows selected (86.034 seconds)



0: jdbc:drill:> select channelid, count(serverTime) from
dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
+------------+----------+
| channelid  |  EXPR$1  |
+------------+----------+
| 3          | 977134   |
| 0          | 836850   |
| 2          | 3202854  |
+------------+----------+
3 rows selected (459.079 seconds)



Questions:
1. As per above Test: Gz is 6x fast than Bz2. why is that ?
2. How can we speed to up Bz2.  Are there any configuration to do ?
3. As bz2 is splittable format, How drill using it ?


regards,
shankar

Reply via email to