I am Comparing Querying speed between GZ and BZ2. Below are the 2 files and their sizes (This 2 files have same data): kafka_3_25-Jul-2016-12a.json.gz = 1.8G kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
Results: 0: jdbc:drill:> select channelid, count(serverTime) from dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ; +------------+----------+ | channelid | EXPR$1 | +------------+----------+ | 3 | 977134 | | 0 | 836850 | | 2 | 3202854 | +------------+----------+ 3 rows selected (86.034 seconds) 0: jdbc:drill:> select channelid, count(serverTime) from dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ; +------------+----------+ | channelid | EXPR$1 | +------------+----------+ | 3 | 977134 | | 0 | 836850 | | 2 | 3202854 | +------------+----------+ 3 rows selected (459.079 seconds) Questions: 1. As per above Test: Gz is 6x fast than Bz2. why is that ? 2. How can we speed to up Bz2. Are there any configuration to do ? 3. As bz2 is splittable format, How drill using it ? regards, shankar
