Greetings fellow Drill users! While running Drill on a directory of parquet files, we are noticing an order of magnitude slow down for queries when selecting multiple rows. Is there something we could do to reduce the query time for multiple columns to be closer to the single-column query times?
*Details of the setup*: 2 directories of parquet files converted by Drill from JSON, about 21 Million rows in total. The EXOMISER_VARIANT_SCORE column is a double between 0 and 1. yuki07@driller:~$ find /data/parquet/ /data/parquet/ /data/parquet/HG00731_200_37_ASM_parquet /data/parquet/HG00731_200_37_ASM_parquet/.0_0_0.parquet.crc /data/parquet/HG00731_200_37_ASM_parquet/0_0_1.parquet /data/parquet/HG00731_200_37_ASM_parquet/0_0_0.parquet /data/parquet/HG00731_200_37_ASM_parquet/.0_0_1.parquet.crc /data/parquet/HG00732_200_37_ASM_parquet /data/parquet/HG00732_200_37_ASM_parquet/.0_0_0.parquet.crc /data/parquet/HG00732_200_37_ASM_parquet/0_0_1.parquet /data/parquet/HG00732_200_37_ASM_parquet/0_0_0.parquet /data/parquet/HG00732_200_37_ASM_parquet/.0_0_1.parquet.crc *Queries:* 0: jdbc:drill:zk=local> select EXOMISER_VARIANT_SCORE from dfs.`/data/parquet/` where EXOMISER_VARIANT_SCORE > 0.9999999 ; +------------------------+ | EXOMISER_VARIANT_SCORE | +------------------------+ | 0.99999993757262 | | 0.9999999743714132 | | 0.9999999893098129 | +------------------------+ 3 rows selected (*1.424* seconds) 0: jdbc:drill:zk=local> select CHROM,POS,EXOMISER_VARIANT_ SCORE from dfs.`/data/parquet/` where EXOMISER_VARIANT_SCORE > 0.9999999 ; +------------+------------+------------------------+ | CHROM | POS | EXOMISER_VARIANT_SCORE | +------------+------------+------------------------+ | 20 | 19078001 | 0.99999993757262 | | 4 | 156042001 | 0.9999999743714132 | | 9 | 113179308 | 0.9999999893098129 | +------------+------------+------------------------+ 3 rows selected (*10.898* seconds) Any help is appreciated, -- Denys Pavlov
