Hi, 1. What version of Drill are you running your query on ? 2. How were your parquet files created ? (were they created on Drill or another system, like Hive etc.) 3. Can you please share the parquet schema of the parquet file being queried ?
Thanks, Khurram On Wed, Aug 1, 2018 at 3:24 AM, Peter Edike < peter.ed...@interswitchgroup.com> wrote: > Hello everyone, > > I am currently using apache drill to query a directory of parquet files > (total size 64gb with each file averaging 6.4Gb per size). I am running > the following query > > select * from dfs.rj.vw_reports_201805 where unique_key in > ( > select unique_key from dfs.rj.vw_reports_201805 group by > unique_key having count(unique_key) > 1 > ) > limit 40 > > Now this query takes about 42 minutes after which it fails because one of > the drillbits in the drillbit cluster crashes. I check the logs and I see > several errors summarized as follows > > 20:35:18.006 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState > - Connection timed out for connection string (172.35.15.129:5181, > 172.35.15.151:5181,172.35.15.152:5181) and timeout (5000) / elapsed > (13347) > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > [curator-client-2.7.1.jar:na] > at > org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > [curator-client-2.7.1.jar:na] > at org.apache.curator.CuratorZookeeperClient.getZooKeeper( > CuratorZookeeperClient.java:115) [curator-client-2.7.1.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl. > performBackgroundOperation(CuratorFrameworkImpl.java:806) > [curator-framework-2.7.1.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl. > backgroundOperationsLoop(CuratorFrameworkImpl.java:792) > [curator-framework-2.7.1.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl. > access$300(CuratorFrameworkImpl.java:62) [curator-framework-2.7.1.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl$4. > call(CuratorFrameworkImpl.java:257) [curator-framework-2.7.1.jar:na] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [na:1.8.0_171] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [na:1.8.0_171] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [na:1.8.0_171] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171] > AND > > > Aug 1, 2018 7:54:18 AM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see > PARQUET-251): parquet-mr (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) > > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) > using format: (.+) version ((.*) )?\(build ?(.*)\) > > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > > at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics( > CorruptStatistics.java:66) > > at org.apache.parquet.format.converter.ParquetMetadataConverter. > fromParquetStatistics(ParquetMetadataConverter.java:264) > > at org.apache.parquet.format.converter.ParquetMetadataConverter. > fromParquetMetadata(ParquetMetadataConverter.java:568) > > at org.apache.parquet.format.converter.ParquetMetadataConverter. > readParquetMetadata(ParquetMetadataConverter.java:545) > > at org.apache.parquet.hadoop.ParquetFileReader.readFooter( > ParquetFileReader.java:455) > > at org.apache.parquet.hadoop.ParquetFileReader.readFooter( > ParquetFileReader.java:404) > > at org.apache.parquet.hadoop.ParquetFileReader.readFooter( > ParquetFileReader.java:390) > > at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator. > getBatch(ParquetScanBatchCreator.java:117) > > at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator. > getBatch(ParquetScanBatchCreator.java:53) > > at org.apache.drill.exec.physical.impl.ImplCreator$2. > run(ImplCreator.java:146) > > at org.apache.drill.exec.physical.impl.ImplCreator$2. > run(ImplCreator.java:142) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1633) > > at org.apache.drill.exec.physical.impl.ImplCreator. > getRecordBatch(ImplCreator.java:142) > > at org.apache.drill.exec.physical.impl.ImplCreator. > getChildren(ImplCreator.java:182) > > at org.apache.drill.exec.physical.impl.ImplCreator. > getRootExec(ImplCreator.java:110) > > at org.apache.drill.exec.physical.impl.ImplCreator. > getExec(ImplCreator.java:87) > > at org.apache.drill.exec.work.fragment.FragmentExecutor.run( > FragmentExecutor.java:206) > > at org.apache.drill.common.SelfCleaningRunnable.run( > SelfCleaningRunnable.java:38) > > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1149) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > The Schema for the data I am querying in question is as follows > > index_no > > ANY > > YES > > bank_code > > ANY > > YES > > trxn_category > > ANY > > YES > > Debit_Account_Type > > ANY > > YES > > Credit_Account_Type > > ANY > > YES > > trxn_amount > > ANY > > YES > > trxn_fee > > ANY > > YES > > trxn_date > > ANY > > YES > > Currency > > ANY > > YES > > Late_reversal > > ANY > > YES > > card_type > > ANY > > YES > > terminal_type > > ANY > > YES > > Acquirer > > ANY > > YES > > Issuer > > ANY > > YES > > Value_SettleAmount > > ANY > > YES > > Value_RequestedAmount > > ANY > > YES > > Volume > > ANY > > YES > > unique_key > > ANY > > YES > > source_node_name > > ANY > > YES > > business_date > > ANY > > YES > > PT_datetime_tran_local > > ANY > > YES > > PTC_pan > > ANY > > YES > > PTC_terminal_id > > ANY > > YES > > PTC_card_acceptor_id_code > > ANY > > YES > > PTC_card_acceptor_name_loc > > ANY > > YES > > PT_system_trace_audit_nr > > ANY > > YES > > PT_message_type > > ANY > > YES > > PT_from_account_id > > ANY > > YES > > PT_to_account_id > > ANY > > YES > > PT_tran_type > > ANY > > YES > > PT_rsp_code_req > > ANY > > YES > > PT_rsp_code_rsp > > ANY > > YES > > PT_settle_amount_req > > ANY > > YES > > PT_settle_amount_rsp > > ANY > > YES > > PTC_merchant_type > > ANY > > YES > > PT_settle_amount_impact > > ANY > > YES > > PT_settle_tran_fee_req > > ANY > > YES > > PT_settle_tran_fee_rsp > > ANY > > YES > > PT_auth_id_rsp > > ANY > > YES > > PT_retrieval_reference_nr > > ANY > > YES > > PTC_totals_group > > ANY > > YES > > PTC_card_product > > ANY > > YES > > PT_tran_currency_code > > ANY > > YES > > PT_payee > > ANY > > YES > > PT_tran_amount_req > > ANY > > YES > > PT_tran_amount_rsp > > ANY > > YES > > PT_extended_tran_type > > ANY > > YES > > PT_tran_postilion_originated > > ANY > > YES > > PT_tran_completed > > ANY > > YES > > PT_tran_nr > > ANY > > YES > > PT_retention_data > > ANY > > YES > > PT_acquiring_inst_id_code > > ANY > > YES > > PT_message_reason_code > > ANY > > YES > > PT_sponsor_bank > > ANY > > YES > > PT_datetime_tran_gmt > > ANY > > YES > > PT_datetime_req > > ANY > > YES > > PT_datetime_rsp > > ANY > > YES > > PT_realtime_business_date > > ANY > > YES > > PT_recon_business_date > > ANY > > YES > > PT_from_account_type > > ANY > > YES > > PT_to_account_type > > ANY > > YES > > PT_tran_cash_req > > ANY > > YES > > PT_tran_cash_rsp > > ANY > > YES > > PT_tran_tran_fee_req > > ANY > > YES > > PT_tran_tran_fee_rsp > > ANY > > YES > > PT_tran_tran_fee_currency_code > > ANY > > YES > > PT_tran_proc_fee_req > > ANY > > YES > > PT_tran_proc_fee_rsp > > ANY > > YES > > PT_tran_proc_fee_currency_code > > ANY > > YES > > PT_settle_currency_code > > ANY > > YES > > PT_pos_entry_mode > > ANY > > YES > > PT_pos_condition_code > > ANY > > YES > > PT_tran_reversed > > ANY > > YES > > PT_card_verification_result > > ANY > > YES > > PT_online_system_id > > ANY > > YES > > PT_participant_id > > ANY > > YES > > PT_receiving_inst_id_code > > ANY > > YES > > PT_routing_type > > ANY > > YES > > PT_pt_pos_card_input_mode > > ANY > > YES > > PT_source_node_key > > ANY > > YES > > PT_proc_online_system_id > > ANY > > YES > > PTC_post_tran_cust_id > > ANY > > YES > > PTC_source_node_name > > ANY > > YES > > PTC_card_seq_nr > > ANY > > YES > > PTC_expiry_date > > ANY > > YES > > PTC_service_restriction_code > > ANY > > YES > > PTC_terminal_owner > > ANY > > YES > > PTC_mapped_card_acceptor_id_code > > ANY > > YES > > PTC_pos_terminal_type > > ANY > > YES > > PTC_pan_encrypted > > ANY > > YES > > PTSP_Account_Nr > > ANY > > YES > > PTSP_Code > > ANY > > YES > > account_PTSP_Code > > ANY > > YES > > PTSP_Name > > ANY > > YES > > Sort_Code > > ANY > > YES > > rdm_amt > > ANY > > YES > > Reward_Code > > ANY > > YES > > Reward_discount > > ANY > > YES > > ptsp_terminal_id > > ANY > > YES > > reward_terminal_id > > ANY > > YES > > > txn_id > > ANY > > YES > > web_category_code > > ANY > > YES > > web_category_name > > ANY > > YES > > web_fee_type > > ANY > > YES > > web_merchant_disc > > ANY > > YES > > web_fee_cap > > ANY > > YES > > Account_Name > > ANY > > YES > > account_nr > > ANY > > YES > > Acquiring_bank > > ANY > > YES > > journal_amount > > ANY > > YES > > xls_amount > > ANY > > YES > > merch_cat_amount_cap > > ANY > > YES > > merch_cat_visa_amount_cap > > ANY > > YES > > reward_amount_cap > > ANY > > YES > > Amount_config_state > > ANY > > YES > > Amount_description > > ANY > > YES > > Authorized_Person > > ANY > > YES > > ACC_BANK_CODE > > ANY > > YES > > merch_cat_bearer > > ANY > > YES > > merch_cat_visa_bearer > > ANY > > YES > > merch_cat_visa_category_code > > ANY > > YES > > merch_cat_category_code > > ANY > > YES > > merch_cat_visa_category_name > > ANY > > YES > > merch_cat_category_name > > ANY > > YES > > credit_acc_id > > ANY > > YES > > credit_acc_nr_id > > ANY > > YES > > credit_cardholder_acc_id > > ANY > > YES > > credit_cardholder_acc_type > > ANY > > YES > > CreditAccNr_acc_id > > ANY > > YES > > CreditAccNr_acc_nr > > ANY > > YES > > CreditAccNr_acc_nr_id > > ANY > > YES > > CreditAccNr_se_id > > ANY > > YES > > CreditAccNr_state > > ANY > > YES > > Date_Modified > > ANY > > YES > > debit_acc_id > > ANY > > YES > > debit_acc_nr_id > > ANY > > YES > > debit_cardholder_acc_id > > ANY > > YES > > debit_cardholder_acc_type > > ANY > > YES > > DebitAccNr_acc_id > > ANY > > YES > > DebitAccNr_acc_nr > > ANY > > YES > > DebitAccNr_acc_nr_id > > ANY > > YES > > extended_trans_type > > ANY > > YES > > fee > > ANY > > YES > > Fee_amount_id > > ANY > > YES > > merch_cat_fee_cap > > ANY > > YES > > merch_cat_visa_fee_cap > > ANY > > YES > > reward_fee_cap > > ANY > > YES > > Fee_description > > ANY > > YES > > Fee_Discount > > ANY > > YES > > Fee_fee_id > > ANY > > YES > > fee_id > > ANY > > YES > > Fee_name > > ANY > > YES > > Fee_se_id > > ANY > > YES > > merch_cat_category_fee_type > > ANY > > YES > > merch_cat_category_visa_fee_type > > ANY > > YES > > merch_cat_category_merch_discount > > ANY > > YES > > merch_cat_category_visa_merch_discount > > ANY > > YES > > PT_sink_node_name > > ANY > > YES > > post_tran_id > > ANY > > YES > > Miscellaneous > > ANY > > YES > > terminal_owner > > ANY > > YES > > terminal_owner_name > > ANY > > YES > > REF_rpt_account_type > > ANY > > YES > > REF_IsPoolAccount > > ANY > > YES > > From_Account_Type > > ANY > > YES > > To_Account_Type > > ANY > > YES > > Tran_type_description > > ANY > > YES > > Region > > ANY > > YES > > Transaction_Status > > ANY > > YES > > Transaction_type_Impact > > ANY > > YES > > Message_Type_Desc > > ANY > > YES > > Response_Code_description > > ANY > > YES > > REF_BatchId > > ANY > > YES > > Beneficiary_Account > > ANY > > YES > > Rate > > ANY > > YES > > final_fee > > ANY > > YES > > final_fee_type > > ANY > > YES > > settlement_date > > ANY > > YES > > > Please what can I do to improve performance of this query > >