Hello everyone,
I am currently using apache drill to query a directory of parquet files (total
size 64gb with each file averaging 6.4Gb per size). I am running the following
query
select * from dfs.rj.vw_reports_201805 where unique_key in
(
select unique_key from dfs.rj.vw_reports_201805 group by unique_key
having count(unique_key) > 1
)
limit 40
Now this query takes about 42 minutes after which it fails because one of the
drillbits in the drillbit cluster crashes. I check the logs and I see several
errors summarized as follows
20:35:18.006 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState -
Connection timed out for connection string
(172.35.15.129:5181,172.35.15.151:5181,172.35.15.152:5181) and timeout (5000) /
elapsed (13347)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
[curator-client-2.7.1.jar:na]
at
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
[curator-client-2.7.1.jar:na]
at
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
[curator-client-2.7.1.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
[curator-framework-2.7.1.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
[curator-framework-2.7.1.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
[curator-framework-2.7.1.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
[curator-framework-2.7.1.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[na:1.8.0_171]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[na:1.8.0_171]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]
AND
Aug 1, 2018 7:54:18 AM WARNING: org.apache.parquet.CorruptStatistics: Ignoring
statistics because created_by could not be parsed (see PARQUET-251): parquet-mr
(build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb)
org.apache.parquet.VersionParser$VersionParseException: Could not parse
created_by: parquet-mr (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) using
format: (.+) version ((.*) )?\(build ?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:66)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:264)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:568)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:545)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:455)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:404)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:390)
at
org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:117)
at
org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:53)
at
org.apache.drill.exec.physical.impl.ImplCreator$2.run(ImplCreator.java:146)
at
org.apache.drill.exec.physical.impl.ImplCreator$2.run(ImplCreator.java:142)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1633)
at
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:142)
at
org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182)
at
org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:110)
at
org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87)
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:206)
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The Schema for the data I am querying in question is as follows
index_no
ANY
YES
bank_code
ANY
YES
trxn_category
ANY
YES
Debit_Account_Type
ANY
YES
Credit_Account_Type
ANY
YES
trxn_amount
ANY
YES
trxn_fee
ANY
YES
trxn_date
ANY
YES
Currency
ANY
YES
Late_reversal
ANY
YES
card_type
ANY
YES
terminal_type
ANY
YES
Acquirer
ANY
YES
Issuer
ANY
YES
Value_SettleAmount
ANY
YES
Value_RequestedAmount
ANY
YES
Volume
ANY
YES
unique_key
ANY
YES
source_node_name
ANY
YES
business_date
ANY
YES
PT_datetime_tran_local
ANY
YES
PTC_pan
ANY
YES
PTC_terminal_id
ANY
YES
PTC_card_acceptor_id_code
ANY
YES
PTC_card_acceptor_name_loc
ANY
YES
PT_system_trace_audit_nr
ANY
YES
PT_message_type
ANY
YES
PT_from_account_id
ANY
YES
PT_to_account_id
ANY
YES
PT_tran_type
ANY
YES
PT_rsp_code_req
ANY
YES
PT_rsp_code_rsp
ANY
YES
PT_settle_amount_req
ANY
YES
PT_settle_amount_rsp
ANY
YES
PTC_merchant_type
ANY
YES
PT_settle_amount_impact
ANY
YES
PT_settle_tran_fee_req
ANY
YES
PT_settle_tran_fee_rsp
ANY
YES
PT_auth_id_rsp
ANY
YES
PT_retrieval_reference_nr
ANY
YES
PTC_totals_group
ANY
YES
PTC_card_product
ANY
YES
PT_tran_currency_code
ANY
YES
PT_payee
ANY
YES
PT_tran_amount_req
ANY
YES
PT_tran_amount_rsp
ANY
YES
PT_extended_tran_type
ANY
YES
PT_tran_postilion_originated
ANY
YES
PT_tran_completed
ANY
YES
PT_tran_nr
ANY
YES
PT_retention_data
ANY
YES
PT_acquiring_inst_id_code
ANY
YES
PT_message_reason_code
ANY
YES
PT_sponsor_bank
ANY
YES
PT_datetime_tran_gmt
ANY
YES
PT_datetime_req
ANY
YES
PT_datetime_rsp
ANY
YES
PT_realtime_business_date
ANY
YES
PT_recon_business_date
ANY
YES
PT_from_account_type
ANY
YES
PT_to_account_type
ANY
YES
PT_tran_cash_req
ANY
YES
PT_tran_cash_rsp
ANY
YES
PT_tran_tran_fee_req
ANY
YES
PT_tran_tran_fee_rsp
ANY
YES
PT_tran_tran_fee_currency_code
ANY
YES
PT_tran_proc_fee_req
ANY
YES
PT_tran_proc_fee_rsp
ANY
YES
PT_tran_proc_fee_currency_code
ANY
YES
PT_settle_currency_code
ANY
YES
PT_pos_entry_mode
ANY
YES
PT_pos_condition_code
ANY
YES
PT_tran_reversed
ANY
YES
PT_card_verification_result
ANY
YES
PT_online_system_id
ANY
YES
PT_participant_id
ANY
YES
PT_receiving_inst_id_code
ANY
YES
PT_routing_type
ANY
YES
PT_pt_pos_card_input_mode
ANY
YES
PT_source_node_key
ANY
YES
PT_proc_online_system_id
ANY
YES
PTC_post_tran_cust_id
ANY
YES
PTC_source_node_name
ANY
YES
PTC_card_seq_nr
ANY
YES
PTC_expiry_date
ANY
YES
PTC_service_restriction_code
ANY
YES
PTC_terminal_owner
ANY
YES
PTC_mapped_card_acceptor_id_code
ANY
YES
PTC_pos_terminal_type
ANY
YES
PTC_pan_encrypted
ANY
YES
PTSP_Account_Nr
ANY
YES
PTSP_Code
ANY
YES
account_PTSP_Code
ANY
YES
PTSP_Name
ANY
YES
Sort_Code
ANY
YES
rdm_amt
ANY
YES
Reward_Code
ANY
YES
Reward_discount
ANY
YES
ptsp_terminal_id
ANY
YES
reward_terminal_id
ANY
YES
txn_id
ANY
YES
web_category_code
ANY
YES
web_category_name
ANY
YES
web_fee_type
ANY
YES
web_merchant_disc
ANY
YES
web_fee_cap
ANY
YES
Account_Name
ANY
YES
account_nr
ANY
YES
Acquiring_bank
ANY
YES
journal_amount
ANY
YES
xls_amount
ANY
YES
merch_cat_amount_cap
ANY
YES
merch_cat_visa_amount_cap
ANY
YES
reward_amount_cap
ANY
YES
Amount_config_state
ANY
YES
Amount_description
ANY
YES
Authorized_Person
ANY
YES
ACC_BANK_CODE
ANY
YES
merch_cat_bearer
ANY
YES
merch_cat_visa_bearer
ANY
YES
merch_cat_visa_category_code
ANY
YES
merch_cat_category_code
ANY
YES
merch_cat_visa_category_name
ANY
YES
merch_cat_category_name
ANY
YES
credit_acc_id
ANY
YES
credit_acc_nr_id
ANY
YES
credit_cardholder_acc_id
ANY
YES
credit_cardholder_acc_type
ANY
YES
CreditAccNr_acc_id
ANY
YES
CreditAccNr_acc_nr
ANY
YES
CreditAccNr_acc_nr_id
ANY
YES
CreditAccNr_se_id
ANY
YES
CreditAccNr_state
ANY
YES
Date_Modified
ANY
YES
debit_acc_id
ANY
YES
debit_acc_nr_id
ANY
YES
debit_cardholder_acc_id
ANY
YES
debit_cardholder_acc_type
ANY
YES
DebitAccNr_acc_id
ANY
YES
DebitAccNr_acc_nr
ANY
YES
DebitAccNr_acc_nr_id
ANY
YES
extended_trans_type
ANY
YES
fee
ANY
YES
Fee_amount_id
ANY
YES
merch_cat_fee_cap
ANY
YES
merch_cat_visa_fee_cap
ANY
YES
reward_fee_cap
ANY
YES
Fee_description
ANY
YES
Fee_Discount
ANY
YES
Fee_fee_id
ANY
YES
fee_id
ANY
YES
Fee_name
ANY
YES
Fee_se_id
ANY
YES
merch_cat_category_fee_type
ANY
YES
merch_cat_category_visa_fee_type
ANY
YES
merch_cat_category_merch_discount
ANY
YES
merch_cat_category_visa_merch_discount
ANY
YES
PT_sink_node_name
ANY
YES
post_tran_id
ANY
YES
Miscellaneous
ANY
YES
terminal_owner
ANY
YES
terminal_owner_name
ANY
YES
REF_rpt_account_type
ANY
YES
REF_IsPoolAccount
ANY
YES
From_Account_Type
ANY
YES
To_Account_Type
ANY
YES
Tran_type_description
ANY
YES
Region
ANY
YES
Transaction_Status
ANY
YES
Transaction_type_Impact
ANY
YES
Message_Type_Desc
ANY
YES
Response_Code_description
ANY
YES
REF_BatchId
ANY
YES
Beneficiary_Account
ANY
YES
Rate
ANY
YES
final_fee
ANY
YES
final_fee_type
ANY
YES
settlement_date
ANY
YES
Please what can I do to improve performance of this query