[ https://issues.apache.org/jira/browse/IMPALA-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong resolved IMPALA-6564. ----------------------------------- Resolution: Not A Bug It turns out that the issue was introduced in one of my patches: https://gerrit.cloudera.org/#/c/8707/32/be/src/runtime/io/disk-io-mgr.cc This check was saving us: {noformat} Status DiskIoMgr::AddScanRanges(RequestContext* reader, const vector<ScanRange*>& ranges, bool schedule_immediately) { if (ranges.empty()) return Status::OK(); {noformat} > Queries randomly fail with "CANCELLED" due to a race with IssueInitialRanges() > ------------------------------------------------------------------------------ > > Key: IMPALA-6564 > URL: https://issues.apache.org/jira/browse/IMPALA-6564 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 2.12.0 > Reporter: Tim Armstrong > Assignee: Tim Armstrong > Priority: Blocker > Labels: flaky > > I've been chasing a flaky test that I saw in test_basic_runtime_filters when > running against https://gerrit.cloudera.org/#/c/8966/ (the scanner buffer > pool changes). > I think it is a latent bug that has started reproducing more frequently. What > I've found is: > * Different queries fail with CANCELLED. I can repro it on my branch ~3/4 > times by running: impala-py.test tests/query_test/test_runtime_filters.py -n8 > --verbose --maxfail 1 -k basic . It happens with a variety of queries and > file formats. > * It seems to happen when all files are pruned out by runtime filters > * Logging reveals IssueInitialRanges() fails with a CANCELLED status, which > propagates up to the query status: > {code} > if (!initial_ranges_issued_) { > // We do this in GetNext() to maximise the amount of work we can do while > waiting for > // runtime filters to show up. The scanner threads have already started > (in Open()), > // so we need to tell them there is work to do. > // TODO: This is probably not worth splitting the organisational cost of > splitting > // initialisation across two places. Move to before the scanner threads > start. > Status status = IssueInitialScanRanges(state); > if (!status.ok()) LOG(INFO) << runtime_state_->fragment_instance_id() << > " IssueInitialRanges() failed with status: " << status.GetDetail() << " " << > (void*) this; > {code} > * It appears that the CANCELLED comes from DiskIoMgr::AddScanRanges(). > * That function returned cancelled because a scanner thread noticed that the > scan was complete here and cancelled the RequestContext: > {code} > // Done with range and it completed successfully > if (progress_.done()) { > // All ranges are finished. Indicate we are done. > LOG(INFO) << runtime_state_->fragment_instance_id() << " All ranges > done " << (void*) this; > SetDone(); > break; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)