[ 
https://issues.apache.org/jira/browse/IMPALA-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-6564.
-----------------------------------
    Resolution: Not A Bug

It turns out that the issue was introduced in one of my patches:
https://gerrit.cloudera.org/#/c/8707/32/be/src/runtime/io/disk-io-mgr.cc

This check was saving us:
{noformat}
Status DiskIoMgr::AddScanRanges(RequestContext* reader,
    const vector<ScanRange*>& ranges, bool schedule_immediately) {
  if (ranges.empty()) return Status::OK();
{noformat}

> Queries randomly fail with "CANCELLED" due to a race with IssueInitialRanges()
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-6564
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6564
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.12.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Blocker
>              Labels: flaky
>
> I've been chasing a flaky test that I saw in test_basic_runtime_filters when 
> running against https://gerrit.cloudera.org/#/c/8966/ (the scanner buffer 
> pool changes).
> I think it is a latent bug that has started reproducing more frequently. What 
> I've found is:
> * Different queries fail with CANCELLED. I can repro it on my branch ~3/4 
> times by running: impala-py.test tests/query_test/test_runtime_filters.py -n8 
> --verbose --maxfail 1 -k basic . It happens with a variety of queries and 
> file formats.
> * It seems to happen when all files are pruned out by runtime filters
> * Logging reveals IssueInitialRanges() fails with a CANCELLED status, which 
> propagates up to the query status:
> {code}
>   if (!initial_ranges_issued_) {
>     // We do this in GetNext() to maximise the amount of work we can do while 
> waiting for
>     // runtime filters to show up. The scanner threads have already started 
> (in Open()),
>     // so we need to tell them there is work to do.
>     // TODO: This is probably not worth splitting the organisational cost of 
> splitting
>     // initialisation across two places. Move to before the scanner threads 
> start.
>     Status status = IssueInitialScanRanges(state);
>     if (!status.ok()) LOG(INFO) << runtime_state_->fragment_instance_id() << 
> " IssueInitialRanges() failed with status: " << status.GetDetail()  << " " << 
> (void*) this;
> {code}
> * It appears that the CANCELLED comes from DiskIoMgr::AddScanRanges().
> * That function returned cancelled because a scanner thread noticed that the 
> scan was complete here and cancelled the RequestContext:
> {code}
>     // Done with range and it completed successfully
>     if (progress_.done()) {
>       // All ranges are finished.  Indicate we are done.
>       LOG(INFO) << runtime_state_->fragment_instance_id() << " All ranges 
> done " << (void*) this;
>       SetDone();
>       break;
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to