Paul Rogers created DRILL-7789:
----------------------------------

             Summary: Exchanges are slow on large systems & queries
                 Key: DRILL-7789
                 URL: https://issues.apache.org/jira/browse/DRILL-7789
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.16.0
            Reporter: Paul Rogers


A user with moderate-sized cluster and query has experienced extreme slowness 
in exchanges. Up to 11/12 of the time is spent waiting in one query, 3/4 of 
time spent waiting in another. We suspect that exchanges are somehow 
serializing across the cluster.

Cluster:
 * Drill 1.16 (MapR version)
 * MapR-FS
 * Data stored in a 8GB Parquet file, unpacks to about 80 GB, 20B records
 * 4 Drillbits
 * Each node has 56 cores, 400 GB of memory
 * Drill queries run with 40 fragments (70% of CPU) and 80 GB of memory

The query is, essentially:

{noformat}
Parquet writer
- Hash Join
  - Scan
  - Window, Sort
  - Window, Sort
  - Hash Join
    - Scan
    - Scan
{noformat}

In the above, each line represents a fragment boundary. The plan includes mux 
exchanges between the two "lower" scans and the hash join.

The total query  time is 6 hours. Of that, 30 minutes is spent working, the 
other 5.5 hours is spent waiting. (The 30 minutes is obtained by summing the 
"Avg Runtime" column in the profile.)

When checking resource usage with "top", we found that only a small amount of 
CPU was used. We should have seen 4000% (40 cores) but we actually saw just 
around 300-400%. This again indicates that the query spent most of its time 
doing nothing: not using CPU.

In particular the sender spends about 5 hours waiting for the receiver, which 
in turn spends about 5 hours waiting for the sender. This pattern occurs in 
every exchange in the "main" data path (the 20B records.)

As an experiment, the user disabled Mux exchanges. The system became overloaded 
at 40 fragments per node, so parallelism was reduced to 20. Now, the partition 
sender waited for the unordered receiver and visa-versa.

The original query incurred spilling. We hypothesized that the spilling caused 
delays which somehow rippled through the DAG. However, the user revised the 
query to eliminate spilling and to reduce the query to just the "bottom" hash 
join. The query ran for an hour, of which 3/4 of the time was again spent with 
senders and receivers waiting for each other.

We have eliminated a number of potential causes:

* System has sufficient memory
* MapRFS file system has plenty of spindles and plenty of I/O capability.
* Network is fast
* No other load on the nodes
* Query was simplified down to the simplest possible: a single join (with 
exchanges)
* If the query is simplified further (scan and write to Parquet, no join), it 
completes in just a few minutes: about as fast as the disk I/O rate.

The query profile does not provide sufficient information to dig further. The 
profile provides aggregate wait times, but does not, say, tell us which 
fragments wait for which other fragments for how long.

We believe that, if the exchange delays are fixed, the query which takes six 
hours should complete in less than a half hour -- even with shuffles, spilling, 
reading from Parquet and writing to Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to