Hi guys, I'm trying to understand how this could be possible. I have a Hadoop cluster of a name node and two data nodes setup. All have identical specs in terms of CPU/RAM etc.
The two data nodes have a replicated HDFS setup where I'm storing some Parquet files. A Drill cluster (with Zookeeper) is running with Drillbits on all three servers. When I submit a query to *any* of the Drillbits, no matter who the foreman is, one particular data node gets picked to do the vast majority of the work. We've even added three more task nodes to the cluster and everything still puts a huge load on one particular server. There is nothing unique about this data node. HDFS is fully replicated (no unreplicated blocks) to the other data node. I know that Drill tries to get data locality, so I'm wondering if this is the cause, but this essentially swamping this data node with 100% CPU usage while leaving the others barely doing any work. As soon as we shut down the Drillbit on this data node, query performance increases significantly. Any thoughts on how I can troubleshoot why Drill is picking that particular node?
