Adam, Could you give more info regarding the dataset, including:
number and size of parquet files block locations of the parquet files drillbit hosts If you could send the profile json files for a couple of queries, that could be helpful too. On Wed, Mar 25, 2015 at 11:23 AM, Jacques Nadeau <[email protected]> wrote: > Adam, > > There is actually an option to control how much Drill uses locality versus > distribution. Not sure if that is influencing you but it could be. If so, > you can decrease the value to increase the importance of distribution. The > option is `planner.affinity_factor`. > > > > On Wed, Mar 25, 2015 at 12:00 AM, Adam Gilmore <[email protected]> > wrote: > > > Hi guys, > > > > I'm trying to understand how this could be possible. I have a Hadoop > > cluster of a name node and two data nodes setup. All have identical > specs > > in terms of CPU/RAM etc. > > > > The two data nodes have a replicated HDFS setup where I'm storing some > > Parquet files. > > > > A Drill cluster (with Zookeeper) is running with Drillbits on all three > > servers. > > > > When I submit a query to *any* of the Drillbits, no matter who the > foreman > > is, one particular data node gets picked to do the vast majority of the > > work. > > > > We've even added three more task nodes to the cluster and everything > still > > puts a huge load on one particular server. > > > > There is nothing unique about this data node. HDFS is fully replicated > (no > > unreplicated blocks) to the other data node. > > > > I know that Drill tries to get data locality, so I'm wondering if this is > > the cause, but this essentially swamping this data node with 100% CPU > usage > > while leaving the others barely doing any work. > > > > As soon as we shut down the Drillbit on this data node, query performance > > increases significantly. > > > > Any thoughts on how I can troubleshoot why Drill is picking that > particular > > node? > > > -- Steven Phillips Software Engineer mapr.com
