Adam,

Could you give more info regarding the dataset, including:

number and size of parquet files
block locations of the parquet files
drillbit hosts

If you could send the profile json files for a couple of queries, that
could be helpful too.

On Wed, Mar 25, 2015 at 11:23 AM, Jacques Nadeau <[email protected]> wrote:

> Adam,
>
> There is actually an option to control how much Drill uses locality versus
> distribution.  Not sure if that is influencing you but it could be.  If so,
> you can decrease the value to increase the importance of distribution.  The
> option is `planner.affinity_factor`.
>
>
>
> On Wed, Mar 25, 2015 at 12:00 AM, Adam Gilmore <[email protected]>
> wrote:
>
> > Hi guys,
> >
> > I'm trying to understand how this could be possible.  I have a Hadoop
> > cluster of a name node and two data nodes setup.  All have identical
> specs
> > in terms of CPU/RAM etc.
> >
> > The two data nodes have a replicated HDFS setup where I'm storing some
> > Parquet files.
> >
> > A Drill cluster (with Zookeeper) is running with Drillbits on all three
> > servers.
> >
> > When I submit a query to *any* of the Drillbits, no matter who the
> foreman
> > is, one particular data node gets picked to do the vast majority of the
> > work.
> >
> > We've even added three more task nodes to the cluster and everything
> still
> > puts a huge load on one particular server.
> >
> > There is nothing unique about this data node.  HDFS is fully replicated
> (no
> > unreplicated blocks) to the other data node.
> >
> > I know that Drill tries to get data locality, so I'm wondering if this is
> > the cause, but this essentially swamping this data node with 100% CPU
> usage
> > while leaving the others barely doing any work.
> >
> > As soon as we shut down the Drillbit on this data node, query performance
> > increases significantly.
> >
> > Any thoughts on how I can troubleshoot why Drill is picking that
> particular
> > node?
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Reply via email to