Re: Drill favouring a particular Drillbit

Adam Gilmore Tue, 07 Apr 2015 22:59:52 -0700

Anyone have any more thoughts on this?  Anywhere I can start trying to
troubleshoot?


On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <[email protected]> wrote:

> So there are 5 Parquet files, each ~125mb - not sure what I can provide re
> the block locations?  I believe it's under the HDFS block size so they
> should be stored contiguously.
>
> I've tried setting the affinity factor to various values (1, 0, etc.) but
> nothing seems to change that.  It always prefers certain nodes.
>
> Moreover, we added a stack more nodes and it started picking very specific
> nodes as foremen (perhaps 2-3 nodes out of 20 were always picked as
> foremen).  Therefore, the foremen were being swamped with CPU while the
> other nodes were doing very little work.
>
> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <[email protected]>
> wrote:
>
>> Actually, I believe a query submitted through REST interface will
>> instantiate a DrillClient, which uses the same ZKClusterCoordinator that
>> sqlline uses, and thus the foreman for the query is not necessarily on the
>> same drillbit as it was submitted to. But I'm still not sure it's related
>> to DRILL-2512.
>>
>> I'll wait for your additional info before speculating further.
>>
>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <[email protected]>
>> wrote:
>>
>> > We actually setup a separate load balancer for port 8047 (we're
>> submitting
>> > these queries via the REST API at the moment) so Zookeeper etc. is out
>> of
>> > the equation, thus I doubt we're hitting DRILL-2512.
>> >
>> > When shutitng down the "troublesome" drillbit, it starts parallelizing
>> much
>> > nicer again.  We even added 10+ nodes to the cluster and as long as that
>> > particular drillbit is shut down, it distributes very nicely.  The
>> minute
>> > we start the drillbit on that node again, it starts swamping it with
>> work.
>> >
>> > I'll shoot through the JSON profiles and some more information on the
>> > dataset etc. later today (Australian time!).
>> >
>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
>> [email protected]>
>> > wrote:
>> >
>> > > I didn't notice at first that Adam said "no matter who the foreman
>> is".
>> > >
>> > > Another suspicion I have is that our current logic for assigning work
>> > will
>> > > assign to the exact same nodes every time we query a particular table.
>> > > Changing affinity factor may change it, but it will still be the same
>> > every
>> > > time. That is my suspicion, but I am not sure why shutting down the
>> > > drillbit would improve performance. I would expect that shutting down
>> the
>> > > drillbit would result in a different drillbit becoming the hotspot.
>> > >
>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <[email protected]>
>> > > wrote:
>> > >
>> > > > On Steven's point, the node that the client connects to is not
>> > currently
>> > > > randomized.  Given your description of behavior, I'm not sure that
>> > you're
>> > > > hitting 2512 or just general undesirable distribution.
>> > > >
>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
>> > > [email protected]>
>> > > > wrote:
>> > > >
>> > > > > This is a known issue:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
>> > > > >
>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
>> > > > > [email protected]> wrote:
>> > > > >
>> > > > > > What version of Drill are you running?
>> > > > > >
>> > > > > > Any hints when looking at the query profiles? Is the node that
>> is
>> > > being
>> > > > > > hammered the foreman for the queries and most of the major
>> > fragments
>> > > > are
>> > > > > > tied to the foreman?
>> > > > > >
>> > > > > > —Andries
>> > > > > >
>> > > > > >
>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
>> [email protected]>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi guys,
>> > > > > > >
>> > > > > > > I'm trying to understand how this could be possible.  I have a
>> > > Hadoop
>> > > > > > > cluster of a name node and two data nodes setup.  All have
>> > > identical
>> > > > > > specs
>> > > > > > > in terms of CPU/RAM etc.
>> > > > > > >
>> > > > > > > The two data nodes have a replicated HDFS setup where I'm
>> storing
>> > > > some
>> > > > > > > Parquet files.
>> > > > > > >
>> > > > > > > A Drill cluster (with Zookeeper) is running with Drillbits on
>> all
>> > > > three
>> > > > > > > servers.
>> > > > > > >
>> > > > > > > When I submit a query to *any* of the Drillbits, no matter who
>> > the
>> > > > > > foreman
>> > > > > > > is, one particular data node gets picked to do the vast
>> majority
>> > of
>> > > > the
>> > > > > > > work.
>> > > > > > >
>> > > > > > > We've even added three more task nodes to the cluster and
>> > > everything
>> > > > > > still
>> > > > > > > puts a huge load on one particular server.
>> > > > > > >
>> > > > > > > There is nothing unique about this data node.  HDFS is fully
>> > > > replicated
>> > > > > > (no
>> > > > > > > unreplicated blocks) to the other data node.
>> > > > > > >
>> > > > > > > I know that Drill tries to get data locality, so I'm
>> wondering if
>> > > > this
>> > > > > is
>> > > > > > > the cause, but this essentially swamping this data node with
>> 100%
>> > > CPU
>> > > > > > usage
>> > > > > > > while leaving the others barely doing any work.
>> > > > > > >
>> > > > > > > As soon as we shut down the Drillbit on this data node, query
>> > > > > performance
>> > > > > > > increases significantly.
>> > > > > > >
>> > > > > > > Any thoughts on how I can troubleshoot why Drill is picking
>> that
>> > > > > > particular
>> > > > > > > node?
>> > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > >  Steven Phillips
>> > > > >  Software Engineer
>> > > > >
>> > > > >  mapr.com
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >  Steven Phillips
>> > >  Software Engineer
>> > >
>> > >  mapr.com
>> > >
>> >
>>
>>
>>
>> --
>>  Steven Phillips
>>  Software Engineer
>>
>>  mapr.com
>>
>
>

Re: Drill favouring a particular Drillbit

Reply via email to