Anyone have any more thoughts on this? Anywhere I can start trying to troubleshoot?
On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <[email protected]> wrote: > So there are 5 Parquet files, each ~125mb - not sure what I can provide re > the block locations? I believe it's under the HDFS block size so they > should be stored contiguously. > > I've tried setting the affinity factor to various values (1, 0, etc.) but > nothing seems to change that. It always prefers certain nodes. > > Moreover, we added a stack more nodes and it started picking very specific > nodes as foremen (perhaps 2-3 nodes out of 20 were always picked as > foremen). Therefore, the foremen were being swamped with CPU while the > other nodes were doing very little work. > > On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <[email protected]> > wrote: > >> Actually, I believe a query submitted through REST interface will >> instantiate a DrillClient, which uses the same ZKClusterCoordinator that >> sqlline uses, and thus the foreman for the query is not necessarily on the >> same drillbit as it was submitted to. But I'm still not sure it's related >> to DRILL-2512. >> >> I'll wait for your additional info before speculating further. >> >> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <[email protected]> >> wrote: >> >> > We actually setup a separate load balancer for port 8047 (we're >> submitting >> > these queries via the REST API at the moment) so Zookeeper etc. is out >> of >> > the equation, thus I doubt we're hitting DRILL-2512. >> > >> > When shutitng down the "troublesome" drillbit, it starts parallelizing >> much >> > nicer again. We even added 10+ nodes to the cluster and as long as that >> > particular drillbit is shut down, it distributes very nicely. The >> minute >> > we start the drillbit on that node again, it starts swamping it with >> work. >> > >> > I'll shoot through the JSON profiles and some more information on the >> > dataset etc. later today (Australian time!). >> > >> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips < >> [email protected]> >> > wrote: >> > >> > > I didn't notice at first that Adam said "no matter who the foreman >> is". >> > > >> > > Another suspicion I have is that our current logic for assigning work >> > will >> > > assign to the exact same nodes every time we query a particular table. >> > > Changing affinity factor may change it, but it will still be the same >> > every >> > > time. That is my suspicion, but I am not sure why shutting down the >> > > drillbit would improve performance. I would expect that shutting down >> the >> > > drillbit would result in a different drillbit becoming the hotspot. >> > > >> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <[email protected]> >> > > wrote: >> > > >> > > > On Steven's point, the node that the client connects to is not >> > currently >> > > > randomized. Given your description of behavior, I'm not sure that >> > you're >> > > > hitting 2512 or just general undesirable distribution. >> > > > >> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips < >> > > [email protected]> >> > > > wrote: >> > > > >> > > > > This is a known issue: >> > > > > >> > > > > https://issues.apache.org/jira/browse/DRILL-2512 >> > > > > >> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht < >> > > > > [email protected]> wrote: >> > > > > >> > > > > > What version of Drill are you running? >> > > > > > >> > > > > > Any hints when looking at the query profiles? Is the node that >> is >> > > being >> > > > > > hammered the foreman for the queries and most of the major >> > fragments >> > > > are >> > > > > > tied to the foreman? >> > > > > > >> > > > > > —Andries >> > > > > > >> > > > > > >> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore < >> [email protected]> >> > > > > wrote: >> > > > > > >> > > > > > > Hi guys, >> > > > > > > >> > > > > > > I'm trying to understand how this could be possible. I have a >> > > Hadoop >> > > > > > > cluster of a name node and two data nodes setup. All have >> > > identical >> > > > > > specs >> > > > > > > in terms of CPU/RAM etc. >> > > > > > > >> > > > > > > The two data nodes have a replicated HDFS setup where I'm >> storing >> > > > some >> > > > > > > Parquet files. >> > > > > > > >> > > > > > > A Drill cluster (with Zookeeper) is running with Drillbits on >> all >> > > > three >> > > > > > > servers. >> > > > > > > >> > > > > > > When I submit a query to *any* of the Drillbits, no matter who >> > the >> > > > > > foreman >> > > > > > > is, one particular data node gets picked to do the vast >> majority >> > of >> > > > the >> > > > > > > work. >> > > > > > > >> > > > > > > We've even added three more task nodes to the cluster and >> > > everything >> > > > > > still >> > > > > > > puts a huge load on one particular server. >> > > > > > > >> > > > > > > There is nothing unique about this data node. HDFS is fully >> > > > replicated >> > > > > > (no >> > > > > > > unreplicated blocks) to the other data node. >> > > > > > > >> > > > > > > I know that Drill tries to get data locality, so I'm >> wondering if >> > > > this >> > > > > is >> > > > > > > the cause, but this essentially swamping this data node with >> 100% >> > > CPU >> > > > > > usage >> > > > > > > while leaving the others barely doing any work. >> > > > > > > >> > > > > > > As soon as we shut down the Drillbit on this data node, query >> > > > > performance >> > > > > > > increases significantly. >> > > > > > > >> > > > > > > Any thoughts on how I can troubleshoot why Drill is picking >> that >> > > > > > particular >> > > > > > > node? >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Steven Phillips >> > > > > Software Engineer >> > > > > >> > > > > mapr.com >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Steven Phillips >> > > Software Engineer >> > > >> > > mapr.com >> > > >> > >> >> >> >> -- >> Steven Phillips >> Software Engineer >> >> mapr.com >> > >
