Hi Francois,
Thank you for the picture and the info you provided.
We will keep you updated and let you know when we make changes in future
release.
Thanks,
Padma
> On Oct 26, 2016, at 6:06 PM, François Méthot wrote:
>
> Hi,
>
> Sorry it took so long, lost the origin
The mailing list does not seem to allow for images. Can you put the image
elsewhere (Github or Dropbox), and reply with a link to it?
- Sudheesh
> On Oct 19, 2016, at 5:37 PM, François Méthot wrote:
>
> We had problem on the 220 nodes cluster. No problem on the 12 nodes
We had problem on the 220 nodes cluster. No problem on the 12 nodes cluster.
I agree that the data may not be distributed evenly. It would be a long and
tedious process for me to produce a report.
Here is a drawing of the fragments overview before and after the changes
of the affinity factory
Hi Francois,
It would be good to understand how increasing affinity_factor helped in your
case
so we can better document and also use that knowledge to improve things in
future release.
If you have two clusters, it is not clear whether you had the problem on 12
node cluster
or 220 node
We have a 12 nodes cluster and a 220 nodes cluster, but they do not talk
to each other. So Padma's analysis do not apply but thanks for your
comments. Our goal had been to run Drill on the 220 nodes cluster after it
proved worthy of it on the small cluster.
planner.width.max_per_node was
Seems like you have 215 nodes, but the data for your query is there on only 12
nodes.
Drill tries to distribute the scan fragments across the cluster more uniformly
(trying to utilize all CPU resources).
That is why you have lot of remote reads going on and increasing affinity
factor
I am surprised that it's not the default.
On Fri, Oct 14, 2016 at 11:18 AM, Sudheesh Katkam
wrote:
> Hi Francois,
>
> Thank you for posting your findings! Glad to see a 10X improvement.
>
> By increasing affinity factor, looks like Drill’s parallelizer is forced
> to
Hi Francois,
Thank you for posting your findings! Glad to see a 10X improvement.
By increasing affinity factor, looks like Drill’s parallelizer is forced to
assign fragments on nodes with data i.e. with high favorability for data
locality.
Regarding the random disconnection, I agree with your
Hi,
We finally got rid of this error. We have tried many, many things (like
modifying drill to ignore the error!), it ultimately came down to this
change:
from default
planner.affinity_factor=1.2
to
planner.affinity_factor=100
Basically this encourages fragment to only care about locally
After the 30 seconds gap, all the Drill nodes receives the following:
2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
changed. Now includes 220 total bits. New Active drill bits
...faulty node is not on the list...
2016-09-26 20:07:38,897 [Curator-ServiceCache-0]
Hi,
We have switched to 1.8 and we are still getting node disconnection.
We did many tests, we thought initially our stand alone parquet converter
was generating parquet files with problematic data (like 10K characters
string), but we were able to reproduce it with employee data from the
Hi Sudheesh,
If I add selection filter so that no row are returned, the same problem
occur. I also simplified the query to include only few integer columns.
That particular data repo is ~200+ Billions records spread over ~50 000
parquet files.
We have other CSV data repo that are 100x smaller
One more interesting thing and another guess to resolve the problem,
> P.S.:
> We do see this also:
> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
>
Hi Francois,
A simple query with only projections is not an “ideal” use case, since Drill is
bound by how fast the client can consume records. There are 1000 scanners
sending data to 1 client (vs far fewer scanners sending data in the 12 node
case).
This might increase the load on the
Hi Sudheesh,
+ Does the query involve any aggregations or filters? Or is this a select
query with only projections?
Simple query with only projections
+ Any suspicious timings in the query profile?
Nothing specially different than our working query on our small cluster.
+ Any suspicious
Hi Francois,
More questions..
> + Can you share the query profile?
> I will sum it up:
> It is a select on 18 columns: 9 string, 9 integers.
> Scan is done on 13862 parquet files spread on 1000 fragments.
> Fragments are spread accross 215 nodes.
So ~5 leaf fragments (or scanners) per
Hi Francois,
Few questions:
+ How many zookeeper servers in the quorum?
+ What is the load on atsqa4-133.qa.lab when this happens? Any other
applications running on that node? How many threads is the Drill process using?
+ When running the same query on 12 nodes, is the data size same?
+ Can you
17 matches
Mail list logo