You can find documentation on the -default_query_options flag here: https://impala.apache.org/docs/build/html/topics/impala_config_options.html
Keep in mind that setting replica_preference to REMOTE will make Impala ignore any locality when deciding where to schedule a read. Even within the group of impalads that have local storage attached, Impala will pick a randomized assignment, optimizing for the number of bytes read by each node. There is currently no logic to schedule a fraction of the reads locally and assign the rest to remote impalads (such a scenario wasn't part of the considerations when working on the scheduler). On Thu, Apr 19, 2018 at 9:47 AM, Fawze Abujaber <fawz...@gmail.com> wrote: > Thanks Tim for you quick response as usual, > > Can you send me a documentation how to do that or send me detail example > how to do that globally and per pool ... > > Again much appreciate your readiness to help > > On Thu, 19 Apr 2018 at 19:43 Tim Armstrong <tarmstr...@cloudera.com> > wrote: > >> We have a way to set global and per-pool defaults for query options. You >> can set default query options via the --default_query_options startup flag >> or if you have resource pools set up, you can set default query option >> values for queries submitted to each resource pool (including the default >> pool) >> >> On Tue, Apr 17, 2018 at 3:27 AM, Fawze Abujaber <fawz...@gmail.com> >> wrote: >> >>> Thanks Tim, >>> >>> That's means that i cannot disable this cross the impala cluster and i >>> need to manage this at the query level, right? >>> >>> Is it any configuration at the cluster level to disable this? >>> >>> On Wed, Apr 4, 2018 at 3:44 AM, Tim Armstrong <tarmstr...@cloudera.com> >>> wrote: >>> >>>> I agree with Jim's answers. >>>> >>>> You may run into challenges if you have some Impala daemons that have >>>> local DataNodes and some that do not have local DataNodes. By default >>>> Impala always chooses a daemon with a local copy of the data, which would >>>> mean that daemons without a co-located DataNode might never get fragments >>>> scheduled on them. We do have a knob that let's you disable locality-based >>>> scheduling https://impala.apache.org/docs/build/html/topics/impala_ >>>> replica_preference.html but that may be too blunt an instrument. >>>> >>>> On Tue, Apr 3, 2018 at 11:34 AM, Jim Apple <jbap...@cloudera.com> >>>> wrote: >>>> >>>>> I think the answers are: >>>>> >>>>> 1. It depends on your workload and your network. I know some users run >>>>> with ONLY remote reads and still get performance they are happy with. Your >>>>> existing nodes will continue to be able to short-circuit read. >>>>> >>>>> 2. This is highly workload-dependent. You want to try and avoid >>>>> spilling, obviously, but if your spinning disk can write 200MB/s it would >>>>> take 3000 seconds, which is 50 minutes, to fill up. >>>>> >>>>> 3. I think the impalads are smart enough to not try and do a >>>>> short-circuit read on data that isn't local. >>>>> >>>>> On Tue, Apr 3, 2018 at 10:22 AM, Fawze Abujaber <fawz...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have reached a point in my cluster that i don't need more storage >>>>>> for the HDFS and i need to add processing power, i'm using Yarn,Spark and >>>>>> Impala on the normal nodes for processing. >>>>>> >>>>>> My questions: >>>>>> >>>>>> 1- How much the data locality will impact impala performance as i >>>>>> know impala rely on data locality on it's processing? >>>>>> >>>>>> 2- I have OS disk with 600GB, will this be enough to be used to spill >>>>>> to disk when needed? is it dependent on other factors, the impala daemon >>>>>> memory limit is 35GB. >>>>>> >>>>>> 3- Should i disable the *HDFS Short Circuit Read* on these nodes? >>>>>> >>>>>> Will happy to get more recommendation on this .... >>>>>> >>>>>> -- >>>>>> Take Care >>>>>> Fawze Abujaber >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Take Care >>> Fawze Abujaber >>> >> >> -- > Take Care > Fawze Abujaber >