I think the answers are: 1. It depends on your workload and your network. I know some users run with ONLY remote reads and still get performance they are happy with. Your existing nodes will continue to be able to short-circuit read.
2. This is highly workload-dependent. You want to try and avoid spilling, obviously, but if your spinning disk can write 200MB/s it would take 3000 seconds, which is 50 minutes, to fill up. 3. I think the impalads are smart enough to not try and do a short-circuit read on data that isn't local. On Tue, Apr 3, 2018 at 10:22 AM, Fawze Abujaber <fawz...@gmail.com> wrote: > Hi All, > > I have reached a point in my cluster that i don't need more storage for > the HDFS and i need to add processing power, i'm using Yarn,Spark and > Impala on the normal nodes for processing. > > My questions: > > 1- How much the data locality will impact impala performance as i know > impala rely on data locality on it's processing? > > 2- I have OS disk with 600GB, will this be enough to be used to spill to > disk when needed? is it dependent on other factors, the impala daemon > memory limit is 35GB. > > 3- Should i disable the *HDFS Short Circuit Read* on these nodes? > > Will happy to get more recommendation on this .... > > -- > Take Care > Fawze Abujaber >