I think the answers are:

1. It depends on your workload and your network. I know some users run with
ONLY remote reads and still get performance they are happy with. Your
existing nodes will continue to be able to short-circuit read.

2. This is highly workload-dependent. You want to try and avoid spilling,
obviously, but if your spinning disk can write 200MB/s it would take 3000
seconds, which is 50 minutes, to fill up.

3. I think the impalads are smart enough to not try and do a short-circuit
read on data that isn't local.

On Tue, Apr 3, 2018 at 10:22 AM, Fawze Abujaber <fawz...@gmail.com> wrote:

> Hi All,
>
> I have reached a point in my cluster that i don't need more storage for
> the HDFS and i need to add processing power, i'm using Yarn,Spark and
> Impala on the normal nodes for processing.
>
> My questions:
>
> 1- How much the data locality will impact impala performance as i know
> impala rely on data locality on it's processing?
>
> 2- I have OS disk with 600GB, will this be enough to be used to spill to
> disk when needed? is it dependent on other factors, the impala daemon
> memory limit is 35GB.
>
> 3- Should i disable the  *HDFS Short Circuit Read*  on these nodes?
>
> Will happy to get more recommendation on this ....
>
> --
> Take Care
> Fawze Abujaber
>

Reply via email to