oops, sorry for the confusion. I mean "20% of the size of your input data
set" allocated to Alluxio as memory resource as the starting point.
after that, you can checkout the cache hit ratio into Alluxio space based
on the metrics collected in Alluxio web UI
<http://www.alluxio.org/docs/1.8/en/basic/Web-Interface.html#master-metrics>
.
If you see lower hit ratio, increase Alluxio storage size and vice versa.

Hope this helps,

- Bin

On Thu, Apr 4, 2019 at 9:29 PM Bin Fan <fanbin...@gmail.com> wrote:

> Hi Andy,
>
> It really depends on your workloads. I would suggest to allocate 20% of
> the size of your input data set
> as the starting point and see how it works.
>
> Also depending on your data source as the under store of Alluxio, if it is
> remote (e.g., cloud storage like S3 or GCS),
> you can perhaps use Alluxio to manage local disk or SSD storage resource
> rather than memory resource.
> In this case, the "local Alluxio storage" is still much faster compared to
> reading from remote storage.
> Check out the documentation on tiered storage configuration here (
> http://www.alluxio.org/docs/1.8/en/advanced/Alluxio-Storage-Management.html#configuring-alluxio-storage
> )
>
> - Bin
>
> On Thu, Mar 21, 2019 at 8:26 AM u9g <lwx371...@163.com> wrote:
>
>> Hey,
>>
>> We have a cluster of 10 nodes each of which consists 128GB memory. We are
>> about to running Spark and Alluxio on the cluster.  We wonder how shall
>> allocate the memory to the Spark executor and the Alluxio worker on a
>> machine? Are there some recommendations? Thanks!
>>
>> Best,
>> Andy Li
>>
>>
>>
>>
>

Reply via email to