Re: Performance of Spark when the compute and storage are separated

Mark Hamstra Sun, 15 Apr 2018 14:02:42 -0700

Keep forgetting to reply to user list...

On Sun, Apr 15, 2018 at 1:58 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:


> Sure, data locality all the way at the basic storage layer is the easy way
> to avoid paying the costs of remote I/O. My point, though, is that that
> kind of storage locality isn't necessarily the only way to get acceptable
> performance -- it really does depend heavily on your use case and on your
> performance expectations/requirements. In some cases, it can even be
> acceptable to do query federation between data centers, where some of the
> storage is really remote and the costs to access it are quite high; but if
> you're not doing something like trying to bring over all of the remote
> data, and if you are reusing many times the bit of data that you did bring
> in with the very expensive I/O and then cached, overall performance can be
> quite acceptable.
>
> On Sun, Apr 15, 2018 at 1:46 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Mark,
>>
>> I guess this may be broadened to the concept of separate compute from
>> storage. Your point on " ... can kind of disappear after the data is
>> first read from the storage layer." reminds of performing Logical IOs as
>> opposed to Physical IOs. But again as you correctly pointed out on the
>> amount of available cache and concurrency that can saturate the hits on the
>> storage. I personally believe that Data locality helps by avoiding these
>> remote IO calls
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 April 2018 at 21:22, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>>> This is a sort of your mileage varies type question.
>>>>
>>>
>>> Yes, it really does. Not only does it depend heavily on the
>>> configuration of your compute and storage, but it also depends a lot on any
>>> caching that you are doing between compute and storage and on the nature of
>>> your Spark queries/Jobs. If you are mostly doing cold full scans, then
>>> you're going to see a big performance hit. If you are reusing a lot of
>>> prior or intermediate results, then you are frequently not going all the
>>> way back to a slow storage layer, but rather to a Spark CachedTable, some
>>> other cache, or even the OS buffer cache for shuffle files -- or to local
>>> disk spillage. All of that is typically going to be local to your compute
>>> nodes, so the data locality issue can kind of disappear after the data is
>>> first read from the storage layer.
>>>
>>>
>>> On Sat, Apr 14, 2018 at 12:17 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> This is a sort of your mileage varies type question.
>>>>
>>>> In a classic Hadoop cluster, one has data locality when each node
>>>> includes the Spark libraries and HDFS data. this helps certain queries like
>>>> interactive BI.
>>>>
>>>> However running Spark over remote storage say Isilon scaled out NAS
>>>> instead of LOCAL HDFS becomes problematic. The full-scan Spark needs
>>>> to do will take much longer when it is done over the network (access the
>>>> remote Isilon storage) instead of local I/O request to HDFS.
>>>>
>>>> Has anyone done some comparative studies on this?
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Performance of Spark when the compute and storage are separated

Reply via email to