Keep forgetting to reply to user list... On Sun, Apr 15, 2018 at 1:58 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> Sure, data locality all the way at the basic storage layer is the easy way > to avoid paying the costs of remote I/O. My point, though, is that that > kind of storage locality isn't necessarily the only way to get acceptable > performance -- it really does depend heavily on your use case and on your > performance expectations/requirements. In some cases, it can even be > acceptable to do query federation between data centers, where some of the > storage is really remote and the costs to access it are quite high; but if > you're not doing something like trying to bring over all of the remote > data, and if you are reusing many times the bit of data that you did bring > in with the very expensive I/O and then cached, overall performance can be > quite acceptable. > > On Sun, Apr 15, 2018 at 1:46 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Thanks Mark, >> >> I guess this may be broadened to the concept of separate compute from >> storage. Your point on " ... can kind of disappear after the data is >> first read from the storage layer." reminds of performing Logical IOs as >> opposed to Physical IOs. But again as you correctly pointed out on the >> amount of available cache and concurrency that can saturate the hits on the >> storage. I personally believe that Data locality helps by avoiding these >> remote IO calls >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 15 April 2018 at 21:22, Mark Hamstra <m...@clearstorydata.com> wrote: >> >>> This is a sort of your mileage varies type question. >>>> >>> >>> Yes, it really does. Not only does it depend heavily on the >>> configuration of your compute and storage, but it also depends a lot on any >>> caching that you are doing between compute and storage and on the nature of >>> your Spark queries/Jobs. If you are mostly doing cold full scans, then >>> you're going to see a big performance hit. If you are reusing a lot of >>> prior or intermediate results, then you are frequently not going all the >>> way back to a slow storage layer, but rather to a Spark CachedTable, some >>> other cache, or even the OS buffer cache for shuffle files -- or to local >>> disk spillage. All of that is typically going to be local to your compute >>> nodes, so the data locality issue can kind of disappear after the data is >>> first read from the storage layer. >>> >>> >>> On Sat, Apr 14, 2018 at 12:17 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> This is a sort of your mileage varies type question. >>>> >>>> In a classic Hadoop cluster, one has data locality when each node >>>> includes the Spark libraries and HDFS data. this helps certain queries like >>>> interactive BI. >>>> >>>> However running Spark over remote storage say Isilon scaled out NAS >>>> instead of LOCAL HDFS becomes problematic. The full-scan Spark needs >>>> to do will take much longer when it is done over the network (access the >>>> remote Isilon storage) instead of local I/O request to HDFS. >>>> >>>> Has anyone done some comparative studies on this? >>>> >>>> >>>> Thanks >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>> >>> >> >