Re: Spark based Data Warehouse

Phillip Henry Sun, 12 Nov 2017 10:20:23 -0800

Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is
spill-to-disk but I am not sure how this pertains to a "concurrent user
scenario". Each executor will run in its own JVM and is therefore isolated
from others. That is, if the JVM of one user dies, this should not effect
another user who is running their own jobs in their own JVMs. The amount of
resources used by a user can be controlled by the resource manager.


AFAIK, you configure something like YARN to limit the number of cores and
the amount of memory in the cluster a certain user or group is allowed to
use for their job. This is obviously quite a coarse-grained approach as (to
my knowledge) IO is not throttled. I believe people generally use something
like Apache Ambari to keep an eye on network and disk usage to mitigate
problems in a shared cluster.

If the user has badly designed their query, it may very well fail with
OOMEs but this can happen irrespective of whether one user or many is using
the cluster at a given moment in time.

Does this help?

Regards,

Phillip


On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.com> wrote:

> Thanks Jorn and Phillip. My question was specifically to anyone who have
> tried creating a system using spark SQL, as Data Warehouse. I was trying to
> check, if someone has tried it and they can help with the kind of workloads
> which worked and the ones, which have problems.
>
> Regarding spill to disk, I might be wrong but not all functionality of
> spark is spill to disk. So it still doesn't provide DB like reliability in
> execution. In case of DBs, queries get slow but they don't fail or go out
> of memory, specifically in concurrent user scenarios.
>
> Regards,
> Ashish
>
> On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote:
>
> Agree with Jorn. The answer is: it depends.
>
> In the past, I've worked with data scientists who are happy to use the
> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
> of your customers).
>
> Regarding sharing resources, different teams were limited to their own
> queue so they could not hog all the resources. However, people within a
> team had to do some horse trading if they had a particularly intensive job
> to run. I did feel that this was an area that could be improved. It may be
> by now, I've just not looked into it for a while.
>
> BTW I'm not sure what you mean by "Spark still does not provide spill to
> disk" as the FAQ says "Spark's operators spill data to disk if it does not
> fit in memory" (http://spark.apache.org/faq.html). So, your data will not
> normally cause OutOfMemoryErrors (certain terms and conditions may apply).
>
> My 2 cents.
>
> Phillip
>
>
>
> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> What do you mean all possible workloads?
>> You cannot prepare any system to do all possible processing.
>>
>> We do not know the requirements of your data scientists now or in the
>> future so it is difficult to say. How do they work currently without the
>> new solution? Do they all work on the same data? I bet you will receive on
>> your email a lot of private messages trying to sell their solution that
>> solves everything - with the information you provided this is impossible to
>> say.
>>
>> Then with every system: have incremental releases but have then in short
>> time frames - do not engineer a big system that you will deliver in 2
>> years. In the cloud you have the perfect possibility to scale feature but
>> also infrastructure wise.
>>
>> Challenges with concurrent queries is the right definition of the
>> scheduler (eg fairscheduler) that not one query take all the resources or
>> that long running queries starve.
>>
>> User interfaces: what could help are notebooks (Jupyter etc) but you may
>> need to train your data scientists. Some may know or prefer other tools.
>>
>> On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote:
>>
>> I am looking for similar solution more aligned to data scientist group.
>> The concern i have is about supporting complex aggregations at runtime .
>>
>> Thanks
>> Deepak
>>
>> On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote:
>>
>>> Hello Everyone,
>>>
>>> I was trying to understand if anyone here has tried a data warehouse
>>> solution using S3 and Spark SQL. Out of multiple possible options
>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>> our aggregates and processing requirements.
>>>
>>> If anyone has tried it out, would like to understand the following:
>>>
>>>    1. Is Spark SQL and UDF, able to handle all the workloads?
>>>    2. What user interface did you provide for data scientist, data
>>>    engineers and analysts
>>>    3. What are the challenges in running concurrent queries, by many
>>>    users, over Spark SQL? Considering Spark still does not provide spill to
>>>    disk, in many scenarios, are there frequent query failures when executing
>>>    concurrent queries
>>>    4. Are there any open source implementations, which provide
>>>    something similar?
>>>
>>>
>>> Regards,
>>> Ashish
>>>
>>
>
>

Re: Spark based Data Warehouse

Reply via email to