Re: Spark based Data Warehouse

lucas.g...@gmail.com Fri, 17 Nov 2017 22:32:02 -0800

We are using Spark on Kubernetes on AWS (it's a long story) but it does
work.  It's still on the raw side but we've been pretty successful.


We configured our cluster primarily with Kube-AWS and auto scaling groups.
There are gotcha's there, but so far we've been quite successful.

Gary Lucas

On 17 November 2017 at 22:20, ashish rawat <dceash...@gmail.com> wrote:

> Thanks everyone for their suggestions. Does any of you take care of auto
> scale up and down of your underlying spark clusters on AWS?
>
> On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" <lucas.g...@gmail.com>
> wrote:
>
> Hi Ashish, bear in mind that EMR has some additional tooling available
> that smoothes out some S3 problems that you may / almost certainly will
> encounter.
>
> We are using Spark / S3 not on EMR and have encountered issues with file
> consistency, you can deal with it but be aware it's additional technical
> debt that you'll need to own.  We didn't want to own an HDFS cluster so we
> consider it worthwhile.
>
> Here are some additional resources:  The video is Steve Loughran talking
> about S3.
> https://medium.com/@subhojit20_27731/apache-spark-and-
> amazon-s3-gotchas-and-best-practices-a767242f3d98
> https://www.youtube.com/watch?v=ND4L_zSDqF0
>
> For the record we use S3 heavily but tend to drop our processed data into
> databases so they can be more easily consumed by visualization tools.
>
> Good luck!
>
> Gary Lucas
>
> On 13 November 2017 at 20:04, Affan Syed <as...@an10.io> wrote:
>
>> Another option that we are trying internally is to uses Mesos for
>> isolating different jobs or groups. Within a single group, using Livy to
>> create different spark contexts also works.
>>
>> - Affan
>>
>> On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <dceash...@gmail.com>
>> wrote:
>>
>>> Thanks Sky Yin. This really helps.
>>>
>>> On Nov 14, 2017 12:11 AM, "Sky Yin" <sky....@gmail.com> wrote:
>>>
>>> We are running Spark in AWS EMR as data warehouse. All data are in S3
>>> and metadata in Hive metastore.
>>>
>>> We have internal tools to creat juypter notebook on the dev cluster. I
>>> guess you can use zeppelin instead, or Livy?
>>>
>>> We run genie as a job server for the prod cluster, so users have to
>>> submit their queries through the genie. For better resource utilization, we
>>> rely on Yarn dynamic allocation to balance the load of multiple
>>> jobs/queries in Spark.
>>>
>>> Hope this helps.
>>>
>>> On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com>
>>> wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> I was trying to understand if anyone here has tried a data warehouse
>>>> solution using S3 and Spark SQL. Out of multiple possible options
>>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>>> our aggregates and processing requirements.
>>>>
>>>> If anyone has tried it out, would like to understand the following:
>>>>
>>>>    1. Is Spark SQL and UDF, able to handle all the workloads?
>>>>    2. What user interface did you provide for data scientist, data
>>>>    engineers and analysts
>>>>    3. What are the challenges in running concurrent queries, by many
>>>>    users, over Spark SQL? Considering Spark still does not provide spill to
>>>>    disk, in many scenarios, are there frequent query failures when 
>>>> executing
>>>>    concurrent queries
>>>>    4. Are there any open source implementations, which provide
>>>>    something similar?
>>>>
>>>>
>>>> Regards,
>>>> Ashish
>>>>
>>>
>>>
>>
>
>

Re: Spark based Data Warehouse

Reply via email to