We are using Spark on Kubernetes on AWS (it's a long story) but it does work. It's still on the raw side but we've been pretty successful.
We configured our cluster primarily with Kube-AWS and auto scaling groups. There are gotcha's there, but so far we've been quite successful. Gary Lucas On 17 November 2017 at 22:20, ashish rawat <dceash...@gmail.com> wrote: > Thanks everyone for their suggestions. Does any of you take care of auto > scale up and down of your underlying spark clusters on AWS? > > On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" <lucas.g...@gmail.com> > wrote: > > Hi Ashish, bear in mind that EMR has some additional tooling available > that smoothes out some S3 problems that you may / almost certainly will > encounter. > > We are using Spark / S3 not on EMR and have encountered issues with file > consistency, you can deal with it but be aware it's additional technical > debt that you'll need to own. We didn't want to own an HDFS cluster so we > consider it worthwhile. > > Here are some additional resources: The video is Steve Loughran talking > about S3. > https://medium.com/@subhojit20_27731/apache-spark-and- > amazon-s3-gotchas-and-best-practices-a767242f3d98 > https://www.youtube.com/watch?v=ND4L_zSDqF0 > > For the record we use S3 heavily but tend to drop our processed data into > databases so they can be more easily consumed by visualization tools. > > Good luck! > > Gary Lucas > > On 13 November 2017 at 20:04, Affan Syed <as...@an10.io> wrote: > >> Another option that we are trying internally is to uses Mesos for >> isolating different jobs or groups. Within a single group, using Livy to >> create different spark contexts also works. >> >> - Affan >> >> On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <dceash...@gmail.com> >> wrote: >> >>> Thanks Sky Yin. This really helps. >>> >>> On Nov 14, 2017 12:11 AM, "Sky Yin" <sky....@gmail.com> wrote: >>> >>> We are running Spark in AWS EMR as data warehouse. All data are in S3 >>> and metadata in Hive metastore. >>> >>> We have internal tools to creat juypter notebook on the dev cluster. I >>> guess you can use zeppelin instead, or Livy? >>> >>> We run genie as a job server for the prod cluster, so users have to >>> submit their queries through the genie. For better resource utilization, we >>> rely on Yarn dynamic allocation to balance the load of multiple >>> jobs/queries in Spark. >>> >>> Hope this helps. >>> >>> On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com> >>> wrote: >>> >>>> Hello Everyone, >>>> >>>> I was trying to understand if anyone here has tried a data warehouse >>>> solution using S3 and Spark SQL. Out of multiple possible options >>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for >>>> our aggregates and processing requirements. >>>> >>>> If anyone has tried it out, would like to understand the following: >>>> >>>> 1. Is Spark SQL and UDF, able to handle all the workloads? >>>> 2. What user interface did you provide for data scientist, data >>>> engineers and analysts >>>> 3. What are the challenges in running concurrent queries, by many >>>> users, over Spark SQL? Considering Spark still does not provide spill to >>>> disk, in many scenarios, are there frequent query failures when >>>> executing >>>> concurrent queries >>>> 4. Are there any open source implementations, which provide >>>> something similar? >>>> >>>> >>>> Regards, >>>> Ashish >>>> >>> >>> >> > >