Re: Spark based Data Warehouse

2017-11-17 Thread lucas.g...@gmail.com
We are using Spark on Kubernetes on AWS (it's a long story) but it does work. It's still on the raw side but we've been pretty successful. We configured our cluster primarily with Kube-AWS and auto scaling groups. There are gotcha's there, but so far we've been quite successful. Gary Lucas On

Re: Spark based Data Warehouse

2017-11-17 Thread ashish rawat
Thanks everyone for their suggestions. Does any of you take care of auto scale up and down of your underlying spark clusters on AWS? On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" wrote: Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes

Re: Spark based Data Warehouse

2017-11-13 Thread lucas.g...@gmail.com
Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes out some S3 problems that you may / almost certainly will encounter. We are using Spark / S3 not on EMR and have encountered issues with file consistency, you can deal with it but be aware it's additional

Re: Spark based Data Warehouse

2017-11-13 Thread Affan Syed
Another option that we are trying internally is to uses Mesos for isolating different jobs or groups. Within a single group, using Livy to create different spark contexts also works. - Affan On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat wrote: > Thanks Sky Yin. This really

Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat
Thanks Sky Yin. This really helps. On Nov 14, 2017 12:11 AM, "Sky Yin" wrote: We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. We have internal tools to creat juypter notebook on the dev cluster. I guess you can use

Re: Spark based Data Warehouse

2017-11-13 Thread Sky Yin
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy? We run genie as a job server for the prod cluster, so users have to submit

Re: Spark based Data Warehouse

2017-11-13 Thread Deepak Sharma
os for your >> end users; but it sounds like you’ll be using it for exploratory analysis. >> Spark is great for this ☺ >> >> >> >> -Pat >> >> >> >> >> >> *From: *Vadim Semenov <vadim.seme...@datadoghq.com> >> *Date: *Su

Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat
dim.seme...@datadoghq.com> > *Date: *Sunday, November 12, 2017 at 1:06 PM > *To: *Gourav Sengupta <gourav.sengu...@gmail.com> > *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat < > dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak S

Re: Spark based Data Warehouse

2017-11-12 Thread Patrick Alwell
mail.com> Cc: Phillip Henry <londonjava...@gmail.com>, ashish rawat <dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma <deepakmc...@gmail.com>, spark users <user@spark.apache.org> Subject: Re: Spark based Data Warehouse It's actually quite simp

Re: Spark based Data Warehouse

2017-11-12 Thread Vadim Semenov
It's actually quite simple to answer > 1. Is Spark SQL and UDF, able to handle all the workloads? Yes > 2. What user interface did you provide for data scientist, data engineers and analysts Home-grown platform, EMR, Zeppelin > What are the challenges in running concurrent queries, by many

Re: Spark based Data Warehouse

2017-11-12 Thread Gourav Sengupta
Dear Ashish, what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think

Re: Spark based Data Warehouse

2017-11-12 Thread Phillip Henry
Hi, Ashish. You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not

Re: Spark based Data Warehouse

2017-11-12 Thread ashish rawat
Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems. Regarding spill to disk,

Re: Spark based Data Warehouse

2017-11-12 Thread Phillip Henry
Agree with Jorn. The answer is: it depends. In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers). Regarding sharing resources, different teams were limited to their own queue so they

Re: Spark based Data Warehouse

2017-11-12 Thread Jörn Franke
What do you mean all possible workloads? You cannot prepare any system to do all possible processing. We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I

Re: Spark based Data Warehouse

2017-11-11 Thread Deepak Sharma
I am looking for similar solution more aligned to data scientist group. The concern i have is about supporting complex aggregations at runtime . Thanks Deepak On Nov 12, 2017 12:51, "ashish rawat" wrote: > Hello Everyone, > > I was trying to understand if anyone here has

Spark based Data Warehouse

2017-11-11 Thread ashish rawat
Hello Everyone, I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements. If anyone has tried it out,