Dear Ashish, what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think that your approach and question needs a lot of modification.
Regards, Gourav On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com> wrote: > Hi, Ashish. > > You are correct in saying that not *all* functionality of Spark is > spill-to-disk but I am not sure how this pertains to a "concurrent user > scenario". Each executor will run in its own JVM and is therefore isolated > from others. That is, if the JVM of one user dies, this should not effect > another user who is running their own jobs in their own JVMs. The amount of > resources used by a user can be controlled by the resource manager. > > AFAIK, you configure something like YARN to limit the number of cores and > the amount of memory in the cluster a certain user or group is allowed to > use for their job. This is obviously quite a coarse-grained approach as (to > my knowledge) IO is not throttled. I believe people generally use something > like Apache Ambari to keep an eye on network and disk usage to mitigate > problems in a shared cluster. > > If the user has badly designed their query, it may very well fail with > OOMEs but this can happen irrespective of whether one user or many is using > the cluster at a given moment in time. > > Does this help? > > Regards, > > Phillip > > > On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.com> wrote: > >> Thanks Jorn and Phillip. My question was specifically to anyone who have >> tried creating a system using spark SQL, as Data Warehouse. I was trying to >> check, if someone has tried it and they can help with the kind of workloads >> which worked and the ones, which have problems. >> >> Regarding spill to disk, I might be wrong but not all functionality of >> spark is spill to disk. So it still doesn't provide DB like reliability in >> execution. In case of DBs, queries get slow but they don't fail or go out >> of memory, specifically in concurrent user scenarios. >> >> Regards, >> Ashish >> >> On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote: >> >> Agree with Jorn. The answer is: it depends. >> >> In the past, I've worked with data scientists who are happy to use the >> Spark CLI. Again, the answer is "it depends" (in this case, on the skills >> of your customers). >> >> Regarding sharing resources, different teams were limited to their own >> queue so they could not hog all the resources. However, people within a >> team had to do some horse trading if they had a particularly intensive job >> to run. I did feel that this was an area that could be improved. It may be >> by now, I've just not looked into it for a while. >> >> BTW I'm not sure what you mean by "Spark still does not provide spill to >> disk" as the FAQ says "Spark's operators spill data to disk if it does not >> fit in memory" (http://spark.apache.org/faq.html). So, your data will >> not normally cause OutOfMemoryErrors (certain terms and conditions may >> apply). >> >> My 2 cents. >> >> Phillip >> >> >> >> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> What do you mean all possible workloads? >>> You cannot prepare any system to do all possible processing. >>> >>> We do not know the requirements of your data scientists now or in the >>> future so it is difficult to say. How do they work currently without the >>> new solution? Do they all work on the same data? I bet you will receive on >>> your email a lot of private messages trying to sell their solution that >>> solves everything - with the information you provided this is impossible to >>> say. >>> >>> Then with every system: have incremental releases but have then in short >>> time frames - do not engineer a big system that you will deliver in 2 >>> years. In the cloud you have the perfect possibility to scale feature but >>> also infrastructure wise. >>> >>> Challenges with concurrent queries is the right definition of the >>> scheduler (eg fairscheduler) that not one query take all the resources or >>> that long running queries starve. >>> >>> User interfaces: what could help are notebooks (Jupyter etc) but you may >>> need to train your data scientists. Some may know or prefer other tools. >>> >>> On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote: >>> >>> I am looking for similar solution more aligned to data scientist group. >>> The concern i have is about supporting complex aggregations at runtime . >>> >>> Thanks >>> Deepak >>> >>> On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote: >>> >>>> Hello Everyone, >>>> >>>> I was trying to understand if anyone here has tried a data warehouse >>>> solution using S3 and Spark SQL. Out of multiple possible options >>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for >>>> our aggregates and processing requirements. >>>> >>>> If anyone has tried it out, would like to understand the following: >>>> >>>> 1. Is Spark SQL and UDF, able to handle all the workloads? >>>> 2. What user interface did you provide for data scientist, data >>>> engineers and analysts >>>> 3. What are the challenges in running concurrent queries, by many >>>> users, over Spark SQL? Considering Spark still does not provide spill to >>>> disk, in many scenarios, are there frequent query failures when >>>> executing >>>> concurrent queries >>>> 4. Are there any open source implementations, which provide >>>> something similar? >>>> >>>> >>>> Regards, >>>> Ashish >>>> >>> >> >> >