you are correct, mark. i misspoke. apologies for the confusion. so the problem is even worse given that a typical job requires multiple tasks/cores.
i have yet to see this particular architecture work in production. i would love for someone to prove otherwise. On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <[email protected]> wrote: > For example, if you're looking to scale out to 1000 concurrent requests, >> this is 1000 concurrent Spark jobs. This would require a cluster with 1000 >> cores. > > > This doesn't make sense. A Spark Job is a driver/DAGScheduler concept > without any 1:1 correspondence between Worker cores and Jobs. Cores are > used to run Tasks, not Jobs. So, yes, a 1000 core cluster can run at most > 1000 simultaneous Tasks, but that doesn't really tell you anything about > how many Jobs are or can be concurrently tracked by the DAGScheduler, which > will be apportioning the Tasks from those concurrent Jobs across the > available Executor cores. > > On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <[email protected]> wrote: > >> Good stuff, Evan. Looks like this is utilizing the in-memory >> capabilities of FiloDB which is pretty cool. looking forward to the >> webcast as I don't know much about FiloDB. >> >> My personal thoughts here are to removed Spark from the user >> request/response hot path. >> >> I can't tell you how many times i've had to unroll that architecture at >> clients - and replace with a real database like Cassandra, ElasticSearch, >> HBase, MySql. >> >> Unfortunately, Spark - and Spark Streaming, especially - lead you to >> believe that Spark could be used as an application server. This is not a >> good use case for Spark. >> >> Remember that every job that is launched by Spark requires 1 CPU core, >> some memory, and an available Executor JVM to provide the CPU and memory. >> >> Yes, you can horizontally scale this because of the distributed nature of >> Spark, however it is not an efficient scaling strategy. >> >> For example, if you're looking to scale out to 1000 concurrent requests, >> this is 1000 concurrent Spark jobs. This would require a cluster with 1000 >> cores. this is just not cost effective. >> >> Use Spark for what it's good for - ad-hoc, interactive, and iterative >> (machine learning, graph) analytics. Use an application server for what >> it's good - managing a large amount of concurrent requests. And use a >> database for what it's good for - storing/retrieving data. >> >> And any serious production deployment will need failover, throttling, >> back pressure, auto-scaling, and service discovery. >> >> While Spark supports these to varying levels of production-readiness, >> Spark is a batch-oriented system and not meant to be put on the user >> request/response hot path. >> >> For the failover, throttling, back pressure, autoscaling that i mentioned >> above, it's worth checking out the suite of Netflix OSS - particularly >> Hystrix, Eureka, Zuul, Karyon, etc: http://netflix.github.io/ >> >> Here's my github project that incorporates a lot of these: >> https://github.com/cfregly/fluxcapacitor >> >> Here's a netflix Skunkworks github project that packages these up in >> Docker images: https://github.com/Netflix-Skunkworks/zerotodocker >> >> >> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <[email protected]> >> wrote: >> >>> Hi, >>> >>> I just wrote a blog post which might be really useful to you -- I have >>> just >>> benchmarked being able to achieve 700 queries per second in Spark. So, >>> yes, >>> web speed SQL queries are definitely possible. Read my new blog post: >>> >>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/ >>> >>> and feel free to email me (at [email protected]) if you would like to >>> follow >>> up. >>> >>> -Evan >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> >> -- >> >> *Chris Fregly* >> Principal Data Solutions Engineer >> IBM Spark Technology Center, San Francisco, CA >> http://spark.tc | http://advancedspark.com >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com
