nice discussion , I've a question about Web Service with Spark. What Could be the problem using Akka-http as web service (Like play does ) , with one SparkContext created , and the queries over -http akka using only the instance of that SparkContext ,
Also about Analytics , we are working on real- time Analytics and as Hemant said Spark is not a solution for low latency queries. What about using Ingite for that? On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <[email protected]> wrote: > Spark-jobserver is an elegant product that builds concurrency on top of > Spark. But, the current design of DAGScheduler prevents Spark to become a > truly concurrent solution for low latency queries. DagScheduler will turn > out to be a bottleneck for low latency queries. Sparrow project was an > effort to make Spark more suitable for such scenarios but it never made it > to the Spark codebase. If Spark has to become a highly concurrent solution, > scheduling has to be distributed. > > Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> > www.snappydata.io > > On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <[email protected]> wrote: > >> great discussion, indeed. >> >> Mark Hamstra and i spoke offline just now. >> >> Below is a quick recap of our discussion on how they've achieved >> acceptable performance from Spark on the user request/response path (@mark- >> feel free to correct/comment). >> >> 1) there is a big difference in request/response latency between >> submitting a full Spark Application (heavy weight) versus having a >> long-running Spark Application (like Spark Job Server) that submits >> lighter-weight Jobs using a shared SparkContext. mark is obviously using >> the latter - a long-running Spark App. >> >> 2) there are some enhancements to Spark that are required to achieve >> acceptable user request/response times. some links that Mark provided are >> as follows: >> >> - https://issues.apache.org/jira/browse/SPARK-11838 >> - https://github.com/apache/spark/pull/11036 >> - https://github.com/apache/spark/pull/11403 >> - https://issues.apache.org/jira/browse/SPARK-13523 >> - https://issues.apache.org/jira/browse/SPARK-13756 >> >> Essentially, a deeper level of caching at the shuffle file layer to >> reduce compute and memory between queries. >> >> Note that Mark is running a slightly-modified version of stock Spark. >> (He's mentioned this in prior posts, as well.) >> >> And I have to say that I'm, personally, seeing more and more >> slightly-modified versions of Spark being deployed to production to >> workaround outstanding PR's and Jiras. >> >> this may not be what people want to hear, but it's a trend that i'm >> seeing lately as more and more customize Spark to their specific use cases. >> >> Anyway, thanks for the good discussion, everyone! This is why we have >> these lists, right! :) >> >> >> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <[email protected]> >> wrote: >> >>> One of the premises here is that if you can restrict your workload to >>> fewer cores - which is easier with FiloDB and careful data modeling - >>> you can make this work for much higher concurrency and lower latency >>> than most typical Spark use cases. >>> >>> The reason why it typically does not work in production is that most >>> people are using HDFS and files. These data sources are designed for >>> running queries and workloads on all your cores across many workers, >>> and not for filtering your workload down to only one or two cores. >>> >>> There is actually nothing inherent in Spark that prevents people from >>> using it as an app server. However, the insistence on using it with >>> HDFS is what kills concurrency. This is why FiloDB is important. >>> >>> I agree there are more optimized stacks for running app servers, but >>> the choices that you mentioned: ES is targeted at text search; Cass >>> and HBase by themselves are not fast enough for analytical queries >>> that the OP wants; and MySQL is great but not scalable. Probably >>> something like VectorWise, HANA, Vertica would work well, but those >>> are mostly not free solutions. Druid could work too if the use case >>> is right. >>> >>> Anyways, great discussion! >>> >>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <[email protected]> wrote: >>> > you are correct, mark. i misspoke. apologies for the confusion. >>> > >>> > so the problem is even worse given that a typical job requires multiple >>> > tasks/cores. >>> > >>> > i have yet to see this particular architecture work in production. i >>> would >>> > love for someone to prove otherwise. >>> > >>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <[email protected] >>> > >>> > wrote: >>> >>> >>> >>> For example, if you're looking to scale out to 1000 concurrent >>> requests, >>> >>> this is 1000 concurrent Spark jobs. This would require a cluster >>> with 1000 >>> >>> cores. >>> >> >>> >> >>> >> This doesn't make sense. A Spark Job is a driver/DAGScheduler concept >>> >> without any 1:1 correspondence between Worker cores and Jobs. Cores >>> are >>> >> used to run Tasks, not Jobs. So, yes, a 1000 core cluster can run at >>> most >>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything >>> about how >>> >> many Jobs are or can be concurrently tracked by the DAGScheduler, >>> which will >>> >> be apportioning the Tasks from those concurrent Jobs across the >>> available >>> >> Executor cores. >>> >> >>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <[email protected]> >>> wrote: >>> >>> >>> >>> Good stuff, Evan. Looks like this is utilizing the in-memory >>> >>> capabilities of FiloDB which is pretty cool. looking forward to the >>> webcast >>> >>> as I don't know much about FiloDB. >>> >>> >>> >>> My personal thoughts here are to removed Spark from the user >>> >>> request/response hot path. >>> >>> >>> >>> I can't tell you how many times i've had to unroll that architecture >>> at >>> >>> clients - and replace with a real database like Cassandra, >>> ElasticSearch, >>> >>> HBase, MySql. >>> >>> >>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to >>> >>> believe that Spark could be used as an application server. This is >>> not a >>> >>> good use case for Spark. >>> >>> >>> >>> Remember that every job that is launched by Spark requires 1 CPU >>> core, >>> >>> some memory, and an available Executor JVM to provide the CPU and >>> memory. >>> >>> >>> >>> Yes, you can horizontally scale this because of the distributed >>> nature of >>> >>> Spark, however it is not an efficient scaling strategy. >>> >>> >>> >>> For example, if you're looking to scale out to 1000 concurrent >>> requests, >>> >>> this is 1000 concurrent Spark jobs. This would require a cluster >>> with 1000 >>> >>> cores. this is just not cost effective. >>> >>> >>> >>> Use Spark for what it's good for - ad-hoc, interactive, and iterative >>> >>> (machine learning, graph) analytics. Use an application server for >>> what >>> >>> it's good - managing a large amount of concurrent requests. And use >>> a >>> >>> database for what it's good for - storing/retrieving data. >>> >>> >>> >>> And any serious production deployment will need failover, throttling, >>> >>> back pressure, auto-scaling, and service discovery. >>> >>> >>> >>> While Spark supports these to varying levels of production-readiness, >>> >>> Spark is a batch-oriented system and not meant to be put on the user >>> >>> request/response hot path. >>> >>> >>> >>> For the failover, throttling, back pressure, autoscaling that i >>> mentioned >>> >>> above, it's worth checking out the suite of Netflix OSS - >>> particularly >>> >>> Hystrix, Eureka, Zuul, Karyon, etc: http://netflix.github.io/ >>> >>> >>> >>> Here's my github project that incorporates a lot of these: >>> >>> https://github.com/cfregly/fluxcapacitor >>> >>> >>> >>> Here's a netflix Skunkworks github project that packages these up in >>> >>> Docker images: https://github.com/Netflix-Skunkworks/zerotodocker >>> >>> >>> >>> >>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github < >>> [email protected]> >>> >>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> I just wrote a blog post which might be really useful to you -- I >>> have >>> >>>> just >>> >>>> benchmarked being able to achieve 700 queries per second in Spark. >>> So, >>> >>>> yes, >>> >>>> web speed SQL queries are definitely possible. Read my new blog >>> post: >>> >>>> >>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/ >>> >>>> >>> >>>> and feel free to email me (at [email protected]) if you would like >>> to >>> >>>> follow >>> >>>> up. >>> >>>> >>> >>>> -Evan >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> View this message in context: >>> >>>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html >>> >>>> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >>>> >>> >>>> >>> --------------------------------------------------------------------- >>> >>>> To unsubscribe, e-mail: [email protected] >>> >>>> For additional commands, e-mail: [email protected] >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> >>> >>> Chris Fregly >>> >>> Principal Data Solutions Engineer >>> >>> IBM Spark Technology Center, San Francisco, CA >>> >>> http://spark.tc | http://advancedspark.com >>> >> >>> >> >>> > >>> > >>> > >>> > -- >>> > >>> > Chris Fregly >>> > Principal Data Solutions Engineer >>> > IBM Spark Technology Center, San Francisco, CA >>> > http://spark.tc | http://advancedspark.com >>> >> >> >> >> -- >> >> *Chris Fregly* >> Principal Data Solutions Engineer >> IBM Spark Technology Center, San Francisco, CA >> http://spark.tc | http://advancedspark.com >> > > -- Ing. Ivaldi Andres
