Thanks Evan for the points. I had supposed what you said, but as I don't have enough experience maybe I was missing something, thanks for the answer!!
On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan <velvia.git...@gmail.com> wrote: > Andres, > > A couple points: > > 1) If you look at my post, you can see that you could use Spark for > low-latency - many sub-second queries could be executed in under a > second, with the right technology. It really depends on "real time" > definition, but I believe low latency is definitely possible. > 2) Akka-http over SparkContext - this is essentially what Spark Job > Server does. (it uses Spray, whic is the predecessor to akka-http.... > we will upgrade once Spark 2.0 is incorporated) > 3) Someone else can probably talk about Ignite, but it is based on a > distributed object cache. So you define your objects in Java, POJOs, > annotate which ones you want indexed, upload your jars, then you can > execute queries. It's a different use case than typical OLAP. > There is some Spark integration, but then you would have the same > bottlenecks going through Spark. > > > On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote: > > nice discussion , I've a question about Web Service with Spark. > > > > What Could be the problem using Akka-http as web service (Like play does > ) , > > with one SparkContext created , and the queries over -http akka using > only > > the instance of that SparkContext , > > > > Also about Analytics , we are working on real- time Analytics and as > Hemant > > said Spark is not a solution for low latency queries. What about using > > Ingite for that? > > > > > > On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <hemant9...@gmail.com> > > wrote: > >> > >> Spark-jobserver is an elegant product that builds concurrency on top of > >> Spark. But, the current design of DAGScheduler prevents Spark to become > a > >> truly concurrent solution for low latency queries. DagScheduler will > turn > >> out to be a bottleneck for low latency queries. Sparrow project was an > >> effort to make Spark more suitable for such scenarios but it never made > it > >> to the Spark codebase. If Spark has to become a highly concurrent > solution, > >> scheduling has to be distributed. > >> > >> Hemant Bhanawat > >> www.snappydata.io > >> > >> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote: > >>> > >>> great discussion, indeed. > >>> > >>> Mark Hamstra and i spoke offline just now. > >>> > >>> Below is a quick recap of our discussion on how they've achieved > >>> acceptable performance from Spark on the user request/response path > (@mark- > >>> feel free to correct/comment). > >>> > >>> 1) there is a big difference in request/response latency between > >>> submitting a full Spark Application (heavy weight) versus having a > >>> long-running Spark Application (like Spark Job Server) that submits > >>> lighter-weight Jobs using a shared SparkContext. mark is obviously > using > >>> the latter - a long-running Spark App. > >>> > >>> 2) there are some enhancements to Spark that are required to achieve > >>> acceptable user request/response times. some links that Mark provided > are > >>> as follows: > >>> > >>> https://issues.apache.org/jira/browse/SPARK-11838 > >>> https://github.com/apache/spark/pull/11036 > >>> https://github.com/apache/spark/pull/11403 > >>> https://issues.apache.org/jira/browse/SPARK-13523 > >>> https://issues.apache.org/jira/browse/SPARK-13756 > >>> > >>> Essentially, a deeper level of caching at the shuffle file layer to > >>> reduce compute and memory between queries. > >>> > >>> Note that Mark is running a slightly-modified version of stock Spark. > >>> (He's mentioned this in prior posts, as well.) > >>> > >>> And I have to say that I'm, personally, seeing more and more > >>> slightly-modified versions of Spark being deployed to production to > >>> workaround outstanding PR's and Jiras. > >>> > >>> this may not be what people want to hear, but it's a trend that i'm > >>> seeing lately as more and more customize Spark to their specific use > cases. > >>> > >>> Anyway, thanks for the good discussion, everyone! This is why we have > >>> these lists, right! :) > >>> > >>> > >>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com> > >>> wrote: > >>>> > >>>> One of the premises here is that if you can restrict your workload to > >>>> fewer cores - which is easier with FiloDB and careful data modeling - > >>>> you can make this work for much higher concurrency and lower latency > >>>> than most typical Spark use cases. > >>>> > >>>> The reason why it typically does not work in production is that most > >>>> people are using HDFS and files. These data sources are designed for > >>>> running queries and workloads on all your cores across many workers, > >>>> and not for filtering your workload down to only one or two cores. > >>>> > >>>> There is actually nothing inherent in Spark that prevents people from > >>>> using it as an app server. However, the insistence on using it with > >>>> HDFS is what kills concurrency. This is why FiloDB is important. > >>>> > >>>> I agree there are more optimized stacks for running app servers, but > >>>> the choices that you mentioned: ES is targeted at text search; Cass > >>>> and HBase by themselves are not fast enough for analytical queries > >>>> that the OP wants; and MySQL is great but not scalable. Probably > >>>> something like VectorWise, HANA, Vertica would work well, but those > >>>> are mostly not free solutions. Druid could work too if the use case > >>>> is right. > >>>> > >>>> Anyways, great discussion! > >>>> > >>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> > wrote: > >>>> > you are correct, mark. i misspoke. apologies for the confusion. > >>>> > > >>>> > so the problem is even worse given that a typical job requires > >>>> > multiple > >>>> > tasks/cores. > >>>> > > >>>> > i have yet to see this particular architecture work in production. > i > >>>> > would > >>>> > love for someone to prove otherwise. > >>>> > > >>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra > >>>> > <m...@clearstorydata.com> > >>>> > wrote: > >>>> >>> > >>>> >>> For example, if you're looking to scale out to 1000 concurrent > >>>> >>> requests, > >>>> >>> this is 1000 concurrent Spark jobs. This would require a cluster > >>>> >>> with 1000 > >>>> >>> cores. > >>>> >> > >>>> >> > >>>> >> This doesn't make sense. A Spark Job is a driver/DAGScheduler > >>>> >> concept > >>>> >> without any 1:1 correspondence between Worker cores and Jobs. > Cores > >>>> >> are > >>>> >> used to run Tasks, not Jobs. So, yes, a 1000 core cluster can run > at > >>>> >> most > >>>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything > >>>> >> about how > >>>> >> many Jobs are or can be concurrently tracked by the DAGScheduler, > >>>> >> which will > >>>> >> be apportioning the Tasks from those concurrent Jobs across the > >>>> >> available > >>>> >> Executor cores. > >>>> >> > >>>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> > >>>> >> wrote: > >>>> >>> > >>>> >>> Good stuff, Evan. Looks like this is utilizing the in-memory > >>>> >>> capabilities of FiloDB which is pretty cool. looking forward to > the > >>>> >>> webcast > >>>> >>> as I don't know much about FiloDB. > >>>> >>> > >>>> >>> My personal thoughts here are to removed Spark from the user > >>>> >>> request/response hot path. > >>>> >>> > >>>> >>> I can't tell you how many times i've had to unroll that > architecture > >>>> >>> at > >>>> >>> clients - and replace with a real database like Cassandra, > >>>> >>> ElasticSearch, > >>>> >>> HBase, MySql. > >>>> >>> > >>>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you > to > >>>> >>> believe that Spark could be used as an application server. This > is > >>>> >>> not a > >>>> >>> good use case for Spark. > >>>> >>> > >>>> >>> Remember that every job that is launched by Spark requires 1 CPU > >>>> >>> core, > >>>> >>> some memory, and an available Executor JVM to provide the CPU and > >>>> >>> memory. > >>>> >>> > >>>> >>> Yes, you can horizontally scale this because of the distributed > >>>> >>> nature of > >>>> >>> Spark, however it is not an efficient scaling strategy. > >>>> >>> > >>>> >>> For example, if you're looking to scale out to 1000 concurrent > >>>> >>> requests, > >>>> >>> this is 1000 concurrent Spark jobs. This would require a cluster > >>>> >>> with 1000 > >>>> >>> cores. this is just not cost effective. > >>>> >>> > >>>> >>> Use Spark for what it's good for - ad-hoc, interactive, and > >>>> >>> iterative > >>>> >>> (machine learning, graph) analytics. Use an application server > for > >>>> >>> what > >>>> >>> it's good - managing a large amount of concurrent requests. And > use > >>>> >>> a > >>>> >>> database for what it's good for - storing/retrieving data. > >>>> >>> > >>>> >>> And any serious production deployment will need failover, > >>>> >>> throttling, > >>>> >>> back pressure, auto-scaling, and service discovery. > >>>> >>> > >>>> >>> While Spark supports these to varying levels of > >>>> >>> production-readiness, > >>>> >>> Spark is a batch-oriented system and not meant to be put on the > user > >>>> >>> request/response hot path. > >>>> >>> > >>>> >>> For the failover, throttling, back pressure, autoscaling that i > >>>> >>> mentioned > >>>> >>> above, it's worth checking out the suite of Netflix OSS - > >>>> >>> particularly > >>>> >>> Hystrix, Eureka, Zuul, Karyon, etc: http://netflix.github.io/ > >>>> >>> > >>>> >>> Here's my github project that incorporates a lot of these: > >>>> >>> https://github.com/cfregly/fluxcapacitor > >>>> >>> > >>>> >>> Here's a netflix Skunkworks github project that packages these up > in > >>>> >>> Docker images: > https://github.com/Netflix-Skunkworks/zerotodocker > >>>> >>> > >>>> >>> > >>>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github > >>>> >>> <velvia.git...@gmail.com> > >>>> >>> wrote: > >>>> >>>> > >>>> >>>> Hi, > >>>> >>>> > >>>> >>>> I just wrote a blog post which might be really useful to you -- I > >>>> >>>> have > >>>> >>>> just > >>>> >>>> benchmarked being able to achieve 700 queries per second in > Spark. > >>>> >>>> So, > >>>> >>>> yes, > >>>> >>>> web speed SQL queries are definitely possible. Read my new blog > >>>> >>>> post: > >>>> >>>> > >>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/ > >>>> >>>> > >>>> >>>> and feel free to email me (at vel...@gmail.com) if you would > like > >>>> >>>> to > >>>> >>>> follow > >>>> >>>> up. > >>>> >>>> > >>>> >>>> -Evan > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> -- > >>>> >>>> View this message in context: > >>>> >>>> > >>>> >>>> > http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html > >>>> >>>> Sent from the Apache Spark User List mailing list archive at > >>>> >>>> Nabble.com. > >>>> >>>> > >>>> >>>> > >>>> >>>> > --------------------------------------------------------------------- > >>>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>>> >>>> For additional commands, e-mail: user-h...@spark.apache.org > >>>> >>>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> -- > >>>> >>> > >>>> >>> Chris Fregly > >>>> >>> Principal Data Solutions Engineer > >>>> >>> IBM Spark Technology Center, San Francisco, CA > >>>> >>> http://spark.tc | http://advancedspark.com > >>>> >> > >>>> >> > >>>> > > >>>> > > >>>> > > >>>> > -- > >>>> > > >>>> > Chris Fregly > >>>> > Principal Data Solutions Engineer > >>>> > IBM Spark Technology Center, San Francisco, CA > >>>> > http://spark.tc | http://advancedspark.com > >>> > >>> > >>> > >>> > >>> -- > >>> > >>> Chris Fregly > >>> Principal Data Solutions Engineer > >>> IBM Spark Technology Center, San Francisco, CA > >>> http://spark.tc | http://advancedspark.com > >> > >> > > > > > > > > -- > > Ing. Ivaldi Andres > -- Ing. Ivaldi Andres