Re: Can we use spark inside a web service?

Andrés Ivaldi Tue, 15 Mar 2016 06:13:34 -0700

Thanks Evan for the points. I had supposed what you said, but as I don't
have enough experience maybe I was missing something, thanks for the
answer!!


On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan <velvia.git...@gmail.com> wrote:

> Andres,
>
> A couple points:
>
> 1) If you look at my post, you can see that you could use Spark for
> low-latency - many sub-second queries could be executed in under a
> second, with the right technology.  It really depends on "real time"
> definition, but I believe low latency is definitely possible.
> 2) Akka-http over SparkContext - this is essentially what Spark Job
> Server does.  (it uses Spray, whic is the predecessor to akka-http....
> we will upgrade once Spark 2.0 is incorporated)
> 3) Someone else can probably talk about Ignite, but it is based on a
> distributed object cache. So you define your objects in Java, POJOs,
> annotate which ones you want indexed, upload your jars, then you can
> execute queries.   It's a different use case than typical OLAP.
> There is some Spark integration, but then you would have the same
> bottlenecks going through Spark.
>
>
> On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
> > nice discussion , I've a question about  Web Service with Spark.
> >
> > What Could be the problem using Akka-http as web service (Like play does
> ) ,
> > with one SparkContext created , and the queries over -http akka using
> only
> > the instance of  that SparkContext ,
> >
> > Also about Analytics , we are working on real- time Analytics and as
> Hemant
> > said Spark is not a solution for low latency queries. What about using
> > Ingite for that?
> >
> >
> > On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <hemant9...@gmail.com>
> > wrote:
> >>
> >> Spark-jobserver is an elegant product that builds concurrency on top of
> >> Spark. But, the current design of DAGScheduler prevents Spark to become
> a
> >> truly concurrent solution for low latency queries. DagScheduler will
> turn
> >> out to be a bottleneck for low latency queries. Sparrow project was an
> >> effort to make Spark more suitable for such scenarios but it never made
> it
> >> to the Spark codebase. If Spark has to become a highly concurrent
> solution,
> >> scheduling has to be distributed.
> >>
> >> Hemant Bhanawat
> >> www.snappydata.io
> >>
> >> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote:
> >>>
> >>> great discussion, indeed.
> >>>
> >>> Mark Hamstra and i spoke offline just now.
> >>>
> >>> Below is a quick recap of our discussion on how they've achieved
> >>> acceptable performance from Spark on the user request/response path
> (@mark-
> >>> feel free to correct/comment).
> >>>
> >>> 1) there is a big difference in request/response latency between
> >>> submitting a full Spark Application (heavy weight) versus having a
> >>> long-running Spark Application (like Spark Job Server) that submits
> >>> lighter-weight Jobs using a shared SparkContext.  mark is obviously
> using
> >>> the latter - a long-running Spark App.
> >>>
> >>> 2) there are some enhancements to Spark that are required to achieve
> >>> acceptable user request/response times.  some links that Mark provided
> are
> >>> as follows:
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-11838
> >>> https://github.com/apache/spark/pull/11036
> >>> https://github.com/apache/spark/pull/11403
> >>> https://issues.apache.org/jira/browse/SPARK-13523
> >>> https://issues.apache.org/jira/browse/SPARK-13756
> >>>
> >>> Essentially, a deeper level of caching at the shuffle file layer to
> >>> reduce compute and memory between queries.
> >>>
> >>> Note that Mark is running a slightly-modified version of stock Spark.
> >>> (He's mentioned this in prior posts, as well.)
> >>>
> >>> And I have to say that I'm, personally, seeing more and more
> >>> slightly-modified versions of Spark being deployed to production to
> >>> workaround outstanding PR's and Jiras.
> >>>
> >>> this may not be what people want to hear, but it's a trend that i'm
> >>> seeing lately as more and more customize Spark to their specific use
> cases.
> >>>
> >>> Anyway, thanks for the good discussion, everyone!  This is why we have
> >>> these lists, right!  :)
> >>>
> >>>
> >>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com>
> >>> wrote:
> >>>>
> >>>> One of the premises here is that if you can restrict your workload to
> >>>> fewer cores - which is easier with FiloDB and careful data modeling -
> >>>> you can make this work for much higher concurrency and lower latency
> >>>> than most typical Spark use cases.
> >>>>
> >>>> The reason why it typically does not work in production is that most
> >>>> people are using HDFS and files.  These data sources are designed for
> >>>> running queries and workloads on all your cores across many workers,
> >>>> and not for filtering your workload down to only one or two cores.
> >>>>
> >>>> There is actually nothing inherent in Spark that prevents people from
> >>>> using it as an app server.   However, the insistence on using it with
> >>>> HDFS is what kills concurrency.   This is why FiloDB is important.
> >>>>
> >>>> I agree there are more optimized stacks for running app servers, but
> >>>> the choices that you mentioned:  ES is targeted at text search;  Cass
> >>>> and HBase by themselves are not fast enough for analytical queries
> >>>> that the OP wants;  and MySQL is great but not scalable.   Probably
> >>>> something like VectorWise, HANA, Vertica would work well, but those
> >>>> are mostly not free solutions.   Druid could work too if the use case
> >>>> is right.
> >>>>
> >>>> Anyways, great discussion!
> >>>>
> >>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com>
> wrote:
> >>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
> >>>> >
> >>>> > so the problem is even worse given that a typical job requires
> >>>> > multiple
> >>>> > tasks/cores.
> >>>> >
> >>>> > i have yet to see this particular architecture work in production.
> i
> >>>> > would
> >>>> > love for someone to prove otherwise.
> >>>> >
> >>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra
> >>>> > <m...@clearstorydata.com>
> >>>> > wrote:
> >>>> >>>
> >>>> >>> For example, if you're looking to scale out to 1000 concurrent
> >>>> >>> requests,
> >>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
> >>>> >>> with 1000
> >>>> >>> cores.
> >>>> >>
> >>>> >>
> >>>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler
> >>>> >> concept
> >>>> >> without any 1:1 correspondence between Worker cores and Jobs.
> Cores
> >>>> >> are
> >>>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run
> at
> >>>> >> most
> >>>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
> >>>> >> about how
> >>>> >> many Jobs are or can be concurrently tracked by the DAGScheduler,
> >>>> >> which will
> >>>> >> be apportioning the Tasks from those concurrent Jobs across the
> >>>> >> available
> >>>> >> Executor cores.
> >>>> >>
> >>>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> Good stuff, Evan.  Looks like this is utilizing the in-memory
> >>>> >>> capabilities of FiloDB which is pretty cool.  looking forward to
> the
> >>>> >>> webcast
> >>>> >>> as I don't know much about FiloDB.
> >>>> >>>
> >>>> >>> My personal thoughts here are to removed Spark from the user
> >>>> >>> request/response hot path.
> >>>> >>>
> >>>> >>> I can't tell you how many times i've had to unroll that
> architecture
> >>>> >>> at
> >>>> >>> clients - and replace with a real database like Cassandra,
> >>>> >>> ElasticSearch,
> >>>> >>> HBase, MySql.
> >>>> >>>
> >>>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you
> to
> >>>> >>> believe that Spark could be used as an application server.  This
> is
> >>>> >>> not a
> >>>> >>> good use case for Spark.
> >>>> >>>
> >>>> >>> Remember that every job that is launched by Spark requires 1 CPU
> >>>> >>> core,
> >>>> >>> some memory, and an available Executor JVM to provide the CPU and
> >>>> >>> memory.
> >>>> >>>
> >>>> >>> Yes, you can horizontally scale this because of the distributed
> >>>> >>> nature of
> >>>> >>> Spark, however it is not an efficient scaling strategy.
> >>>> >>>
> >>>> >>> For example, if you're looking to scale out to 1000 concurrent
> >>>> >>> requests,
> >>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
> >>>> >>> with 1000
> >>>> >>> cores.  this is just not cost effective.
> >>>> >>>
> >>>> >>> Use Spark for what it's good for - ad-hoc, interactive, and
> >>>> >>> iterative
> >>>> >>> (machine learning, graph) analytics.  Use an application server
> for
> >>>> >>> what
> >>>> >>> it's good - managing a large amount of concurrent requests.  And
> use
> >>>> >>> a
> >>>> >>> database for what it's good for - storing/retrieving data.
> >>>> >>>
> >>>> >>> And any serious production deployment will need failover,
> >>>> >>> throttling,
> >>>> >>> back pressure, auto-scaling, and service discovery.
> >>>> >>>
> >>>> >>> While Spark supports these to varying levels of
> >>>> >>> production-readiness,
> >>>> >>> Spark is a batch-oriented system and not meant to be put on the
> user
> >>>> >>> request/response hot path.
> >>>> >>>
> >>>> >>> For the failover, throttling, back pressure, autoscaling that i
> >>>> >>> mentioned
> >>>> >>> above, it's worth checking out the suite of Netflix OSS -
> >>>> >>> particularly
> >>>> >>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
> >>>> >>>
> >>>> >>> Here's my github project that incorporates a lot of these:
> >>>> >>> https://github.com/cfregly/fluxcapacitor
> >>>> >>>
> >>>> >>> Here's a netflix Skunkworks github project that packages these up
> in
> >>>> >>> Docker images:
> https://github.com/Netflix-Skunkworks/zerotodocker
> >>>> >>>
> >>>> >>>
> >>>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github
> >>>> >>> <velvia.git...@gmail.com>
> >>>> >>> wrote:
> >>>> >>>>
> >>>> >>>> Hi,
> >>>> >>>>
> >>>> >>>> I just wrote a blog post which might be really useful to you -- I
> >>>> >>>> have
> >>>> >>>> just
> >>>> >>>> benchmarked being able to achieve 700 queries per second in
> Spark.
> >>>> >>>> So,
> >>>> >>>> yes,
> >>>> >>>> web speed SQL queries are definitely possible.   Read my new blog
> >>>> >>>> post:
> >>>> >>>>
> >>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
> >>>> >>>>
> >>>> >>>> and feel free to email me (at vel...@gmail.com) if you would
> like
> >>>> >>>> to
> >>>> >>>> follow
> >>>> >>>> up.
> >>>> >>>>
> >>>> >>>> -Evan
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>
> >>>> >>>> --
> >>>> >>>> View this message in context:
> >>>> >>>>
> >>>> >>>>
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> >>>> >>>> Sent from the Apache Spark User List mailing list archive at
> >>>> >>>> Nabble.com.
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>
> ---------------------------------------------------------------------
> >>>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>>> >>>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>> >>>>
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> --
> >>>> >>>
> >>>> >>> Chris Fregly
> >>>> >>> Principal Data Solutions Engineer
> >>>> >>> IBM Spark Technology Center, San Francisco, CA
> >>>> >>> http://spark.tc | http://advancedspark.com
> >>>> >>
> >>>> >>
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> >
> >>>> > Chris Fregly
> >>>> > Principal Data Solutions Engineer
> >>>> > IBM Spark Technology Center, San Francisco, CA
> >>>> > http://spark.tc | http://advancedspark.com
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Chris Fregly
> >>> Principal Data Solutions Engineer
> >>> IBM Spark Technology Center, San Francisco, CA
> >>> http://spark.tc | http://advancedspark.com
> >>
> >>
> >
> >
> >
> > --
> > Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres

Re: Can we use spark inside a web service?

Reply via email to