Re: Can we use spark inside a web service?

Teng Qiu Thu, 10 Mar 2016 15:27:40 -0800

This is really depends on how you defined "hot" :) and use cases, spark is
definitely not that one fits all. At least not yet. Specially for heavy
joins and full scans.


Maybe spark alone fits your production workload and analytical
requirements, but in general, I agree with Chris, for high concurrency,
multi-tenants scenario, there are many existing better solutions.

Am Donnerstag, 10. März 2016 schrieb Mark Hamstra :
> The fact that a typical Job requires multiple Tasks is not a problem, but
rather an opportunity for the Scheduler to interleave the workloads of
multiple concurrent Jobs across the available cores.
> I work every day with such a production architecture with Spark on the
user request/response hot path.
> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <[email protected]> wrote:
>>
>> you are correct, mark.  i misspoke.  apologies for the confusion.
>> so the problem is even worse given that a typical job requires multiple
tasks/cores.
>> i have yet to see this particular architecture work in production.  i
would love for someone to prove otherwise.
>> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <[email protected]>
wrote:
>>>>
>>>> For example, if you're looking to scale out to 1000 concurrent
requests, this is 1000 concurrent Spark jobs.  This would require a cluster
with 1000 cores.
>>>
>>> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.  Cores are
used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
1000 simultaneous Tasks, but that doesn't really tell you anything about
how many Jobs are or can be concurrently tracked by the DAGScheduler, which
will be apportioning the Tasks from those concurrent Jobs across the
available Executor cores.
>>> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <[email protected]> wrote:
>>>>
>>>> Good stuff, Evan.  Looks like this is utilizing the in-memory
capabilities of FiloDB which is pretty cool.  looking forward to the
webcast as I don't know much about FiloDB.
>>>> My personal thoughts here are to removed Spark from the user
request/response hot path.
>>>> I can't tell you how many times i've had to unroll that architecture
at clients - and replace with a real database like Cassandra,
ElasticSearch, HBase, MySql.
>>>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
believe that Spark could be used as an application server.  This is not a
good use case for Spark.
>>>> Remember that every job that is launched by Spark requires 1 CPU core,
some memory, and an available Executor JVM to provide the CPU and memory.
>>>> Yes, you can horizontally scale this because of the distributed nature
of Spark, however it is not an efficient scaling strategy.
>>>> For example, if you're looking to scale out to 1000 concurrent
requests, this is 1000 concurrent Spark jobs.  This would require a cluster
with 1000 cores.  this is just not cost effective.
>>>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
(machine learning, graph) analytics.  Use an application server for what
it's good - managing a large amount of concurrent requests.  And use a
database for what it's good for - storing/retrieving data.
>>>> And any serious production deployment will need failover, throttling,
back pressure, auto-scaling, and service discovery.
>>>> While Spark supports these to varying levels of production-readiness,
Spark is a batch-oriented system and not meant to be put on the user
request/response hot path.
>>>> For the failover, throttling, back pressure, autoscaling that i
mentioned above, it's worth checking out the suite of Netflix OSS -
particularly Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>>>> Here's my github project that incorporates a lot of these:
https://github.com/cfregly/fluxcapacitor
>>>> Here's a netflix Skunkworks github project that packages these up in
Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>>>
>>>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <[email protected]>
wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I just wrote a blog post which might be really useful to you -- I
have just
>>>>> benchmarked being able to achieve 700 queries per second in Spark.
So, yes,
>>>>> web speed SQL queries are definitely possible.   Read my new blog
post:
>>>>>
>>>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>>>>
>>>>> and feel free to email me (at [email protected]) if you would like to
follow
>>>>> up.
>>>>>
>>>>> -Evan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>>>>> Sent from the Apache Spark User List mailing list archive at
Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Chris Fregly
>>>> Principal Data Solutions Engineer
>>>> IBM Spark Technology Center, San Francisco, CA
>>>> http://spark.tc | http://advancedspark.com
>>
>>
>>
>> --
>>
>> Chris Fregly
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>

Re: Can we use spark inside a web service?

Reply via email to