Hi,
I am currently using spark 1.5.2 and I have been able to run benchmarks in
spark (SQL specifically) in single user mode. For benchmarking with
multiple users, I have tried some of the following approaches, but each has
its own disadvantage
1. Start thrift server in Spark.
- Execute queries via JDBC from Jmeter. (Disadvantage is that, it is
not possible to execute custom code to load tables as DataFrames)
2. Start custom thrift server in Spark. Custom thrift server would
create HiveContext and could load all relevant tables as temp tables (with
DF). Later it could start thrift-server via
"HiveThriftServer2.startWithContext(hiveContext); "
- Execute queries in Jmeter via JDBC. (Disadvantage is that, it can
simulate single user. When multiple threads submit the queries, they are
executed in serial fashion)
- Even if number of executors is increased, it does not solve this
problem. With more executors, the response times of small
queries tend to
be higher with multiple runs (may be consecutive executions are happening
in different executors where the data wasn’t cached).
3. Create multiple SparkContexts when Jmeter initializes the benchmark.
This is more like a pool of SparkContexts and every user can make use of
different SparkContext.
- This leads to SPARK-2243
<https://issues.apache.org/jira/browse/SPARK-2243> and
"spark.driver.allowMultipleContexts=true” is not helpful in this case.
4. Another option could be to just launch multiple spark-shells to
simulate multiple users with dynamic resource allocation enabled. I
haven’t tried this yet.
Are there any standard approaches for benchmarking with multiple users in
Spark? Any pointers on this would be helpful.
~Rajesh.B