It’s really hard to answer this, as the comparison is not really fair – Storm
is much lower level than Spark and has less overhead when dealing with
stateless operations. I’d be curious how is your colleague implementing the
Average on a “batch” and what is the storm equivalent of a Batch.
That aside, the storm implementation seems in the ballpark (we were clocking
~80K events /sec/node in a well tuned storm app).
For a spark app, it really depends on the lineage, checkpointing, level of
parallelism, partitioning, number of tasks, scheduling overhead, etc – unless
we confirm that the job operates at full capacity, I can’t say for sure that
it’s good or bad.
Some questions to isolate the issue:
* Can you try with a manual average operation? (e.g.reduce the events to a
tuple with count and sum, then divide them)
* Looking at the DoubleRDD.mean implementation, the stats module
computes lots of stuff on the side, not just the mean
* Not sure if it matters, but I’m assuming on the storm side you’re not
doing that
* Can you confirm that the operation in cause is indeed computed in
parallel (should have 48 tasks).
* If yes, how many records per second do you get on average, only for
this operation? - you can find this out in the SparkUI, dividing the number of
records allocated to one of the executors by the total task time for that
executor
* Are the executors well balanced? Is each of them processing 16 partitions
in approx equal time? If not, you can have multiple issues here:
* Kafka cluster imbalance – happens from time to time, you can monitor
this from the command line with kafka-topics.sh —describe
* Kafka key partitioning scheme – assuming round-robin distribution,
just checking – you should see this in the 1st stage, again all the executors
should have allocated an equal number of tasks/partitions to process with an
equal number of messages
Given that you have 7 kafka brokers and 3 spark executors, you should try a
.repartition(48) on the kafka Dstream – at the one time cost of a shuffle you
redistribute the data evenly across all nodes/cores and avoid most of the
issues above – at least you have a guarantee that the load is spread evenly
across the cluster.
Hope this helps,
-adrian
From: "Young, Matthew T"
Date: Thursday, September 24, 2015 at 11:47 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Cc: "Krumm, Greg"
Subject: Reasonable performance numbers?
Hello,
I am doing performance testing with Spark Streaming. I want to know if the
throughput numbers I am encountering are reasonable for the power of my cluster
and Spark’s performance characteristics.
My job has the following processing steps:
1. Read 600 Byte JSON strings from a 7 broker / 48 partition Kafka cluster
via the Kafka Direct API
2. Parse the JSON with play-json or lift-json (no significant performance
difference)
3. Read one integer value out of the JSON
4. Compute the average of this integer value across all records in the
batch with DoubleRDD.mean
5. Write the average for the batch back to a different Kafka topic
I have tried 2, 4, and 10 second batch intervals. The best throughput I can
sustain is about 75,000 records/second for the whole cluster.
The Spark cluster is in a VM environment with 3 VMs. Each VM has 32 GB of RAM
and 16 cores. The systems are networked with 10 GB NICs. I started testing with
Spark 1.3.1 and switched to Spark 1.5 to see if there was improvement (none
significant). When I look at the event timeline in the WebUI I see that the
majority of the processing time for each batch is “Executor Computing Time” in
the foreachRDD that computes the average, not the transform that does the JSON
parsing.
CPU util hovers around 40% across the cluster, and RAM has plenty of free space
remaining as well. Network comes nowhere close to being saturated.
My colleague implementing similar functionality in Storm is able to exceed a
quarter million records per second with the same hardware.
Is 75K records/seconds reasonable for a cluster of this size? What kind of
performance would you expect for this job?
Thanks,
-- Matthew