This article http://www.virdata.com/tuning-spark/ gives you a pretty good
start on the Spark streaming side. And this article
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
is for the kafka, it has nice explanation how message size and partitions
effects the throughput. And this article
https://www.sigmoid.com/creating-sigview-a-real-time-analytics-dashboard/
has a use-case.
Thanks
Best Regards
On Tue, May 12, 2015 at 8:25 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition. Have folks looked at ways of speeding
up this type of ingestion?
Let's say the main part of the ingest process is fetching documents from
somewhere and performing text extraction on them. Is this type of
processing
best done by expressing the pipelining with Spark RDD transformations or by
just kicking off a multi-threaded pipeline?
Or, is using a multi-threaded pipeliner per partition is a decent strategy
and the performance comes from running in a clustered mode?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org