Re: How to speed up data ingestion with Spark

2015-05-12 Thread Akhil Das
This article http://www.virdata.com/tuning-spark/ gives you a pretty good
start on the Spark streaming side. And this article
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
is for the kafka, it has nice explanation how message size and partitions
effects the throughput. And this article
https://www.sigmoid.com/creating-sigview-a-real-time-analytics-dashboard/
has a use-case.

Thanks
Best Regards

On Tue, May 12, 2015 at 8:25 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:

 Hi,

 I'm looking at a data ingestion implementation which streams data out of
 Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
 process the data in each partition.  Have folks looked at ways of speeding
 up this type of ingestion?

 Let's say the main part of the ingest process is fetching documents from
 somewhere and performing text extraction on them. Is this type of
 processing
 best done by expressing the pipelining with Spark RDD transformations or by
 just kicking off a multi-threaded pipeline?

 Or, is using a multi-threaded pipeliner per partition is a decent strategy
 and the performance comes from running in a clustered mode?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to speed up data ingestion with Spark

2015-05-12 Thread dgoldenberg
Hi,

I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition.  Have folks looked at ways of speeding
up this type of ingestion?

Let's say the main part of the ingest process is fetching documents from
somewhere and performing text extraction on them. Is this type of processing
best done by expressing the pipelining with Spark RDD transformations or by
just kicking off a multi-threaded pipeline?

Or, is using a multi-threaded pipeliner per partition is a decent strategy
and the performance comes from running in a clustered mode?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org