Spark Streaming Data Sharing

Levi Bowman Mon, 18 Aug 2014 07:31:24 -0700

Based on my understanding something like this doesn't seem to be possible out 
of the box, but I thought I would write it up anyway in case someone has any 
ideas.


We have conceptually one high volume input stream, each streaming job is either 
interested in a subset of the stream or the entire stream.  We would like to 
get to the point where we could be running a large number of streaming jobs 
concurrently across a cluster.  It does not seem that putting a buffer like 
Kafka in between the source and the streaming jobs would be a sustainable route 
as all jobs would be consuming the whole stream and the majority of them would 
only be interested in a small subset of the available data.  As we don't know 
exactly what data each job would be interested in up front it would be 
difficult to separate/partition Kafka topics up front.

What I think we want is a way to have one streaming job whose output is passed 
to n other streaming jobs.  Has anyone though about implementing something like 
this?   I don't see a way to have the Receiver for one streaming job be the 
output of another streaming job.



________________________________

This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

Spark Streaming Data Sharing

Reply via email to