Based on my understanding something like this doesn't seem to be possible out of the box, but I thought I would write it up anyway in case someone has any ideas.
We have conceptually one high volume input stream, each streaming job is either interested in a subset of the stream or the entire stream. We would like to get to the point where we could be running a large number of streaming jobs concurrently across a cluster. It does not seem that putting a buffer like Kafka in between the source and the streaming jobs would be a sustainable route as all jobs would be consuming the whole stream and the majority of them would only be interested in a small subset of the available data. As we don't know exactly what data each job would be interested in up front it would be difficult to separate/partition Kafka topics up front. What I think we want is a way to have one streaming job whose output is passed to n other streaming jobs. Has anyone though about implementing something like this? I don't see a way to have the Receiver for one streaming job be the output of another streaming job. ________________________________ This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.