Ah, got it. To actually answer your question: Replica assignment tool allows you to assign partitions to specific brokers. Since you always read from the lead replica, you can mark specific replicas as "preferred replica" and they will be the leader if they are available.
You'll need to get each SparkStreaming worker to consume from a specific replica though... and then make sure that the files are available on those specific nodes. Not sure how feasible this is. Are you planning on running HDFS, Kafka and SparkStreaming on same nodes? Because HDFS and Kafka may end up contending for disk access. If it was my system, I probably wouldn't worry about message locality between Kafka and SparkStreaming. Your messages contain just URLs and will probably be too small to create significant network contention (which is what locality is meant to avoid). Gwen On Mon, Mar 16, 2015 at 1:35 PM, Daniel Haviv <[email protected]> wrote: > Let's say I have 3 different types of algorithms that are implemented by > three streaming apps (on Yarn). > They are completely separate, meaning that they can run in parallel on the > same data and not sequentially. > > *Using Kafka: *File X is loaded into HDFS and I want Algorithms A and B to > process it so I push it's path to TopicA and TopicB and > streamingAppA and streamingAppB will be able to consume simultaneously. > > *Using Directory Listener:* File X is loaded into HDFS. Algorithm A will > listen on directoryA, B on directoryB, C on directoryC. > For the file to be processed I will have to copy the file both to > directoryA and to directoyB and only then will the streaming apps will > start processing it. > > Daniel > > > > On Mon, Mar 16, 2015 at 10:25 PM, Gwen Shapira <[email protected]> > wrote: > >> Probably off-topic for Kafka list, but why do you think you need >> multiple copies of the file to parallelize access? >> You'll have parallel access based on how many containers you have on >> the machine (if you are using YARN-Spark). >> >> On Mon, Mar 16, 2015 at 1:20 PM, Daniel Haviv >> <[email protected]> wrote: >> > Hi, >> > The reason we want to use this method is that this way a file can be >> consumed by different streaming apps simultaneously (they just consume it's >> path from kafka and open it locally). >> > >> > With fileStream to parallelize the processing of a specific file I will >> have to make several copies of it, which wasteful in terms of space and >> time. >> > >> > Thanks, >> > Daniel >> > >> >> On 16 במרץ 2015, at 22:12, Gwen Shapira <[email protected]> wrote: >> >> >> >> Any reason not to use SparkStreaming directly with HDFS files, so >> >> you'll get locality guarantees from the Hadoop framework? >> >> StreamContext has textFileStream() method you could use for this. >> >> >> >> On Mon, Mar 16, 2015 at 12:46 PM, Daniel Haviv >> >> <[email protected]> wrote: >> >>> Hi, >> >>> Is it possible to assign specific partitions to specific nodes? >> >>> I want to upload files to HDFS, find out on which nodes the file >> resides >> >>> and then push their path into a topic and partition it by nodes. >> >>> This way I can ensure that the consumer (Spark Streaming) will consume >> both >> >>> the message and file locally. >> >>> >> >>> Can this be achieved ? >> >>> >> >>> Thanks, >> >>> Daniel >>
