Re: Assigning partitions to specific nodes

Gwen Shapira Mon, 16 Mar 2015 14:04:18 -0700

Ah, got it.

To actually answer your question: Replica assignment tool allows you
to assign partitions to specific brokers. Since you always read from
the lead replica, you can mark specific replicas as "preferred
replica" and they will be the leader if they are available.


You'll need to get each SparkStreaming worker to consume from a
specific replica though... and then make sure that the files are
available on those specific nodes. Not sure how feasible this is.

Are you planning on running HDFS, Kafka and SparkStreaming on same
nodes? Because HDFS and Kafka may end up contending for disk access.

If it was my system, I probably wouldn't worry about message locality
between Kafka and SparkStreaming. Your messages contain just URLs and
will probably be too small to create significant network contention
(which is what locality is meant to avoid).

Gwen








On Mon, Mar 16, 2015 at 1:35 PM, Daniel Haviv
<[email protected]> wrote:
> Let's say I have 3 different types of algorithms that are implemented by
> three streaming apps (on Yarn).
> They are completely separate, meaning that they can run in parallel on the
> same data and not sequentially.
>
> *Using Kafka: *File X is loaded into HDFS and I want Algorithms A and B to
> process it so I push it's path to TopicA and TopicB and
> streamingAppA and streamingAppB will be able to consume simultaneously.
>
> *Using Directory Listener:* File X is loaded into HDFS. Algorithm A will
> listen on directoryA, B on directoryB, C on directoryC.
> For the file to be processed I will have to copy the file both to
> directoryA and to directoyB and only then will the streaming apps will
> start processing it.
>
> Daniel
>
>
>
> On Mon, Mar 16, 2015 at 10:25 PM, Gwen Shapira <[email protected]>
> wrote:
>
>> Probably off-topic for Kafka list, but why do you think you need
>> multiple copies of the file to parallelize access?
>> You'll have parallel access based on how many containers you have on
>> the machine (if you are using YARN-Spark).
>>
>> On Mon, Mar 16, 2015 at 1:20 PM, Daniel Haviv
>> <[email protected]> wrote:
>> > Hi,
>> > The reason we want to use this method is that this way a file can be
>> consumed by different streaming apps simultaneously (they just consume it's
>> path from kafka and open it locally).
>> >
>> > With fileStream to parallelize the processing of a specific file I will
>> have to make several copies of it, which wasteful in terms of space and
>> time.
>> >
>> > Thanks,
>> > Daniel
>> >
>> >> On 16 במרץ 2015, at 22:12, Gwen Shapira <[email protected]> wrote:
>> >>
>> >> Any reason not to use SparkStreaming directly with HDFS files, so
>> >> you'll get locality guarantees from the Hadoop framework?
>> >> StreamContext has textFileStream() method you could use for this.
>> >>
>> >> On Mon, Mar 16, 2015 at 12:46 PM, Daniel Haviv
>> >> <[email protected]> wrote:
>> >>> Hi,
>> >>> Is it possible to assign specific partitions to specific nodes?
>> >>> I want to upload files to HDFS, find out on which nodes the file
>> resides
>> >>> and then push their path into a topic and partition it by nodes.
>> >>> This way I can ensure that the consumer (Spark Streaming) will consume
>> both
>> >>> the message and file locally.
>> >>>
>> >>> Can this be achieved ?
>> >>>
>> >>> Thanks,
>> >>> Daniel
>>

Re: Assigning partitions to specific nodes

Reply via email to