Well there is *some* data locality, it's just not guaranteed. My
understanding (and someone correct me if I'm wrong) is that
ColumnFamilyInputFormat implements InputSplit and the getLocations()
method.

http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapre
duce/InputSplit.html

ColumnFamilySplit.java contains logic to do it's best to determine what
node that particular hadoop node contains the data for that mapper.

But obviously this isn't guaranteed though that all data will be on that
node.

Also, for the sake of completeness, we have RF=3 on the Keyspace in
question.

On 10/18/12 1:15 PM, "Andrey Ilinykh" <ailin...@gmail.com> wrote:

>On Thu, Oct 18, 2012 at 12:00 PM, Michael Kjellman
><mkjell...@barracuda.com> wrote:
>> Unless you have Brisk (however as far as I know there was one fork that
>>got
>> it working on 1.0 but nothing for 1.1 and is not being actively
>>maintained
>> by Datastax) or go with CFS (which comes with DSE) you are not
>>guaranteed
>> all data is on that hadoop node. You can take a look at the forks if
>> interested here: https://github.com/riptano/brisk/network but I'd
>>personally
>> be afraid to put my eggs in a basket that is certainly not super
>>supported
>> anymore.
>>
>> job.getConfiguration().set("cassandra.consistencylevel.read", "QUORUM");
>> should get you started.
>This is what I don't understand. With QUORUM you read data from at
>least two nodes. If so, you don't benefit from data locality. What's
>the point to use hadoop? I can run application on any machine(s) and
>iterate through column family. What is the difference?
>
>Thank you,
>  Andrey


'Like' us on Facebook for exclusive content and other resources on all 
Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook


Reply via email to