It's always had data locality (since hadoop support was added in 0.6).

You don't need to specify a partition, you specify the input predicate with 
ConfigHelper or the cassandra.input.predicate property.

On Oct 2, 2012, at 2:26 PM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote:

> So you're saying that you can access the primary index with a key range, but 
> to access the secondary index, you first need to get all keys and follow up 
> with a multiget, which would use the secondary index to speed the lookup of 
> the matching rows?
> 
> Yes, that is how I "believe" it works.  I am by no means an expert.
> 
> I also wanted to fire off a MR to process matching rows in the "virtual" CF 
> ideally running on the nodes where it reads data in.  In 0.7, I thought the 
> M/R jobs did not run locally with the data like hadoop does???  Anyone know 
> if that is still true or does it run locally to the data now?
> 
> Thanks,
> Dean
> 
> From: Ben Hood <0x6e6...@gmail.com<mailto:0x6e6...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Tuesday, October 2, 2012 1:01 PM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: 1000's of column families
> 
> Dean,
> 
> On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:
> 
> Because the data for an index is not all together(ie. Need a multi get to get 
> the data). It is not contiguous.
> 
> The prefix in a partition they keep the data so all data for a prefix from 
> what I understand is contiguous.
> 
> 
> 
> 
> 
> QUESTION: What I don't get in the comment is I assume you are referring to 
> CQL in which case we would need to specify the partition (in addition to the 
> index)which means all that data is on one node, correct? Or did I miss 
> something there.
> 
> Maybe my question was just silly - I wasn't referring to CQL.
> 
> As for the locality of the data, I was hoping to be able to fire off an MR 
> job to process all matching rows in the CF - I was assuming that that this 
> job would get executed on the same node as the data.
> 
> But I think the real confusion in my question has to do with the way the 
> ColumnFamilyInputFormat has been implemented, since it would appear that it 
> ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to 
> be applied in the job rather than up front in the Cassandra query.
> 
> Cheers,
> 
> Ben
> 

Reply via email to