It's always had data locality (since hadoop support was added in 0.6). You don't need to specify a partition, you specify the input predicate with ConfigHelper or the cassandra.input.predicate property.
On Oct 2, 2012, at 2:26 PM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote: > So you're saying that you can access the primary index with a key range, but > to access the secondary index, you first need to get all keys and follow up > with a multiget, which would use the secondary index to speed the lookup of > the matching rows? > > Yes, that is how I "believe" it works. I am by no means an expert. > > I also wanted to fire off a MR to process matching rows in the "virtual" CF > ideally running on the nodes where it reads data in. In 0.7, I thought the > M/R jobs did not run locally with the data like hadoop does??? Anyone know > if that is still true or does it run locally to the data now? > > Thanks, > Dean > > From: Ben Hood <0x6e6...@gmail.com<mailto:0x6e6...@gmail.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Date: Tuesday, October 2, 2012 1:01 PM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: 1000's of column families > > Dean, > > On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote: > > Because the data for an index is not all together(ie. Need a multi get to get > the data). It is not contiguous. > > The prefix in a partition they keep the data so all data for a prefix from > what I understand is contiguous. > > > > > > QUESTION: What I don't get in the comment is I assume you are referring to > CQL in which case we would need to specify the partition (in addition to the > index)which means all that data is on one node, correct? Or did I miss > something there. > > Maybe my question was just silly - I wasn't referring to CQL. > > As for the locality of the data, I was hoping to be able to fire off an MR > job to process all matching rows in the CF - I was assuming that that this > job would get executed on the same node as the data. > > But I think the real confusion in my question has to do with the way the > ColumnFamilyInputFormat has been implemented, since it would appear that it > ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to > be applied in the job rather than up front in the Cassandra query. > > Cheers, > > Ben >