Re: Hadoop Integration: Limiting scan to a range of keys

aaron morton Mon, 03 Dec 2012 13:05:25 -0800

For background, you may find the wide row setting useful 
http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration

AFAIK all the input row readers for Hadoop do range scans. And I think the 
support for setting the start and end token is used so that jobs only select 
data which is local to the node. It's not really possible to select individual 
rows by token. 

If you had a secondary index on the row you could use the setInputRange 
overload that takes an index expression. 

Or it may be easier to use hive. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/12/2012, at 3:04 PM, Jamie Rothfeder <[email protected]> wrote:

> Hey All,
> 
> I have a bunch of time-series data stored in a cluster using a 
> ByteOrderedPartitioner. My keys are time buckets representing events that 
> occurred in an hour. I've been trying to write a mapreduce job that considers 
> only events with in a certain time range by specifying an input range, but 
> this doesn't seem to be working.
> 
> I expect the following code to scan data for a single key (1353456000), but 
> it is scanning all keys.
> 
> int key = 1353456000;
> IPartitioner part = ConfigHelper.getInputPartitioner(job.getConfiguration());
> Token token =  part.getToken(ByteBufferUtil.bytes(key));
> ConfigHelper.setInputRange(job.getConfiguration(), 
> part.getTokenFactory().toString(token), 
> part.getTokenFactory().toString(token));
> 
> Any idea what I'm doing wrong?
> 
> Thanks,
> Jamie

Re: Hadoop Integration: Limiting scan to a range of keys

Reply via email to