I need a unique & permanent ID assigned to new item encountered, which has a constraint that it is in the range of, let's say for simple discussion, one to one million.
I suppose I could assign a range of usable IDs to each reduce task (where ID's are assigned) and keep those organized somehow at the end of the job, but this seems clunky too. Since this is on AWS, zookeeper is not a good option. I thought it was part of the hadoop cluster (and thus easy to access), but guess I was wrong there. I would think that such a service would run most logically on the taskmaster server. I'm surprised this isn't a common issue. I guess I could launch a separate job that runs such a sequence service perhaps. But that's non trivial its self with failure concerns. Perhaps there's just a better way of thinking of this? From: Ted Dunning [mailto:[email protected]] Sent: Saturday, October 27, 2012 12:23 PM To: [email protected] Subject: Re: Cluster wide atomic operations This is better asked on the Zookeeper lists. The first answer is that global atomic operations are a generally bad idea. The second answer is that if you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter? On Fri, Oct 26, 2012 at 11:07 PM, David Parks <[email protected]> wrote: How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment counter. Does Hadoop provide native support for these kinds of operations? An in case ultimate answer involves zookeeper, I'd love to work out doing this in AWS/EMR.
