That's a very helpful discussion. Thank you.
I'd like to go with assigning blocks of IDs for each reducer. Snowflake would require external changes that are a pain, I'd rather make my job fit our current constraints. Is there a way to get an index number for each reducer such that I could identify which block of IDs to assign each one? Thanks, David From: Ted Dunning [mailto:[email protected]] Sent: Monday, October 29, 2012 12:58 PM To: [email protected] Subject: Re: Cluster wide atomic operations On Sun, Oct 28, 2012 at 9:15 PM, David Parks <[email protected]> wrote: I need a unique & permanent ID assigned to new item encountered, which has a constraint that it is in the range of, let's say for simple discussion, one to one million. Having such a limited range may require that you have a central service to generate ID's. The use of a central service can be disastrous for throughput. I suppose I could assign a range of usable IDs to each reduce task (where ID's are assigned) and keep those organized somehow at the end of the job, but this seems clunky too. Yes. Much better. Since this is on AWS, zookeeper is not a good option. I thought it was part of the hadoop cluster (and thus easy to access), but guess I was wrong there. No. This is specifically not part of Hadoop for performance reasons. I would think that such a service would run most logically on the taskmaster server. I'm surprised this isn't a common issue. I guess I could launch a separate job that runs such a sequence service perhaps. But that's non trivial its self with failure concerns. The problem is that a serial number service is a major loss of performance in a parallel system. Unless you relax the idea considerably (by allowing blocks, or having lots of bits like Snowflake), then you wind up with a round-trip per id and you have a critical section on the ID generator. This is bad. Look up Amdahl's Law. Perhaps there's just a better way of thinking of this? Yes. Use lots of bits and be satisfied with uniqueness rather than perfect ordering and limited range. As the other respondent said, look up Snowflake.
