Ok, false alarm - the problem was a mis-configuration in our code that was causing multiple processes to update that znode whereas only one should have.
Apologies for wasting your time. Ishaaq On 11 October 2011 13:09, Ishaaq Chandy <ish...@gmail.com> wrote: > Technically we don't need the contents as we're going to overwrite it > anyway, we're just asserting the fact that we're the only one writing to > that node. > > Was just checking if it is a known issue - clearly not, so I'll continue > investigating our code. > > Thanks, > Ishaaq > > > On 11 October 2011 12:21, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> Why do you get the version in the first place without getting the >> contents? >> >> If you don't have the contents, what is the point of enforcing a version. >> >> On Mon, Oct 10, 2011 at 8:26 AM, Ishaaq Chandy <ish...@gmail.com> wrote: >> >> > Thanks Mahadev, >> > Yup, I am aware of the fact that 2 is a particularly bad number for >> cluster >> > size and hopefully we should fix that soon, I was just hoping that for >> some >> > reason that was why the problem is occurring - my conjecture was, for >> e.g. >> > if the two zk servers disagree about the version there is no way to >> decide >> > who is correct without a third tie-breaker server. >> > >> > But, if you say that is not the case, then I need to keep looking >> (sigh). >> > >> > I am pretty sure that only one thread is touching that znode. We put in >> > some >> > trace logging to try and pinpoint the problem and noticed that every >> time >> > we >> > get the BadVersionException the actual version on the znode is one more >> > than >> > what we expected it to be based on the previous "exists()" call. >> > >> > As I said, this code gets called once every 2 seconds (or thereabouts). >> It >> > seems to fail with a BadVersionException about 3 times an hour (on >> > average). >> > >> > By the way, not sure if it is relevant, but the reason we are using 2 >> nodes >> > in the cluster and the reason why their version is 3.2.2 is because they >> > are >> > the ZKs that come embedded inside HBase (we're running 2 Hbase >> > regionservers) - I've been meaning to pull them out and run them >> standalone >> > but just haven't got around to it (yet). >> > >> > Ishaaq >> > >> > On 10 October 2011 17:35, Mahadev Konar <maha...@hortonworks.com> >> wrote: >> > >> > > Ishaaq, >> > > 2 ZK servers is definitely not the right number for running a ZK >> > > service but its no reason to get a Badversion exception because of >> > > that. For more information on the size of the ZK ensemble take a look >> > > at: >> > > >> > > http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html >> > > >> > > As for the version on the znode, can you try reading the version when >> > > you get a setData/BadException? >> > > >> > > Also, is there any chance of a delete on the znode that removes it and >> > > another create happens for the same path? >> > > >> > > I dont think we have seen this version issue in the releases, so I'd >> > > be inclined to say that there could be something in the code thats >> > > making some changes to the znode before you set the data. >> > > >> > > Hope that helps >> > > thanks >> > > mahadev >> > > >> > > On Fri, Oct 7, 2011 at 6:47 PM, Ishaaq Chandy <ish...@gmail.com> >> wrote: >> > > > Hi all, >> > > > >> > > > We're seeing a puzzling error. Here's the scenario: >> > > > >> > > > 1. We have a single thread that wakes up every two seconds (give or >> > take) >> > > > and does some work >> > > > 2. As part of that work it updates a node on ZK. When it does this >> it >> > > first >> > > > gets the Stat of the existing node and uses the version retrieved >> from >> > it >> > > to >> > > > update the value. >> > > > 3. There are no other processes updating the node >> > > > >> > > > The code goes something like this: >> > > > final Stat stat = zooKeeper.exists(path, false); >> > > > // do some other work here to create the path if it does not exist - >> > this >> > > > code only ever gets called once >> > > > zooKeeper.setData(path, value, stat.getVersion()); >> > > > >> > > > What we're seeing is that every so often (once every 5 minutes or >> so?) >> > is >> > > > that that setData() call fails with a BadVersionException. This is >> very >> > > > unexpected because, as I mentioned previously, this thread is the >> sole >> > > > updater of that node. >> > > > >> > > > One possibility I am considering is that we are using the wrong >> number >> > of >> > > > ZKs in our cluster - i.e 2 nodes. I am wondering if 2 is the worst >> > number >> > > of >> > > > nodes possible for ZK as there is no way to resolve a disagreement. >> > > > >> > > > Another possibility is that we are using an old version of ZK >> (3.2.2), >> > > > perhaps there is a known bug with it? Though I see nothing related >> to >> > > this >> > > > in the release logs for subsequent versions. >> > > > >> > > > Thoughts/suggestions? >> > > > >> > > > Thanks, >> > > > Ishaaq >> >