Re: BIND 9.7 Serial Number Decrease Problem
Barry Finkel wrote: I ran a test this morning on one of the Solaris 10 slave servers. A query to the server showed serial numbers: _tcp 1238 _udp842 Both of these match the zone on the MS Windows DNS Server. I checked the zone files on the slave server: _tcp 1239 _udp843 Both of these are increased by one from what BIND returns in response to a query. The two zones have NO .jnl files. I did ./rndc stop Wait for the exiting message. /etc/init.d/named.anl start;tail -f /var/adm/messages Once BIND started, the serial numbers were INCREASED, as I expected they would be, given the lack of .jnl files. And a few minutes later BIND complained about the serial number on the master being less than that on the slave for both zones. I consider this a bug in BIND 9. What further diagnostics do I need to get? I have another Solaris 10 slave on which, I assume, I can duplicate this. And from past experience, in one day, after the zone has expired and been refreshed, I will be in the same state on this slave. Do bind slave instances EVER make up or increment serial numbers? This just seems like such an unlikely bug that bind would start doing that. Could it be that the supposed slave instance is accepting dynamic updates? I'd be tracing/tracking SOA files on the master, and communications between the dns instances very closely before I'd even give such a potential bug much thought. Perhaps there are bind functions that I'm not aware of and I'm wrong. John ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
On 07/06/11 13:51, I wrote: I now have this situation on one Solaris 10 slave; the problem probably also exists on the other Sol 10 slave and the two Ubuntu hardy slaves: The _tcp zone on the master MS DNS Server: 1238 600 86400 3600 The _tcp zone on the BIND 9.7.3-P1 Solaris 10 server disk: 1239 ; serial 900; refresh (15 minutes) 600; retry (10 minutes) 86400 ; expire (1 day) 3600 ; minimum (1 hour) The _udp zone on the master MS DNS Server: 842 900 600 86400 3600 The _udp zone on the BIND 9.7.3-P1 Solaris 10 server disk: 843; serial 900; refresh (15 minutes) 600; retry (10 minutes) 86400 ; expire (1 day) 3600 ; minimum (1 hour) Note that the zone serial number for both zones on the master is one LESS than the serial number on the slave. The last messages in /var/adm/messages are _tcp: Jun 4 07:46:57 serial number (1238) received from master ... ours (1239) Jun 4 07:47:35 zone ... expired Jun 4 07:47:35 zone ... transfer started Jun 4 07:47:35 zone ... transferred serial 1238 Jun 4 07:47:35 zone ... Transfer completed: ... _udp: Jun 4 07:39:22 serial number (842) received from master ... ours (843) Jun 4 07:42:22 zone ... expired Jun 4 07:42:22 zone ... transfer started Jun 4 07:42:22 zone ... transferred serial 842 Jun 4 07:42:22 zone ... Transfer completed There was a zone serial number mismatch, each zone expired three days ago, and new zones were transferred from the master. But the zone files on disk still have the higher serial numbers. There are no .jnl files on the disk. A dig on the server for the zone serial numbers shows the lower numbers, so BIND has those correct serial numbers. I assume that if I stopped BIND (rndc stop) and restarted it, then I would again see the serial number mismatches. I can try this during the day, as this server is not heavily used. Is there any debugging I need to run? Thanks. I ran a test this morning on one of the Solaris 10 slave servers. A query to the server showed serial numbers: _tcp 1238 _udp842 Both of these match the zone on the MS Windows DNS Server. I checked the zone files on the slave server: _tcp 1239 _udp843 Both of these are increased by one from what BIND returns in response to a query. The two zones have NO .jnl files. I did ./rndc stop Wait for the exiting message. /etc/init.d/named.anl start;tail -f /var/adm/messages Once BIND started, the serial numbers were INCREASED, as I expected they would be, given the lack of .jnl files. And a few minutes later BIND complained about the serial number on the master being less than that on the slave for both zones. I consider this a bug in BIND 9. What further diagnostics do I need to get? I have another Solaris 10 slave on which, I assume, I can duplicate this. And from past experience, in one day, after the zone has expired and been refreshed, I will be in the same state on this slave. - -- Barry S. Finkel Computing and Information Systems Division Argonne National Laboratory Phone:+1 (630) 252-7277 9700 South Cass Avenue Facsimile:+1 (630) 252-4601 Building 240, Room 5.B.8 Internet: bsfin...@anl.gov Argonne, IL 60439-4828 IBMMAIL: I1004994 ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
On 06/06/2011 08:01 PM, Barry Finkel wrote: Phil Mayers suggested a corrupt .jnl file; I am not sure. How do I debug this? Given what Mark has said, I think it's unlikely; I didn't realise bind wrote a new journal and did a rename() which is atomic on every POSIX system that you're likely to be using. So, ignore what I said! ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
In my last posting I was confused as to the .jnl file. I have about 44 AD slave files on my BIND servers, and 40 .jnl files. The two zones in question do not have .jnl files. As I do not look at .jnl files much, I had forgotten about the tool to list them. I now have this situation on one Solaris 10 slave; the problem probably also exists on the other Sol 10 slave and the two Ubuntu hardy slaves: The _tcp zone on the master MS DNS Server: 1238 600 86400 3600 The _tcp zone on the BIND 9.7.3-P1 Solaris 10 server disk: 1239 ; serial 900; refresh (15 minutes) 600; retry (10 minutes) 86400 ; expire (1 day) 3600 ; minimum (1 hour) The _udp zone on the master MS DNS Server: 842 900 600 86400 3600 The _udp zone on the BIND 9.7.3-P1 Solaris 10 server disk: 843; serial 900; refresh (15 minutes) 600; retry (10 minutes) 86400 ; expire (1 day) 3600 ; minimum (1 hour) Note that the zone serial number for both zones on the master is one LESS than the serial number on the slave. The last messages in /var/adm/messages are _tcp: Jun 4 07:46:57 serial number (1238) received from master ... ours (1239) Jun 4 07:47:35 zone ... expired Jun 4 07:47:35 zone ... transfer started Jun 4 07:47:35 zone ... transferred serial 1238 Jun 4 07:47:35 zone ... Transfer completed: ... _udp: Jun 4 07:39:22 serial number (842) received from master ... ours (843) Jun 4 07:42:22 zone ... expired Jun 4 07:42:22 zone ... transfer started Jun 4 07:42:22 zone ... transferred serial 842 Jun 4 07:42:22 zone ... Transfer completed There was a zone serial number mismatch, each zone expired three days ago, and new zones were transferred from the master. But the zone files on disk still have the higher serial numbers. There are no .jnl files on the disk. A dig on the server for the zone serial numbers shows the lower numbers, so BIND has those correct serial numbers. I assume that if I stopped BIND (rndc stop) and restarted it, then I would again see the serial number mismatches. I can try this during the day, as this server is not heavily used. Is there any debugging I need to run? Thanks. -- -- Barry S. Finkel Computing and Information Systems Division Argonne National Laboratory Phone:+1 (630) 252-7277 9700 South Cass Avenue Facsimile:+1 (630) 252-4601 Building 240, Room 5.B.8 Internet: bsfin...@anl.gov Argonne, IL 60439-4828 IBMMAIL: I1004994 ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
On 6/7/11 7:51 AM, Barry Finkel bsfin...@anl.gov wrote: There was a zone serial number mismatch, each zone expired three days ago, and new zones were transferred from the master. But the zone files on disk still have the higher serial numbers. There are no .jnl files on the disk. A dig on the server for the zone serial numbers shows the lower numbers, so BIND has those correct serial numbers. If you have multiple masters for which this server is a slave, then check the serial number on all of the masters. I think you will find that one of them is higher than the other... I assume that if I stopped BIND (rndc stop) and restarted it, then I would again see the serial number mismatches. I can try this during the day, as this server is not heavily used. Is there any debugging I need to run? Thanks. -- Daniel J McDonald, CCIE # 2495, CISSP # 78281 ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
On 07/06/11 13:51, Barry Finkel wrote: In my last posting I was confused as to the .jnl file. I have about 44 AD slave files on my BIND servers, and 40 .jnl files. The two zones in question do not have .jnl files. As I do not look at .jnl files much, I had forgotten about the tool to list them. I now have this situation on one Solaris 10 slave; the problem probably also exists on the other Sol 10 slave and the two Ubuntu hardy slaves: The _tcp zone on the master MS DNS Server: 1238 600 86400 3600 The _tcp zone on the BIND 9.7.3-P1 Solaris 10 server disk: 1239 ; serial As Dan McDonald mentioned - the AD integrated DNS zones do not maintain a stable serial number, and in fact return a per-AD-controller SOA statement. Are you sure that isn't the cause of your problem? ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: BIND 9.7 Serial Number Decrease Problem
McDonald, Dan dan.mcdon...@austinenergy.com replied to my posting: I think your root problem is trying to deal with active directory integrated zones. We stopped using them entirely when we found that each domain controller maintains an individual SOA record with its own serial number. The serial numbers rapidly (and purposely) fall out of sync, but active directory doesn't care as they use a different replication method. The only way that we could successfully interact from bind was to set up a forward-only zone and try to cache the results. When we found that Active directory under windows 2000 was unable to maintain proper synchronization, we switched to bind for all zones and haven't looked back. If you check the list archives (back to the days when there was bind-users and bind9-users), you will find my postings dealing with MS article 282826. MS details the problem with zone serial numbers, and that is why we run the DNS Server on only ONE Domain Controller (and have since the beginning of AD in Windows 2000). When we run the DNS Server on a second DC (because the Windows admins want to), I tell BIND that there is ONE master server. I do not care what the zone serial number is on the other DC DNS Server, unless we have to switch masters. The only times I have switched is when the master DC is being upgraded, and I switch to another DC as the master. We have NO machines cofigured (as far as I know) to use the DNS Servers on the DC as primary DNS servers; all machines are configured to use the BIND slaves. In the early days of AD, there were serial number decreases in the MS code. I had an open trouble ticket for a long time before the MS DNS development team found the problem. I have not had a serial number decrease on the MS side for a long time except, occasionally, when patches are being applied to the DC, the serial number on one or more zones will decrease during the patch run, but after the DC is rebooted, the serial number goes back to a non-decrease normal. -- -- Barry S. Finkel Computing and Information Systems Division Argonne National Laboratory Phone:+1 (630) 252-7277 9700 South Cass Avenue Facsimile:+1 (630) 252-4601 Building 240, Room 5.B.8 Internet: bsfin...@anl.gov Argonne, IL 60439-4828 IBMMAIL: I1004994 ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
In message4de9045c.2050...@anl.gov, Barry Finkel writes: I have a problem with BIND 9.7.x on Ubuntu. I have two servers that are running 9.7.3. They slave 332 zones, and they also master 213,750 malware/spyware zones that we have defined to reroute these domains to a local machine. When I was upgrading the BIND to 9.7.3-P1 yesterday, an ./rndc stop command ran over 8 minutes, and named did not stop. A kill command did not work; I had to revert to a kill -9 command. What was BIND doing? Gracefully closing all of the zones? Most probably. rndc stop ensures that masterfiles are up-to-date before exiting. rndc halt does not try to flush master files before exiting. There could also have been a reference leak causing named to not stop. BIND 9.7.3-P1 came up fine, but there are two things that concern me: 1) After BIND began responding to queries, it was using 100% of the CPU for about three minutes. I am not sure what BIND was doing. This is not major because BIND was handling customer queries, and after the three minutes the CPU usage dropped to a normal 1%. 2) Two zones reported serial number decreases. This is bad. I did some research on the two zones - both Microsoft Active Directory zones (one _tcp and one _udp) that are mastered on a Windows Domain Controller and slaved on my BIND boxes. I have around 44 AD zones I slave, and only these two reported problems - on my two internal Ubuntu slaves and my two Solaris 10 slaves. The two Solaris 10 slaves do not run the spyware zones, so I had no problem with ./rndc stop. I therefore am not sure that the serial number problems are due to the kill -9. They shouldn't be. The handling of master files and journals is designed to have the power be pull at anytime provided the filesystem supports atomic replacement of files. I looked at the serial number issue on these two zones in detail; I capture the serial numbers on all the AD zones each morning at 6:10. Here is information for the _tcp zone: DateZone Mast Slav Slav 20 Oct 2010 _tcp. 1233 1233 1233 21 Oct 2010 _tcp. 1239 1239 1239 The master incremented the serial. ... 09 Nov 2010 _tcp. 1239 1239 1239 10 Nov 2010 _tcp. 1238 1239 1239 Master decreased due to MS patch 11 Nov 2010 _tcp. 1238 1238 1238 ... 03 Dec 2010 _tcp. 1238 1238 1238 04 Dec 2010 _tcp. 1238 1238 1239 ?? 05 Dec 2010 _tcp. 1238 1239 1238 ?? 06 Dec 2010 _tcp. 1238 1238 1238 ... 09 Dec 2010 _tcp. 1238 1238 1238 10 Dec 2010 _tcp. 1238 1238 1239 ?? 11 Dec 2010 _tcp. 1238 1239 1238 ?? 12 Dec 2010 _tcp. 1238 1238 1238 ... 05 Jan 2011 _tcp. 1238 1238 1238 06 Jan 2011 _tcp. 1238 1239 1239 ?? 07 Jan 2011 _tcp. 1238 1238 1238 ... 02 Mar 2011 _tcp. 1238 1238 1238 Upgrade 9.7.2-P3 to 9.7.3 03 Mar 2011 _tcp. 1238 1239 1239 04 Mar 2011 _tcp. 1238 1238 1238 ... 16 Apr 2011 _tcp. 1238 1238 1238 17 Apr 2011 _tcp. 1238 1238 1238 1238 1238 Two Sol10 slaves added. ... 02 Jun 2011 _tcp. 1238 1238 1238 1238 1238 Upgrade 9.7.3 to 9.7.3-P1 03 Jun 2011 _tcp. 1238 1239 1239 1239 1239 Both Ubuntu slaves have been up for 149 days (reboot around Jan 15). The zone serial was 1239 until a MS patch run on the Domain Controller decreased the serial by one on the evening of Nov 9. I did nothing to correct the problem; I waited for the two zones to expire, and then new zones were transferred from the Windows master server. The serial number was 1238 on the master and slaves. On a few days, the serial on the slaves increased by one, and I am not sure what happened on those days. On Mar 02 I upgraded BIND from 9.7.2-P3 to 9.7.3, and the serial numbers on the two upgraded BIND slaves reverted to the higher 1239 serial. Again, I did no fixup, and on Mar 04 the serials were the same at the lower value. I think that the serial number decrease was temporary during the patch run. On Apr 17 I added the two Solaris 10 slaves to my morning report, and all five serials were contant at 1238 until I upgraded BIND Tuesday (on the Solaris 10 boxes) and yesterday (on the Ubuntu boxes). Immediately after the upgrade BIND reported the serial number problem on these two zones. The other AD zones have had no serial number problems. I have no idea why BIND would remember the increased 1239 serial number, when the serial number for the zone has been constant at 1238 since Mar 04. I have to assume that between Mar 04 and Jun 03 BIND would have written the zone to disk, either in the base zone file or a .jnl file. -- -- Barry S. Finkel Phil Mayers suggested a corrupt .jnl file; I am not sure. How do I debug this? I have the following situation now: 1) The master (on an MS DNS Server) has serial 1238. 2) The zone file on a Solaris 10
Re: BIND 9.7 Serial Number Decrease Problem
Barry Finkel bsfin...@anl.gov wrote: I am not sure how to decode the .jnl file; I have not looked at the code in detail. Try the named-journalprint program. You can also try named-compilezone -j which applies the journal to the master file. Tony. -- f.anthony.n.finch d...@dotat.at http://dotat.at/ Rockall, Malin, Hebrides: Cyclonic, becoming north, 5 to 7, occasionally gale 8 in Rockall. Moderate or rough, occasionally very rough in Rockall. Rain or squally showers. Moderate or good, occasionally poor. ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: BIND 9.7 Serial Number Decrease Problem
-Original Message- From: bind-users-bounces+dan.mcdonald=austinenergy@lists.isc.org [mailto:bind-users-bounces+dan.mcdonald=austinenergy@lists.isc.org] On Behalf Of Tony Finch Sent: Monday, June 06, 2011 2:43 PM To: Barry Finkel Cc: bind-users@lists.isc.org Subject: Re: BIND 9.7 Serial Number Decrease Problem I think your root problem is trying to deal with active directory integrated zones. We stopped using them entirely when we found that each domain controller maintains an individual SOA record with its own serial number. The serial numbers rapidly (and purposely) fall out of sync, but active directory doesn't care as they use a different replication method. The only way that we could successfully interact from bind was to set up a forward-only zone and try to cache the results. When we found that Active directory under windows 2000 was unable to maintain proper synchronization, we switched to bind for all zones and haven't looked back. __ Daniel J McDonald, CCIE # 2495, CISSP # 78281 Austin Energy ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND 9.7 Serial Number Decrease Problem
In message 4de9045c.2050...@anl.gov, Barry Finkel writes: I have a problem with BIND 9.7.x on Ubuntu. I have two servers that are running 9.7.3. They slave 332 zones, and they also master 213,750 malware/spyware zones that we have defined to reroute these domains to a local machine. When I was upgrading the BIND to 9.7.3-P1 yesterday, an ./rndc stop command ran over 8 minutes, and named did not stop. A kill command did not work; I had to revert to a kill -9 command. What was BIND doing? Gracefully closing all of the zones? Most probably. rndc stop ensures that masterfiles are up-to-date before exiting. rndc halt does not try to flush master files before exiting. There could also have been a reference leak causing named to not stop. BIND 9.7.3-P1 came up fine, but there are two things that concern me: 1) After BIND began responding to queries, it was using 100% of the CPU for about three minutes. I am not sure what BIND was doing. This is not major because BIND was handling customer queries, and after the three minutes the CPU usage dropped to a normal 1%. 2) Two zones reported serial number decreases. This is bad. I did some research on the two zones - both Microsoft Active Directory zones (one _tcp and one _udp) that are mastered on a Windows Domain Controller and slaved on my BIND boxes. I have around 44 AD zones I slave, and only these two reported problems - on my two internal Ubuntu slaves and my two Solaris 10 slaves. The two Solaris 10 slaves do not run the spyware zones, so I had no problem with ./rndc stop. I therefore am not sure that the serial number problems are due to the kill -9. They shouldn't be. The handling of master files and journals is designed to have the power be pull at anytime provided the filesystem supports atomic replacement of files. I looked at the serial number issue on these two zones in detail; I capture the serial numbers on all the AD zones each morning at 6:10. Here is information for the _tcp zone: DateZone Mast Slav Slav 20 Oct 2010 _tcp. 1233 1233 1233 21 Oct 2010 _tcp. 1239 1239 1239 The master incremented the serial. ... 09 Nov 2010 _tcp. 1239 1239 1239 10 Nov 2010 _tcp. 1238 1239 1239 Master decreased due to MS patch 11 Nov 2010 _tcp. 1238 1238 1238 ... 03 Dec 2010 _tcp. 1238 1238 1238 04 Dec 2010 _tcp. 1238 1238 1239 ?? 05 Dec 2010 _tcp. 1238 1239 1238 ?? 06 Dec 2010 _tcp. 1238 1238 1238 ... 09 Dec 2010 _tcp. 1238 1238 1238 10 Dec 2010 _tcp. 1238 1238 1239 ?? 11 Dec 2010 _tcp. 1238 1239 1238 ?? 12 Dec 2010 _tcp. 1238 1238 1238 ... 05 Jan 2011 _tcp. 1238 1238 1238 06 Jan 2011 _tcp. 1238 1239 1239 ?? 07 Jan 2011 _tcp. 1238 1238 1238 ... 02 Mar 2011 _tcp. 1238 1238 1238 Upgrade 9.7.2-P3 to 9.7.3 03 Mar 2011 _tcp. 1238 1239 1239 04 Mar 2011 _tcp. 1238 1238 1238 ... 16 Apr 2011 _tcp. 1238 1238 1238 17 Apr 2011 _tcp. 1238 1238 1238 1238 1238 Two Sol10 slaves added. ... 02 Jun 2011 _tcp. 1238 1238 1238 1238 1238 Upgrade 9.7.3 to 9.7.3-P1 03 Jun 2011 _tcp. 1238 1239 1239 1239 1239 Both Ubuntu slaves have been up for 149 days (reboot around Jan 15). The zone serial was 1239 until a MS patch run on the Domain Controller decreased the serial by one on the evening of Nov 9. I did nothing to correct the problem; I waited for the two zones to expire, and then new zones were transferred from the Windows master server. The serial number was 1238 on the master and slaves. On a few days, the serial on the slaves increased by one, and I am not sure what happened on those days. On Mar 02 I upgraded BIND from 9.7.2-P3 to 9.7.3, and the serial numbers on the two upgraded BIND slaves reverted to the higher 1239 serial. Again, I did no fixup, and on Mar 04 the serials were the same at the lower value. I think that the serial number decrease was temporary during the patch run. On Apr 17 I added the two Solaris 10 slaves to my morning report, and all five serials were contant at 1238 until I upgraded BIND Tuesday (on the Solaris 10 boxes) and yesterday (on the Ubuntu boxes). Immediately after the upgrade BIND reported the serial number problem on these two zones. The other AD zones have had no serial number problems. I have no idea why BIND would remember the increased 1239 serial number, when the serial number for the zone has been constant at 1238 since Mar 04. I have to assume that between Mar 04 and Jun 03 BIND would have written the zone to disk, either in the base zone file or a .jnl file. -- -- Barry S. Finkel Computing and Information Systems Division Argonne National Laboratory Phone:+1 (630) 252-7277 9700 South Cass Avenue Facsimile:+1
Re: BIND 9.7 Serial Number Decrease Problem
On 06/03/2011 04:57 PM, Barry Finkel wrote: I have a problem with BIND 9.7.x on Ubuntu. I have two servers that are running 9.7.3. They slave 332 zones, and they also master 213,750 malware/spyware zones that we have defined to reroute these domains to a local machine. That's a hell of a lot of zones. Have you investigated RPZ in the newer versions of bind? I have no idea why BIND would remember the increased 1239 serial number, when the serial number for the zone has been constant at 1238 since Mar 04. I have to assume that between Mar 04 and Jun 03 BIND would have written the zone to disk, either in the base zone file or a .jnl file. Perhaps the .jnl file was corrupted when you -9ed it? ___ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users