Re: [DRBD-user] Dual-primary/ Very slow synchronization
Hello again, today I killed the asynchronous system, installed Xenserver 5.6 SP2 from scratch and compiled the latest version of DRBD. SP2 reserves 4 VCPUs for dom0 and I configured it to reserve 2 GB of RAM too instead of the 768 MB which were default. But when I created the storage on the local DRBD device it almost hang and the system log brought kernel timeout error messages of dom0. Some other idea without having proof: Although I wasn't shure why the issues just happen now and didn't occur in the beginning this issue reminds me a bit like IRQ resource conflicts in the Win98 times. I thought today's systems wouldn't worry about PCI latency and things. Would it be worth a try putting the 10GbE NIC or the RAID controller card into other PCI Express ports like we used to do 10 years ago when something went curious? I am seriously thinking about leaving Citrix Xenserver behind and trying some other distribution. I mean: This time I even did not yet sync towards some other machine but still worked locally. So it cannot be a network related issue anymore but something on the machine itself. :-/ Are problems like mine happening more often or am I the only black sheep experiencing this? CU, Mészi. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
Am 20.05.2011 10:22, schrieb Lars Ellenberg: According to you, it had been working before with the exact same hardware and configuration. Now if it does not anymore, then I strongly suspect network problems. I am currently checking the network switches together with the DELL support. Just to sort out a possible responsibility of the 10GbE switches. Did you do some network benchmarks on the replication link recently? Packet loss, excessive retransmits, checksum errors, bad cabling? Port stats on the switch, any error counters? Duplex or other auto-negotiating mismatch? Use flood ping with both large (32k) and small packet sizes, iperf, your favorite network integrity checker or benchmarking tool... Up to now I did tests with: 1. netio: up to 230 MByte/s 2. iperf: always below 80MBit/s ...both did not affect the CPU usage stats in any bigger kind. 3. ping: 9k and 18k packages, simultaneously from/to both hosts. ...led to high CPU peaks. Services of the virtual guests not available for that time, dmesg log outputs about timeouts - the log entries that I posted already. In the end of this email I attach three logiles of three machines starting from the moment when I restarted the resync of the split-brain-DRBD. host1: The machine where the running virtual guests reside. Runs Xenserver 5.6. host2: The machine that lost the DRBD sync. No running virtual machines. Runs Xenserver 5.6. guest: The mailserver as an example for all other virtual guests. Runs CentOS 5.5. Resides on host1. An explanation for the timeline: 13:41:42: DRBD resync started on host2 13:49:25: DRBD resync aborted by disconnecting on host2 13:51:44: DRBD disconnection still not finished. So I tried stopping the DRBD service on host2. After that has also been refused, I rebooted host2. While the DRBD resync attempt was in progress the mailserver was not accessible in the network, neither its mail services nor the webmail service or the ssh console. In that time top showed just very few CPU usage on host1 but around 95% on host2 (drbd_receiver process). Any further ideas? I am already thinking about installing a fresh CentOS 5.6 or Debian 6 on host2, setting up DRBD as Standalone, installing Xen and moving (system as image, data as copy) onto that system ... then setting up host1 similarly and syncing everything onto its I'd lose fine things like XenMotion but I possibly gained much more control over the resources. What's your opinion about that? Possibly Xenserver, my hardware setup and my virtualization goals simply do not fit? :-/ CU, Daniel. --- host1 /var/log messages May 24 13:41:45 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94 May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection - WFReportParams ) May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [12355]) May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake: May 24 13:41:45 localhost kernel: block drbd0: self 06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE bits:1420811336 flags:0 May 24 13:41:45 localhost kernel: block drbd0: peer 1B3487AE74F5DBA0::6CEB7B1CC746F358:42F748D007A8C8C5 bits:3874219255 flags:0 May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=1 by rule 70 May 24 13:41:45 localhost kernel: block drbd0: Becoming sync source due to disk states. May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown - Secondary ) conn( WFReportParams - WFBitMapS ) May 24 13:41:47 localhost kernel: block drbd0: peer( Secondary - Unknown ) conn( WFBitMapS - TearDown ) May 24 13:41:47 localhost kernel: block drbd0: asender terminated May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread May 24 13:41:51 localhost kernel: block drbd0: sock_sendmsg returned -32 May 24 13:41:51 localhost kernel: block drbd0: short sent ReportBitMap size=4096 sent=408 May 24 13:41:51 localhost kernel: block drbd0: Connection closed May 24 13:41:51 localhost kernel: block drbd0: conn( TearDown - Unconnected ) May 24 13:41:51 localhost kernel: block drbd0: receiver terminated May 24 13:41:51 localhost kernel: block drbd0: Restarting receiver thread May 24 13:41:51 localhost kernel: block drbd0: receiver (re)started May 24 13:41:51 localhost kernel: block drbd0: conn( Unconnected - WFConnection ) May 24 13:42:11 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94 May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection - WFReportParams ) May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [12355]) May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake: May 24 13:42:11 localhost kernel: block drbd0: self
Re: [DRBD-user] Dual-primary/ Very slow synchronization
On Fri, May 20, 2011 at 10:12:29AM +0200, Daniel Meszaros wrote: Hi! Am 19.05.2011 21:16, schrieb Digimer: As Felix stated, try 10M. If it get's up to that speed (and it can take a while, be patient), then bump it to 20M, etc. I tried syncing with 10M last night and indeed it became faster than before and was around 10MB/s. Using drbdsetup /dev/drbd0 syncer -r 20M and so on I could increase the sync speed up to 70M ... then it interrupted and a new bitmap check started. While doing so I recognized some kernel error message in my virtual guests of the Xenserver: Linux: [ 2040.064272] INFO: task exim4:2336 blocked for more than 120 seconds. [ 2040.064281] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Windows 2003 SBS: NTDS (432) NTDSA: Eine Anforderung, von der Datei C:\WINDOWS\NTDS\ntds.dit ab Offset 4661248 (0x00472000) insgesamt 8192 (0x2000) Bytes zu lesen, war erfolgreich, benötigte aber ungewöhnlich viel Zeit (406 Sekunden) von Seiten des Betriebssystems. Zusätzlich haben 0 andere E/A-Anforderungen an diese Datei ungewöhnlich viel Zeit benötigt, seit die letzte Meldung bezüglich dieses Problems vor 539 Sekunden gesendet wurde. Dieses Problem ist vermutlich durch fehlerhafte Hardware bedingt. Wenden Sie sich für weitere Unterstützung bei der Diagnose des Problems an Ihren Hardwarehersteller. While this is happening these machines are not available in the network. Therefore I stopped the synchronization and shut down the machine that is out of sync ... which led back to normally working services. For any reason the DRBD sync appears to take too much I/O performance, even if it is running at 10M. I must admit that the last time I remember having done a full sync was before I had these machines set up and running. When asking Google I found emails from this list I found some message mentioning scst vdisk_blockio to be changed to vdisk_fileio, however I do not use SCST. Any ideas on non-SCST systems? :-/ According to you, it had been working before with the exact same hardware and configuration. Now if it does not anymore, then I strongly suspect network problems. Did you do some network benchmarks on the replication link recently? Packet loss, excessive retransmits, checksum errors, bad cabling? Port stats on the switch, any error counters? Duplex or other auto-negotiating mismatch? Use flood ping with both large (32k) and small packet sizes, iperf, your favorite network integrity checker or benchmarking tool... -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
On 05/20/2011 10:12 AM, Daniel Meszaros wrote: I tried syncing with 10M last night and indeed it became faster than before and was around 10MB/s. Using drbdsetup /dev/drbd0 syncer -r 20M and so on I could increase the sync speed up to 70M ... then it interrupted and a new bitmap check started. Is it possible that you're facing some sort of weird networking issues? You stated earlier that you were doing netio benchmarks. I trust that those exclusively used the link that DRBD uses as well? If you can go to 50MBps, that should get you somewhere, even if it takes some time, no? Regards, Felix ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
Am 20.05.2011 10:23, schrieb Felix Frank: On 05/20/2011 10:12 AM, Daniel Meszaros wrote: I tried syncing with 10M last night and indeed it became faster than before and was around 10MB/s. Using drbdsetup /dev/drbd0 syncer -r 20M and so on I could increase the sync speed up to 70M ... then it interrupted and a new bitmap check started. Is it possible that you're facing some sort of weird networking issues? You stated earlier that you were doing netio benchmarks. I trust that those exclusively used the link that DRBD uses as well? Of course. But AFAIK netio just measures the network throughput. And as netio showed 200 MByte/s I don't expect the network to be to blame. If you can go to 50MBps, that should get you somewhere, even if it takes some time, no? The 1st problem is that it interrupted after a while (didn't just run slow but continuously but stalled after a while) and 2ndly it disturbs the virtual guests' services. Therefore I tend to search for some wrong setting in relationship with the I/O of the disks. ;-) CU, Daniel. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
On 05/19/2011 03:16 PM, Daniel Meszaros wrote: syncer { al-extents 511; rate 2048M; } This is a lot. How much bandwidth do you have, anyway? Does this work? Does drbdsetup /dev/drbd0 show report that same syncer rate? Regards, Felix ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
Am 19.05.2011 15:19, schrieb Felix Frank: On 05/19/2011 03:16 PM, Daniel Meszaros wrote: syncer { al-extents 511; rate 2048M; } This is a lot. How much bandwidth do you have, anyway? Does this work? Does drbdsetup /dev/drbd0 show report that same syncer rate? No, it is far less. But it worked with around 200 MB/s before ... with the same setup. And when running drbdsetup /dev/drbd0 syncer -r 250M the sync speed still did not improve. :-/ CU, Mészi. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
On 05/19/2011 09:23 AM, Daniel Meszaros wrote: Am 19.05.2011 15:19, schrieb Felix Frank: On 05/19/2011 03:16 PM, Daniel Meszaros wrote: syncer { al-extents 511; rate 2048M; } This is a lot. How much bandwidth do you have, anyway? Does this work? Does drbdsetup /dev/drbd0 show report that same syncer rate? No, it is far less. But it worked with around 200 MB/s before ... with the same setup. And when running drbdsetup /dev/drbd0 syncer -r 250M the sync speed still did not improve. :-/ CU, Mészi. Setting a sync speed greater than the actual possible speed can hurt performance. You said that you sustained ~200M before, right? Try setting the sync rate to 180M and see if that works any better. -- Digimer E-Mail: digi...@alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org I feel confined, only free to expand myself within boundaries. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
On 05/19/2011 03:23 PM, Daniel Meszaros wrote: And when running drbdsetup /dev/drbd0 syncer -r 250M the sync speed still did not improve. What Digimer said - try 10MB for starters. Also, what *does* drbdsetup show report? Do the logs show any reaction to the syncer call? ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
Hi! Am 05/19/2011 03:48 PM, schrieb Digimer: Setting a sync speed greater than the actual possible speed can hurt performance. You said that you sustained ~200M before, right? Try setting the sync rate to 180M and see if that works any better. I set it to 180M like suggested but no improvement unfortunately. The show parameter showed the same syncer rate that I set up. Then I switched back to 100M but the speed is still around 5,500 K/sec. Any other config parameters that I could have forgotten? :-/ As I already wrote: The only thing that changed in the setup are the GBICs that I added to one of both switches. Ok, as I obviously haven't been informed by the system about the split-brain occurrence I cannot say how long the issue has been persisting already. :-( CU, Mészi. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Dual-primary/ Very slow synchronization
On 05/19/2011 02:16 PM, Daniel Meszaros wrote: Hi! Am 05/19/2011 03:48 PM, schrieb Digimer: Setting a sync speed greater than the actual possible speed can hurt performance. You said that you sustained ~200M before, right? Try setting the sync rate to 180M and see if that works any better. I set it to 180M like suggested but no improvement unfortunately. The show parameter showed the same syncer rate that I set up. Then I switched back to 100M but the speed is still around 5,500 K/sec. Any other config parameters that I could have forgotten? :-/ As I already wrote: The only thing that changed in the setup are the GBICs that I added to one of both switches. Ok, as I obviously haven't been informed by the system about the split-brain occurrence I cannot say how long the issue has been persisting already. :-( CU, Mészi. As Felix stated, try 10M. If it get's up to that speed (and it can take a while, be patient), then bump it to 20M, etc. It might be worth adding a log watcher that sends an email when: block drbd2: Split-Brain detected but unresolved, dropping connection! Or any occurrence of split-brain is found. -- Digimer E-Mail: digi...@alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org I feel confined, only free to expand myself within boundaries. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user