Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-25 Thread Daniel Meszaros

Hello again,

today I killed the asynchronous system, installed Xenserver 5.6 SP2 from 
scratch and compiled the latest version of DRBD. SP2 reserves 4 VCPUs 
for dom0 and I configured it to reserve 2 GB of RAM too instead of the 
768 MB which were default.


But when I created the storage on the local DRBD device it almost hang 
and the system log brought kernel timeout error messages of dom0.


Some other idea without having proof: Although I wasn't shure why the 
issues just happen now and didn't occur in the beginning this issue 
reminds me a bit like IRQ resource conflicts in the Win98 times. I 
thought today's systems wouldn't worry about PCI latency and things. 
Would it be worth a try putting the 10GbE NIC or the RAID controller 
card into other PCI Express ports like we used to do 10 years ago when 
something went curious?


I am seriously thinking about leaving Citrix Xenserver behind and trying 
some other distribution. I mean: This time I even did not yet sync 
towards some other machine but still worked locally. So it cannot be a 
network related issue anymore but something on the machine itself. :-/


Are problems like mine happening more often or am I the only black sheep 
experiencing this?


CU,
Mészi.
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-24 Thread Daniel Meszaros

Am 20.05.2011 10:22, schrieb Lars Ellenberg:

According to you, it had been working before with the exact same
hardware and configuration.

Now if it does not anymore, then I strongly suspect network problems.
I am currently checking the network switches together with the DELL 
support. Just to sort out a possible responsibility of the 10GbE switches.



Did you do some network benchmarks on the replication link recently?
Packet loss, excessive retransmits, checksum errors, bad cabling?
Port stats on the switch, any error counters?
Duplex or other auto-negotiating mismatch?

Use flood ping with both large (32k) and small packet sizes, iperf,
your favorite network integrity checker or benchmarking tool...

Up to now I did tests with:

1. netio: up to 230 MByte/s
2. iperf: always below 80MBit/s

...both did not affect the CPU usage stats in any bigger kind.

3. ping: 9k and 18k packages, simultaneously from/to both hosts.

...led to high CPU peaks. Services of the virtual guests not available 
for that time, dmesg log outputs about timeouts - the log entries that I 
posted already.


In the end of this email I attach three logiles of three machines 
starting from the moment when I restarted the resync of the 
split-brain-DRBD.


host1: The machine where the running virtual guests reside. Runs 
Xenserver 5.6.
host2: The machine that lost the DRBD sync. No running virtual machines. 
Runs Xenserver 5.6.
guest: The mailserver as an example for all other virtual guests. Runs 
CentOS 5.5. Resides on host1.


An explanation for the timeline:

13:41:42: DRBD resync started on host2
13:49:25: DRBD resync aborted by disconnecting on host2
13:51:44: DRBD disconnection still not finished. So I tried stopping the 
DRBD service on host2. After that has also been refused, I rebooted host2.


While the DRBD resync attempt was in progress the mailserver was not 
accessible in the network, neither its mail services nor the webmail 
service or the ssh console. In that time top showed just very few CPU 
usage on host1 but around 95% on host2 (drbd_receiver process).


Any further ideas?

I am already thinking about installing a fresh CentOS 5.6 or Debian 6 on 
host2, setting up DRBD as Standalone, installing Xen and moving (system 
as image, data as copy) onto that system ... then setting up host1 
similarly and syncing everything onto its I'd lose fine things like 
XenMotion but I possibly gained much more control over the resources. 
What's your opinion about that? Possibly Xenserver, my hardware setup 
and my virtualization goals simply do not fit? :-/


CU,
Daniel.

---
host1 /var/log messages

May 24 13:41:45 localhost kernel: block drbd0: Handshake successful: 
Agreed network protocol version 94
May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection - 
WFReportParams )
May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [12355])

May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:41:45 localhost kernel: block drbd0: self 
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE 
bits:1420811336 flags:0
May 24 13:41:45 localhost kernel: block drbd0: peer 
1B3487AE74F5DBA0::6CEB7B1CC746F358:42F748D007A8C8C5 
bits:3874219255 flags:0

May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=1 by rule 70
May 24 13:41:45 localhost kernel: block drbd0: Becoming sync source due 
to disk states.
May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown - 
Secondary ) conn( WFReportParams - WFBitMapS )
May 24 13:41:47 localhost kernel: block drbd0: peer( Secondary - 
Unknown ) conn( WFBitMapS - TearDown )

May 24 13:41:47 localhost kernel: block drbd0: asender terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread
May 24 13:41:51 localhost kernel: block drbd0: sock_sendmsg returned -32
May 24 13:41:51 localhost kernel: block drbd0: short sent ReportBitMap 
size=4096 sent=408

May 24 13:41:51 localhost kernel: block drbd0: Connection closed
May 24 13:41:51 localhost kernel: block drbd0: conn( TearDown - 
Unconnected )

May 24 13:41:51 localhost kernel: block drbd0: receiver terminated
May 24 13:41:51 localhost kernel: block drbd0: Restarting receiver thread
May 24 13:41:51 localhost kernel: block drbd0: receiver (re)started
May 24 13:41:51 localhost kernel: block drbd0: conn( Unconnected - 
WFConnection )
May 24 13:42:11 localhost kernel: block drbd0: Handshake successful: 
Agreed network protocol version 94
May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection - 
WFReportParams )
May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [12355])

May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:42:11 localhost kernel: block drbd0: self 

Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-20 Thread Lars Ellenberg
On Fri, May 20, 2011 at 10:12:29AM +0200, Daniel Meszaros wrote:
 Hi!
 
 Am 19.05.2011 21:16, schrieb Digimer:
 As Felix stated, try 10M. If it get's up to that speed (and it can take
 a while, be patient), then bump it to 20M, etc.
 
 I tried syncing with 10M last night and indeed it became faster than
 before and was around 10MB/s. Using drbdsetup /dev/drbd0 syncer -r
 20M and so on I could increase the sync speed up to 70M ... then it
 interrupted and a new bitmap check started.
 
 While doing so I recognized some kernel error message in my virtual
 guests of the Xenserver:
 
 Linux:
 [ 2040.064272] INFO: task exim4:2336 blocked for more than 120 seconds.
 [ 2040.064281] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 
 Windows 2003 SBS:
 NTDS (432) NTDSA: Eine Anforderung, von der Datei
 C:\WINDOWS\NTDS\ntds.dit ab Offset 4661248 (0x00472000)
 insgesamt 8192 (0x2000) Bytes zu lesen, war erfolgreich,
 benötigte aber ungewöhnlich viel Zeit (406 Sekunden) von Seiten des
 Betriebssystems. Zusätzlich haben 0 andere E/A-Anforderungen an
 diese Datei ungewöhnlich viel Zeit benötigt, seit die letzte Meldung
 bezüglich dieses Problems vor 539 Sekunden gesendet wurde. Dieses
 Problem ist vermutlich durch fehlerhafte Hardware bedingt. Wenden
 Sie sich für weitere Unterstützung bei der Diagnose des Problems an
 Ihren Hardwarehersteller.
 
 While this is happening these machines are not available in the
 network. Therefore I stopped the synchronization and shut down the
 machine that is out of sync ... which led back to normally working
 services.
 
 For any reason the DRBD sync appears to take too much I/O
 performance, even if it is running at 10M. I must admit that the
 last time I remember having done a full sync was before I had these
 machines set up and running.
 
 When asking Google I found emails from this list I found some
 message mentioning scst vdisk_blockio to be changed to
 vdisk_fileio, however I do not use SCST.
 
 Any ideas on non-SCST systems? :-/

According to you, it had been working before with the exact same
hardware and configuration.

Now if it does not anymore, then I strongly suspect network problems.

Did you do some network benchmarks on the replication link recently?
Packet loss, excessive retransmits, checksum errors, bad cabling?
Port stats on the switch, any error counters?
Duplex or other auto-negotiating mismatch?

Use flood ping with both large (32k) and small packet sizes, iperf, 
your favorite network integrity checker or benchmarking tool...


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-20 Thread Felix Frank
On 05/20/2011 10:12 AM, Daniel Meszaros wrote:
 I tried syncing with 10M last night and indeed it became faster than
 before and was around 10MB/s. Using drbdsetup /dev/drbd0 syncer -r 20M
 and so on I could increase the sync speed up to 70M ... then it
 interrupted and a new bitmap check started.

Is it possible that you're facing some sort of weird networking issues?

You stated earlier that you were doing netio benchmarks. I trust that
those exclusively used the link that DRBD uses as well?

If you can go to 50MBps, that should get you somewhere, even if it takes
some time, no?

Regards,
Felix
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-20 Thread Daniel Meszaros

Am 20.05.2011 10:23, schrieb Felix Frank:

On 05/20/2011 10:12 AM, Daniel Meszaros wrote:

I tried syncing with 10M last night and indeed it became faster than
before and was around 10MB/s. Using drbdsetup /dev/drbd0 syncer -r 20M
and so on I could increase the sync speed up to 70M ... then it
interrupted and a new bitmap check started.

Is it possible that you're facing some sort of weird networking issues?

You stated earlier that you were doing netio benchmarks. I trust that
those exclusively used the link that DRBD uses as well?
Of course. But AFAIK netio just measures the network throughput. And 
as netio showed  200 MByte/s I don't expect the network to be to blame.



If you can go to 50MBps, that should get you somewhere, even if it takes
some time, no?
The 1st problem is that it interrupted after a while (didn't just run 
slow but continuously but stalled after a while) and 2ndly it disturbs 
the virtual guests' services. Therefore I tend to search for some wrong 
setting in relationship with the I/O of the disks. ;-)


CU,
Daniel.

___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-19 Thread Felix Frank
On 05/19/2011 03:16 PM, Daniel Meszaros wrote:
 syncer {
 al-extents 511;
 rate 2048M;
 }

This is a lot. How much bandwidth do you have, anyway?

Does this work? Does drbdsetup /dev/drbd0 show report that same syncer
rate?

Regards,
Felix
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-19 Thread Daniel Meszaros

Am 19.05.2011 15:19, schrieb Felix Frank:

On 05/19/2011 03:16 PM, Daniel Meszaros wrote:

 syncer {
 al-extents 511;
 rate 2048M;
 }

This is a lot. How much bandwidth do you have, anyway?

Does this work? Does drbdsetup /dev/drbd0 show report that same syncer
rate?


No, it is far less. But it worked with around 200 MB/s before ... with 
the same setup. And when running drbdsetup /dev/drbd0 syncer -r 250M 
the sync speed still did not improve. :-/


CU,
Mészi.

___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-19 Thread Digimer
On 05/19/2011 09:23 AM, Daniel Meszaros wrote:
 Am 19.05.2011 15:19, schrieb Felix Frank:
 On 05/19/2011 03:16 PM, Daniel Meszaros wrote:
  syncer {
  al-extents 511;
  rate 2048M;
  }
 This is a lot. How much bandwidth do you have, anyway?

 Does this work? Does drbdsetup /dev/drbd0 show report that same syncer
 rate?
 
 No, it is far less. But it worked with around 200 MB/s before ... with
 the same setup. And when running drbdsetup /dev/drbd0 syncer -r 250M
 the sync speed still did not improve. :-/
 
 CU,
 Mészi.

Setting a sync speed greater than the actual possible speed can hurt
performance. You said that you sustained ~200M before, right? Try
setting the sync rate to 180M and see if that works any better.

-- 
Digimer
E-Mail: digi...@alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
I feel confined, only free to expand myself within boundaries.
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-19 Thread Felix Frank
On 05/19/2011 03:23 PM, Daniel Meszaros wrote:
 And when running drbdsetup /dev/drbd0 syncer -r 250M the sync speed
 still did not improve.

What Digimer said - try 10MB for starters.
Also, what *does* drbdsetup show report? Do the logs show any reaction
to the syncer call?
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-19 Thread Daniel Meszaros

Hi!

Am 05/19/2011 03:48 PM, schrieb Digimer:

Setting a sync speed greater than the actual possible speed can hurt
performance. You said that you sustained ~200M before, right? Try
setting the sync rate to 180M and see if that works any better.


I set it to 180M like suggested but no improvement unfortunately.

The show parameter showed the same syncer rate that I set up.

Then I switched back to 100M but the speed is still around 5,500 K/sec.

Any other config parameters that I could have forgotten? :-/

As I already wrote: The only thing that changed in the setup are the 
GBICs that I added to one of both switches. Ok, as I obviously haven't 
been informed by the system about the split-brain occurrence I cannot 
say how long the issue has been persisting already. :-(


CU,
Mészi.
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Dual-primary/ Very slow synchronization

2011-05-19 Thread Digimer
On 05/19/2011 02:16 PM, Daniel Meszaros wrote:
 Hi!
 
 Am 05/19/2011 03:48 PM, schrieb Digimer:
 Setting a sync speed greater than the actual possible speed can hurt
 performance. You said that you sustained ~200M before, right? Try
 setting the sync rate to 180M and see if that works any better.
 
 I set it to 180M like suggested but no improvement unfortunately.
 
 The show parameter showed the same syncer rate that I set up.
 
 Then I switched back to 100M but the speed is still around 5,500 K/sec.
 
 Any other config parameters that I could have forgotten? :-/
 
 As I already wrote: The only thing that changed in the setup are the
 GBICs that I added to one of both switches. Ok, as I obviously haven't
 been informed by the system about the split-brain occurrence I cannot
 say how long the issue has been persisting already. :-(
 
 CU,
 Mészi.

As Felix stated, try 10M. If it get's up to that speed (and it can take
a while, be patient), then bump it to 20M, etc.

It might be worth adding a log watcher that sends an email when:

block drbd2: Split-Brain detected but unresolved, dropping connection!

Or any occurrence of split-brain is found.

-- 
Digimer
E-Mail: digi...@alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
I feel confined, only free to expand myself within boundaries.
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user