Re: [PVE-User] Cluster is functioning properly but showing all nodes as OFFLINE on web GUI

2016-12-18 Thread Szabolcs F.
Hi,

I've had a similar issue. Someone kindly suggested me to set the 'token'
value to 4000 in the corosync.cnf.

/etc/pve/corosync.conf

totem {
  cluster_name: x
  config_version: 35
  ip_version: ipv4
  version: 2
  token: 4000

  interface {
bindnetaddr: X.X.X.X
ringnumber: 0
  }

}

Then do this on all nodes:

killall -9 corosync
/etc/init.d/pve-cluster restart
service pveproxy restart


This solved the similar problem for me and my cluster of 12 nodes is
working properly ever since.

On Sun, Dec 18, 2016 at 10:04 AM, Tom  wrote:

> pvecm status runs fine showing everything is okay, and only storage thats
> there is the local /var/lib/vz
>
> Thanks
>
>
> On Sun, 18 Dec 2016 at 08:02, Dietmar Maurer  wrote:
>
> > > Does anyone have any solutions/pointers?
> >
> >
> >
> > And "pvesm status" runs without any delay?
> >
> >
> >
> > # pvesm status
> >
> >
> >
> > Or is there a storage which hangs?
> >
> >
> >
> > ___
> >
> > pve-user mailing list
> >
> > pve-user@pve.proxmox.com
> >
> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> >
> >
> ___
> pve-user mailing list
> pve-user@pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Boot issue

2016-11-09 Thread Szabolcs F.
On Wed, Nov 9, 2016 at 10:02 AM, Fabian Grünbichler <
f.gruenbich...@proxmox.com> wrote:

> On Wed, Nov 09, 2016 at 09:46:13AM +0100, Szabolcs F. wrote:
> > It feels like as if the disks weren't ready when grub trying to mount the
> > LVM volumes. Any ideas how to fix this? Maybe adding some other type of
> > wait to the grub config?
> >
>
> "rootdelay" is probably what you are looking for (man bootparam):
> "This parameter sets the delay (in seconds) to pause before attempting
> to mount the root filesystem."
>
Hi Fabian,

thank you! I think the rootdelay was the thing I needed. I'll test &
confirm.


>
> ___
> pve-user mailing list
> pve-user@pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


[PVE-User] Boot issue

2016-11-09 Thread Szabolcs F.
Hello All,

I've got an interesting issue booting my Dell C6220 servers (12x nodes in a
PVE 4.3 cluster, but this probably unrelated now). Some of them fail to
boot with the 'unable to find lvm volume pve/root' message. But if I give
them a few reboot they successfully boot eventually. There's no consistency
reproducing the issue, sometimes the servers boot at first attempt,
sometimes I need to reboot them up to five times.

I've added the lvmwait option to /etc/default/grub to
the GRUB_CMDLINE_LINUX_DEFAULT variable and updated grub config +initrd,
confirmed the option appears in the /boot/grub/grub.cfg file. But this
doesn't help, the same keeps happening.

It feels like as if the disks weren't ready when grub trying to mount the
LVM volumes. Any ideas how to fix this? Maybe adding some other type of
wait to the grub config?

Thanks,
Szabolcs
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] How to move VM from dead node?

2016-11-07 Thread Szabolcs F.
Hello again,

moving the /etc/pve/nodes/pve11/qemu-server/*.conf files to another node
worked well. I didn't have to restart any services.

Thanks again!

On Mon, Nov 7, 2016 at 11:41 AM, Szabolcs F. <subc...@gmail.com> wrote:

> Hi All,
>
> thanks for all your comments.
>
> Yes, I've got shared storage.
>
>> All my VMs are stored on NAS servers, so a failing Proxmox node is not an
>> issue from this point of view, I can still access the VM files. All my 12
>> PVE nodes access the storage with NFS.
>
> I'll try to move the .conf files and see how it works. Thanks again!
>
>
> On Mon, Nov 7, 2016 at 11:37 AM, Fabrizio Cuseo <f.cu...@panservice.it>
> wrote:
>
>> You can simply move from any of the running nodes the files in
>> /etc/pve/nodes/name-of-dead-host/qemu-server and move the *conf files to
>> /etc/pve/qemu-server
>> Your VMs will appear in that node.
>>
>> PS: if you can repair your dead node without reinstalling it, delete the
>> files in /etc/pve/qemu-server and 
>> /etc/pve/nodes/name-of-dead-host/qemu-server
>> before connecting to the cluster; i have never done it, but I think that is
>> the right way.
>>
>> Regards, Fabrizio
>>
>>
>>
>> - Il 7-nov-16, alle 11:16, Szabolcs F. subc...@gmail.com ha scritto:
>>
>> > Hello All,
>> >
>> > I've got a Proxmox VE 4.3 cluster (no subscription) of 12 Dell C6220
>> nodes.
>> >
>> > My question is: how do I move a VM from a dead node? Let's say my pve11
>> > dies (hardware issue), but the other 11 nodes are still up In
>> this
>> > case I can't migrate VMs off of pve11, because I get the 'no route to
>> host'
>> > issue. I can only see the VM ID of the VMs that should be running on
>> pve11.
>> > But I want to move the VMs to the working nodes until I can fix the
>> > hardware issue.
>> >
>> > All my VMs are stored on NAS servers, so a failing Proxmox node is not
>> an
>> > issue from this point of view, I can still access the VM files. All my
>> 12
>> > PVE nodes access the storage with NFS.
>> >
>> > Thanks in advance!
>> > ___
>> > pve-user mailing list
>> > pve-user@pve.proxmox.com
>> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> --
>> ---
>> Fabrizio Cuseo - mailto:f.cu...@panservice.it
>> Direzione Generale - Panservice InterNetWorking
>> Servizi Professionali per Internet ed il Networking
>> Panservice e' associata AIIP - RIPE Local Registry
>> Phone: +39 0773 410020 - Fax: +39 0773 470219
>> http://www.panservice.it  mailto:i...@panservice.it
>> Numero verde nazionale: 800 901492
>> ___
>> pve-user mailing list
>> pve-user@pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] How to move VM from dead node?

2016-11-07 Thread Szabolcs F.
Hi All,

thanks for all your comments.

Yes, I've got shared storage.

> All my VMs are stored on NAS servers, so a failing Proxmox node is not an
> issue from this point of view, I can still access the VM files. All my 12
> PVE nodes access the storage with NFS.

I'll try to move the .conf files and see how it works. Thanks again!


On Mon, Nov 7, 2016 at 11:37 AM, Fabrizio Cuseo <f.cu...@panservice.it>
wrote:

> You can simply move from any of the running nodes the files in
> /etc/pve/nodes/name-of-dead-host/qemu-server and move the *conf files to
> /etc/pve/qemu-server
> Your VMs will appear in that node.
>
> PS: if you can repair your dead node without reinstalling it, delete the
> files in /etc/pve/qemu-server and /etc/pve/nodes/name-of-dead-host/qemu-server
> before connecting to the cluster; i have never done it, but I think that is
> the right way.
>
> Regards, Fabrizio
>
>
>
> - Il 7-nov-16, alle 11:16, Szabolcs F. subc...@gmail.com ha scritto:
>
> > Hello All,
> >
> > I've got a Proxmox VE 4.3 cluster (no subscription) of 12 Dell C6220
> nodes.
> >
> > My question is: how do I move a VM from a dead node? Let's say my pve11
> > dies (hardware issue), but the other 11 nodes are still up In
> this
> > case I can't migrate VMs off of pve11, because I get the 'no route to
> host'
> > issue. I can only see the VM ID of the VMs that should be running on
> pve11.
> > But I want to move the VMs to the working nodes until I can fix the
> > hardware issue.
> >
> > All my VMs are stored on NAS servers, so a failing Proxmox node is not an
> > issue from this point of view, I can still access the VM files. All my 12
> > PVE nodes access the storage with NFS.
> >
> > Thanks in advance!
> > ___
> > pve-user mailing list
> > pve-user@pve.proxmox.com
> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
> --
> ---
> Fabrizio Cuseo - mailto:f.cu...@panservice.it
> Direzione Generale - Panservice InterNetWorking
> Servizi Professionali per Internet ed il Networking
> Panservice e' associata AIIP - RIPE Local Registry
> Phone: +39 0773 410020 - Fax: +39 0773 470219
> http://www.panservice.it  mailto:i...@panservice.it
> Numero verde nazionale: 800 901492
> ___
> pve-user mailing list
> pve-user@pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


[PVE-User] How to move VM from dead node?

2016-11-07 Thread Szabolcs F.
Hello All,

I've got a Proxmox VE 4.3 cluster (no subscription) of 12 Dell C6220 nodes.

My question is: how do I move a VM from a dead node? Let's say my pve11
dies (hardware issue), but the other 11 nodes are still up In this
case I can't migrate VMs off of pve11, because I get the 'no route to host'
issue. I can only see the VM ID of the VMs that should be running on pve11.
But I want to move the VMs to the working nodes until I can fix the
hardware issue.

All my VMs are stored on NAS servers, so a failing Proxmox node is not an
issue from this point of view, I can still access the VM files. All my 12
PVE nodes access the storage with NFS.

Thanks in advance!
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Promox 4.3 cluster issue

2016-11-02 Thread Szabolcs F.
Hi All,

just confirming that since I've added the 'token: 4000' to my
corosync.conf, my cluster has been working fine (4 days so far).

Thanks again for everyone helping me!

On Sat, Oct 29, 2016 at 2:22 PM, Szabolcs F. <subc...@gmail.com> wrote:

> Hi Alexandre,
>
> thanks so much for the tip about killing corosync and restarting the
> pve-cluster service. Previously I tried to kill many different processes
> and also tried clean pve-cluster restart (without killing processes) but
> none of these worked.
> Your tip worked, the cluster came back without having to powering down all
> of my nodes.
>
> Now I'll try to change the corosync.conf values and see if it makes the
> cluster more stable.
>
> Thanks again!
>
> On Sat, Oct 29, 2016 at 9:26 AM, Alexandre DERUMIER <aderum...@odiso.com>
> wrote:
>
>> Also, you can try to increase token value in
>>
>> /etc/pve/corosync.conf
>>
>> here mine:
>>
>>
>> totem {
>>   cluster_name: x
>>   config_version: 35
>>   ip_version: ipv4
>>   version: 2
>>   token: 4000
>>
>>   interface {
>> bindnetaddr: X.X.X.X
>> ringnumber: 0
>>   }
>>
>> }
>>
>>
>> (increase the config_version +1 before save the file)
>>
>>
>>
>> Without token value, I'm able to reproduce exactly your corosync error.
>>
>>
>>
>> - Mail original -
>> De: "aderumier" <aderum...@odiso.com>
>> À: "proxmoxve" <pve-user@pve.proxmox.com>
>> Envoyé: Samedi 29 Octobre 2016 09:20:19
>> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>>
>> What you can do, is to test to kill corosync on all nodes && them start
>> it node by node,
>> to see when the problem begin to occur
>>
>>
>> if you want to restart corosync on all nodes, you can do a
>>
>> on each node
>> 
>> #killall -9 corosync
>>
>> then on each node
>> -
>> /etc/init.d/pve-cluster restart (this will restart corosync && pmxcfs to
>> mount /etc/pve)
>>
>>
>> In past , I have found 1 "slow" node in my cluster (opteron 64cores
>> 2,1ghz) , with 15 nodes (intel 40 cores 3,1ghz),
>> give me this kind of problem.
>>
>> you have already 12 nodes, so network latency could impact corosync speed.
>>
>> I'm currently running a 16 nodes cluster with this latency
>>
>> rtt min/avg/max/mdev = 0.050/0.070/0.079/0.010 ms
>>
>>
>>
>>
>>
>> - Mail original -
>> De: "Szabolcs F." <subc...@gmail.com>
>> À: "proxmoxve" <pve-user@pve.proxmox.com>
>> Envoyé: Vendredi 28 Octobre 2016 17:30:49
>> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>>
>> Hi Alexandre,
>>
>> please find my logs here. From three different nodes just to see if
>> there's
>> any difference.
>>
>> pve01 node : http://pastebin.com/M14R0WBc
>> pve02 node : http://pastebin.com/q1kW07xs
>> pve09 node (totem) : http://pastebin.com/CpZd6dmn
>>
>> omping gives me similar results on all nodes:
>> http://pastebin.com/s4H92Scg
>>
>>
>> Thanks!
>>
>>
>> On Fri, Oct 28, 2016 at 3:55 PM, Alexandre DERUMIER <aderum...@odiso.com>
>> wrote:
>>
>> > can you send your corosync log in /var/log/daemon.log ?
>> >
>> >
>> > - Mail original -
>> > De: "Szabolcs F." <subc...@gmail.com>
>> > À: "Michael Rasmussen" <m...@miras.org>
>> > Cc: "proxmoxve" <pve-user@pve.proxmox.com>
>> > Envoyé: Vendredi 28 Octobre 2016 15:40:06
>> > Objet: Re: [PVE-User] Promox 4.3 cluster issue
>> >
>> > Hi All,
>> >
>> > my issue came back. So it wasn't related to having Proxmox 4.2 on 4
>> nodes
>> > and Proxmox 4.3 on the other 8 nodes.
>> >
>> > Now for example if I log into the web UI of my first node all the 11
>> other
>> > nodes are marked with the red cross. But if I click on a node I can
>> still
>> > see the summary (uptime, load, etc), still can get a shell on other
>> nodes.
>> > But I can't see the name/status of virtual machines running on the red
>> > crossed nodes (I can only see the VM ID/number). And of course I can't
>> > migrated any VM from one host to another.
>> >
>> > Any ideas?
>> >
>> > Thanks!
>> >
>> > On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F

Re: [PVE-User] Promox 4.3 cluster issue

2016-10-29 Thread Szabolcs F.
Hi Alexandre,

thanks so much for the tip about killing corosync and restarting the
pve-cluster service. Previously I tried to kill many different processes
and also tried clean pve-cluster restart (without killing processes) but
none of these worked.
Your tip worked, the cluster came back without having to powering down all
of my nodes.

Now I'll try to change the corosync.conf values and see if it makes the
cluster more stable.

Thanks again!

On Sat, Oct 29, 2016 at 9:26 AM, Alexandre DERUMIER <aderum...@odiso.com>
wrote:

> Also, you can try to increase token value in
>
> /etc/pve/corosync.conf
>
> here mine:
>
>
> totem {
>   cluster_name: x
>   config_version: 35
>   ip_version: ipv4
>   version: 2
>   token: 4000
>
>   interface {
> bindnetaddr: X.X.X.X
> ringnumber: 0
>   }
>
> }
>
>
> (increase the config_version +1 before save the file)
>
>
>
> Without token value, I'm able to reproduce exactly your corosync error.
>
>
>
> - Mail original -
> De: "aderumier" <aderum...@odiso.com>
> À: "proxmoxve" <pve-user@pve.proxmox.com>
> Envoyé: Samedi 29 Octobre 2016 09:20:19
> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>
> What you can do, is to test to kill corosync on all nodes && them start it
> node by node,
> to see when the problem begin to occur
>
>
> if you want to restart corosync on all nodes, you can do a
>
> on each node
> 
> #killall -9 corosync
>
> then on each node
> -
> /etc/init.d/pve-cluster restart (this will restart corosync && pmxcfs to
> mount /etc/pve)
>
>
> In past , I have found 1 "slow" node in my cluster (opteron 64cores
> 2,1ghz) , with 15 nodes (intel 40 cores 3,1ghz),
> give me this kind of problem.
>
> you have already 12 nodes, so network latency could impact corosync speed.
>
> I'm currently running a 16 nodes cluster with this latency
>
> rtt min/avg/max/mdev = 0.050/0.070/0.079/0.010 ms
>
>
>
>
>
> - Mail original -
> De: "Szabolcs F." <subc...@gmail.com>
> À: "proxmoxve" <pve-user@pve.proxmox.com>
> Envoyé: Vendredi 28 Octobre 2016 17:30:49
> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>
> Hi Alexandre,
>
> please find my logs here. From three different nodes just to see if there's
> any difference.
>
> pve01 node : http://pastebin.com/M14R0WBc
> pve02 node : http://pastebin.com/q1kW07xs
> pve09 node (totem) : http://pastebin.com/CpZd6dmn
>
> omping gives me similar results on all nodes: http://pastebin.com/s4H92Scg
>
>
> Thanks!
>
>
> On Fri, Oct 28, 2016 at 3:55 PM, Alexandre DERUMIER <aderum...@odiso.com>
> wrote:
>
> > can you send your corosync log in /var/log/daemon.log ?
> >
> >
> > - Mail original -
> > De: "Szabolcs F." <subc...@gmail.com>
> > À: "Michael Rasmussen" <m...@miras.org>
> > Cc: "proxmoxve" <pve-user@pve.proxmox.com>
> > Envoyé: Vendredi 28 Octobre 2016 15:40:06
> > Objet: Re: [PVE-User] Promox 4.3 cluster issue
> >
> > Hi All,
> >
> > my issue came back. So it wasn't related to having Proxmox 4.2 on 4 nodes
> > and Proxmox 4.3 on the other 8 nodes.
> >
> > Now for example if I log into the web UI of my first node all the 11
> other
> > nodes are marked with the red cross. But if I click on a node I can still
> > see the summary (uptime, load, etc), still can get a shell on other
> nodes.
> > But I can't see the name/status of virtual machines running on the red
> > crossed nodes (I can only see the VM ID/number). And of course I can't
> > migrated any VM from one host to another.
> >
> > Any ideas?
> >
> > Thanks!
> >
> > On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F. <subc...@gmail.com> wrote:
> >
> > > Hello again,
> > >
> > > sorry for another followup. I just realised that 4 of the 12 cluster
> > nodes
> > > still have PVE Manager version 4.2 and the other 8 nodes have version
> > 4.3.
> > > Can this be the reason of all my troubles?
> > >
> > > I'm in the process of updating these 4 nodes. These 4 nodes were
> > installed
> > > with the Proxmox install media, but the other 8 nodes were installed
> with
> > > Debian 8 first. So the 4 outdated nodes didn't have the 'deb
> > > http://download.proxmox.com/debian jessie pve-no-subscription' repo
> > file.
> > > Adding this repo made the 4.3 updates available.
> > >
> > >
>

Re: [PVE-User] Promox 4.3 cluster issue

2016-10-28 Thread Szabolcs F.
Hi Alexandre,

please find my logs here. From three different nodes just to see if there's
any difference.

pve01 node : http://pastebin.com/M14R0WBc
pve02 node : http://pastebin.com/q1kW07xs
pve09 node (totem) : http://pastebin.com/CpZd6dmn

omping gives me similar results on all nodes: http://pastebin.com/s4H92Scg


Thanks!


On Fri, Oct 28, 2016 at 3:55 PM, Alexandre DERUMIER <aderum...@odiso.com>
wrote:

> can you send your corosync log in /var/log/daemon.log ?
>
>
> - Mail original -----
> De: "Szabolcs F." <subc...@gmail.com>
> À: "Michael Rasmussen" <m...@miras.org>
> Cc: "proxmoxve" <pve-user@pve.proxmox.com>
> Envoyé: Vendredi 28 Octobre 2016 15:40:06
> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>
> Hi All,
>
> my issue came back. So it wasn't related to having Proxmox 4.2 on 4 nodes
> and Proxmox 4.3 on the other 8 nodes.
>
> Now for example if I log into the web UI of my first node all the 11 other
> nodes are marked with the red cross. But if I click on a node I can still
> see the summary (uptime, load, etc), still can get a shell on other nodes.
> But I can't see the name/status of virtual machines running on the red
> crossed nodes (I can only see the VM ID/number). And of course I can't
> migrated any VM from one host to another.
>
> Any ideas?
>
> Thanks!
>
> On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F. <subc...@gmail.com> wrote:
>
> > Hello again,
> >
> > sorry for another followup. I just realised that 4 of the 12 cluster
> nodes
> > still have PVE Manager version 4.2 and the other 8 nodes have version
> 4.3.
> > Can this be the reason of all my troubles?
> >
> > I'm in the process of updating these 4 nodes. These 4 nodes were
> installed
> > with the Proxmox install media, but the other 8 nodes were installed with
> > Debian 8 first. So the 4 outdated nodes didn't have the 'deb
> > http://download.proxmox.com/debian jessie pve-no-subscription' repo
> file.
> > Adding this repo made the 4.3 updates available.
> >
> >
> >
> > On Wed, Oct 26, 2016 at 12:20 PM, Szabolcs F. <subc...@gmail.com> wrote:
> >
> >> Hi Michael,
> >>
> >> I can change to LACP, sure. Would it be better than simple
> active-backup?
> >> I haven't got too much experience with LACP though.
> >>
> >> On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <m...@miras.org>
> >> wrote:
> >>
> >>> Is it possible to switch to 802.3ad bond mode?
> >>>
> >>> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." <
> >>> subc...@gmail.com> wrote:
> >>>
> >>>> Hi Lutz,
> >>>>
> >>>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN
> >>>> and my corosync.conf : http://pastebin.com/2ss0AAEr
> >>>>
> >>>> Mutlicast is enabled on my switches.
> >>>>
> >>>> The problem is I don't have a way to to replicate the problem, it
> seems to
> >>>> happen randomly, so I'm unsure how to do more tests. At the moment my
> >>>> cluster is working fine for about 16 hours. Any ideas forcing the
> issue?
> >>>>
> >>>> Thanks,
> >>>> Szabolcs
> >>>>
> >>>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek <
> l.wil...@science-computing.de>
> >>>> wrote:
> >>>>
> >>>> Am 24.10.2016 um 15:16 schrieb Szabolcs F.:
> >>>>>
> >>>>> Corosync has a lot of
> >>>>>> these in the /var/logs/daemon.log :
> >>>>>> http://pastebin.com/ajhE8Rb9
> >>>>>
> >>>>>
> >>>>>
> >>>>> please carefully check your (node/switch/multicast) network
> configuration,
> >>>>> and please paste your corosync configuration file and output of
> >>>>> /proc/net/bonding/bondXX
> >>>>>
> >>>>> just a guess:
> >>>>>
> >>>>> * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect)
> >>>>> --> Problems still occours?
> >>>>>
> >>>>> * during "problem time"
> >>>>> --> omping is still ok?
> >>>>>
> >>>>> https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor
> >>>>> um_and_cluster_issues
> >>>>>
> >>>>>
> >>>>> Freundliche Grüß

Re: [PVE-User] Promox 4.3 cluster issue

2016-10-28 Thread Szabolcs F.
Hi All,

my issue came back. So it wasn't related to having Proxmox 4.2 on 4 nodes
and Proxmox 4.3 on the other 8 nodes.

Now for example if I log into the web UI of my first node all the 11 other
nodes are marked with the red cross. But if I click on a node I can still
see the summary (uptime, load, etc), still can get a shell on other nodes.
But I can't see the name/status of virtual machines running on the red
crossed nodes (I can only see the VM ID/number). And of course I can't
migrated any VM from one host to another.

Any ideas?

Thanks!

On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F. <subc...@gmail.com> wrote:

> Hello again,
>
> sorry for another followup. I just realised that 4 of the 12 cluster nodes
> still have PVE Manager version 4.2 and the other 8 nodes have version 4.3.
> Can this be the reason of all my troubles?
>
> I'm in the process of updating these 4 nodes. These 4 nodes were installed
> with the Proxmox install media, but the other 8 nodes were installed with
> Debian 8 first. So the 4 outdated nodes didn't have the 'deb
> http://download.proxmox.com/debian jessie pve-no-subscription' repo file.
> Adding this repo made the 4.3 updates available.
>
>
>
> On Wed, Oct 26, 2016 at 12:20 PM, Szabolcs F. <subc...@gmail.com> wrote:
>
>> Hi Michael,
>>
>> I can change to LACP, sure. Would it be better than simple active-backup?
>> I haven't got too much experience with LACP though.
>>
>> On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <m...@miras.org>
>> wrote:
>>
>>> Is it possible to switch to 802.3ad bond mode?
>>>
>>> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." <
>>> subc...@gmail.com> wrote:
>>>
>>>> Hi Lutz,
>>>>
>>>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN
>>>> and my corosync.conf : http://pastebin.com/2ss0AAEr
>>>>
>>>> Mutlicast is enabled on my switches.
>>>>
>>>> The problem is I don't have a way to to replicate the problem, it seems to
>>>> happen randomly, so I'm unsure how to do more tests. At the moment my
>>>> cluster is working fine for about 16 hours. Any ideas forcing the issue?
>>>>
>>>> Thanks,
>>>> Szabolcs
>>>>
>>>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek 
>>>> <l.wil...@science-computing.de>
>>>> wrote:
>>>>
>>>>  Am 24.10.2016 um 15:16 schrieb Szabolcs F.:
>>>>>
>>>>>  Corosync has a lot of
>>>>>> these in the /var/logs/daemon.log :
>>>>>>  http://pastebin.com/ajhE8Rb9
>>>>>
>>>>>
>>>>>
>>>>>  please carefully check your (node/switch/multicast) network 
>>>>> configuration,
>>>>>  and please paste your corosync configuration file and output of
>>>>>  /proc/net/bonding/bondXX
>>>>>
>>>>>  just a guess:
>>>>>
>>>>>  * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect)
>>>>>--> Problems still occours?
>>>>>
>>>>>  * during "problem time"
>>>>>--> omping is still ok?
>>>>>
>>>>>  https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor
>>>>>  um_and_cluster_issues
>>>>>
>>>>>
>>>>>  Freundliche Grüße / Best Regards
>>>>>
>>>>>   Lutz Willek
>>>>>
>>>>>  --
>>>>> --
>>>>> creating IT solutions
>>>>>  Lutz Willek science + computing ag
>>>>>  Senior Systems Engineer Geschäftsstelle Berlin
>>>>>  IT Services Berlin
>>>>>   Friedrichstraße 187
>>>>>  phone +49(0)30 2007697-21   10117 Berlin, Germany
>>>>>  fax   +49(0)30 2007697-11   http://de.atos.net/sc
>>>>>
>>>>>  S/MIME-Sicherheit:
>>>>>  http://www.science-computing.de/cacert.crt
>>>>>  http://www.science-computing.de/cacert-sha512.crt
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>  pve-user mailing list
>>>>>  pve-user@pve.proxmox.com
>>>>>  http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>>
>>>> --
>>>>
>>>> pve-user mailing list
>>>> pve-user@pve.proxmox.com
>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>>
>>> --
>>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>>>
>>
>>
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Promox 4.3 cluster issue

2016-10-26 Thread Szabolcs F.
Hello again,

sorry for another followup. I just realised that 4 of the 12 cluster nodes
still have PVE Manager version 4.2 and the other 8 nodes have version 4.3.
Can this be the reason of all my troubles?

I'm in the process of updating these 4 nodes. These 4 nodes were installed
with the Proxmox install media, but the other 8 nodes were installed with
Debian 8 first. So the 4 outdated nodes didn't have the 'deb
http://download.proxmox.com/debian jessie pve-no-subscription' repo file.
Adding this repo made the 4.3 updates available.



On Wed, Oct 26, 2016 at 12:20 PM, Szabolcs F. <subc...@gmail.com> wrote:

> Hi Michael,
>
> I can change to LACP, sure. Would it be better than simple active-backup?
> I haven't got too much experience with LACP though.
>
> On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <m...@miras.org> wrote:
>
>> Is it possible to switch to 802.3ad bond mode?
>>
>> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." <
>> subc...@gmail.com> wrote:
>>
>>> Hi Lutz,
>>>
>>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN
>>> and my corosync.conf : http://pastebin.com/2ss0AAEr
>>>
>>> Mutlicast is enabled on my switches.
>>>
>>> The problem is I don't have a way to to replicate the problem, it seems to
>>> happen randomly, so I'm unsure how to do more tests. At the moment my
>>> cluster is working fine for about 16 hours. Any ideas forcing the issue?
>>>
>>> Thanks,
>>> Szabolcs
>>>
>>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek <l.wil...@science-computing.de>
>>> wrote:
>>>
>>>  Am 24.10.2016 um 15:16 schrieb Szabolcs F.:
>>>>
>>>>  Corosync has a lot of
>>>>> these in the /var/logs/daemon.log :
>>>>>  http://pastebin.com/ajhE8Rb9
>>>>
>>>>
>>>>
>>>>  please carefully check your (node/switch/multicast) network configuration,
>>>>  and please paste your corosync configuration file and output of
>>>>  /proc/net/bonding/bondXX
>>>>
>>>>  just a guess:
>>>>
>>>>  * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect)
>>>>--> Problems still occours?
>>>>
>>>>  * during "problem time"
>>>>--> omping is still ok?
>>>>
>>>>  https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor
>>>>  um_and_cluster_issues
>>>>
>>>>
>>>>  Freundliche Grüße / Best Regards
>>>>
>>>>   Lutz Willek
>>>>
>>>>  --
>>>> --
>>>> creating IT solutions
>>>>  Lutz Willek science + computing ag
>>>>  Senior Systems Engineer Geschäftsstelle Berlin
>>>>  IT Services Berlin
>>>>   Friedrichstraße 187
>>>>  phone +49(0)30 2007697-21   10117 Berlin, Germany
>>>>  fax   +49(0)30 2007697-11   http://de.atos.net/sc
>>>>
>>>>  S/MIME-Sicherheit:
>>>>  http://www.science-computing.de/cacert.crt
>>>>  http://www.science-computing.de/cacert-sha512.crt
>>>>
>>>>
>>>> --
>>>>
>>>>  pve-user mailing list
>>>>  pve-user@pve.proxmox.com
>>>>  http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>>
>>> --
>>>
>>> pve-user mailing list
>>> pve-user@pve.proxmox.com
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>>
>> --
>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>>
>
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Promox 4.3 cluster issue

2016-10-26 Thread Szabolcs F.
Hi Michael,

I can change to LACP, sure. Would it be better than simple active-backup? I
haven't got too much experience with LACP though.

On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <m...@miras.org> wrote:

> Is it possible to switch to 802.3ad bond mode?
>
> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." <
> subc...@gmail.com> wrote:
>
>> Hi Lutz,
>>
>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN
>> and my corosync.conf : http://pastebin.com/2ss0AAEr
>>
>> Mutlicast is enabled on my switches.
>>
>> The problem is I don't have a way to to replicate the problem, it seems to
>> happen randomly, so I'm unsure how to do more tests. At the moment my
>> cluster is working fine for about 16 hours. Any ideas forcing the issue?
>>
>> Thanks,
>> Szabolcs
>>
>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek <l.wil...@science-computing.de>
>> wrote:
>>
>>  Am 24.10.2016 um 15:16 schrieb Szabolcs F.:
>>>
>>>  Corosync has a lot of
>>>> these in the /var/logs/daemon.log :
>>>>  http://pastebin.com/ajhE8Rb9
>>>
>>>
>>>
>>>  please carefully check your (node/switch/multicast) network configuration,
>>>  and please paste your corosync configuration file and output of
>>>  /proc/net/bonding/bondXX
>>>
>>>  just a guess:
>>>
>>>  * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect)
>>>--> Problems still occours?
>>>
>>>  * during "problem time"
>>>--> omping is still ok?
>>>
>>>  https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor
>>>  um_and_cluster_issues
>>>
>>>
>>>  Freundliche Grüße / Best Regards
>>>
>>>   Lutz Willek
>>>
>>>  --
>>> --
>>> creating IT solutions
>>>  Lutz Willek science + computing ag
>>>  Senior Systems Engineer Geschäftsstelle Berlin
>>>  IT Services Berlin
>>>   Friedrichstraße 187
>>>  phone +49(0)30 2007697-21   10117 Berlin, Germany
>>>  fax   +49(0)30 2007697-11   http://de.atos.net/sc
>>>
>>>  S/MIME-Sicherheit:
>>>  http://www.science-computing.de/cacert.crt
>>>  http://www.science-computing.de/cacert-sha512.crt
>>>
>>>
>>> --
>>>
>>>  pve-user mailing list
>>>  pve-user@pve.proxmox.com
>>>  http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>
>> --
>>
>> pve-user mailing list
>> pve-user@pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Promox 4.3 cluster issue

2016-10-26 Thread Szabolcs F.
Hi Lutz,

my bondXX files look like this: http://pastebin.com/GX8x3ZaN
and my corosync.conf : http://pastebin.com/2ss0AAEr

Mutlicast is enabled on my switches.

The problem is I don't have a way to to replicate the problem, it seems to
happen randomly, so I'm unsure how to do more tests. At the moment my
cluster is working fine for about 16 hours. Any ideas forcing the issue?

Thanks,
Szabolcs

On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek <l.wil...@science-computing.de>
wrote:

> Am 24.10.2016 um 15:16 schrieb Szabolcs F.:
>
>> Corosync has a lot of these in the /var/logs/daemon.log :
>> http://pastebin.com/ajhE8Rb9
>>
>
> please carefully check your (node/switch/multicast) network configuration,
> and please paste your corosync configuration file and output of
> /proc/net/bonding/bondXX
>
> just a guess:
>
> * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect)
>   --> Problems still occours?
>
> * during "problem time"
>   --> omping is still ok?
>
> https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor
> um_and_cluster_issues
>
>
> Freundliche Grüße / Best Regards
>
>  Lutz Willek
>
> --
> creating IT solutions
> Lutz Willek science + computing ag
> Senior Systems Engineer Geschäftsstelle Berlin
> IT Services Berlin  Friedrichstraße 187
> phone +49(0)30 2007697-21   10117 Berlin, Germany
> fax   +49(0)30 2007697-11   http://de.atos.net/sc
>
> S/MIME-Sicherheit:
> http://www.science-computing.de/cacert.crt
> http://www.science-computing.de/cacert-sha512.crt
>
>
> ___
> pve-user mailing list
> pve-user@pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Promox 4.3 cluster issue

2016-10-26 Thread Szabolcs F.
Hi Alwin,

thanks for the links. Do you mean VLAN tagging on trunk ports or completely
separated, untagged, dedicated ports?

ps: I forgot to ask about the jumbo frames. Should I enable them?

Thanks,
Szabolcs

On Tue, Oct 25, 2016 at 6:09 PM, Alwin Antreich <sysadmin-...@cognitec.com>
wrote:

> Hi Szabolcs,
>
> On 10/25/2016 04:07 PM, Szabolcs F. wrote:
> > Hi Alwin,
> >
> > the Cisco 4948 switches don't have jumbo frames enabled. Global Ethernet
> > MTU is 1500 bytes. Port security is not enabled.
> >
> > When the issue happens the hosts are able to ping each other without any
> > packet loss.
> >
> > On Tue, Oct 25, 2016 at 3:02 PM, Alwin Antreich <
> sysadmin-...@cognitec.com>
> > wrote:
> >
> >> Hi Szabolcs,
> >>
> >> On 10/25/2016 12:24 PM, Szabolcs F. wrote:
> >>> Hi Alwin,
> >>>
> >>> bond0 is on two Cisco 4948 switches and bond1 is on two Cisco
> N3K-3064PQ
> >>> switches. They worked fine for about two months in this setup. But last
> >>> week (after I started to have these issues) I powered down one Cisco
> 4948
> >>> and one N3K-3064PQ switch (in both cases the designated backup switches
> >>> were powered down). This is to make sure all servers use the same
> switch
> >> as
> >>> their active link. After that I stopped the Proxmox cluster (all nodes)
> >> and
> >>> started them again, but the issue occurred again.
> >>
> >> Ok, so one thing less to check. How are your remaining switch
> configured,
> >> especially, where the pve cluster is on? Do
> >> they use jumbo frames? Or some network/port security?
> >>
> >>>
> >>> I've just added the 'bond_primary ethX' option to the interfaces file.
> >> I'll
> >>> reboot everything once again and see if it helps.
> >>
> >> That's only going to be used, when you have all links connected and want
> >> to prefer a link to be the primary, eg. 10GbE
> >> as primary and 1GbE as backup.
> >>
> >>>
> >>> syslog: http://pastebin.com/MsuCcNx8
> >>> dmesg: http://pastebin.com/xUPMKDJR
> >>> pveproxy (I can only see access.log for pveproxy, so this is the
> service
> >>> status): http://pastebin.com/gPPb4F3x
> >>
> >> I couldn't find anything unusual, but that doesn't mean there isn't.
> >>
> >>>
> >>> What other logs should I be reading?
> >>>
> >>> Thanks
> >>>
> >>> On Tue, Oct 25, 2016 at 11:23 AM, Alwin Antreich <
> >> sysadmin-...@cognitec.com>
> >>> wrote:
> >>>
> >>>> Hi Szabolcs,
> >>>>
> >>>> On 10/25/2016 10:01 AM, Szabolcs F. wrote:
> >>>>> Hi Alwin,
> >>>>>
> >>>>> thanks for your hints.
> >>>>>
> >>>>>> On which interface is proxmox running on? Are these interfaces
> clogged
> >>>>> because, there is some heavy network IO going on?
> >>>>> I've got my two Intel Gbps network interfaces bonded together (bond0)
> >> as
> >>>>> active-backup and vmbr0 is bridged on this bond, then Proxmox is
> >> running
> >>>> on
> >>>>> this interface. I.e. http://pastebin.com/WZKQ02Qu
> >>>>> All nodes are configured like this. There is no heavy IO on these
> >>>>> interfaces, because the storage network uses the separate 10Gbps
> fiber
> >>>>> Intel NICs (bond1).
> >>>>
> >>>> Is your bond working properly? Is the bond on the same switch or two
> >>>> different?
> >>>>
> >>>> Usually I add the "bond_primary ethX" option to set the interface that
> >>>> should be primarily used in active-backup
> >>>> configuration - side note. :-)
> >>>>
> >>>> What are the logs on the server showing? You know, syslog, dmesg,
> >>>> pveproxy, etc. ;-)
> >>>>
> >>>>>
> >>>>>> Another guess, are all servers synchronizing with a NTP server and
> >> have
> >>>>> the correct time?
> >>>>> Yes, NTP is working properly, the firewall lets all NTP request go
> >>>> through.
> >>>>>
> >>>>>
> >>>>> On Mon, Oct 24, 2016 at 5:19 PM, Alwin Antreich <
> >>>> sysadmin-...@cogni

Re: [PVE-User] Promox 4.3 cluster issue

2016-10-25 Thread Szabolcs F.
Hi Alwin,

bond0 is on two Cisco 4948 switches and bond1 is on two Cisco N3K-3064PQ
switches. They worked fine for about two months in this setup. But last
week (after I started to have these issues) I powered down one Cisco 4948
and one N3K-3064PQ switch (in both cases the designated backup switches
were powered down). This is to make sure all servers use the same switch as
their active link. After that I stopped the Proxmox cluster (all nodes) and
started them again, but the issue occurred again.

I've just added the 'bond_primary ethX' option to the interfaces file. I'll
reboot everything once again and see if it helps.

syslog: http://pastebin.com/MsuCcNx8
dmesg: http://pastebin.com/xUPMKDJR
pveproxy (I can only see access.log for pveproxy, so this is the service
status): http://pastebin.com/gPPb4F3x

What other logs should I be reading?

Thanks

On Tue, Oct 25, 2016 at 11:23 AM, Alwin Antreich <sysadmin-...@cognitec.com>
wrote:

> Hi Szabolcs,
>
> On 10/25/2016 10:01 AM, Szabolcs F. wrote:
> > Hi Alwin,
> >
> > thanks for your hints.
> >
> >> On which interface is proxmox running on? Are these interfaces clogged
> > because, there is some heavy network IO going on?
> > I've got my two Intel Gbps network interfaces bonded together (bond0) as
> > active-backup and vmbr0 is bridged on this bond, then Proxmox is running
> on
> > this interface. I.e. http://pastebin.com/WZKQ02Qu
> > All nodes are configured like this. There is no heavy IO on these
> > interfaces, because the storage network uses the separate 10Gbps fiber
> > Intel NICs (bond1).
>
> Is your bond working properly? Is the bond on the same switch or two
> different?
>
> Usually I add the "bond_primary ethX" option to set the interface that
> should be primarily used in active-backup
> configuration - side note. :-)
>
> What are the logs on the server showing? You know, syslog, dmesg,
> pveproxy, etc. ;-)
>
> >
> >> Another guess, are all servers synchronizing with a NTP server and have
> > the correct time?
> > Yes, NTP is working properly, the firewall lets all NTP request go
> through.
> >
> >
> > On Mon, Oct 24, 2016 at 5:19 PM, Alwin Antreich <
> sysadmin-...@cognitec.com>
> > wrote:
> >
> >> Hello Szabolcs,
> >>
> >> On 10/24/2016 03:16 PM, Szabolcs F. wrote:
> >>> Hello,
> >>>
> >>> I've got a Proxmox VE 4.3 cluster of 12 nodes. All of them are Dell
> C6220
> >>> sleds. Each has 2x Intel Xeon E5-2670 CPU and 64GB RAM. I've got two
> >>> separate networks: 1Gbps LAN (Cisco 4948 switch) and 10Gbps storage
> >> (Cisco
> >>> N3K-3064PQ fiber switch). The Dell nodes use the integrated Intel Gbit
> >>> adapters for LAN and Intel PCI-E 10Gbps cards for the fiber network
> >> (ixgbe
> >>> driver). The storage servers are separate, they run FreeNAS and export
> >> the
> >>> shares with NFS. My virtual machines (I've made about 40 of them so
> far)
> >>> are KVM/QCOW2 and they are stored on the FreeNAS storage. So far so
> good.
> >>> I've been using this environment as a test and was almost ready to push
> >>> into production.
> >> On which interface is proxmox running on? Are these interfaces clogged
> >> because, there is some heavy network IO going on?
> >>>
> >>> But I have a problem with the cluster. From time to time the pveproxy
> >>> service dies on the nodes or the web UI lists all nodes (except the one
> >> I'm
> >>> actually logged into) as unreachable (red cross). Sometimes all nodes
> are
> >>> listed as working (green status) but if I try to connect to a virtual
> >>> machine I get a 'connection refused' error. When the cluster acts up I
> >>> can't do any VM migration and any other VM management (i.e. console,
> >>> start/stop/reset, new VM, etc). When it happens the only way to recover
> >> is
> >>> powering down all 12 nodes and starting them one after another. Then
> >>> everything works properly for a random amount of time: sometimes for
> >> weeks,
> >>> sometimes for only a few days.
> >> Another guess, are all servers synchronizing with a NTP server and have
> >> the correct time?
> >>>
> >>> I followed the network troubleshooting guide with omping, multicast,
> etc
> >>> and confirmed I've got multicase enabled and the troubleshooting didn't
> >>> return any error. The /etc/hosts file is configured on all nodes with
> the
> >>> proper hostname/IP list 

Re: [PVE-User] Promox 4.3 cluster issue

2016-10-25 Thread Szabolcs F.
Hi Alwin,

thanks for your hints.

>On which interface is proxmox running on? Are these interfaces clogged
because, there is some heavy network IO going on?
I've got my two Intel Gbps network interfaces bonded together (bond0) as
active-backup and vmbr0 is bridged on this bond, then Proxmox is running on
this interface. I.e. http://pastebin.com/WZKQ02Qu
All nodes are configured like this. There is no heavy IO on these
interfaces, because the storage network uses the separate 10Gbps fiber
Intel NICs (bond1).

>Another guess, are all servers synchronizing with a NTP server and have
the correct time?
Yes, NTP is working properly, the firewall lets all NTP request go through.


On Mon, Oct 24, 2016 at 5:19 PM, Alwin Antreich <sysadmin-...@cognitec.com>
wrote:

> Hello Szabolcs,
>
> On 10/24/2016 03:16 PM, Szabolcs F. wrote:
> > Hello,
> >
> > I've got a Proxmox VE 4.3 cluster of 12 nodes. All of them are Dell C6220
> > sleds. Each has 2x Intel Xeon E5-2670 CPU and 64GB RAM. I've got two
> > separate networks: 1Gbps LAN (Cisco 4948 switch) and 10Gbps storage
> (Cisco
> > N3K-3064PQ fiber switch). The Dell nodes use the integrated Intel Gbit
> > adapters for LAN and Intel PCI-E 10Gbps cards for the fiber network
> (ixgbe
> > driver). The storage servers are separate, they run FreeNAS and export
> the
> > shares with NFS. My virtual machines (I've made about 40 of them so far)
> > are KVM/QCOW2 and they are stored on the FreeNAS storage. So far so good.
> > I've been using this environment as a test and was almost ready to push
> > into production.
> On which interface is proxmox running on? Are these interfaces clogged
> because, there is some heavy network IO going on?
> >
> > But I have a problem with the cluster. From time to time the pveproxy
> > service dies on the nodes or the web UI lists all nodes (except the one
> I'm
> > actually logged into) as unreachable (red cross). Sometimes all nodes are
> > listed as working (green status) but if I try to connect to a virtual
> > machine I get a 'connection refused' error. When the cluster acts up I
> > can't do any VM migration and any other VM management (i.e. console,
> > start/stop/reset, new VM, etc). When it happens the only way to recover
> is
> > powering down all 12 nodes and starting them one after another. Then
> > everything works properly for a random amount of time: sometimes for
> weeks,
> > sometimes for only a few days.
> Another guess, are all servers synchronizing with a NTP server and have
> the correct time?
> >
> > I followed the network troubleshooting guide with omping, multicast, etc
> > and confirmed I've got multicase enabled and the troubleshooting didn't
> > return any error. The /etc/hosts file is configured on all nodes with the
> > proper hostname/IP list of all nodes.
> > When trying to do 'service pve-cluster restart' I get these errors:
> > http://pastebin.com/NXnEf4rd (running pmxcsf manually mounts the
> /etc/pve
> > properly, but doesn't fix the cluster/proxy issue)
> > pvecm status : http://pastebin.com/jsDFkqu3 (I powered down one node,
> > that's why it's missing)
> > pvecm nodes : http://pastebin.com/1WR8Yij8
> > Corosync has a lot of these in the /var/logs/daemon.log :
> > http://pastebin.com/ajhE8Rb9
> >
> > Someone please help!
> >
> > Thanks,
> > Szabolcs
> > ___
> > pve-user mailing list
> > pve-user@pve.proxmox.com
> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> >
>
> --
> Cheers,
> Alwin
> ___
> pve-user mailing list
> pve-user@pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


[PVE-User] Promox 4.3 cluster issue

2016-10-24 Thread Szabolcs F.
Hello,

I've got a Proxmox VE 4.3 cluster of 12 nodes. All of them are Dell C6220
sleds. Each has 2x Intel Xeon E5-2670 CPU and 64GB RAM. I've got two
separate networks: 1Gbps LAN (Cisco 4948 switch) and 10Gbps storage (Cisco
N3K-3064PQ fiber switch). The Dell nodes use the integrated Intel Gbit
adapters for LAN and Intel PCI-E 10Gbps cards for the fiber network (ixgbe
driver). The storage servers are separate, they run FreeNAS and export the
shares with NFS. My virtual machines (I've made about 40 of them so far)
are KVM/QCOW2 and they are stored on the FreeNAS storage. So far so good.
I've been using this environment as a test and was almost ready to push
into production.

But I have a problem with the cluster. From time to time the pveproxy
service dies on the nodes or the web UI lists all nodes (except the one I'm
actually logged into) as unreachable (red cross). Sometimes all nodes are
listed as working (green status) but if I try to connect to a virtual
machine I get a 'connection refused' error. When the cluster acts up I
can't do any VM migration and any other VM management (i.e. console,
start/stop/reset, new VM, etc). When it happens the only way to recover is
powering down all 12 nodes and starting them one after another. Then
everything works properly for a random amount of time: sometimes for weeks,
sometimes for only a few days.

I followed the network troubleshooting guide with omping, multicast, etc
and confirmed I've got multicase enabled and the troubleshooting didn't
return any error. The /etc/hosts file is configured on all nodes with the
proper hostname/IP list of all nodes.
When trying to do 'service pve-cluster restart' I get these errors:
http://pastebin.com/NXnEf4rd (running pmxcsf manually mounts the /etc/pve
properly, but doesn't fix the cluster/proxy issue)
pvecm status : http://pastebin.com/jsDFkqu3 (I powered down one node,
that's why it's missing)
pvecm nodes : http://pastebin.com/1WR8Yij8
Corosync has a lot of these in the /var/logs/daemon.log :
http://pastebin.com/ajhE8Rb9

Someone please help!

Thanks,
Szabolcs
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


[PVE-User] Promox 4.3 cluster issue

2016-10-24 Thread Szabolcs F.
Hello,

I've got a Proxmox VE 4.3 cluster of 12 nodes. All of them are Dell C6220
sleds. Each has 2x Intel Xeon E5-2670 CPU and 64GB RAM. I've got two
separate networks: 1Gbps LAN (Cisco 4948 switch) and 10Gbps storage (Cisco
N3K-3064PQ fiber switch). The Dell nodes use the integrated Intel Gbit
adapters for LAN and Intel PCI-E 10Gbps cards for the fiber network (ixgbe
driver). The storage servers are separate, they run FreeNAS and export the
shares with NFS. My virtual machines (I've made about 40 of them so far)
are KVM/QCOW2 and they are stored on the FreeNAS storage. So far so good.
I've been using this environment as a test and was almost ready to push
into production.

But I have a problem with the cluster. From time to time the pveproxy
service dies on the nodes or the web UI lists all nodes (except the one I'm
actually logged into) as unreachable (red cross). Sometimes all nodes are
listed as working (green status) but if I try to connect to a virtual
machine I get a 'connection refused' error. When the cluster acts up I
can't do any VM migration and any other VM management (i.e. console,
start/stop/reset, new VM, etc). When it happens the only way to recover is
powering down all 12 nodes and starting them one after another. Then
everything works properly for a random amount of time: sometimes for weeks,
sometimes for only a few days.

I followed the network troubleshooting guide with omping, multicast, etc
and confirmed I've got multicase enabled and the troubleshooting didn't
return any error. The /etc/hosts file is configured on all nodes with the
proper hostname/IP list of all nodes.
When trying to do 'service pve-cluster restart' I get these errors:
http://pastebin.com/NXnEf4rd (running pmxcsf manually mounts the /etc/pve
properly, but doesn't fix the cluster/proxy issue)
pvecm status : http://pastebin.com/jsDFkqu3 (I powered down one node,
that's why it's missing)
pvecm nodes : http://pastebin.com/1WR8Yij8
Corosync has a lot of these in the /var/logs/daemon.log :
http://pastebin.com/ajhE8Rb9

Someone please help!

Thanks,
Szabolcs
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user