On 2/22/19 9:36 PM, Kevin Keane wrote:
Basically, having two NICs on the same subnet is a Bad Thing (tm) unless you bond them. It's doubly bad if you use DHCP, because then both NICs have the same default gateway, which means that your system has two routes to the same destination.

Hello,

sorry for the delay, I've been in vacation then busy.
Yeah, I know it's bad.

A quick reminder :

Nodes are normally physically configured like this :

conf 1:

a) eth0 -> switch A (1G) : ipmi (ipmi subnet) traffic allowed
b) eth2 -> switch B (10G) : only data (data subnet) traffic allowed

and logically configured to get switch-based discovered using swith B

I did configure on purpose one node/port like this :

conf 2:

a) eth0 -> switch A (1G) : data (data subnet) + ipmi (ipmi subnet) traffic allowed
b) eth2 -> switch B (10G) : data (data subnet) only traffic allowed


precisely to be able to handle (by knowing what's happening and/or forcing eth2 install) a switch misconfiguration which would result to the conf2 above [and at the time I wrote my initial message, a BIOS boot order misconfiguration too]

[I will requote the steps I provided and you mention]

That said, the way I read your log files:

--
Step 1 was :

1) genesis gets the nics configured and retrieve an ip from the dynamic range :

eth2 : 2019-02-22T18:16:32.004867+01:00 xcat-tars dhcpd: DHCPACK on 192.168.134.252 to 0c:c4:7a:58:c7:6a via eth1

eth0 : 2019-02-22T18:16:39.003422+01:00 xcat-tars dhcpd: DHCPACK on 192.168.134.250 to 0c:c4:7a:4d:85:a8 via eth1
--

In step 1, genesis shouldn't be involved yet. The BIOS should get IP addresses for the NICs - apparently for both.

I don't see why. I was talking about what's seen after the 'Beginin discovery process' message on the console
On the console we can see 1 DHCP request for each nic.

Important here: which MAC address it picks is not inherently related to which NIC was used to boot genesis. Normally, only the bootup NIC would have an IP address (because none of the others are configured yet, and they shouldn't be connected to a subnet with DHCP). But in your case, there are two NICs that have IP addresses via DHCP. I don't know how genesis picks the "right" NIC in that scenario. It might actually be random chance. Based on your observation in step 5, genesis might actually try to discover BOTH NICs.

My first guess is that indeed xCAT tries to discover BOTH NICs but I'm not so sure.

---
Step 2 was :

2) discovery begins on eth2

2019-02-22T18:16:44.320279+01:00 xcat-tars xcat[60918]: xcatd: Processing discovery request from 192.168.134.252
---

Step 3 was :
---
3) discovery has seen that it was for the tars-113 node :

2019-02-22T18:16:57.162396+01:00 xcat-tars xcat[60918]: Discover info: configure static BMC ip:10.6.96.115 for host_node:tars-113.
---


In step 3, the BMC is changed from DHCP to a static IP address. Note that it's not clear *which* NIC ends up with the static IP.

I don't see the problem : BMC, even it physically shared the eth0 port does have it's own MAC address.

Step 4 was
---
4) eth0 and eth2 leases are released :

2019-02-22T18:16:59.621541+01:00 xcat-tars dhcpd: DHCPRELEASE of 192.168.134.250 from 0c:c4:7a:4d:85:a8 via eth1 (found)

2019-02-22T18:16:59.664557+01:00 xcat-tars dhcpd: DHCPRELEASE of 192.168.134.252 from 0c:c4:7a:58:c7:6a via eth1 (found)
---


In step 4, the leases are released. That's expected since the BMC is no longer using DHCP.

I don't think it's related to the BMC since both MAC addresses for the DHCPRELEASE are the adresses of eth0 and eth2, not the MAC address of the BMC

Step 5 was
---
5) I've got those messages

2019-02-22T18:16:59.752892+01:00 xcat-tars xcatd: Discovery worker: switch instance: nodediscover instance[60918]: tars-113 has been discovered 2019-02-22T18:16:59.754145+01:00 xcat-tars xcat[60918]: Discovery Error: Could not find any node.
---

Step 5a: xCAT confirms that for one of the IP addresses, it found a corresponding predefined node object, and completed discovery. Step 5b: xCAT tells you "there is no predefined node object for this IP address". Of course not - because Step 5a used up the predefined node object.

I agree something like this must happen. And it must be switch-based since I removed the tars-113 entry in discoverydata. So I don't think MTMS discovery could have found out anything.

Step 6 was :
---
6) eth0 gets the desired ip (the ip set in the node def)

2019-02-22T18:17:00.000095+01:00 xcat-tars dhcpd: DHCPDISCOVER from 0c:c4:7a:4d:85:a8 via eth1 2019-02-22T18:17:00.000116+01:00 xcat-tars dhcpd: DHCPOFFER on 192.168.128.115 to 0c:c4:7a:4d:85:a8 via eth1 2019-02-22T18:17:00.000433+01:00 xcat-tars dhcpd: DHCPREQUEST for 192.168.128.115 (192.168.132.2) from 0c:c4:7a:4d:85:a8 via eth1 2019-02-22T18:17:00.000442+01:00 xcat-tars dhcpd: DHCPACK on 192.168.128.115 to 0c:c4:7a:4d:85:a8 via eth1

---


Step 6: eth2 has been reconfigured with a static IP in step 3, so eth0 is the only one that still has DHCP enabled. Remember, these DHCP requests are generated by the BIOS; they have nothing to do with discovery.

I don't get why ? BIOS only does the initial native PXE, doesn't it ?

Check your /var/lib/dhcpd/dhcp.leases file to see if there
are reservations for one or both MAC addresses. My guess is that xCAT created a reservation for eth0's mac address with eth2's IP address.

eth0 mac gets what I'd want for eth2 and eth2 gets

host tars-113-noip0cc47a58c76a {
  dynamic;
  hardware ethernet 0c:c4:7a:58:c7:6a;
        supersede server.allow-booting = 00;


Step 7 was
---
7) eth2 asks for an ip but is denied to boot :

2019-02-22T18:17:01.001030+01:00 xcat-tars dhcpd: DHCPDISCOVER from 0c:c4:7a:58:c7:6a via eth1: booting disallowed
---


Step 7: I'm not sure what this means. My hunch is that eth0 may have received the same IP address via DHCP that has been statically assigned to eth2 in step 3. eth2 detects the conflict, and falls back to DHCP.

Step 8 was
---
8) I've got this message I don't fully understand either :

2019-02-22T18:17:01.500387+01:00 xcat-tars xcatd: Discovery worker: switch instance: nodediscover instance[60918]: Failed to notify 192.168.134.252 that it's actually tars-113.
---

Step 8: because of step 7, eth0 got the IP address via DHCP that xCAT thought it had statically assigned to eth2. So xCAT is talking to the wrong NIC on the node.

So to sum up, what's unsure is

- does genesis try to discover BOTH nic
- which nic does genesis register

What's experienced is the the node ends up having the eth0 MAC address in the MAC table.

Again, the initial need was a way to make sure eth2 MAC was discovered and only this one even in the case of a switch misconfiguration or a BIOS order misconfiguration (although you pretty convinced me BIOS order may not have something to do with what MAC address ends up discovered)

Thanks

--
Thomas H.


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to