[Touch-packages] [Bug 2002445] Re: udev NIC renaming race with mlx5_core driver

Chad Smith Mon, 27 Feb 2023 20:01:51 -0800

** Description changed:

  [Impact]
  On systems with mellanox NICs, udev's NIC renaming races with the mlx5_core 
driver's own configuration of subordinate interfaces. When the kernel wins this 
race, the device cannot be renamed as udev has attempted, and this causes 
systemd-network-online.target to timeout waiting for links to be configured. 
This ultimately results in boot being delayed by about 2 minutes.
  
  [Test Plan]
- Since this is a race condition, we need to boot many instances before we see 
the issue. The Ubuntu Server team will help coordinate the testing at scale to 
confirm the fix.
+ Repeated launches of Standard_D8ds_v5 instance types will generally hit this 
race around 1 in 10 runs. Create a vm snapshot with updated systemd from 
ppa:enr0n/systemd-245. Launch 100 Standard_D8ds_v5 instances with updated 
systemd. Assert not failure in cloud-init status and no 2 minute delay in 
network-online.target.
+ 
+ 
+ To check for failure symptom:
+   - Assert that network-online.target isn't the longest pole from 
systemd-analyze blame. 
+ 
+ To assert success condition during net rename busy race:
+   - assert when "eth1" is still the primary device name, that two altnames 
are listed (preserving the altname due to the primary NIC rename being hit.
+ 
+ 
+ Sample script uses pycloudlib to create modified base image for test and
+ launches 100 VMs of type Standard_D8ds_v5, counting both successes and
+ any failures seen.
+ 
+ #!/usr/bin/env python3
+ 
+ 
+ """Basic examples of various lifecycle with an Azure instance."""
+ 
+ import logging
+ import json
+ 
+ import pycloudlib
+ LOG = logging.getLogger()
+ 
+ base_cfg = """#cloud-config
+ ssh-import-id: [chad.smith, enr0n]
+ """
+ 
+ apt_cfg = """
+ # Add developer PPA
+ apt:
+  sources:
+    systemd-testing:
+      source: "deb [allow-insecure=yes] 
https://ppa.launchpadcontent.net/enr0n/systemd-245/ubuntu focal main"
+ # upgrade systemd after cloud-init is nearly done
+ runcmd:
+  - apt install systemd udev -y --allow-unauthenticated
+ """
+ 
+ debug_systemd_cfg = """
+ # Create systemd-udev debug override.conf in base image
+ write_files:
+ - path: /etc/systemd/system/systemd-networkd.service.d/override.conf
+   owner: root:root
+   defer: {defer}
+   content: |
+     [Service]
+     Environment=SYSTEMD_LOG_LEVEL=debug
+     
+ - path: /etc/systemd/system/systemd-udevd.service.d/override.conf
+   owner: root:root
+   defer: {defer}
+   content: |
+     [Service]
+     Environment=SYSTEMD_LOG_LEVEL=debug
+     LogRateLimitIntervalSec=0
+ """
+ 
+ cloud_config = base_cfg + apt_cfg + debug_systemd_cfg
+ cloud_config2 = base_cfg + debug_systemd_cfg
+ 
+ 
+ def debug_systemd_image_launch_overlake_v5_with_snapshot():
+     """Test overlake v5 timeouts
+     
+     test procedure:
+     - Launch base focal image
+     - enable ppa:enr0n/systemd-245 and systemd/udev debugging
+     - cloud-init clean --logs && deconfigure waalinux agent before shutdown
+     - snapshot a base image
+     - launch v5 system from snapshot
+     - check systemd-analyze for expected timeout
+     """
+     client = pycloudlib.Azure(tag="azure")
+ 
+     image_id = client.daily_image(release="focal")
+     pub_path = "/home/ubuntu/.ssh/id_rsa.pub"
+     priv_path = "/home/ubuntu/.ssh/id_rsa"
+ 
+     client.use_key(pub_path, priv_path)
+ 
+     base_instance = client.launch(
+         image_id=image_id,
+         instance_type="Standard_DS1_v2",
+         user_data=cloud_config.format(defer="true"),
+     )
+ 
+     LOG.info(f"base instance: ssh ubuntu@{base_instance.ip}")
+     base_instance.wait()
+     LOG.info(base_instance.execute("apt cache policy systemd"))
+     snapshotted_image_id = client.snapshot(base_instance)
+ 
+     reproducer = False
+     tries = 0
+     success_count_with_race = 0
+     success_count_no_race = 0
+     failure_count_network_delay = 0
+     failure_count_no_altnames = 0
+     TEST_SUMMARY_TMPL = """
+     ----- Test run complete: {tries} attempted -----
+     Successes without rename race: {success_count_no_race}
+     Successes with rename race and preserved altname: 
{success_count_with_race_persist_altname}
+     Failures due to network delay: {failure_count_network_delay}
+     Failures due to no altnames persisted: {failure_count_no_altnames}
+     ===================================
+     """
+     while tries < 10 and not reproducer:
+         tries += 1
+         new_instance = client.launch(
+             image_id=snapshotted_image_id,
+             instance_type="Standard_D8ds_v5",
+             user_data=cloud_config.format(defer="false"),
+         )
+         # breadcrumb for us pycloudlib/Azure  will reuse available IPs
+         new_instance.wait()
+         blame = new_instance.execute("systemd-analyze blame").splitlines()
+         LOG.info(f"--- Attempt {tries} ssh ubuntu@{new_instance.ip} Blame: 
{blame[0]}")
+         ip_addr = json.loads(new_instance.execute("ip -j addr").stdout)
+         for d in ip_addr:
+             if len(d.get('altnames', [])) > 1:
+                 success_count_with_race_persist_altname += 1
+                 LOG.info(
+                     f"--- SUCCESS persisting altnames {d['altnames']} due to 
rename race on resource busy on {d['ifname']}"
+                 )
+                 LOG.info(new_instance.execute("journalctl -u 
systemd-udevd.service -b 0 --no-pager"))
+             if d['ifname'] == "eth1" and len(d.get('altnames', [])) < 2:
+                 failure_count_no_altnames += 1
+                 LOG.error(f"FAILURE: to preserve altnames for {d['ifname']}. 
Only preserved {d.get('altnames', [])}")
+             else:
+                 success_count_no_race += 1
+         LOG.info('\n'.join([f'{d["ifname"]}: {d.get("altnames")}' for d in 
ip_addr]))
+         if "systemd-networkd-wait-online.service" not in blame[0]:
+             LOG.info(f"Destroying instance, normal boot seen: {blame[0]}")
+             new_instance.delete()
+         else:
+             failure_count_network_delay += 1
+             failure_count_no_altnames_persisted += 1
+             LOG.info(f"--- Attempt {tries} found delayed instance boot: 
{blame[0]}")
+             reproducer = True
+             r = new_instance.execute("journalctl -u systemd-udevd.service -b 
0 --no-pager")
+             LOG.info(r)
+             if "Failure to rename" in str(r):
+                 LOG.info(f"Found rename refusal!: {r[0]}")
+     base_instance.delete()
+     new_instance.delete()
+     LOG.info(
+         TEST_SUMMARY_TMPL.format(
+             success_count_with_race=success_count_with_race,
+             success_count_no_race=success_count_no_race,
+             failure_count_network_delay=failure_count_network_delay,
+             failure_count_no_altnames=failure_count_no_altnames, tries=tries
+         )
+      )
+ 
+ 
+ if __name__ == "__main__":
+     # Avoid polluting the log with azure info
+     logging.getLogger("paramiko").setLevel(logging.WARNING)
+     logging.getLogger("pycloudlib").setLevel(logging.WARNING)
+     logging.getLogger("adal-python").setLevel(logging.WARNING)
+     logging.getLogger("cli.azure.cli.core").setLevel(logging.WARNING)
+     logging.basicConfig(filename="sru-systemd.log", level=logging.INFO)
+     debug_systemd_image_launch_overlake_v5_with_snapshot()
+ 
+ 
  
  [Where problems could occur]
  The patches effectively make it so that if a interface cannot be renamed from 
udev, then the new name is left as an alternative name as a fallback. If 
problems occur, it would be related to device renaming, and particularly 
related to the devices alternative names.


** Description changed:

  [Impact]
  On systems with mellanox NICs, udev's NIC renaming races with the mlx5_core 
driver's own configuration of subordinate interfaces. When the kernel wins this 
race, the device cannot be renamed as udev has attempted, and this causes 
systemd-network-online.target to timeout waiting for links to be configured. 
This ultimately results in boot being delayed by about 2 minutes.
  
  [Test Plan]
  Repeated launches of Standard_D8ds_v5 instance types will generally hit this 
race around 1 in 10 runs. Create a vm snapshot with updated systemd from 
ppa:enr0n/systemd-245. Launch 100 Standard_D8ds_v5 instances with updated 
systemd. Assert not failure in cloud-init status and no 2 minute delay in 
network-online.target.
  
- 
  To check for failure symptom:
-   - Assert that network-online.target isn't the longest pole from 
systemd-analyze blame. 
+   - Assert that network-online.target isn't the longest pole from 
systemd-analyze blame.
  
  To assert success condition during net rename busy race:
-   - assert when "eth1" is still the primary device name, that two altnames 
are listed (preserving the altname due to the primary NIC rename being hit.
- 
+   - assert when "eth1" is still the primary device name, that two altnames 
are listed (preserving the altname due to the primary NIC rename being hit.
  
  Sample script uses pycloudlib to create modified base image for test and
  launches 100 VMs of type Standard_D8ds_v5, counting both successes and
  any failures seen.
  
  #!/usr/bin/env python3
- 
  
  """Basic examples of various lifecycle with an Azure instance."""
  
  import logging
  import json
  
  import pycloudlib
  LOG = logging.getLogger()
  
  base_cfg = """#cloud-config
  ssh-import-id: [chad.smith, enr0n]
  """
  
  apt_cfg = """
  # Add developer PPA
  apt:
-  sources:
-    systemd-testing:
-      source: "deb [allow-insecure=yes] 
https://ppa.launchpadcontent.net/enr0n/systemd-245/ubuntu focal main"
+  sources:
+    systemd-testing:
+      source: "deb [allow-insecure=yes] 
https://ppa.launchpadcontent.net/enr0n/systemd-245/ubuntu focal main"
  # upgrade systemd after cloud-init is nearly done
  runcmd:
-  - apt install systemd udev -y --allow-unauthenticated
+  - apt install systemd udev -y --allow-unauthenticated
  """
  
  debug_systemd_cfg = """
  # Create systemd-udev debug override.conf in base image
  write_files:
  - path: /etc/systemd/system/systemd-networkd.service.d/override.conf
-   owner: root:root
-   defer: {defer}
-   content: |
-     [Service]
-     Environment=SYSTEMD_LOG_LEVEL=debug
-     
+   owner: root:root
+   defer: {defer}
+   content: |
+     [Service]
+     Environment=SYSTEMD_LOG_LEVEL=debug
+ 
  - path: /etc/systemd/system/systemd-udevd.service.d/override.conf
-   owner: root:root
-   defer: {defer}
-   content: |
-     [Service]
-     Environment=SYSTEMD_LOG_LEVEL=debug
-     LogRateLimitIntervalSec=0
+   owner: root:root
+   defer: {defer}
+   content: |
+     [Service]
+     Environment=SYSTEMD_LOG_LEVEL=debug
+     LogRateLimitIntervalSec=0
  """
  
  cloud_config = base_cfg + apt_cfg + debug_systemd_cfg
  cloud_config2 = base_cfg + debug_systemd_cfg
  
+ def debug_systemd_image_launch_overlake_v5_with_snapshot():
+     """Test overlake v5 timeouts
  
- def debug_systemd_image_launch_overlake_v5_with_snapshot():
-     """Test overlake v5 timeouts
-     
-     test procedure:
-     - Launch base focal image
-     - enable ppa:enr0n/systemd-245 and systemd/udev debugging
-     - cloud-init clean --logs && deconfigure waalinux agent before shutdown
-     - snapshot a base image
-     - launch v5 system from snapshot
-     - check systemd-analyze for expected timeout
-     """
-     client = pycloudlib.Azure(tag="azure")
+     test procedure:
+     - Launch base focal image
+     - enable ppa:enr0n/systemd-245 and systemd/udev debugging
+     - cloud-init clean --logs && deconfigure waalinux agent before shutdown
+     - snapshot a base image
+     - launch v5 system from snapshot
+     - check systemd-analyze for expected timeout
+     """
+     client = pycloudlib.Azure(tag="azure")
  
-     image_id = client.daily_image(release="focal")
-     pub_path = "/home/ubuntu/.ssh/id_rsa.pub"
-     priv_path = "/home/ubuntu/.ssh/id_rsa"
+     image_id = client.daily_image(release="focal")
+     pub_path = "/home/ubuntu/.ssh/id_rsa.pub"
+     priv_path = "/home/ubuntu/.ssh/id_rsa"
  
-     client.use_key(pub_path, priv_path)
+     client.use_key(pub_path, priv_path)
  
-     base_instance = client.launch(
-         image_id=image_id,
-         instance_type="Standard_DS1_v2",
-         user_data=cloud_config.format(defer="true"),
-     )
+     base_instance = client.launch(
+         image_id=image_id,
+         instance_type="Standard_DS1_v2",
+         user_data=cloud_config.format(defer="true"),
+     )
  
-     LOG.info(f"base instance: ssh ubuntu@{base_instance.ip}")
-     base_instance.wait()
-     LOG.info(base_instance.execute("apt cache policy systemd"))
-     snapshotted_image_id = client.snapshot(base_instance)
+     LOG.info(f"base instance: ssh ubuntu@{base_instance.ip}")
+     base_instance.wait()
+     LOG.info(base_instance.execute("apt cache policy systemd"))
+     snapshotted_image_id = client.snapshot(base_instance)
  
-     reproducer = False
-     tries = 0
-     success_count_with_race = 0
-     success_count_no_race = 0
-     failure_count_network_delay = 0
-     failure_count_no_altnames = 0
-     TEST_SUMMARY_TMPL = """
-     ----- Test run complete: {tries} attempted -----
-     Successes without rename race: {success_count_no_race}
-     Successes with rename race and preserved altname: 
{success_count_with_race_persist_altname}
-     Failures due to network delay: {failure_count_network_delay}
-     Failures due to no altnames persisted: {failure_count_no_altnames}
-     ===================================
-     """
-     while tries < 10 and not reproducer:
-         tries += 1
-         new_instance = client.launch(
-             image_id=snapshotted_image_id,
-             instance_type="Standard_D8ds_v5",
-             user_data=cloud_config.format(defer="false"),
-         )
-         # breadcrumb for us pycloudlib/Azure  will reuse available IPs
-         new_instance.wait()
-         blame = new_instance.execute("systemd-analyze blame").splitlines()
-         LOG.info(f"--- Attempt {tries} ssh ubuntu@{new_instance.ip} Blame: 
{blame[0]}")
-         ip_addr = json.loads(new_instance.execute("ip -j addr").stdout)
-         for d in ip_addr:
-             if len(d.get('altnames', [])) > 1:
-                 success_count_with_race_persist_altname += 1
-                 LOG.info(
-                     f"--- SUCCESS persisting altnames {d['altnames']} due to 
rename race on resource busy on {d['ifname']}"
-                 )
-                 LOG.info(new_instance.execute("journalctl -u 
systemd-udevd.service -b 0 --no-pager"))
-             if d['ifname'] == "eth1" and len(d.get('altnames', [])) < 2:
-                 failure_count_no_altnames += 1
-                 LOG.error(f"FAILURE: to preserve altnames for {d['ifname']}. 
Only preserved {d.get('altnames', [])}")
-             else:
-                 success_count_no_race += 1
-         LOG.info('\n'.join([f'{d["ifname"]}: {d.get("altnames")}' for d in 
ip_addr]))
-         if "systemd-networkd-wait-online.service" not in blame[0]:
-             LOG.info(f"Destroying instance, normal boot seen: {blame[0]}")
-             new_instance.delete()
-         else:
-             failure_count_network_delay += 1
-             failure_count_no_altnames_persisted += 1
-             LOG.info(f"--- Attempt {tries} found delayed instance boot: 
{blame[0]}")
-             reproducer = True
-             r = new_instance.execute("journalctl -u systemd-udevd.service -b 
0 --no-pager")
-             LOG.info(r)
-             if "Failure to rename" in str(r):
-                 LOG.info(f"Found rename refusal!: {r[0]}")
-     base_instance.delete()
-     new_instance.delete()
-     LOG.info(
-         TEST_SUMMARY_TMPL.format(
-             success_count_with_race=success_count_with_race,
-             success_count_no_race=success_count_no_race,
-             failure_count_network_delay=failure_count_network_delay,
-             failure_count_no_altnames=failure_count_no_altnames, tries=tries
-         )
-      )
- 
+     reproducer = False
+     tries = 0
+     success_count_with_race = 0
+     success_count_no_race = 0
+     failure_count_network_delay = 0
+     failure_count_no_altnames = 0
+     TEST_SUMMARY_TMPL = """
+     ----- Test run complete: {tries} attempted -----
+     Successes without rename race: {success_count_no_race}
+     Successes with rename race and preserved altname: 
{success_count_with_race_persist_altname}
+     Failures due to network delay: {failure_count_network_delay}
+     Failures due to no altnames persisted: {failure_count_no_altnames}
+     ===================================
+     """
+     while tries < 100 and not reproducer:
+         tries += 1
+         new_instance = client.launch(
+             image_id=snapshotted_image_id,
+             instance_type="Standard_D8ds_v5",
+             user_data=cloud_config.format(defer="false"),
+         )
+         # breadcrumb for us pycloudlib/Azure  will reuse available IPs
+         new_instance.wait()
+         blame = new_instance.execute("systemd-analyze blame").splitlines()
+         LOG.info(f"--- Attempt {tries} ssh ubuntu@{new_instance.ip} Blame: 
{blame[0]}")
+         ip_addr = json.loads(new_instance.execute("ip -j addr").stdout)
+         for d in ip_addr:
+             if len(d.get('altnames', [])) > 1:
+                 success_count_with_race_persist_altname += 1
+                 LOG.info(
+                     f"--- SUCCESS persisting altnames {d['altnames']} due to 
rename race on resource busy on {d['ifname']}"
+                 )
+                 LOG.info(new_instance.execute("journalctl -u 
systemd-udevd.service -b 0 --no-pager"))
+             if d['ifname'] == "eth1" and len(d.get('altnames', [])) < 2:
+                 failure_count_no_altnames += 1
+                 LOG.error(f"FAILURE: to preserve altnames for {d['ifname']}. 
Only preserved {d.get('altnames', [])}")
+             else:
+                 success_count_no_race += 1
+         LOG.info('\n'.join([f'{d["ifname"]}: {d.get("altnames")}' for d in 
ip_addr]))
+         if "systemd-networkd-wait-online.service" not in blame[0]:
+             LOG.info(f"Destroying instance, normal boot seen: {blame[0]}")
+             new_instance.delete()
+         else:
+             failure_count_network_delay += 1
+             failure_count_no_altnames_persisted += 1
+             LOG.info(f"--- Attempt {tries} found delayed instance boot: 
{blame[0]}")
+             reproducer = True
+             r = new_instance.execute("journalctl -u systemd-udevd.service -b 
0 --no-pager")
+             LOG.info(r)
+             if "Failure to rename" in str(r):
+                 LOG.info(f"Found rename refusal!: {r[0]}")
+     base_instance.delete()
+     new_instance.delete()
+     LOG.info(
+         TEST_SUMMARY_TMPL.format(
+             success_count_with_race=success_count_with_race,
+             success_count_no_race=success_count_no_race,
+             failure_count_network_delay=failure_count_network_delay,
+             failure_count_no_altnames=failure_count_no_altnames, tries=tries
+         )
+      )
  
  if __name__ == "__main__":
-     # Avoid polluting the log with azure info
-     logging.getLogger("paramiko").setLevel(logging.WARNING)
-     logging.getLogger("pycloudlib").setLevel(logging.WARNING)
-     logging.getLogger("adal-python").setLevel(logging.WARNING)
-     logging.getLogger("cli.azure.cli.core").setLevel(logging.WARNING)
-     logging.basicConfig(filename="sru-systemd.log", level=logging.INFO)
-     debug_systemd_image_launch_overlake_v5_with_snapshot()
- 
- 
+     # Avoid polluting the log with azure info
+     logging.getLogger("paramiko").setLevel(logging.WARNING)
+     logging.getLogger("pycloudlib").setLevel(logging.WARNING)
+     logging.getLogger("adal-python").setLevel(logging.WARNING)
+     logging.getLogger("cli.azure.cli.core").setLevel(logging.WARNING)
+     logging.basicConfig(filename="sru-systemd.log", level=logging.INFO)
+     debug_systemd_image_launch_overlake_v5_with_snapshot()
  
  [Where problems could occur]
  The patches effectively make it so that if a interface cannot be renamed from 
udev, then the new name is left as an alternative name as a fallback. If 
problems occur, it would be related to device renaming, and particularly 
related to the devices alternative names.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to systemd in Ubuntu.
https://bugs.launchpad.net/bugs/2002445

Title:
  udev NIC renaming race with mlx5_core driver

Status in systemd package in Ubuntu:
  Fix Committed
Status in systemd source package in Focal:
  Triaged
Status in systemd source package in Jammy:
  Triaged
Status in systemd source package in Kinetic:
  New
Status in systemd source package in Lunar:
  Fix Committed

Bug description:
  [Impact]
  On systems with mellanox NICs, udev's NIC renaming races with the mlx5_core 
driver's own configuration of subordinate interfaces. When the kernel wins this 
race, the device cannot be renamed as udev has attempted, and this causes 
systemd-network-online.target to timeout waiting for links to be configured. 
This ultimately results in boot being delayed by about 2 minutes.

  [Test Plan]
  Repeated launches of Standard_D8ds_v5 instance types will generally hit this 
race around 1 in 10 runs. Create a vm snapshot with updated systemd from 
ppa:enr0n/systemd-245. Launch 100 Standard_D8ds_v5 instances with updated 
systemd. Assert not failure in cloud-init status and no 2 minute delay in 
network-online.target.

  To check for failure symptom:
    - Assert that network-online.target isn't the longest pole from 
systemd-analyze blame.

  To assert success condition during net rename busy race:
    - assert when "eth1" is still the primary device name, that two altnames 
are listed (preserving the altname due to the primary NIC rename being hit.

  Sample script uses pycloudlib to create modified base image for test
  and launches 100 VMs of type Standard_D8ds_v5, counting both successes
  and any failures seen.

  #!/usr/bin/env python3

  """Basic examples of various lifecycle with an Azure instance."""

  import logging
  import json

  import pycloudlib
  LOG = logging.getLogger()

  base_cfg = """#cloud-config
  ssh-import-id: [chad.smith, enr0n]
  """

  apt_cfg = """
  # Add developer PPA
  apt:
   sources:
     systemd-testing:
       source: "deb [allow-insecure=yes] 
https://ppa.launchpadcontent.net/enr0n/systemd-245/ubuntu focal main"
  # upgrade systemd after cloud-init is nearly done
  runcmd:
   - apt install systemd udev -y --allow-unauthenticated
  """

  debug_systemd_cfg = """
  # Create systemd-udev debug override.conf in base image
  write_files:
  - path: /etc/systemd/system/systemd-networkd.service.d/override.conf
    owner: root:root
    defer: {defer}
    content: |
      [Service]
      Environment=SYSTEMD_LOG_LEVEL=debug

  - path: /etc/systemd/system/systemd-udevd.service.d/override.conf
    owner: root:root
    defer: {defer}
    content: |
      [Service]
      Environment=SYSTEMD_LOG_LEVEL=debug
      LogRateLimitIntervalSec=0
  """

  cloud_config = base_cfg + apt_cfg + debug_systemd_cfg
  cloud_config2 = base_cfg + debug_systemd_cfg

  def debug_systemd_image_launch_overlake_v5_with_snapshot():
      """Test overlake v5 timeouts

      test procedure:
      - Launch base focal image
      - enable ppa:enr0n/systemd-245 and systemd/udev debugging
      - cloud-init clean --logs && deconfigure waalinux agent before shutdown
      - snapshot a base image
      - launch v5 system from snapshot
      - check systemd-analyze for expected timeout
      """
      client = pycloudlib.Azure(tag="azure")

      image_id = client.daily_image(release="focal")
      pub_path = "/home/ubuntu/.ssh/id_rsa.pub"
      priv_path = "/home/ubuntu/.ssh/id_rsa"

      client.use_key(pub_path, priv_path)

      base_instance = client.launch(
          image_id=image_id,
          instance_type="Standard_DS1_v2",
          user_data=cloud_config.format(defer="true"),
      )

      LOG.info(f"base instance: ssh ubuntu@{base_instance.ip}")
      base_instance.wait()
      LOG.info(base_instance.execute("apt cache policy systemd"))
      snapshotted_image_id = client.snapshot(base_instance)

      reproducer = False
      tries = 0
      success_count_with_race = 0
      success_count_no_race = 0
      failure_count_network_delay = 0
      failure_count_no_altnames = 0
      TEST_SUMMARY_TMPL = """
      ----- Test run complete: {tries} attempted -----
      Successes without rename race: {success_count_no_race}
      Successes with rename race and preserved altname: 
{success_count_with_race_persist_altname}
      Failures due to network delay: {failure_count_network_delay}
      Failures due to no altnames persisted: {failure_count_no_altnames}
      ===================================
      """
      while tries < 100 and not reproducer:
          tries += 1
          new_instance = client.launch(
              image_id=snapshotted_image_id,
              instance_type="Standard_D8ds_v5",
              user_data=cloud_config.format(defer="false"),
          )
          # breadcrumb for us pycloudlib/Azure  will reuse available IPs
          new_instance.wait()
          blame = new_instance.execute("systemd-analyze blame").splitlines()
          LOG.info(f"--- Attempt {tries} ssh ubuntu@{new_instance.ip} Blame: 
{blame[0]}")
          ip_addr = json.loads(new_instance.execute("ip -j addr").stdout)
          for d in ip_addr:
              if len(d.get('altnames', [])) > 1:
                  success_count_with_race_persist_altname += 1
                  LOG.info(
                      f"--- SUCCESS persisting altnames {d['altnames']} due to 
rename race on resource busy on {d['ifname']}"
                  )
                  LOG.info(new_instance.execute("journalctl -u 
systemd-udevd.service -b 0 --no-pager"))
              if d['ifname'] == "eth1" and len(d.get('altnames', [])) < 2:
                  failure_count_no_altnames += 1
                  LOG.error(f"FAILURE: to preserve altnames for {d['ifname']}. 
Only preserved {d.get('altnames', [])}")
              else:
                  success_count_no_race += 1
          LOG.info('\n'.join([f'{d["ifname"]}: {d.get("altnames")}' for d in 
ip_addr]))
          if "systemd-networkd-wait-online.service" not in blame[0]:
              LOG.info(f"Destroying instance, normal boot seen: {blame[0]}")
              new_instance.delete()
          else:
              failure_count_network_delay += 1
              failure_count_no_altnames_persisted += 1
              LOG.info(f"--- Attempt {tries} found delayed instance boot: 
{blame[0]}")
              reproducer = True
              r = new_instance.execute("journalctl -u systemd-udevd.service -b 
0 --no-pager")
              LOG.info(r)
              if "Failure to rename" in str(r):
                  LOG.info(f"Found rename refusal!: {r[0]}")
      base_instance.delete()
      new_instance.delete()
      LOG.info(
          TEST_SUMMARY_TMPL.format(
              success_count_with_race=success_count_with_race,
              success_count_no_race=success_count_no_race,
              failure_count_network_delay=failure_count_network_delay,
              failure_count_no_altnames=failure_count_no_altnames, tries=tries
          )
       )

  if __name__ == "__main__":
      # Avoid polluting the log with azure info
      logging.getLogger("paramiko").setLevel(logging.WARNING)
      logging.getLogger("pycloudlib").setLevel(logging.WARNING)
      logging.getLogger("adal-python").setLevel(logging.WARNING)
      logging.getLogger("cli.azure.cli.core").setLevel(logging.WARNING)
      logging.basicConfig(filename="sru-systemd.log", level=logging.INFO)
      debug_systemd_image_launch_overlake_v5_with_snapshot()

  [Where problems could occur]
  The patches effectively make it so that if a interface cannot be renamed from 
udev, then the new name is left as an alternative name as a fallback. If 
problems occur, it would be related to device renaming, and particularly 
related to the devices alternative names.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2002445/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

[Touch-packages] [Bug 2002445] Re: udev NIC renaming race with mlx5_core driver

Reply via email to