------- Comment From [email protected] 2016-05-25 12:17 EDT------- Hello Canonical, after talking with different folks here and trying to figure out why this fails, we found the likely bug here, and its in the kernel after all. When scanning the SCSI LUNs behind a found FCP remote port (during activation of a FCP device) linux will also try to attach a device-handler to each found LUN - with our storages it will try to attach the ALUA device handler (scsi_dh_alua). In parallel to attaching the DH it will also activate the scsi-disk driver for the found LUN. Those two will run in parallel. Schematic this looks roughly like this (I hope this ASCII-Art survives in the Bugzilla):
Thread A | Thread B | scsi_report_lun_scan() | : | | send REPORT LUNS | : | | for each LUN sequentially: | : : | +------>| scsi_probe_and_add_lun() | | : | Kick SD for this LUN: | | scsi_dh_add_device() | : | : | : | | alua_bus_attach() | : | : | : | | alua_vpd_inquiry() | | sd_spinup_disk() | : | ............. | : | | continue with next LUN +-------+ : | exit v Now it happens that sd_spinup_disk() sends the first command (a TUR) that is not an INQURY or REPORT LUNS in parallel to the INQURY that is send by alua_vpd_inquiry(). This TUR will raise a Check Condition with Sense Unit Attention because of a reset in the storage - this is normal if the LUN was just attached the first time after the remote port open (SAM-4, 5.14). And because the storage has a default value of QErr=01b set in the Control Mode Page (SPC-4, 7.5.8) this raised Check Condition will cause an Abort of all running command in the same task set (same I_T nexus; SAM-4, 5.6). But for this Abort no status will be returned for the affected commands in the same I_T nexus as the command that received the Check Condition (also SPC-4, 7.5.8). So in essence the storage just forgets all other commands, other than the first TUR, and that gets the Unit Attention (all other commands send before the TUR are either INQURY or REPORT LUNS, and those will never raise a Unit Attention; SAM-4, 5.14). This is why the INQURY send by alua_vpd_inquiry() will ultimately time out after one minute. And this will abort the whole LUN scan in scsi_report_lun_scan(). Ofc this depends on specific timings - when arrives the INQURY and when the TUR. But like written in this report, we have seen this in multiple completely independent setups. I looked at changes made by the maintainer of scsi_dh_alua upstream since release of 4.4 and there has been some very substantial changes (recent changes were titled "ALUA device handler update, part II" on LKML). And those changes mitigate this problem. They change the timing of commands sent in scsi_dh_alua and remove the send of own INQURYs completely (it uses already gathered information instead). With the removed INQURY and the changed timing, there are no overlaps anymore and the LUN-scan doesn't abort half-way through (which BTW. was also changed in 4.5: failures in scsi_dh_add_device() won't abort the whole scan anymore). In my own small tests I couldn't reproduce this problem with 4.6 yet. Although I might add, the main issue here - sending commands in parallel with QErr set to 01b might lead to forgotten commands in case of a Check Condition in the same I_T nexus - is not yet fixed in my opinion. I have spend quite some time reading the mentioned parts in the SCSI standards and have not reached any other conclusion as of now, but I am also not that fluent with them yet. I hope this helps you finding a good solution here. - Benjamin -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1567602 Title: FCP devices are not detected correctly nor deterministically To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-release-notes/+bug/1567602/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
