> On Nov 4, 2020, at 11:40 PM, Gulam Mohamed <gulam.moha...@oracle.com> wrote:
>
> Description
> =========
> 1. This Kernel panic could be due to a timing issue when there is a race
> between the sync thread and the initiator was processing of a login response
> from the target. The session re-open can be invoked from two places
> a. Sessions sync thread when the iscsid restart
> b. From iscsid through iscsi error handler
> 2. The session reopen sequence is as follows in user-space
> (iscsi-initiator-utils)
> a. Disconnect the connection
> b. Then send the stop connection request to the kernel which
> releases the connection (releases the socket)
> c. Queues the reopen for 2 seconds delay
> d. Once the delay expires, create the TCP connection again by
> calling the connect() call
> e. Poll for the connection
> f. When poll is successful i.e when the TCP connection is
> established, it performs
> i. Creation of session and connection data structures
> ii. Bind the connection to the session. This is the place where we
> assign the sock to tcp_sw_conn->sock
> iii. Sets the parameters like target name, persistent address etc .
> iv. Creates the login pdu
> v. Sends the login pdu to kernel
> vi. Returns to the main loop to process further events. The kernel then
> sends the login request over to the target node
> g. Once login response with success is received, it enters into full
> feature phase and sets the negotiable parameters like max_recv_data_length,
> max_transmit_length, data_digest etc .
> 3. While setting the negotiable parameters by calling
> "iscsi_session_set_neg_params()", kernel panicked as sock was NULL
>
> What happened here is
> --------------------------------
> 1. Before initiator received the login response mentioned in above point
> 2.f.v, another reopen request was sent from the error handler/sync session
> for the same session, as the initiator utils was in main loop to process
> further events (as
> mentioned in point 2.f.vi above).
> 2. While processing this reopen, it stopped the connection which released
> the socket and queued this connection and at this point of time the login
> response was received for the earlier one
To make sure I got this you are saying before iscsi_sw_tcp_conn_stop calls
iscsi_sw_tcp_release_conn that the recv path has called iscsi_recv_pdu right?
> 3. The kernel passed over this response to user-space which then sent the
> set_neg_params request to kernel
> 4. As the connection was stopped, the sock was NULL and hence while the
> kernel was processing the set param request from user-space, it panicked
>
> Fix
> ----
> 1. While setting the set_param in kernel, we need to check if sock is NULL
> 2. If the sock is NULL, then return EPERM (operation not permitted)
> 3. Due to this error handler will be invoked in user-space again to
> recover the session
>
> Signed-off-by: Gulam Mohamed <gulam.moha...@oracle.com>
> Reviewed-by: Junxiao Bi <junxiao...@oracle.com>
> ---
> diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
> index df47557a02a3..fd668a194053 100644
> --- a/drivers/scsi/iscsi_tcp.c
> +++ b/drivers/scsi/iscsi_tcp.c
> @@ -711,6 +711,12 @@ static int iscsi_sw_tcp_conn_set_param(struct
> iscsi_cls_conn *cls_conn,
> struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
> struct iscsi_sw_tcp_conn *tcp_sw_conn = tcp_conn->dd_data;
>
> + if (!tcp_sw_conn->sock) {
> + iscsi_conn_printk(KERN_ERR, conn,
> + "Cannot set param as sock is NULL\n");
> + return -ENOTCONN;
> + }
> +
I think this might have 2 issues:
1. It only fixes iscsi_tcp. What about other drivers that free/null
resources/fields in ep_disconnect that we layer need to access in the set_param
callback (the cxgbi drivers)?
2. What about drivers that do not free/null fields (be2iscsi, bnx2i, ql4xxx and
qedi) so the set_param calls end up just working. What state will userspace be
in where we have logged in, but iscsid also thinks we are in the middle of
trying to login?
I think we could do:
1. On ep_disconnect and stop_conn set some iscsi_cls_conn bit that indicates
set_param calls for connection level settings should fail. scsi_transport_iscsi
could set and check this bit for all drivers.
2. On bind_conn clear the bit.
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to open-iscsi+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/open-iscsi/05277786-2E1F-432D-AE73-F39565C6BEA4%40oracle.com.