On 28/01/12 04:50, Itamar Heim wrote: > top posting since there was a long thread on this anyway. > some questions/comments: > > 1. about the CIDs - it sounds like the engine needs to persist this > info, so it can resume normally in case of a failure/restart (this is > different than today, when the persisted info is the connection details, > rather than some generated identifier)?
This info should be persisted in the engine, in addition to the connection details. > > 2. sounds like the engine needs to block in certain cases after a > manageConnection to make sure it is there and alive before doing an > operation. > this means now engine has to check a host has all relevant connections > online before choosing it as a target for live migration even for a > regular VM (all disks on a storage domain). With the current flow it is not needed for 'regular VM'. The engine currently do not monitor the storage domain's connections on a periodic basis because the storage domain status represents the availability of the domain. > worse/uglier (well, imho), in case of a disk based on a direct LUN, the > engine needs to actively connect the target host, poll till it's up, and > only then live migrate (would be much nicer if vdsm migration protocol > would have taken care of this manageConnection call (preserving the CID?) > > 3. in unmanageStorageServer(connectionID) below you finish with > "Returns: > Success code if VDSM was able to unmanage the connection. > It will return an error if the CID is not registered with VDSM. > Disconnect failures are not reported. Active unmanaged connections can > be tracked with getStorageServerList()" > > it is not clear if vdsm will retry to disconnect, and how races between > those retries and new manage connection requests will be handled. > if the connection only becomes unmanaged, there is no way to track and > clean it up (engine is not supposed to touch the unmanaged connections) > > 4. I don't think we handle this today, but while we are planning for the > future - what if the host needs one of the connections to exist > regardless of engine for another need (say it does boot from network > from same iscsi target - this is an unmanaged connection which you will > disconnect based on the CID refcount concept). > i.e., what happens if the host has an unmanaged connection, which > becomes a managed one. > solving this probably means when adding a connection, need to add an > unmanaged_existed_before CID for refcount? > > > On 01/23/2012 11:54 PM, Saggi Mizrahi wrote: >> I have begun work at changing how API clients can control storage >> connections when interacting with VDSM. >> >> Currently there are 2 API calls: >> connectStorageServer() - Will connect to the storage target if the >> host is not already connected to it. >> disconnectStorageServer() - Will disconnect from the storage target if >> the host is connected to it. >> >> This API is very simple but is inappropriate when multiple clients and >> flows try to access the same storage. >> >> This is currently solved by trying to synchronize things inside rhevm. >> This is hard and convoluted. It also brings out issues with other >> clients using the VDSM API. >> >> Another problem is error recovery. Currently ovirt-engine(OE) has no >> way of monitoring the connections on all the hosts an if a connection >> disappears it's OE's responsibility to reconnect. >> >> I suggest a different concept where VDSM 'manages' the connections. >> VDSM receives a manage request with the connection information and >> from that point forward VDSM will try to keep this connection alive. >> If the connection fails VDSM will automatically try and recover. >> >> Every manage request will also have a connection ID(CID). This CID >> will be used when the same client asks to unamange the connection. >> When multiple requests for manage are received to the same connection >> they all have to have their own unique CID. By internally mapping CIDs >> to actual connections VDSM can properly disconnect when no CID is >> addressing the connection. This allows each client and even each flow >> to have it's own CID effectively eliminating connect\disconnect races. >> >> The change from (dis)connect to (un)manage also changes the semantics >> of the calls significantly. >> Whereas connectStorageServer would have returned when the storage is >> either connected or failed to connect, manageStorageServer will return >> once VDSM registered the CID. This means that the connection might not >> be active immediately as the VDSM tries to connect. The connection >> might remain down for a long time if the storage target is down or is >> having issues. >> >> This allows for VDSM to receive the manage request even if the storage >> is having issues and recover as soon as it's operational without user >> intervention. >> >> In order for the client to query the current state of the connections >> I propose getStorageConnectionList(). This will return a mapping of >> CID to connection status. The status contains the connection info >> (excluding credentials), whether the connection is active, whether the >> connection is managed (unamanged connection are returned with >> transient IDs), and, if the connection is down, the last error >> information. >> >> The same actual connection can return multiple times, once for each CID. >> >> For cases where an operation requires a connection to be active a user >> can poll the status of the CID. The user can then choose to poll for a >> certain amount of time or until an error appears in the error field of >> the status. This will give you either a timeout or a "try once" >> semantic depending on the flows needs. >> >> All connections that have been managed persist VDSM restart and will >> be managed until a corresponding unmanage command has been issued. >> >> There is no concept of temporary connections as "temporary" is flow >> dependent and VDSM can't accommodate all interpretation of >> "temporary". An ad-hoc mechanism can be build using the CID field. For >> instance a client can manage a connection with "ENGINE_FLOW101_CON1". >> If the flow got interrupted the client can clean all IDs with certain >> flow IDs. >> >> I think this API gives safety, robustness, and implementation freedom. >> >> >> Nitty Gritty: >> >> manageStorageServer >> =================== >> Synopsis: >> manageStorageServer(uri, connectionID): >> >> Parameters: >> uri - a uri pointing to a storage target (eg: nfs://server:export, >> iscsi://host/iqn;portal=1) >> connectionID - string with any char except "/". >> >> Description: >> Tells VDSM to start managing the connection. From this moment on VDSM >> will try and have the connection available when needed. VDSM will >> monitor the connection and will automatically reconnect on failure. >> Returns: >> Success code if VDSM was able to manage the connection. >> It usually just verifies that the arguments are sane and that the CID >> is not already in use. >> This doesn't mean the host is connected. >> ---- >> unmanageStorageServer >> ===================== >> Synopsis: >> unmanageStorageServer(connectionID): >> >> Parameters: >> connectionID - string with any char except "/". >> >> Descriptions: >> Tells VDSM to stop managing the connection. VDSM will try and >> disconnect for the storage target if this is the last CID referencing >> the storage connection. >> >> Returns: >> Success code if VDSM was able to unmanage the connection. >> It will return an error if the CID is not registered with VDSM. >> Disconnect failures are not reported. Active unmanaged connections can >> be tracked with getStorageServerList() >> ---- >> getStorageServerList >> ==================== >> Synopsis: >> getStorageServerList() >> >> Description: >> Will return list of all managed and unmanaged connections. Unmanaged >> connections have temporary IDs and are not guaranteed to be consistent >> across calls. >> >> Results:VDSM was able to manage the connection. >> It usually just verifies that the arguments are sane and that the CID >> is not already in use. >> This doesn't mean the host is connected. >> ---- >> unmanageStorageServer >> ===================== >> Synopsis: >> unmanageStorageServer(connectionID): >> >> Parameters: >> connectionID - string with any char except "/". >> >> Descriptions: >> Tells VDSM to stop managing the connection. VDSM will try and >> disconnect for the storage target if this is the last CID referencing >> the storage connection. >> >> Returns: >> Success code if VDSM was able to unmanage the connection. >> It will return an error if the CID is not registered with VDSM. >> Disconnect failures are not reported. Active unmanaged connections can >> be tracked with getStorageServerList() >> ---- >> getStorageServerList >> ==================== >> Synopsis: >> getStorageServerList() >> >> Description: >> Will return list of all managed and unmanaged connections. Unmanaged >> connections have temporary IDs and are not guaranteed to be consistent >> across calls. >> >> Results: >> A mapping between CIDs and the status. >> example return value (Actual key names may differ) >> >> {'conA': {'connected': True, 'managed': True, 'lastError': 0, >> 'connectionInfo': { >> 'remotePath': 'server:/export >> 'retrans': 3 >> 'version': 4 >> }} >> 'iscsi_session_34': {'connected': False, 'managed': False, >> 'lastError': 339, 'connectionIfno': { >> 'hostname': 'dandylopn' >> 'portal': 1}} >> } >> _______________________________________________ >> Engine-devel mailing list >> engine-de...@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/engine-devel > > _______________________________________________ > Engine-devel mailing list > engine-de...@ovirt.org > http://lists.ovirt.org/mailman/listinfo/engine-devel _______________________________________________ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel