On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote: > > > ----- Original Message ----- > > From: "Shu Ming" <shum...@linux.vnet.ibm.com> > > To: "Saggi Mizrahi" <smizr...@redhat.com> > > Cc: "VDSM Project Development" <vdsm-devel@lists.fedorahosted.org>, > > "engine-devel" <engine-de...@ovirt.org> > > Sent: Thursday, December 6, 2012 11:02:02 AM > > Subject: Re: [vdsm] RFC: New Storage API > > > > Saggi, > > > > Thanks for sharing your thought and I get some comments below. > > > > > > Saggi Mizrahi: > > > I've been throwing a lot of bits out about the new storage API and > > > I think it's time to talk a bit. > > > I will purposefully try and keep implementation details away and > > > concentrate about how the API looks and how you use it. > > > > > > First major change is in terminology, there is no long a storage > > > domain but a storage repository. > > > This change is done because so many things are already called > > > domain in the system and this will make things less confusing for > > > new-commers with a libvirt background. > > > > > > One other changes is that repositories no longer have a UUID. > > > The UUID was only used in the pool members manifest and is no > > > longer needed. > > > > > > > > > connectStorageRepository(repoId, repoFormat, > > > connectionParameters={}): > > > repoId - is a transient name that will be used to refer to the > > > connected domain, it is not persisted and doesn't have to be the > > > same across the cluster. > > > repoFormat - Similar to what used to be type (eg. localfs-1.0, > > > nfs-3.4, clvm-1.2). > > > connectionParameters - This is format specific and will used to > > > tell VDSM how to connect to the repo. > > > > > > Where does repoID come from? I think repoID doesn't exist before > > connectStorageRepository() return. Isn't repoID a return value of > > connectStorageRepository()? > No, repoIDs are no longer part of the domain, they are just a transient > handle. > The user can put whatever it wants there as long as it isn't already taken by > another currently connected domain. > > > > > > > > disconnectStorageRepository(self, repoId) > > > > > > > > > In the new API there are only images, some images are mutable and > > > some are not. > > > mutable images are also called VirtualDisks > > > immutable images are also called Snapshots > > > > > > There are no explicit templates, you can create as many images as > > > you want from any snapshot. > > > > > > There are 4 major image operations: > > > > > > > > > createVirtualDisk(targetRepoId, size, baseSnapshotId=None, > > > userData={}, options={}): > > > > > > targetRepoId - ID of a connected repo where the disk will be > > > created > > > size - The size of the image you wish to create > > > baseSnapshotId - the ID of the snapshot you want the base the new > > > virtual disk on > > > userData - optional data that will be attached to the new VD, could > > > be anything that the user desires. > > > options - options to modify VDSMs default behavior > > > > > > returns the id of the new VD > > > > I think we will also need a function to check if a a VirtualDisk is > > based on a specific snapshot. > > Like: isSnapshotOf(virtualDiskId, baseSnapshotID): > No, the design is that volume dependencies are an implementation detail. > There is no reason for you to know that an image is physically a snapshot of > another. > Logical snapshots, template information, and any other information can be set > by the user by using the userData field available for every image.
Statements like this make me start to worry about your userData concept. It's a sign of a bad API if the user needs to invent a custom metadata scheme for itself. This reminds me of the abomination that is the 'custom' property in the vm definition today. > > > createSnapshot(targetRepoId, baseVirtualDiskId, > > > userData={}, options={}): > > > targetRepoId - The ID of a connected repo where the new sanpshot > > > will be created and the original image exists as well. > > > size - The size of the image you wish to create > > > baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want > > > to snapshot > > > userData - optional data that will be attached to the new Snapshot, > > > could be anything that the user desires. > > > options - options to modify VDSMs default behavior > > > > > > returns the id of the new Snapshot > > > > > > copyImage(targetRepoId, imageId, baseImageId=None, userData={}, > > > options={}) > > > targetRepoId - The ID of a connected repo where the new image will > > > be created > > > imageId - The image you wish to copy > > > baseImageId - if specified, the new image will contain only the > > > diff between image and Id. > > > If None the new image will contain all the bits of > > > image Id. This can be used to copy partial parts of > > > images for export. > > > userData - optional data that will be attached to the new image, > > > could be anything that the user desires. > > > options - options to modify VDSMs default behavior > > > > Does this function mean that we can copy the image from one > > repository > > to another repository? Does it cover the semantics of storage > > migration, > > storage backup, storage incremental backup? > Yes, the main purpose is copying to another repo. and you can even do > incremental backups. > Also the following flow > 1. Run a VM using imageA > 2. write to disk > 3. Stop VM > 4. copy imageA to repoB > 5. Run a VM using imageA again > 6. Write to disk > 7. Stop VM > 8. Copy imageA again basing it of imageA_copy1 on repoB creating a diff on > repo diff without snapshotting the original image. > > > > > > > > > return the Id of the new image. In case of copying an immutable > > > image the ID will be identical to the original image as they > > > contain the same data. However the user should not assume that and > > > always use the value returned from the method. > > > > > > removeImage(repositoryId, imageId, options={}): > > > repositoryId - The ID of a connected repo where the image to delete > > > resides > > > imageId - The id of the image you wish to delete. > > > > > > > > > ---- > > > getImageStatus(repositoryId, imageId) > > > repositoryId - The ID of a connected repo where the image to check > > > resides > > > imageId - The id of the image you wish to check. > > > > > > All operations return once the operations has been committed to > > > disk NOT when the operation actually completes. > > > This is done so that: > > > - operation come to a stable state as quickly as possible. > > > - In case where there is an SDM, only small portion of the > > > operation actually needs to be performed on the SDM host. > > > - No matter how many times the operation fails and on how many > > > hosts, you can always resume the operation and choose when to do > > > it. > > > - You can stop an operation at any time and remove the resulting > > > object making a distinction between "stop because the host is > > > overloaded" to "I don't want that image" > > > > > > This means that after calling any operation that creates a new > > > image the user must then call getImageStatus() to check what is > > > the status of the image. > > > The status of the image can be either optimized, degraded, or > > > broken. > > > "Optimized" means that the image is available and you can run VMs > > > of it. > > > "Degraded" means that the image is available and will run VMs but > > > it might be a better way VDSM can represent the underlying data. > > > > What does the "represent" mean here? > Anything, but mostly image formate RAW\QCOW2 when performance strategy has > been selected. > > > "Broken" means that the image can't be used at the moment, probably > > > because not all the data has been set up on the volume. > > > > > > Apart from that VDSM will also return the last persisted status > > > information which will conatin > > > hostID - the last host to try and optimize of fix the image > > Any host can optimize the image? No need to be SDM? > On anything but lvm based block domains there will not even be an SDM. > On SDM based domains we will try as hard as we can to have as many operations > executable on any host. > > > > > stage - X/Y (eg. 1/10) the last persisted stage of the fix. > > > percent_complete - -1 or 0-100, the last persisted completion > > > percentage of the aforementioned stage. -1 means that no progress > > > is available for that operation. > > > last_error - This will only be filled if the operation failed > > > because of something other then IO or a VDSM crash for obvious > > > reasons. > > > It will usually be set if the task was manually > > > stopped > > > > > > The user can either be satisfied with that information or as the > > > host specified in host ID if it is still working on that image by > > > checking it's running tasks. > > > > So we need a function to know what tasks are running on the image > getImageStatus() > > > > > > checkStorageRepository(self, repositoryId, options={}): > > > A method to go over a storage repository and scan for any existing > > > problems. This includes degraded\broken images and deleted images > > > that have no yet been physically deleted\merged. > > > It returns a list of Fix objects. > > > Fix objects come in 4 types: > > > clean - cleans data, run them to get more space. > > > optimize - run them to optimize a degraded image > > > merge - Merges two images together. Doing this sometimes > > > makes more images ready optimizing or cleaning. > > > The reason it is different from optimize is that > > > unmerged images are considered optimized. > > > mend - mends a broken image > > > > > > The user can read these types and prioritize fixes. Fixes also > > > contain opaque FIX data and they should be sent as received to > > > fixStorageRepository(self, repositoryId, fix, options={}): > > > > > > That will start a fix operation. > > > > > > > > > All major operations automatically start the appropriate "Fix" to > > > bring the created object to an optimize\degraded state (the one > > > that is quicker) unless one of the options is > > > AutoFix=False. This is only useful for repos that might not be able > > > to create volumes on all hosts (SDM) but would like to have the > > > actual IO distributed in the cluster. > > > > > > Other common options is the strategy option: > > > It has currently 2 possible values > > > space and performance - In case VDSM has 2 ways of completing the > > > same operation it will tell it to value one over the other. For > > > example, whether to copy all the data or just create a qcow based > > > of a snapshot. > > > The default is space. > > > > > > You might have also noticed that it is never explicitly specified > > > where to look for existing images. This is done purposefully, VDSM > > > will always look in all connected repositories for existing > > > objects. > > > For very large setups this might be problematic. To mitigate the > > > problem you have these options: > > > participatingRepositories=[repoId, ...] which tell VDSM to narrow > > > the search to just these repositories > > > and > > > imageHints={imgId: repoId} which will force VDSM to look for those > > > image ID just in those repositories and fail if it doesn't find > > > them there. > > > _______________________________________________ > > > vdsm-devel mailing list > > > vdsm-devel@lists.fedorahosted.org > > > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > -- > > --- > > 舒明 Shu Ming > > Open Virtualization Engineerning; CSTL, IBM Corp. > > Tel: 86-10-82451626 Tieline: 9051626 E-mail: shum...@cn.ibm.com or > > shum...@linux.vnet.ibm.com > > Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian > > District, Beijing 100193, PRC > > > > > > > _______________________________________________ > vdsm-devel mailing list > vdsm-devel@lists.fedorahosted.org > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel -- Adam Litke <a...@us.ibm.com> IBM Linux Technology Center _______________________________________________ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel