Re: [vdsm] RFC: New Storage API

Saggi Mizrahi Thu, 06 Dec 2012 08:36:56 -0800


----- Original Message -----
> From: "Tony Asleson" <tasle...@redhat.com>
> To: vdsm-devel@lists.fedorahosted.org
> Sent: Wednesday, December 5, 2012 4:48:34 PM
> Subject: Re: [vdsm] RFC: New Storage API
> 
> On 12/04/2012 03:52 PM, Saggi Mizrahi wrote:
> > I've been throwing a lot of bits out about the new storage API and
> > I think it's time to talk a bit.
> > I will purposefully try and keep implementation details away and
> > concentrate about how the API looks and how you use it.
> > 
> > First major change is in terminology, there is no long a storage
> > domain but a storage repository.
> > This change is done because so many things are already called
> > domain in the system and this will make things less confusing for
> > new-commers with a libvirt background.
> > 
> > One other changes is that repositories no longer have a UUID.
> > The UUID was only used in the pool members manifest and is no
> > longer needed.
> > 
> > 
> > connectStorageRepository(repoId, repoFormat,
> > connectionParameters={}):
> > repoId - is a transient name that will be used to refer to the
> > connected domain, it is not persisted and doesn't have to be the
> > same across the cluster.
> > repoFormat - Similar to what used to be type (eg. localfs-1.0,
> > nfs-3.4, clvm-1.2).
> > connectionParameters - This is format specific and will used to
> > tell VDSM how to connect to the repo.
> > 
> > disconnectStorageRepository(self, repoId):
> > 
> > 
> > In the new API there are only images, some images are mutable and
> > some are not.
> > mutable images are also called VirtualDisks
> > immutable images are also called Snapshots
> > 
> > There are no explicit templates, you can create as many images as
> > you want from any snapshot.
> > 
> > There are 4 major image operations:
> > 
> > 
> > createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
> >                   userData={}, options={}):
> > 
> > targetRepoId - ID of a connected repo where the disk will be
> > created
> > size - The size of the image you wish to create
> > baseSnapshotId - the ID of the snapshot you want the base the new
> > virtual disk on
> > userData - optional data that will be attached to the new VD, could
> > be anything that the user desires.
> > options - options to modify VDSMs default behavior
> > 
> > returns the id of the new VD
> 
> I'm guessing there will be a way to find out how much space is
> available
> for a specified repo before you try to create a virtual disk on it?
This is in the repo API which is not really detailed here.
In any case, due to the nature of storage, you can never tell how much space an 
image is going to actually take.
You have over-committing, thin provisioning, sparse volumes, native snapshots, 
compression, de-dupe, soft raid (btfs\zfs), check-summing, metadata backups, 
metadata per-operation (btrfs), and more.
VDSM might also leave the image in degraded mode if there is no room to 
complete the action.


If you want to create an image you should just give it a whirl, also you should 
always leave certain % percentage free.
> 
> > 
> > createSnapshot(targetRepoId, baseVirtualDiskId,
> >                userData={}, options={}):
> > targetRepoId - The ID of a connected repo where the new sanpshot
> > will be created and the original image exists as well.
> > size - The size of the image you wish to create
> > baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
> > to snapshot
> > userData - optional data that will be attached to the new Snapshot,
> > could be anything that the user desires.
> > options - options to modify VDSMs default behavior
> > 
> > returns the id of the new Snapshot
> > 
> > copyImage(targetRepoId, imageId, baseImageId=None, userData={},
> > options={})
> > targetRepoId - The ID of a connected repo where the new image will
> > be created
> > imageId - The image you wish to copy
> > baseImageId - if specified, the new image will contain only the
> > diff between image and Id.
> >               If None the new image will contain all the bits of
> >               image Id. This can be used to copy partial parts of
> >               images for export.
> > userData - optional data that will be attached to the new image,
> > could be anything that the user desires.
> > options - options to modify VDSMs default behavior
> > 
> > return the Id of the new image. In case of copying an immutable
> > image the ID will be identical to the original image as they
> > contain the same data. However the user should not assume that and
> > always use the value returned from the method.
> 
> Can the target repo id be itself?  The case where a user wants to
> make a
> copy of a virtual disk in the same repo.  A caller could snapshot the
> virtual disk and then create a virtual disk from the snapshot, but if
> the target repo could be the same as source repo then they could use
> this call as long as the returned ID was different.
> 
> Does imageId IO need to be quiesced before calling this or will that
> be
> handled in the implementation (eg. snapshot first)?
Copy of an image is possible to the same repo.
Copy of a sanpshot to the same repo will not work, there is also no reason to 
do that as you get the same object and as it's read-only there is never a 
reason to have 2 on the same repo.
> 
> > removeImage(repositoryId, imageId, options={}):
> > repositoryId - The ID of a connected repo where the image to delete
> > resides
> > imageId - The id of the image you wish to delete.
> >
> 
> What is the behavior if you delete snapshots or virtual disks that
> have
> dependencies on one another?  For example, delete the snapshot a
> virtual
> disk is based on or delete the virtual disk a snapshot is based on?
This is an implementation detail.
removeImage in actuality doesn't remove anything from the disk, it just marks 
the image as unavailable to the user.
the checkStorageRepository() does a dependency check when generating 
clean\merge fixes.
> 
> > 
> > ----
> > getImageStatus(repositoryId, imageId)
> > repositoryId - The ID of a connected repo where the image to check
> > resides
> > imageId - The id of the image you wish to check.
> > 
> > All operations return once the operations has been committed to
> > disk NOT when the operation actually completes.
> > This is done so that:
> > - operation come to a stable state as quickly as possible.
> > - In case where there is an SDM, only small portion of the
> > operation actually needs to be performed on the SDM host.
> > - No matter how many times the operation fails and on how many
> > hosts, you can always resume the operation and choose when to do
> > it.
> > - You can stop an operation at any time and remove the resulting
> > object making a distinction between "stop because the host is
> > overloaded" to "I don't want that image"
> > 
> > This means that after calling any operation that creates a new
> > image the user must then call getImageStatus() to check what is
> > the status of the image.
> > The status of the image can be either optimized, degraded, or
> > broken.
> > "Optimized" means that the image is available and you can run VMs
> > of it.
> > "Degraded" means that the image is available and will run VMs but
> > it might be a better way VDSM can represent the underlying data.
> > "Broken" means that the image can't be used at the moment, probably
> > because not all the data has been set up on the volume.
> 
> So while an operation is executing asynchronously the state is
> broken?
> How do you distinguish between an operation that ends in error and
> one
> that is currently running?
It's because you don't have operations in the classic sense.
When you do createImage() you are not creating an image, you are actually 
saying "Please try as best as you can to create an image".
VDSM will forever aspire to complete this task. This is done through the 
checkRepo()\fixRepo() verbs.
If the last_error field in the task progress of the image status is 
"operation_complete" or "operation_stopped" and the state of the image is 
either "degraded" or "optimized" you know that you can run a VM.
If last_error is anything else you use the value in the host_id field and check 
the last host to update the task. If there is no task running on that host you 
know the operation isn't running at the moment and if the state is either 
"optimized" or "degraded" you can run a VM.
You can also run a Fix on a host to continue the operation.
> 
> > 
> > Apart from that VDSM will also return the last persisted status
> > information which will conatin
> > hostID - the last host to try and optimize of fix the image
> > stage - X/Y (eg. 1/10) the last persisted stage of the fix.
> > percent_complete - -1 or 0-100, the last persisted completion
> > percentage of the aforementioned stage. -1 means that no progress
> > is available for that operation.
> > last_error - This will only be filled if the operation failed
> > because of something other then IO or a VDSM crash for obvious
> > reasons.
> >              It will usually be set if the task was manually
> >              stopped
> > 
> > The user can either be satisfied with that information or as the
> > host specified in host ID if it is still working on that image by
> > checking it's running tasks.
> > 
> > checkStorageRepository(self, repositoryId, options={}):
> > A method to go over a storage repository and scan for any existing
> > problems. This includes degraded\broken images and deleted images
> > that have no yet been physically deleted\merged.
> > It returns a list of Fix objects.
> > Fix objects come in 4 types:
> > clean - cleans data, run them to get more space.
> > optimize - run them to optimize a degraded image
> > merge - Merges two images together. Doing this sometimes
> >         makes more images ready optimizing or cleaning.
> >         The reason it is different from optimize is that
> >         unmerged images are considered optimized.
> > mend - mends a broken image
> > 
> > The user can read these types and prioritize fixes. Fixes also
> > contain opaque FIX data and they should be sent as received to
> > fixStorageRepository(self, repositoryId, fix, options={}):
> > 
> > That will start a fix operation.
> 
> Just to clarify, you scan a repository that has 3 images on it.  You
> will get a list of 3 fix objects, one for each image or a list of
> what
> fixes should be run across the entire repository?
No you will get X fixes that represent any problems detected.
It will most likely be less then the number of images but it can be more then 
the number of images.
> 
> > 
> > 
> > All major operations automatically start the appropriate "Fix" to
> > bring the created object to an optimize\degraded state (the one
> > that is quicker) unless one of the options is
> > AutoFix=False. This is only useful for repos that might not be able
> > to create volumes on all hosts (SDM) but would like to have the
> > actual IO distributed in the cluster.
> > 
> > Other common options is the strategy option:
> > It has currently 2 possible values
> > space and performance - In case VDSM has 2 ways of completing the
> > same operation it will tell it to value one over the other. For
> > example, whether to copy all the data or just create a qcow based
> > of a snapshot.
> > The default is space.
> > 
> > You might have also noticed that it is never explicitly specified
> > where to look for existing images. This is done purposefully, VDSM
> > will always look in all connected repositories for existing
> > objects.
> > For very large setups this might be problematic. To mitigate the
> > problem you have these options:
> > participatingRepositories=[repoId, ...] which tell VDSM to narrow
> > the search to just these repositories
> > and
> > imageHints={imgId: repoId} which will force VDSM to look for those
> > image ID just in those repositories and fail if it doesn't find
> > them there.
> 
> I'm guessing that you will be adding methods to query the existing
> images, snapshots etc. ?
Yes, getImageList(repoId) and the like
> 
> Thanks,
> Tony
> _______________________________________________
> vdsm-devel mailing list
> vdsm-devel@lists.fedorahosted.org
> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> 
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] RFC: New Storage API

Reply via email to