Re: [vdsm] RFC: New Storage API

Adam Litke Mon, 10 Dec 2012 10:40:35 -0800

On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote:
> 
> 
> ----- Original Message -----
> > From: "Shu Ming" <shum...@linux.vnet.ibm.com>
> > To: "Saggi Mizrahi" <smizr...@redhat.com>
> > Cc: "VDSM Project Development" <vdsm-devel@lists.fedorahosted.org>, 
> > "engine-devel" <engine-de...@ovirt.org>
> > Sent: Thursday, December 6, 2012 11:02:02 AM
> > Subject: Re: [vdsm] RFC: New Storage API
> > 
> > Saggi,
> > 
> > Thanks for sharing your thought and I get some comments below.
> > 
> > 
> > Saggi Mizrahi:
> > > I've been throwing a lot of bits out about the new storage API and
> > > I think it's time to talk a bit.
> > > I will purposefully try and keep implementation details away and
> > > concentrate about how the API looks and how you use it.
> > >
> > > First major change is in terminology, there is no long a storage
> > > domain but a storage repository.
> > > This change is done because so many things are already called
> > > domain in the system and this will make things less confusing for
> > > new-commers with a libvirt background.
> > >
> > > One other changes is that repositories no longer have a UUID.
> > > The UUID was only used in the pool members manifest and is no
> > > longer needed.
> > >
> > >
> > > connectStorageRepository(repoId, repoFormat,
> > > connectionParameters={}):
> > > repoId - is a transient name that will be used to refer to the
> > > connected domain, it is not persisted and doesn't have to be the
> > > same across the cluster.
> > > repoFormat - Similar to what used to be type (eg. localfs-1.0,
> > > nfs-3.4, clvm-1.2).
> > > connectionParameters - This is format specific and will used to
> > > tell VDSM how to connect to the repo.
> > 
> > 
> > Where does repoID come from? I think repoID doesn't exist before
> > connectStorageRepository() return.  Isn't repoID a return value of
> > connectStorageRepository()?
> No, repoIDs are no longer part of the domain, they are just a transient 
> handle.
> The user can put whatever it wants there as long as it isn't already taken by 
> another currently connected domain.
> > 
> > >
> > > disconnectStorageRepository(self, repoId)
> > >
> > >
> > > In the new API there are only images, some images are mutable and
> > > some are not.
> > > mutable images are also called VirtualDisks
> > > immutable images are also called Snapshots
> > >
> > > There are no explicit templates, you can create as many images as
> > > you want from any snapshot.
> > >
> > > There are 4 major image operations:
> > >
> > >
> > > createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
> > >                    userData={}, options={}):
> > >
> > > targetRepoId - ID of a connected repo where the disk will be
> > > created
> > > size - The size of the image you wish to create
> > > baseSnapshotId - the ID of the snapshot you want the base the new
> > > virtual disk on
> > > userData - optional data that will be attached to the new VD, could
> > > be anything that the user desires.
> > > options - options to modify VDSMs default behavior
> > >
> > > returns the id of the new VD
> > 
> > I think we will also need a function to check if a a VirtualDisk is
> > based on a specific snapshot.
> > Like: isSnapshotOf(virtualDiskId, baseSnapshotID):
> No, the design is that volume dependencies are an implementation detail.
> There is no reason for you to know that an image is physically a snapshot of 
> another.
> Logical snapshots, template information, and any other information can be set 
> by the user by using the userData field available for every image.


Statements like this make me start to worry about your userData concept.  It's a
sign of a bad API if the user needs to invent a custom metadata scheme for
itself.  This reminds me of the abomination that is the 'custom' property in the
vm definition today.

> > > createSnapshot(targetRepoId, baseVirtualDiskId,
> > >                 userData={}, options={}):
> > > targetRepoId - The ID of a connected repo where the new sanpshot
> > > will be created and the original image exists as well.
> > > size - The size of the image you wish to create
> > > baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
> > > to snapshot
> > > userData - optional data that will be attached to the new Snapshot,
> > > could be anything that the user desires.
> > > options - options to modify VDSMs default behavior
> > >
> > > returns the id of the new Snapshot
> > >
> > > copyImage(targetRepoId, imageId, baseImageId=None, userData={},
> > > options={})
> > > targetRepoId - The ID of a connected repo where the new image will
> > > be created
> > > imageId - The image you wish to copy
> > > baseImageId - if specified, the new image will contain only the
> > > diff between image and Id.
> > >                If None the new image will contain all the bits of
> > >                image Id. This can be used to copy partial parts of
> > >                images for export.
> > > userData - optional data that will be attached to the new image,
> > > could be anything that the user desires.
> > > options - options to modify VDSMs default behavior
> > 
> > Does this function mean that we can copy the image from one
> > repository
> > to another repository? Does it cover the semantics of storage
> > migration,
> > storage backup, storage incremental backup?
> Yes, the main purpose is copying to another repo. and you can even do 
> incremental backups.
> Also the following flow
> 1. Run a VM using imageA
> 2. write to disk
> 3. Stop VM
> 4. copy imageA to repoB
> 5. Run a VM using imageA again
> 6. Write to disk
> 7. Stop VM
> 8. Copy imageA again basing it of imageA_copy1 on repoB creating a diff on 
> repo diff without snapshotting the original image.
> 
> > 
> > >
> > > return the Id of the new image. In case of copying an immutable
> > > image the ID will be identical to the original image as they
> > > contain the same data. However the user should not assume that and
> > > always use the value returned from the method.
> > >
> > > removeImage(repositoryId, imageId, options={}):
> > > repositoryId - The ID of a connected repo where the image to delete
> > > resides
> > > imageId - The id of the image you wish to delete.
> > >
> > >
> > > ----
> > > getImageStatus(repositoryId, imageId)
> > > repositoryId - The ID of a connected repo where the image to check
> > > resides
> > > imageId - The id of the image you wish to check.
> > >
> > > All operations return once the operations has been committed to
> > > disk NOT when the operation actually completes.
> > > This is done so that:
> > > - operation come to a stable state as quickly as possible.
> > > - In case where there is an SDM, only small portion of the
> > > operation actually needs to be performed on the SDM host.
> > > - No matter how many times the operation fails and on how many
> > > hosts, you can always resume the operation and choose when to do
> > > it.
> > > - You can stop an operation at any time and remove the resulting
> > > object making a distinction between "stop because the host is
> > > overloaded" to "I don't want that image"
> > >
> > > This means that after calling any operation that creates a new
> > > image the user must then call getImageStatus() to check what is
> > > the status of the image.
> > > The status of the image can be either optimized, degraded, or
> > > broken.
> > > "Optimized" means that the image is available and you can run VMs
> > > of it.
> > > "Degraded" means that the image is available and will run VMs but
> > > it might be a better way VDSM can represent the underlying data.
> > 
> > What does the "represent" mean here?
> Anything, but mostly image formate RAW\QCOW2 when performance strategy has 
> been selected.
> > > "Broken" means that the image can't be used at the moment, probably
> > > because not all the data has been set up on the volume.
> > >
> > > Apart from that VDSM will also return the last persisted status
> > > information which will conatin
> > > hostID - the last host to try and optimize of fix the image
> > Any host can optimize the image? No need to be SDM?
> On anything but lvm based block domains there will not even be an SDM.
> On SDM based domains we will try as hard as we can to have as many operations 
> executable on any host.
> > 
> > > stage - X/Y (eg. 1/10) the last persisted stage of the fix.
> > > percent_complete - -1 or 0-100, the last persisted completion
> > > percentage of the aforementioned stage. -1 means that no progress
> > > is available for that operation.
> > > last_error - This will only be filled if the operation failed
> > > because of something other then IO or a VDSM crash for obvious
> > > reasons.
> > >               It will usually be set if the task was manually
> > >               stopped
> > >
> > > The user can either be satisfied with that information or as the
> > > host specified in host ID if it is still working on that image by
> > > checking it's running tasks.
> > 
> > So we need a function to know what tasks are running on the image
> getImageStatus()
> > >
> > > checkStorageRepository(self, repositoryId, options={}):
> > > A method to go over a storage repository and scan for any existing
> > > problems. This includes degraded\broken images and deleted images
> > > that have no yet been physically deleted\merged.
> > > It returns a list of Fix objects.
> > > Fix objects come in 4 types:
> > > clean - cleans data, run them to get more space.
> > > optimize - run them to optimize a degraded image
> > > merge - Merges two images together. Doing this sometimes
> > >          makes more images ready optimizing or cleaning.
> > >          The reason it is different from optimize is that
> > >          unmerged images are considered optimized.
> > > mend - mends a broken image
> > >
> > > The user can read these types and prioritize fixes. Fixes also
> > > contain opaque FIX data and they should be sent as received to
> > > fixStorageRepository(self, repositoryId, fix, options={}):
> > >
> > > That will start a fix operation.
> > >
> > >
> > > All major operations automatically start the appropriate "Fix" to
> > > bring the created object to an optimize\degraded state (the one
> > > that is quicker) unless one of the options is
> > > AutoFix=False. This is only useful for repos that might not be able
> > > to create volumes on all hosts (SDM) but would like to have the
> > > actual IO distributed in the cluster.
> > >
> > > Other common options is the strategy option:
> > > It has currently 2 possible values
> > > space and performance - In case VDSM has 2 ways of completing the
> > > same operation it will tell it to value one over the other. For
> > > example, whether to copy all the data or just create a qcow based
> > > of a snapshot.
> > > The default is space.
> > >
> > > You might have also noticed that it is never explicitly specified
> > > where to look for existing images. This is done purposefully, VDSM
> > > will always look in all connected repositories for existing
> > > objects.
> > > For very large setups this might be problematic. To mitigate the
> > > problem you have these options:
> > > participatingRepositories=[repoId, ...] which tell VDSM to narrow
> > > the search to just these repositories
> > > and
> > > imageHints={imgId: repoId} which will force VDSM to look for those
> > > image ID just in those repositories and fail if it doesn't find
> > > them there.
> > > _______________________________________________
> > > vdsm-devel mailing list
> > > vdsm-devel@lists.fedorahosted.org
> > > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> > 
> > 
> > --
> > ---
> > 舒明 Shu Ming
> > Open Virtualization Engineerning; CSTL, IBM Corp.
> > Tel: 86-10-82451626  Tieline: 9051626 E-mail: shum...@cn.ibm.com or
> > shum...@linux.vnet.ibm.com
> > Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian
> > District, Beijing 100193, PRC
> > 
> > 
> > 
> _______________________________________________
> vdsm-devel mailing list
> vdsm-devel@lists.fedorahosted.org
> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

-- 
Adam Litke <a...@us.ibm.com>
IBM Linux Technology Center

_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] RFC: New Storage API

Reply via email to