On 12/06/2012 10:22 PM, Saggi Mizrahi wrote:

----- Original Message -----
From: "Shu Ming" <shum...@linux.vnet.ibm.com>
To: "Saggi Mizrahi" <smizr...@redhat.com>
Cc: "VDSM Project Development" <vdsm-devel@lists.fedorahosted.org>, "engine-devel" 
Sent: Thursday, December 6, 2012 11:02:02 AM
Subject: Re: [vdsm] RFC: New Storage API


Thanks for sharing your thought and I get some comments below.

Saggi Mizrahi:
I've been throwing a lot of bits out about the new storage API and
I think it's time to talk a bit.
I will purposefully try and keep implementation details away and
concentrate about how the API looks and how you use it.

First major change is in terminology, there is no long a storage
domain but a storage repository.
This change is done because so many things are already called
domain in the system and this will make things less confusing for
new-commers with a libvirt background.

One other changes is that repositories no longer have a UUID.
The UUID was only used in the pool members manifest and is no
longer needed.

connectStorageRepository(repoId, repoFormat,
repoId - is a transient name that will be used to refer to the
connected domain, it is not persisted and doesn't have to be the
same across the cluster.
repoFormat - Similar to what used to be type (eg. localfs-1.0,
nfs-3.4, clvm-1.2).
connectionParameters - This is format specific and will used to
tell VDSM how to connect to the repo.

Where does repoID come from? I think repoID doesn't exist before
connectStorageRepository() return.  Isn't repoID a return value of
No, repoIDs are no longer part of the domain, they are just a transient handle.
The user can put whatever it wants there as long as it isn't already taken by 
another currently connected domain.

So what happens when user mistakenly gives a repoID that is in use before.. there should be something in the return value that specifies the error and/or reason for error so that user can try with a new/diff repoID ?

disconnectStorageRepository(self, repoId)

In the new API there are only images, some images are mutable and
some are not.
mutable images are also called VirtualDisks
immutable images are also called Snapshots

There are no explicit templates, you can create as many images as
you want from any snapshot.

There are 4 major image operations:

createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
                    userData={}, options={}):

targetRepoId - ID of a connected repo where the disk will be
size - The size of the image you wish to create
baseSnapshotId - the ID of the snapshot you want the base the new
virtual disk on
userData - optional data that will be attached to the new VD, could
be anything that the user desires.
options - options to modify VDSMs default behavior

IIUC, i can use options to do storage offloads ? For eg. I can create a LUN that represents this VD on my storage array based on the 'options' parameter ? Is this the intended way to use 'options' ?

returns the id of the new VD
I think we will also need a function to check if a a VirtualDisk is
based on a specific snapshot.
Like: isSnapshotOf(virtualDiskId, baseSnapshotID):
No, the design is that volume dependencies are an implementation detail.
There is no reason for you to know that an image is physically a snapshot of 
Logical snapshots, template information, and any other information can be set 
by the user by using the userData field available for every image.
createSnapshot(targetRepoId, baseVirtualDiskId,
                 userData={}, options={}):
targetRepoId - The ID of a connected repo where the new sanpshot
will be created and the original image exists as well.
size - The size of the image you wish to create
baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
to snapshot
userData - optional data that will be attached to the new Snapshot,
could be anything that the user desires.
options - options to modify VDSMs default behavior

returns the id of the new Snapshot

copyImage(targetRepoId, imageId, baseImageId=None, userData={},
targetRepoId - The ID of a connected repo where the new image will
be created
imageId - The image you wish to copy
baseImageId - if specified, the new image will contain only the
diff between image and Id.
                If None the new image will contain all the bits of
                image Id. This can be used to copy partial parts of
                images for export.
userData - optional data that will be attached to the new image,
could be anything that the user desires.
options - options to modify VDSMs default behavior
Does this function mean that we can copy the image from one
to another repository? Does it cover the semantics of storage
storage backup, storage incremental backup?
Yes, the main purpose is copying to another repo. and you can even do 
incremental backups.
Also the following flow
1. Run a VM using imageA
2. write to disk
3. Stop VM
4. copy imageA to repoB
5. Run a VM using imageA again
6. Write to disk
7. Stop VM
8. Copy imageA again basing it of imageA_copy1 on repoB creating a diff on repo 
diff without snapshotting the original image.

return the Id of the new image. In case of copying an immutable
image the ID will be identical to the original image as they
contain the same data. However the user should not assume that and
always use the value returned from the method.

removeImage(repositoryId, imageId, options={}):
repositoryId - The ID of a connected repo where the image to delete
imageId - The id of the image you wish to delete.

getImageStatus(repositoryId, imageId)
repositoryId - The ID of a connected repo where the image to check
imageId - The id of the image you wish to check.

All operations return once the operations has been committed to
disk NOT when the operation actually completes.
This is done so that:
- operation come to a stable state as quickly as possible.
- In case where there is an SDM, only small portion of the
operation actually needs to be performed on the SDM host.
- No matter how many times the operation fails and on how many
hosts, you can always resume the operation and choose when to do
- You can stop an operation at any time and remove the resulting
object making a distinction between "stop because the host is
overloaded" to "I don't want that image"

This means that after calling any operation that creates a new
image the user must then call getImageStatus() to check what is
the status of the image.
The status of the image can be either optimized, degraded, or
"Optimized" means that the image is available and you can run VMs
of it.
"Degraded" means that the image is available and will run VMs but
it might be a better way VDSM can represent the underlying data.

Calling qcow2 based snapshot degraded (meaning its degraded in perf, as its space optimzed ) and calling raw images as optimised ( meaning its optimised for perf as its space in-efficient) is confusing. Degraded sounds like a bad thing when seen by the end-user :) I think there is scope for having some better and less confusing terminology here ?

What does the "represent" mean here?
Anything, but mostly image formate RAW\QCOW2 when performance strategy has been 
"Broken" means that the image can't be used at the moment, probably
because not all the data has been set up on the volume.

Apart from that VDSM will also return the last persisted status
information which will conatin
hostID - the last host to try and optimize of fix the image
Any host can optimize the image? No need to be SDM?
On anything but lvm based block domains there will not even be an SDM.
On SDM based domains we will try as hard as we can to have as many operations 
executable on any host.

1) Can you provide more info on why there is a exception for 'lvm based block domain'. Its not coming out clearly. 2) Based on the terminology change, domain is now replaced by repository, SDM should now be more aptly called SRM (storage repo manager) so that we are consistent in the usage of terminology 3) Can you provide some example flow / scenario to understnad how with and without SDM domains work ? Especially how the disk based lock is taken if no SDM ?

stage - X/Y (eg. 1/10) the last persisted stage of the fix.
percent_complete - -1 or 0-100, the last persisted completion
percentage of the aforementioned stage. -1 means that no progress
is available for that operation.
last_error - This will only be filled if the operation failed
because of something other then IO or a VDSM crash for obvious
               It will usually be set if the task was manually

The user can either be satisfied with that information or as the
host specified in host ID if it is still working on that image by
checking it's running tasks.
So we need a function to know what tasks are running on the image
checkStorageRepository(self, repositoryId, options={}):
A method to go over a storage repository and scan for any existing
problems. This includes degraded\broken images and deleted images
that have no yet been physically deleted\merged.
It returns a list of Fix objects.
Fix objects come in 4 types:
clean - cleans data, run them to get more space.
optimize - run them to optimize a degraded image
merge - Merges two images together. Doing this sometimes
          makes more images ready optimizing or cleaning.
          The reason it is different from optimize is that
          unmerged images are considered optimized.
mend - mends a broken image

The user can read these types and prioritize fixes. Fixes also
contain opaque FIX data and they should be sent as received to
fixStorageRepository(self, repositoryId, fix, options={}):

That will start a fix operation.

It would be good if you can provide some example or flow of "fix" operation.
When and Why would somebody want to do it ?

Does 'Fix' here mean that i move from raw to qcow2 format or vice-versa, or there is more to it ?

All major operations automatically start the appropriate "Fix" to
bring the created object to an optimize\degraded state (the one
that is quicker) unless one of the options is
AutoFix=False. This is only useful for repos that might not be able
to create volumes on all hosts (SDM) but would like to have the
actual IO distributed in the cluster.

Other common options is the strategy option:
It has currently 2 possible values
space and performance - In case VDSM has 2 ways of completing the
same operation it will tell it to value one over the other. For
example, whether to copy all the data or just create a qcow based
of a snapshot.
The default is space.

You might have also noticed that it is never explicitly specified
where to look for existing images. This is done purposefully, VDSM
will always look in all connected repositories for existing
For very large setups this might be problematic. To mitigate the
problem you have these options:
participatingRepositories=[repoId, ...] which tell VDSM to narrow
the search to just these repositories
imageHints={imgId: repoId} which will force VDSM to look for those
image ID just in those repositories and fail if it doesn't find
them there.
vdsm-devel mailing list

舒明 Shu Ming
Open Virtualization Engineerning; CSTL, IBM Corp.
Tel: 86-10-82451626  Tieline: 9051626 E-mail: shum...@cn.ibm.com or
Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian
District, Beijing 100193, PRC

vdsm-devel mailing list

vdsm-devel mailing list

Reply via email to