On Mon, Jul 06, 2009 at 12:29:53PM +0200, Tomeu Vizoso wrote:

Agreed. I have to say that your proposal is excellent, congratulations!
Thanks, I'm flattered. :)

Is the asynchronous API design useful enough to warrant more complex
I'm not sure, but I think that whatever decision we take should be
made based on actual usage of the DS. What about proposing an example
of how an existing activity would be modified to use the new API?
OK, will work on one.

 - For save() calls activity needs to wait for result (containing new    version_id) before it can invoke save() again for the same object    which can take quite some time if save() is sync - especially if other
   activities are saving at the same time.
What about having a separate call that returns synchronously a new
tree_id and/or version_id?
Interesting idea, need to think about it. As we're going to use UUIDs not using requested versions shouldn't be an issue (for other version number schemes like the one you propose below "holes" in the numbering could be troublesome).

Making the API fully asynchronous is the cause for much of the complexity of my proposal, but if we eliminate the queueing the response times for write accesses and checkout() can be very long even for unrelated operations.
Why for unrelated operations?
Because we're serializing VCS operations. They are IO bound (more specifically: disk bound) and parallelisation would only lead to IO starvation, especially for HDDs.

# do we want an optimized way to determine (only) the branch HEADs of
a given tree_id?

This depends on the intended UI. My opinion is that if we branch at
every interesting modification (triggered by the activity detecting an
interesting change or by the user clicking on the Keep button), we
would like to display in the object list all the HEADs of each branch
in each tree_id. In that case yes, we need a way to retrieve that list
that is fast on both the client and the server side.
My imagined usage of branches was to create them automatically upon altering a non-HEAD version. A user basing off an old version could mean the newer version is "broken" (in that case promoting the new version to the HEAD of the current branch makes more sense) or that (s)he uses the older version as a kind of template to create derivates (so creating a branch would make most sense). But I'm open to alternative suggestions. We'd most likely need a way to explicitly create branches then.

# using symlink instead of hardlink for "incoming" queue since we want
to support directory trees, not just files

What justifies this new requirement?
That it's
a) of use to activities (IIRC some of them use ZIP files right instead now),
b) easy enough to achieve with the new design and
c) leads to better delta compression and thus disk space effiency.

# since an index rebuild can take a lot of time we need to provide UI
feedback while doing that

Any I/O operation can potentially take a lot of time, but with the
current version of the DS rebuilding an index with a few thousands of
entries is not so slow on the XO. We should never need to rebuild the
index, so this new requirement might not be justified (given the
current resources, all the other work we need to do, etc).
OK, good to know index rebuilding is fast. So the simple, boolean API I proposed (check_ready() / Ready()) suffices.

# detecting identical files across objects isn't as important since
duplicates are mostly expected to occur as versions of the same object

Based on how current activities are using the DS, this isn't like
The most common case I have heard from the field are children
downloading a PDF for reading several times.
Oh, didn't know that, so it's a new requirement.

An alternative to the current method for detecting duplicates is moving this task to
activities, is that what you suggest?
I'm ambivalent about it. On one hand it's not so easy to achieve in datastore (for various backends) and more indicative of UI deficiencies (why did the children download the file several times in the first place; it's bandwidth wastage as well), on the other hand it might not be easy to do in Browse, too. But maybe storing the URL as metadata and looking for that is enough for most cases? I guess it happens during a single session so the URL (even if including a session ID or whatever) should be stable enough?

About the benefits of differential compression I would like to note
that if you analize a real world journal, the biggest files are
videos, mp3, pdfs, etc., so files in formats not easily editable with
the activities we currently have.
Which is neither an argument pro nor contra delta compression as storage requirements should be about the same either way. OTOH most activities that do support modification currently save in a text based format, so for the large number of versions I expect (remember we're autosaving on activity switch) it could be a huge gain (not with git though since AFAICT it stores the entire blob every time, not just the differences).

With that I don't mean is not an
interesting challenge or something that we won't need in the future,
just that it has a relatively low impact as of today.
Which is why the minimal "delta compression" in git should be sufficient for now. :) What's more of a problem is one of the points mtd raised, though: git potentially choking on large files (mmap should be fine OTOH).

# activities should not submit new entries while the previously
submitted one hasn't been fully committed yet

Why so?
This is the answer I gave before:

Looks like I need to define should/must/etc. for the final version of the
document. It's an advice, not a requirement. The intention is to avoid
having an ever-increasing backlog because the activity saves faster than
the datastore can process.

# version_id and parent_id

Have you thought about version_id being of the form of '2.1.4'?
Yes, that's what I intended originally. But someone (Ben?) made a good argument for random IDs in one of the recent threads. Besides: the current prototype already uses the latter ones. :) Using random IDs and storing relationship in metadata is easier to implement than constructing and parsing structured IDs. It's not clear the latter would buy us anything real.

would make parent_id unneeded because we could refer to the parent as
(tree_id, 2.1.3). And would also allow us to identify the HEAD of each
But only inside datastore - any API consumer shouldn't make assumptions about version format.

# creator

What is it for?
E.g. to determine the default activity for resuming. Current name of this property is 'activity'.

# activity saves data to a disk, ensuring it has been committed (sync)
and proper access rights for data store

By sync you mean written to disk? Why activities need to worry about this?
Because activities know best what exactly needs to be synced. We should be able to remove this requirement in exchange for reduced datastore performance (esp. for directory objects). I'm not perfectly sure fdatasync() done in datastore will cause data written by the activity to be written to disk (though I read the POSIX definition of fdatasync() that way) but there are ways to find that out. :)

#    Changes the (unversioned/version-specific) metadata of the given
object to match metadata. Fully synchronous, no return value.

How do we know which properties are version-specific and which aren't?
By treating them accordingly. :)
Datastore is agnostic to this property (of metadata entries). Metadata is bound to each version but modifiable. For "versioned" metadata the API consumer is supposed to call save(), for "unversioned/version-specific" metadata it should call change_metadata() instead. If we decide to make some metadata global (i.e. common to all versions) I'd just hardcode those few names.

#     Remove (all versions of) given object from data store. Fully
synchronous. Doesn't return anything, emits signal Deleted(tree_id).

Do we have any operation in the UI that matches this?
Sure. It's exactly the same as delete(uid) in the current API, used by Journal. You might convince me to add a variant to remove single versions, but keep in my mind that deleting a single version from a VCS repository can be quite tough. A variant to remove branches might be easier to implement, but we should decide how to use branches before thinking about how useful that would be.

# Get/Got
Maybe should we make it a bit more verbose? Like GetData?
Makes sense as we're only returning data anyway, not metadata so it's not the exact opposite of save(). Changed, thanks for the suggestion.

# Prefixing a key name with '!' inverts the sense of matching

Is this used by the UI?
Currently not but easy to implement (on datastore side) and AFAIR talked about in one of the current threads.

# prefixing it with '*' enables regular expression search

Is this used by the UI? I think it's good to think now how possibly
interesting new features would be added in the future, but based on
past experiences I think it would be better to only implement what we
need right now.
This is one features I'm easy to convince to throw out. :)
As I included the textsearch() API call now (since current Journal needs it) we can rely on that instead. Marked as OPTIONAL for now. Will only be implemented if it's just a few SLoCs (as I expect it to be).

# Arbitrary key names are allowed, but speed may vary (i.e. not
everything is indexed).

Same here, I would return an exception for a non-indexed field before
implementing searches for arbitrary properties.
I think that's crippling potential Journal development / alternatives too much. See Library for example, it uses arbitrary metadata and has to read the whole datastore contents currently.

#     if True returns all matching versions of an object instead of
only the latest one

Where in the UI we would list only the last versions of several tree_id?
In the current Journal list view and in the object picker (which I don't particularly like but that's a topic on its own). There's a case to be made to return the HEADs of all branches instead, see also the corresponding TODO entry.

# textsearch(querystring, options)

What if the user has a date filter and enters a fulltext query? I
don't see how this would be implemented with the proposed
find/textsearch split.
That's a tough example. All other filters are easily replaced by prefix terms, but date is a range so it needs to be a value inside Xapian, not a term. How about just adding "query" from find() to it? Then most activities could rely on the stable interface of find() and the few advanced consumers (like Journal) would need to be adapted to a new IR search API anyway in order to provide better user experience (spelling corrections, tag suggestions, ...).

# Stopped()
What is this for?
Tell me. :-P
Maybe to delay shutdown until datastore has finished writing?

#      The internal data structures of datastore or one of its
backends are corrupted. Should only happen in case of hardware defects
or OS bugs.

Is power failure considered here hw defect or does the proposed design
protects against that?
The latter option. Actually there's another way to corrupt the data structures, namely improper tuning of filesystem / fs options (e.g. data=writeback on ext3 or using VFAT), but it could be argued that it's just an OS bug since the API contract is broken then. ;)

CU Sascha


Attachment: signature.asc
Description: Digital signature

Sugar-devel mailing list

Reply via email to