On Mon, Jul 06, 2009 at 12:29:53PM +0200, Tomeu Vizoso wrote:
Agreed. I have to say that your proposal is excellent, congratulations!
Thanks, I'm flattered. :)
Is the asynchronous API design useful enough to warrant more complex implementation?I'm not sure, but I think that whatever decision we take should be made based on actual usage of the DS. What about proposing an example of how an existing activity would be modified to use the new API?
OK, will work on one.
Interesting idea, need to think about it. As we're going to use UUIDs not using requested versions shouldn't be an issue (for other version number schemes like the one you propose below "holes" in the numbering could be troublesome).- For save() calls activity needs to wait for result (containing new version_id) before it can invoke save() again for the same object which can take quite some time if save() is sync - especially if otheractivities are saving at the same time.What about having a separate call that returns synchronously a new tree_id and/or version_id?
Because we're serializing VCS operations. They are IO bound (more specifically: disk bound) and parallelisation would only lead to IO starvation, especially for HDDs.Making the API fully asynchronous is the cause for much of the complexity of my proposal, but if we eliminate the queueing the response times for write accesses and checkout() can be very long even for unrelated operations.Why for unrelated operations?
My imagined usage of branches was to create them automatically upon altering a non-HEAD version. A user basing off an old version could mean the newer version is "broken" (in that case promoting the new version to the HEAD of the current branch makes more sense) or that (s)he uses the older version as a kind of template to create derivates (so creating a branch would make most sense). But I'm open to alternative suggestions. We'd most likely need a way to explicitly create branches then.# do we want an optimized way to determine (only) the branch HEADs of a given tree_id? This depends on the intended UI. My opinion is that if we branch at every interesting modification (triggered by the activity detecting an interesting change or by the user clicking on the Keep button), we would like to display in the object list all the HEADs of each branch in each tree_id. In that case yes, we need a way to retrieve that list that is fast on both the client and the server side.
# using symlink instead of hardlink for "incoming" queue since we want to support directory trees, not just files
What justifies this new requirement?
That it'sa) of use to activities (IIRC some of them use ZIP files right instead now),
b) easy enough to achieve with the new design and c) leads to better delta compression and thus disk space effiency.
OK, good to know index rebuilding is fast. So the simple, boolean API I proposed (check_ready() / Ready()) suffices.# since an index rebuild can take a lot of time we need to provide UI feedback while doing that Any I/O operation can potentially take a lot of time, but with the current version of the DS rebuilding an index with a few thousands of entries is not so slow on the XO. We should never need to rebuild the index, so this new requirement might not be justified (given the current resources, all the other work we need to do, etc).
# detecting identical files across objects isn't as important since duplicates are mostly expected to occur as versions of the same object
Based on how current activities are using the DS, this isn't like that. The most common case I have heard from the field are children downloading a PDF for reading several times.
Oh, didn't know that, so it's a new requirement.
An alternative to the current method for detecting duplicates is moving this task toI'm ambivalent about it. On one hand it's not so easy to achieve in datastore (for various backends) and more indicative of UI deficiencies (why did the children download the file several times in the first place; it's bandwidth wastage as well), on the other hand it might not be easy to do in Browse, too. But maybe storing the URL as metadata and looking for that is enough for most cases? I guess it happens during a single session so the URL (even if including a session ID or whatever) should be stable enough?activities, is that what you suggest?
Which is neither an argument pro nor contra delta compression as storage requirements should be about the same either way. OTOH most activities that do support modification currently save in a text based format, so for the large number of versions I expect (remember we're autosaving on activity switch) it could be a huge gain (not with git though since AFAICT it stores the entire blob every time, not just the differences).About the benefits of differential compression I would like to note that if you analize a real world journal, the biggest files are videos, mp3, pdfs, etc., so files in formats not easily editable with the activities we currently have.
Which is why the minimal "delta compression" in git should be sufficient for now. :) What's more of a problem is one of the points mtd raised, though: git potentially choking on large files (mmap should be fine OTOH).With that I don't mean is not an interesting challenge or something that we won't need in the future, just that it has a relatively low impact as of today.
# activities should not submit new entries while the previously submitted one hasn't been fully committed yet Why so?
This is the answer I gave before:Looks like I need to define should/must/etc. for the final version of the
document. It's an advice, not a requirement. The intention is to avoid having an ever-increasing backlog because the activity saves faster than the datastore can process.
Yes, that's what I intended originally. But someone (Ben?) made a good argument for random IDs in one of the recent threads. Besides: the current prototype already uses the latter ones. :) Using random IDs and storing relationship in metadata is easier to implement than constructing and parsing structured IDs. It's not clear the latter would buy us anything real.# version_id and parent_id Have you thought about version_id being of the form of '2.1.4'?
But only inside datastore - any API consumer shouldn't make assumptions about version format.That would make parent_id unneeded because we could refer to the parent as (tree_id, 2.1.3). And would also allow us to identify the HEAD of each branch.
E.g. to determine the default activity for resuming. Current name of this property is 'activity'.# creator What is it for?
Because activities know best what exactly needs to be synced. We should be able to remove this requirement in exchange for reduced datastore performance (esp. for directory objects). I'm not perfectly sure fdatasync() done in datastore will cause data written by the activity to be written to disk (though I read the POSIX definition of fdatasync() that way) but there are ways to find that out. :)# activity saves data to a disk, ensuring it has been committed (sync) and proper access rights for data storeBy sync you mean written to disk? Why activities need to worry about this?
# Changes the (unversioned/version-specific) metadata of the given object to match metadata. Fully synchronous, no return value. How do we know which properties are version-specific and which aren't?
By treating them accordingly. :)Datastore is agnostic to this property (of metadata entries). Metadata is bound to each version but modifiable. For "versioned" metadata the API consumer is supposed to call save(), for "unversioned/version-specific" metadata it should call change_metadata() instead. If we decide to make some metadata global (i.e. common to all versions) I'd just hardcode those few names.
Sure. It's exactly the same as delete(uid) in the current API, used by Journal. You might convince me to add a variant to remove single versions, but keep in my mind that deleting a single version from a VCS repository can be quite tough. A variant to remove branches might be easier to implement, but we should decide how to use branches before thinking about how useful that would be.# Remove (all versions of) given object from data store. Fully synchronous. Doesn't return anything, emits signal Deleted(tree_id). Do we have any operation in the UI that matches this?
Makes sense as we're only returning data anyway, not metadata so it's not the exact opposite of save(). Changed, thanks for the suggestion.# Get/Got Maybe should we make it a bit more verbose? Like GetData?
Currently not but easy to implement (on datastore side) and AFAIR talked about in one of the current threads.# Prefixing a key name with '!' inverts the sense of matching Is this used by the UI?
# prefixing it with '*' enables regular expression search Is this used by the UI? I think it's good to think now how possibly interesting new features would be added in the future, but based on past experiences I think it would be better to only implement what we need right now.
This is one features I'm easy to convince to throw out. :)As I included the textsearch() API call now (since current Journal needs it) we can rely on that instead. Marked as OPTIONAL for now. Will only be implemented if it's just a few SLoCs (as I expect it to be).
I think that's crippling potential Journal development / alternatives too much. See Library for example, it uses arbitrary metadata and has to read the whole datastore contents currently.# Arbitrary key names are allowed, but speed may vary (i.e. not everything is indexed). Same here, I would return an exception for a non-indexed field before implementing searches for arbitrary properties.
In the current Journal list view and in the object picker (which I don't particularly like but that's a topic on its own). There's a case to be made to return the HEADs of all branches instead, see also the corresponding TODO entry.# if True returns all matching versions of an object instead of only the latest oneWhere in the UI we would list only the last versions of several tree_id?
That's a tough example. All other filters are easily replaced by prefix terms, but date is a range so it needs to be a value inside Xapian, not a term. How about just adding "query" from find() to it? Then most activities could rely on the stable interface of find() and the few advanced consumers (like Journal) would need to be adapted to a new IR search API anyway in order to provide better user experience (spelling corrections, tag suggestions, ...).# textsearch(querystring, options) What if the user has a date filter and enters a fulltext query? I don't see how this would be implemented with the proposed find/textsearch split.
# Stopped() What is this for?
Tell me. :-P Maybe to delay shutdown until datastore has finished writing?
The latter option. Actually there's another way to corrupt the data structures, namely improper tuning of filesystem / fs options (e.g. data=writeback on ext3 or using VFAT), but it could be argued that it's just an OS bug since the API contract is broken then. ;)# The internal data structures of datastore or one of its backends are corrupted. Should only happen in case of hardware defects or OS bugs. Is power failure considered here hw defect or does the proposed design protects against that?
CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/
Description: Digital signature
_______________________________________________ Sugar-devel mailing list Sugaremail@example.com http://lists.sugarlabs.org/listinfo/sugar-devel