Hi, Le 08/12/2013 00:17, Philipp Heckel a écrit : > Now to the topic: While I am really, really happy that you guys are > discussing so enthusiastically, I think we're drifting a bit into > philosophical and academic discussions. Please do not get this the wrong > way, I think discussion is important, but I think that sometimes code is > easier to understand -- especially when it's a relatively small change > in code (like with the IDs). That's why I suggest to simply play around > in code and show us what you mean.
I would say that each of us as a way of discussing and thinking about code which is different and reflects one's background. I'm an academic, I'm more at ease discussing things on a theoretical/philosophical level and then moving to more concrete things. But I will not ask others to follow me there ;-) So the discussing with Gregor was really nice for me but I understand totally that some code is needed at some point and I also understand that you Philipp and some others will probably waint until things a little more concrete to comment. In my opinion, everybody wins by having this discussion in several steps. > Also -- and again: do not take this the wrong way! -- there are many > important things to do to get a working piece of software, and I feel > that the ID question is more of an optimization. Now I know that Fabrice > likes to get to 1MM files (and believe me we'll get there!), but we > first need to be able to perform a cleanup of files and file versions, > and represent the local database in general in a more efficient way. So > if you will: there are bigger issues to consider when drafting an ID > solution, and bigger issues to solve in general :-) Of course, but leaving a long id was a (very small) risk and moving to a better solution is needed. A simple solution like using ByteArray is clearly possible, but I think the solution proposed by Gregor (and me) is not that complicated. > [..] > Next steps: > - I'm meeting with Gregor tomorrow: My original goal was to talk about > the database stuff in general, but I guess we'll also talk over the ID > stuff. Maybe we'll be enlightened then. We'll review all the code and > suggestions and hopefully implement something. (Btw. I liked the > ShortId<T> & ArrayId<T> idea) Ok. I've pushed some additional modifications in the line of FileId to my branch (longer-file-id), but it's not based on Gregor's design. > - It would be very valuable to me if you could review the general > Database in-memory representation. My solution to the ever-growing local > RAM was to simply put everything in a local SQL database, and load it on > demand, but the JPA stuff is complex and maybe it can be done more > easily ... Ideas? My personal problem with that is again a theoretical one or a conceptual one, if you prefer. I think a design document is _absolutely_ needed if you want to obtain something correct for the database (in memory, locally on disk and remotely on storage). I mean that you had a first implementation in the older code base which brought a lot of insights and allowed to identify two major problems: version control of the database itself and communication issues around this version control. In the second implementation, you have something quite stable that contains a informal specification of the version control system (based on vector clock and such) and of the communication (delta based). But you also also identified representation issues. My recommendation is to use this second implementation as the basis of a design document of the representation rather than using JPA and hoping for the best (which won't happen, as all the benchmarks I've seen show quite bad performances of jpa compared to jdbc). I think we need first an entity-relationship model of the data. For instance, we have a Chunk entity and a MultiChunk entity, with a "is made of" relation, etc. It would be way simpler to reason on such a model than on a bunch of classes. Then we need to identify scenarios and see what they need in terms of request to the model. For instance when one wants to up his/her modifications, a file tree walking will take place. This needs to browse both the file system and the entire last known state of the remote storage (a.k.a the current version of all files) to compare them. A very bad idea would be to walk the tree and query the database for each file to get the last known state of the file: because of the round trips between the walking code and the database code, this will waste a large amount of time. In addition, one needs to detect deleted files, which can only be done via a full scan of the database. So I think in this case, we need to fully load the current state of the database, which is more or less a Map between file path (file key in the future to leverage inodes) and its metadata. This gives a first constraint to the database representation: it needs to be able to produce such a "current state snapshot" efficiently. Should be doable with a "select path,metadata from somewhere where version=current" (or something similar, you see what I mean). Other scenarios will ask for other things. For instance, in the watcher case, maybe individual queries will make sense. Also when one looses a race in the upload, the last commit must be rollback (sort of), so one will need a way to identify the last commit. Another example is the clean up operation: one needs a way to identify chunks that are no longer used. Without all of this, I think we are stuck into the current very complex java representation of the data. This representation was needed to build a working version of syncany and I'm truly impressed by the result. Now that it works, it's time to sort things out without trying first to optimize storing this representation as if it were dictated by the data. Cheers, Fabrice -- Mailing list: https://launchpad.net/~syncany-team Post to : [email protected] Unsubscribe : https://launchpad.net/~syncany-team More help : https://help.launchpad.net/ListHelp

