Re: [PATCH v3 00/14] ref-transactions-reflog
On 11/20/2014 12:22 AM, Stefan Beller wrote: Sorry for the long delay. Thanks for the explanation and discussion. So do I understand it right, that you are not opposing the introduction of everything should go through transactions but rather the detail and abstraction level of the API? Correct. So starting from Michaels proposal in the first response: 1. Add a reflog entry when a reference is updated in a transaction. ok 2. Rename a reflog file when the corresponding reference is renamed. This should happen within the same transaction as the reference is renamed, right? Yes. Maybe there should be a rename reference operation that can be added to a transaction, and it simply knows to rename any associated reflogs. Then the calling code wouldn't have to worry about reflogs explicitly in this case at all. So we don't have a multistep process here, which may abort in between having the reference updated and a broken reflog or vice versa. We want to either have both the ref and the reflog updated or neither. Yes. 3. Delete the reflog when the corresponding reference is deleted [1]. also as one transaction? It would be a side-effect of committing a transaction that contains a reference deletion. The deletion of the reflog would be done at the same time that the rest of the transaction is committed, and again the calling code wouldn't have to explicitly worry about the reflogs. 4. Configure a reference to be reflogged. 5. Configure a reference to not be reflogged anymore and delete any existing reflog. Why do we need 4 and 5 here? Wouldn't all refs be reflog by default and why do I want to exclude some? 6. Selectively expire old reflog entries, e.g., based on their age. This is the maintenance operation, which you were talking about. In my vision, this also should go into one transaction. So you have the business logic figuring out all the changes (drop reflog entry a b and d) and within one transaction we can perform all of the changes. But if we take the approach described above, AFAICT this operation is the only one that would require the caller to manipulate reflog entries explicitly. And it has to iterate through the old reflog entries, decide which ones to keep, possibly change its neighbors to eliminate gaps in the chain, then stuff each of the reflog entries into a transaction one by one. To allow this to be implemented on the caller side, the transaction API has to be complicated in the following ways: * Add a transaction_update_type (UPDATE_SHA1 vs. UPDATE_LOG). * Add reflog_fd, reflog_lock, and committer members to struct ref_update. * New function transaction_update_reflog(). * A new flag REFLOG_TRUNCATE that allows the reflog file to be truncated before writing. * Machinery that recognizes that a transaction contains multiple reflog updates for the same reference and processes them specially to avoid locking and rewriting the reflog file multiple times. So this design has the caller serializing all reflog entries into separate ref_update structs (which implies that they are held in RAM!) only for ref_transaction_commit() to scan through all ref_updates looking for reflog updates that go together so that they can be processed as a whole. In other words, the caller picks the reflog apart and then ref_transaction_commit() glues it back together. It's all very contrived. I suggest that the caller only be responsible for deciding which reflog entries to keep (by supplying a callback function), and a new expire_reflogs_for_me_please() API function be responsible for taking out a lock, feeding the old reflog entries to the callback, expiring the unwanted entries, optionally eliminating gaps in the chain (for the use of reflog [expire|delete] --rewrite), writing the new reflog entries, and optionally updating the reference itself (for the use of reflog [expire|delete] --updateref). The benefit will be simpler code, a better separation of responsibilities, and a simpler VTABLE that future reference backends have to implement. I would love to work on this but unfortunately have way too much on my plate right now. Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
Jonathan Nieder jrnie...@gmail.com writes: I also have some thoughts about how those operations can be implemented without such a performance hit (reading the whole reflog into memory as part of the transaction seems problematic to me), but that should probably wait for a separate message (and I've talked about it a bit in person). Perhaps iterator interface such as for_each_reflog_ent() would help? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
Michael Haggerty wrote: On 11/20/2014 12:22 AM, Stefan Beller wrote: 3. Delete the reflog when the corresponding reference is deleted [1]. also as one transaction? It would be a side-effect of committing a transaction that contains a reference deletion. The deletion of the reflog would be done at the same time that the rest of the transaction is committed, and again the calling code wouldn't have to explicitly worry about the reflogs. There is git reflog delete ref, for when you have logallrefupdates disabled and had explicitly enabled reflogs for a particular ref and now want to turn it off. [...] So this design has the caller serializing all reflog entries into separate ref_update structs (which implies that they are held in RAM!) only for ref_transaction_commit() to scan through all ref_updates looking for reflog updates that go together so that they can be processed as a whole. In other words, the caller picks the reflog apart and then ref_transaction_commit() glues it back together. It's all very contrived. I think there is a simpler and more efficient way to implement this. transaction_update_reflog() can append to a .lock file. transaction_commit() then would rename it into place. There is some fuss about naming the .lock file to avoid D/F conflicts, which is a topic for a separate message. I suggest that the caller only be responsible for deciding which reflog entries to keep (by supplying a callback function), That could be handy. The basic operations described before would still be needed, though: create a new reflog with one entry, for new refs append an entry to a reflog, for ref updates (and the associated symref reflog update) copy (or rename --- that's a more minor detail) a reflog, for renaming refs delete a reflog, for git reflog delete And the filter reflog operation you are describing is implementable using those four operations, with no performance hit when dealing with reflogs stored in the files backend. Providing these operations doesn't prevent adding filter reflog using callback later if it turns out to be the right operation for other backends. It could turn out that some other primitive operations that are easy as an SQL operation is more useful, like delete reflog entry (without iterating over the others) or expire entries older than date. The nice thing is that adding those wouldn't break any code using the initial four operations described above. So this seems like a good starting point. [...] I would love to work on this but unfortunately have way too much on my plate right now. Of course code is always an easy way to change my mind, when the time comes. ;-) Thanks, Jonathan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
Sorry for the long delay. Thanks for the explanation and discussion. So do I understand it right, that you are not opposing the introduction of everything should go through transactions but rather the detail and abstraction level of the API? So starting from Michaels proposal in the first response: 1. Add a reflog entry when a reference is updated in a transaction. ok 2. Rename a reflog file when the corresponding reference is renamed. This should happen within the same transaction as the reference is renamed, right? So we don't have a multistep process here, which may abort in between having the reference updated and a broken reflog or vice versa. We want to either have both the ref and the reflog updated or neither. 3. Delete the reflog when the corresponding reference is deleted [1]. also as one transaction? 4. Configure a reference to be reflogged. 5. Configure a reference to not be reflogged anymore and delete any existing reflog. Why do we need 4 and 5 here? Wouldn't all refs be reflog by default and why do I want to exclude some? 6. Selectively expire old reflog entries, e.g., based on their age. This is the maintenance operation, which you were talking about. In my vision, this also should go into one transaction. So you have the business logic figuring out all the changes (drop reflog entry a b and d) and within one transaction we can perform all of the changes. Thanks, Stefan On Tue, Nov 18, 2014 at 1:28 PM, Junio C Hamano gits...@pobox.com wrote: Michael Haggerty mhag...@alum.mit.edu writes: Sorry, but I lost track---which one is inside and which one is outside? By inside I mean the code that would be within the reference-handling library if we had such a thing; i.e., implemented in refs.c. By outside I mean in the code that calls the library; in this case the outside code would live in builtin/reflog.c. In other words, I'd prefer the outside code in builtin/reflog.c to look vaguely like expire_reflogs_for_me_please(refname, should_expire_cb, cbdata, flags) rather than ... (written as a client of the ref API) ... OK, I very much agree with that. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
Hi, Stefan Beller wrote: Sorry for the long delay. Thanks for the explanation and discussion. So do I understand it right, that you are not opposing the introduction of everything should go through transactions but rather the detail and abstraction level of the API? For what it's worth, I don't personally think it makes sense to put the options supported by 'git reflog expire' into the transaction API as top-level functions. Instead, I think it makes sense, to start off, to support the same building block operations that are used in the current file-based code. That may mean having an API that can't express tricks that e.g. an SQL-based backend would be able to optimize (removing some items from a reflog without copying the rest, filtering based on conditions that can be expressed in SQL such as date, etc) but I think it's fine as a starting point. Later we can add new operations, change existing ones, and so on, based on experience with real backends. The write operations for file-based reflog handling are simple: - create a new reflog with a single reflog entry - add an entry to an existing reflog - (optional) copy a reflog wholesale --- this can be implemented in terms of add an entry, but copying in blocks (or making a reflink, on filesystems that support that) can make this faster - remove a reflog The reflog bookkeeping involved in renaming a ref can be implemented as copy + delete. I also have some thoughts about how those operations can be implemented without such a performance hit (reading the whole reflog into memory as part of the transaction seems problematic to me), but that should probably wait for a separate message (and I've talked about it a bit in person). [...] 4. Configure a reference to be reflogged. 5. Configure a reference to not be reflogged anymore and delete any existing reflog. Why do we need 4 and 5 here? Wouldn't all refs be reflog by default and why do I want to exclude some? See --create-reflog in git-branch(1) and core.logallrefupdates in git-config(1). Reflogs are disabled by default in bare repositories, which makes it easier for unnecessary objects on a server to be more promptly removed by gcs after a non-fast-forward push. I prefer to turn on reflogs when setting up a git server for my personal use. It might be worth flipping that default (as an orthogonal change). 6. Selectively expire old reflog entries, e.g., based on their age. This is the maintenance operation, which you were talking about. In my vision, this also should go into one transaction. So you have the business logic figuring out all the changes (drop reflog entry a b and d) and within one transaction we can perform all of the changes. Makes sense. Thanks, Jonathan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
On 11/18/2014 02:35 AM, Stefan Beller wrote: The following patch series updates the reflog handling to use transactions. This patch series has previously been sent to the list[1]. [...] I was reviewing this patch series (I left some comments in Gerrit about the first few patches) when I realized that I'm having trouble understanding the big picture of where you want to go with this. I have the feeling that the operations that you are implementing are at too low a level of abstraction. What are the elementary write operations that are needed for a reflog? Off the top of my head, 1. Add a reflog entry when a reference is updated in a transaction. 2. Rename a reflog file when the corresponding reference is renamed. 3. Delete the reflog when the corresponding reference is deleted [1]. 4. Configure a reference to be reflogged. 5. Configure a reference to not be reflogged anymore and delete any existing reflog. 6. Selectively expire old reflog entries, e.g., based on their age. Have I forgotten any? The first three should be side-effects of the corresponding reference updates. Aside from the fact that renames are not yet done within a transaction, I think this is already the case. Number 4, I think, currently only happens in conjunction with adding a line to the reflog. So it could be implemented, say, as a FORCE_CREATE_REFLOG flag on a ref_update within a transaction. Number 5 is not very interesting, I think. For example, it could be a separate API function, disconnected from any transactions. Number 6 is more interesting, and from my quick reading, it looks like a lot of the work of this patch series is to allow number 6 to be implemented in builtin/reflog.c:expire_reflog(). But it seems to me that you are building API calls at the wrong level of abstraction. Expiring a reflog should be a single API call to the refs API, and ultimately it should be left up to the refs backend to decide how to implement it. For a filesystem-based backend, it would do what it does now. But (for example) a SQL-based backend might implement this as a single SELECT statement. I also don't have the feeling that reflog expiration has to be done within a ref_transaction. For example, is there ever a reason to combine expiration with other reference updates in a single atomic transaction? I think not. So it seems to me that it would be more practical to have a separate API function that is called to expire selected entries from a reflog [2], unconnected with any transaction. I am not nearly as steeped in this code as you and Ronnie, and it could be that I'm forgetting lots of details that make your design preferable. But other reviewers are probably in the same boat. So I think it would be really helpful if you would provide a high-level description of the API that you are proposing, and some discussion of its design and tradeoffs. A big part of this description could go straight into a file Documentation/technical/api-ref-transactions.txt, which will be a great (and necessary) resource soon anyway. Michael [1] Though hopefully there will be future reference backends that don't have to discard reflogs when a reference is deleted, so let's not bake this behavior too fundamentally into the API. [2] ...and/or possibly one to expire reflogs for multiple references, if performance would benefit significantly. -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
On Tue, Nov 18, 2014 at 3:26 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 11/18/2014 02:35 AM, Stefan Beller wrote: The following patch series updates the reflog handling to use transactions. This patch series has previously been sent to the list[1]. [...] I was reviewing this patch series (I left some comments in Gerrit about the first few patches) when I realized that I'm having trouble understanding the big picture of where you want to go with this. I have the feeling that the operations that you are implementing are at too low a level of abstraction. What are the elementary write operations that are needed for a reflog? Off the top of my head, 1. Add a reflog entry when a reference is updated in a transaction. 2. Rename a reflog file when the corresponding reference is renamed. 3. Delete the reflog when the corresponding reference is deleted [1]. 4. Configure a reference to be reflogged. 5. Configure a reference to not be reflogged anymore and delete any existing reflog. 6. Selectively expire old reflog entries, e.g., based on their age. Have I forgotten any? The first three should be side-effects of the corresponding reference updates. Aside from the fact that renames are not yet done within a transaction, I think this is already the case. Number 4, I think, currently only happens in conjunction with adding a line to the reflog. So it could be implemented, say, as a FORCE_CREATE_REFLOG flag on a ref_update within a transaction. Number 5 is not very interesting, I think. For example, it could be a separate API function, disconnected from any transactions. Number 6 is more interesting, and from my quick reading, it looks like a lot of the work of this patch series is to allow number 6 to be implemented in builtin/reflog.c:expire_reflog(). But it seems to me that you are building API calls at the wrong level of abstraction. Expiring a reflog should be a single API call to the refs API, and ultimately it should be left up to the refs backend to decide how to implement it. For a filesystem-based backend, it would do what it does now. But (for example) a SQL-based backend might implement this as a single SELECT statement. I agree in principle. But things are more difficult since expire_reflog() has very complex semantics. To keep things simple for the reviews at this stage the logic is the same as the original code: loop over all entries: use very complex conditionals to decide which entries to keep/remove optionally modify the sha1 values for the records we keep write records we keep back to the file one record at a time So that as far as possible, we keep the same rules and behavior but we use a different API for the actual write entry to new reflog. We could wrap this inside a new specific transaction_expire_reflog() function so that other types of backends, for example an SQL backend, could optimize, but I think that should be in a separate later patch because expire_reflog is almost impossibly complex. It will not be a simple SELECT unfortunately. The current expire logic is something like : 1, expire all entries older than timestamp 2, optionally, also expire all entries that refer to unreachable objects using a different timestamp This involves actually reading the objects that the sha1 points to and parsing them! 3, optionally, if the sha1 objects can not be referenced, they are not commit objects or if they don't exist, then expire them too. This also involves reading the objects behind the sha1. 4, optionally, delete reflog entry #foo 5, optionally, if any log entries were discarded due to 2,3,4 then we might also re-write and modify some of the reflog entries we keep. or any combination thereof (6, if --dry-run is specified, just print what we would have expired) 2 and 3 requires that we need to read the objects for the entry 4 allows us to delete a specific entry 5 means that even for entries we keep we will need to mutate them. I also don't have the feeling that reflog expiration has to be done within a ref_transaction. For example, is there ever a reason to combine expiration with other reference updates in a single atomic transaction? --updateref In expire_reflog() we not only prune the reflog. When --updateref is used we update the actual ref itself. I think we want to have both the ref update and also the reflog update both be part of a single atomic transaction. I think not. So it seems to me that it would be more practical to have a separate API function that is called to expire selected entries from a reflog [2], unconnected with any transaction. I think it makes the API cleaner if we have a 'you can only update a ref/reflog/other things added in the future/ from within a transaction.' Since we need to do reflog changes within a transaction for the expire reflog case as well as the rename ref case I think it makes sense to enforce that reflog changes must be done
Re: [PATCH v3 00/14] ref-transactions-reflog
On 11/18/2014 07:36 PM, Ronnie Sahlberg wrote: On Tue, Nov 18, 2014 at 3:26 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 11/18/2014 02:35 AM, Stefan Beller wrote: The following patch series updates the reflog handling to use transactions. This patch series has previously been sent to the list[1]. [...] I was reviewing this patch series (I left some comments in Gerrit about the first few patches) when I realized that I'm having trouble understanding the big picture of where you want to go with this. I have the feeling that the operations that you are implementing are at too low a level of abstraction. What are the elementary write operations that are needed for a reflog? Off the top of my head, 1. Add a reflog entry when a reference is updated in a transaction. 2. Rename a reflog file when the corresponding reference is renamed. 3. Delete the reflog when the corresponding reference is deleted [1]. 4. Configure a reference to be reflogged. 5. Configure a reference to not be reflogged anymore and delete any existing reflog. 6. Selectively expire old reflog entries, e.g., based on their age. Have I forgotten any? The first three should be side-effects of the corresponding reference updates. Aside from the fact that renames are not yet done within a transaction, I think this is already the case. Number 4, I think, currently only happens in conjunction with adding a line to the reflog. So it could be implemented, say, as a FORCE_CREATE_REFLOG flag on a ref_update within a transaction. Number 5 is not very interesting, I think. For example, it could be a separate API function, disconnected from any transactions. Number 6 is more interesting, and from my quick reading, it looks like a lot of the work of this patch series is to allow number 6 to be implemented in builtin/reflog.c:expire_reflog(). But it seems to me that you are building API calls at the wrong level of abstraction. Expiring a reflog should be a single API call to the refs API, and ultimately it should be left up to the refs backend to decide how to implement it. For a filesystem-based backend, it would do what it does now. But (for example) a SQL-based backend might implement this as a single SELECT statement. I agree in principle. But things are more difficult since expire_reflog() has very complex semantics. To keep things simple for the reviews at this stage the logic is the same as the original code: loop over all entries: use very complex conditionals to decide which entries to keep/remove optionally modify the sha1 values for the records we keep write records we keep back to the file one record at a time So that as far as possible, we keep the same rules and behavior but we use a different API for the actual write entry to new reflog. We could wrap this inside a new specific transaction_expire_reflog() function so that other types of backends, for example an SQL backend, could optimize, but I think that should be in a separate later patch because expire_reflog is almost impossibly complex. It will not be a simple SELECT unfortunately. The current expire logic is something like : 1, expire all entries older than timestamp 2, optionally, also expire all entries that refer to unreachable objects using a different timestamp This involves actually reading the objects that the sha1 points to and parsing them! 3, optionally, if the sha1 objects can not be referenced, they are not commit objects or if they don't exist, then expire them too. This also involves reading the objects behind the sha1. 4, optionally, delete reflog entry #foo 5, optionally, if any log entries were discarded due to 2,3,4 then we might also re-write and modify some of the reflog entries we keep. or any combination thereof (6, if --dry-run is specified, just print what we would have expired) 2 and 3 requires that we need to read the objects for the entry 4 allows us to delete a specific entry 5 means that even for entries we keep we will need to mutate them. Thanks for the explanation. I now understand that it might be more than a single SELECT statement. Regarding the complicated rules for expiring reflogs (1, 2, 3, 4): For now I think it would be fine for the new expire_reflog() API function to take a callback function as an argument. Regarding the stitching together of the survivors (5), it seems like the API function would be the right place to handle that. Regarding 6, it sounds like you could run the reflog entries through your callback and report what it *would* have expired. I also don't have the feeling that reflog expiration has to be done within a ref_transaction. For example, is there ever a reason to combine expiration with other reference updates in a single atomic transaction? --updateref In expire_reflog() we not only prune the reflog. When --updateref is used we update the actual ref itself. I think we
Re: [PATCH v3 00/14] ref-transactions-reflog
Michael Haggerty mhag...@alum.mit.edu writes: I'm still not convinced. For me, reflog_expire() is an unusual outlier operation, much like git gc or git pack-refs or git fsck. None of these are part of the beautiful Git data model; they are messy maintenance operations. Forcing reference transactions to be general enough to allow reflog expiration to be implemented *outside* the refs API sacrificies their simplicity for lots of infrastructure that will probably only be used to implement this single operation. Better to implement reflog expiration *inside* the refs API. Sorry, but I lost track---which one is inside and which one is outside? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
On 11/18/2014 09:30 PM, Junio C Hamano wrote: Michael Haggerty mhag...@alum.mit.edu writes: I'm still not convinced. For me, reflog_expire() is an unusual outlier operation, much like git gc or git pack-refs or git fsck. None of these are part of the beautiful Git data model; they are messy maintenance operations. Forcing reference transactions to be general enough to allow reflog expiration to be implemented *outside* the refs API sacrificies their simplicity for lots of infrastructure that will probably only be used to implement this single operation. Better to implement reflog expiration *inside* the refs API. Sorry, but I lost track---which one is inside and which one is outside? By inside I mean the code that would be within the reference-handling library if we had such a thing; i.e., implemented in refs.c. By outside I mean in the code that calls the library; in this case the outside code would live in builtin/reflog.c. In other words, I'd prefer the outside code in builtin/reflog.c to look vaguely like expire_reflogs_for_me_please(refname, should_expire_cb, cbdata, flags) rather than transaction = ... for_each_reflog_entry { if should_expire() adjust neighbor reflog entries if necessary (actually, they're transaction entries so we would have to preprocess them before putting them in the transaction) else add reflog entry to transaction } ref_transaction_commit() and instead handle as much of the iteration, bookkeeping, and rewriting as possible inside expire_reflogs_for_me_please(). Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] ref-transactions-reflog
Michael Haggerty mhag...@alum.mit.edu writes: Sorry, but I lost track---which one is inside and which one is outside? By inside I mean the code that would be within the reference-handling library if we had such a thing; i.e., implemented in refs.c. By outside I mean in the code that calls the library; in this case the outside code would live in builtin/reflog.c. In other words, I'd prefer the outside code in builtin/reflog.c to look vaguely like expire_reflogs_for_me_please(refname, should_expire_cb, cbdata, flags) rather than ... (written as a client of the ref API) ... OK, I very much agree with that. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html