Re: [CompositeDataStore] pull request to review GC changes

2018-02-22 Thread Matt Ryan
Hi,


On February 21, 2018 at 10:32:48 PM, Amit Jain (am...@ieee.org) wrote:

Hi,


> Now the problem comes when secondary tries to run the sweep phase. It
will
> first try to verify that a references file exists for each repository
file
> in DS_P - and fail. This fails because primary deleted its references
file
> already. Thus secondary will cancel GC and thus blob C never ends up
> getting deleted. Note that secondary must delete C because it is the only
> repository that knows about C.
>
> This same situation exists also if secondary sweeps first. If record D
was
> created by primary after secondary was cloned, then D is deleted by
> primary, secondary never knows about blob D so it cannot delete it during
> the sweep phase - it can only be deleted by primary.
>

The solution for Shared DataStore currently is to require all repositories
to run a Mark phase then run the Sweep phase on one of them.


Yes.  Sorry, I didn’t mention that.  I was trying to be brief and ended up
being unclear.  In the situation I described above it is definitely running
the mark phase first and then the sweep phase.  The problem is still as I
described - no matter which one runs sweep first, it cannot delete all the
binaries that may possibly have been deleted on both systems.


>
> The change I made to the garbage collector is that when a repository
> finishes the sweep phase, it doesn’t necessarily delete the references
> file. Instead it marks the data store with a “sweepComplete” file
> indicating that this repository finished the sweep phase. When there is a
> “sweepComplete” file for every repository (in other words, the last
> repository to sweep), then all the references files are deleted.
>
> Well currently the problem is that all repositories are not required to
run the sweep phase. The solution above would have been ok when the GC is
to be run manually at different times as in your case.

Exactly - in the case I’ve described both have to successfully run a sweep
or not all binaries will be deleted.


But in the real world applications typically there's a cron (e.g. AEM
maintenance task) which could be setup to execute weekly at a particular
time on all repositories. In this case in almost all cases the repository
which finished the Mark phase at the last would only be able to execute the
Sweep phase as it would be the only repository to see all the reference
files for other repos (others executing before it would fail). This is
still Ok for the Shared DataStore use cases we have.
But with the above solution since not all repositories would be able to run
the sweep phase the reference files won't be cleaned up.

A very valid point.  I’ll need to think that one through some more.



Besides there's a problem of the Sweep phase on the primary encountering
blobs it does not know about (from the secondary) and which it cannot
delete creating an unpleasant experience. As I understand the Primary could
be a production system and having these sort of errors crop up would be
problematic.

If they are regarded as errors, yes.  Currently this logs a WARN level
message (not an ERROR) which suggests that sometimes not all the binaries
targeted for deletion will actually be deleted.

So this might be an issue of setting clear expectations.  But I do see the
point.



So, generically the solution would be to use the shared DataStore GC
paradigm we currently have which requires Mark phase to be run on all
repositories before running a Sweep.

Yes - like I said this is being done, it still requires that both repos do
a sweep.



For this specific use case some observations and quick rough sketch of a
possible solution:
* The DataStores for the 2 repositories - Primary & Secondary can be
thought of as Shared & Private
** Primary does not know about Secondary and could be an existing
repository and thus does not know about the DataStore of the Secondary as
well. In other words it could even function as a normal DataStore and need
not be a CompositeDataStore.

** Secondary does need to know about the Primary and thus registers itself
as sharing the Primary DataStore.
* Encode the blobs ids on the Secondary with the DataStore location/type
with which we can distinguish the blob ids belonging to the respective
DataStores.

That’s a solution that only works in this very specific use case of
CompositeDataStore.  In the future if we were ever to want to support
different scenarios we would then have to reconsider how it encodes blobs
for each delegate.  Would that mean that data written to a data store by
the CompositeDataStore could not be read by another CompositeDataStore
referencing the same delegate?


* Secondary's Mark phase only redirects the Primary owned blobids to the
references file in the Primary's DataStore (Primary's DataStore operating
as Shared).

The DataStore has no knowledge of the garbage collection stages.  So IIUC
this would require creating a new garbage collector that is aware of
composite data stores and has the 

Re: [CompositeDataStore] pull request to review GC changes

2018-02-21 Thread Amit Jain
Hi,


> Now the problem comes when secondary tries to run the sweep phase.  It will
> first try to verify that a references file exists for each repository file
> in DS_P - and fail.  This fails because primary deleted its references file
> already.  Thus secondary will cancel GC and thus blob C never ends up
> getting deleted.  Note that secondary must delete C because it is the only
> repository that knows about C.
>
> This same situation exists also if secondary sweeps first.  If record D was
> created by primary after secondary was cloned, then D is deleted by
> primary, secondary never knows about blob D so it cannot delete it during
> the sweep phase - it can only be deleted by primary.
>

The solution for Shared DataStore currently is to require all repositories
to run a Mark phase then run the Sweep phase on one of them.


>
> The change I made to the garbage collector is that when a repository
> finishes the sweep phase, it doesn’t necessarily delete the references
> file.  Instead it marks the data store with a “sweepComplete” file
> indicating that this repository finished the sweep phase.  When there is a
> “sweepComplete” file for every repository (in other words, the last
> repository to sweep), then all the references files are deleted.
>
> Well currently the problem is that all repositories are not required to
run the sweep phase. The solution above would have been ok when the GC is
to be run manually at different times as in your case.
But in the real world applications typically there's a cron (e.g. AEM
maintenance task) which could be setup to execute weekly at a particular
time on all repositories. In this case in almost all cases the repository
which finished the Mark phase at the last would only be able to execute the
Sweep phase as it would be the only repository to see all the reference
files for other repos (others executing before it would fail). This is
still Ok for the Shared DataStore use cases we have.
But with the above solution since not all repositories would be able to run
the sweep phase the reference files won't be cleaned up.

Besides there's a problem of the Sweep phase on the primary encountering
blobs it does not know about (from the secondary) and which it cannot
delete creating an unpleasant experience. As I understand the Primary could
be a production system and having these sort of errors crop up would be
problematic.

So, generically the solution would be to use the shared DataStore GC
paradigm we currently have which requires Mark phase to be run on all
repositories before running a Sweep.

For this specific use case some observations and quick rough sketch of a
possible solution:
*  The DataStores for the 2 repositories - Primary & Secondary can be
thought of as Shared & Private
** Primary does not know about Secondary and could be an existing
repository and thus does not know about the DataStore of the Secondary as
well. In other words it could even function as a normal DataStore and need
not be a CompositeDataStore.
** Secondary does need to know about the Primary and thus registers itself
as sharing the Primary DataStore.
* Encode the blobs ids on the Secondary with the DataStore location/type
with which we can distinguish the blob ids belonging to the respective
DataStores.
* Secondary's Mark phase only redirects the Primary owned blobids to the
references file in the Primary's DataStore (Primary's DataStore operating
as Shared).
* Secondary executes GC for its DataStore independently and does not worry
about the Shared blobids (already taken care of above).

I presume some of the above steps are required to enable a generic or even
some restricted CompositeDataStore solutions.

Thanks
Amit