[ 
https://issues.apache.org/jira/browse/JAMES-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411930#comment-17411930
 ] 

Benoit Tellier edited comment on JAMES-3150 at 9/8/21, 1:31 PM:
----------------------------------------------------------------

Here is some return over experience for the BloomFilter scanning algorithm

Run 1: many deletions, listing batch size of 1.000 blobs at a time

25 hours for 35M deletions, 67M blobs.

{code:java}
{
  "additionalInformation": {
    "type": "BlobGCTask",
    "timestamp": "2021-09-08T08:14:34.883206Z",
    "referenceSourceCount": 49035372,
    "blobCount": 67546056,
    "gcedBlobCount": 34933325,
    "errorCount": 0,
    "bloomFilterExpectedBlobCount": 100000000,
    "bloomFilterAssociatedProbability": 0.02
  },
  "status": "completed",
  "taskId": "4b981c40-0d1f-4c9a-9bf9-d0aae5779647",
  "startedDate": "2021-09-07T07:50:05.648+0000",
  "completedDate": "2021-09-08T08:14:35.038+0000",
  "executedOn": "james-jmap-988b8f869-cwxwn",
  "submittedFrom": "james-jmap-988b8f869-cwxwn",
  "cancelledFrom": null,
  "submitDate": "2021-09-07T07:50:05.576+0000",
  "type": "BlobGCTask"
}
{code}

Run 2: Few deletes, listing batch size of 10.000 blobs at a time

2 hours for 32 million blobs, 3251 deletions

{code:java}
{
  "additionalInformation": {
    "type": "BlobGCTask",
    "timestamp": "2021-09-08T12:46:59.008272Z",
    "referenceSourceCount": 49035372,
    "blobCount": 32612766,
    "gcedBlobCount": 3251,
    "errorCount": 0,
    "bloomFilterExpectedBlobCount": 67546056,
    "bloomFilterAssociatedProbability": 0.02
  },
  "status": "completed",
  "type": "BlobGCTask",
  "taskId": "01dd426c-7c03-467e-a25f-5426b618773b",
  "startedDate": "2021-09-08T10:49:58.916+0000",
  "completedDate": "2021-09-08T12:46:59.055+0000",
  "executedOn": "james-jmap-84bb8c66c5-qsdpf",
  "submittedFrom": "james-imap-smtp-c4fdffbdd-vwffh",
  "cancelledFrom": null,
  "submitDate": "2021-09-08T10:49:58.762+0000"
}
{code}

We will run a *third run* tomorrow with 1.000 blob listing batch size expecting 
no deletes, it will allow to discriminate which factor caused the run to be 
slow, the small page size or the deletions.

We could also plan a *fourth run*, exploring if further increasing the  blob 
listing batch size further improves performance.



was (Author: btellier):
Run 1: many deletions, listing batch size of 1.000 blobs at a time

25 hours for 35M deletions, 67M blobs.

{code:java}
{
  "additionalInformation": {
    "type": "BlobGCTask",
    "timestamp": "2021-09-08T08:14:34.883206Z",
    "referenceSourceCount": 49035372,
    "blobCount": 67546056,
    "gcedBlobCount": 34933325,
    "errorCount": 0,
    "bloomFilterExpectedBlobCount": 100000000,
    "bloomFilterAssociatedProbability": 0.02
  },
  "status": "completed",
  "taskId": "4b981c40-0d1f-4c9a-9bf9-d0aae5779647",
  "startedDate": "2021-09-07T07:50:05.648+0000",
  "completedDate": "2021-09-08T08:14:35.038+0000",
  "executedOn": "james-jmap-988b8f869-cwxwn",
  "submittedFrom": "james-jmap-988b8f869-cwxwn",
  "cancelledFrom": null,
  "submitDate": "2021-09-07T07:50:05.576+0000",
  "type": "BlobGCTask"
}
{code}

Run 2: Few deletes, listing batch size of 10.000 blobs at a time

2 hours for 32 million blobs, 3251 deletions

{code:java}
{
  "additionalInformation": {
    "type": "BlobGCTask",
    "timestamp": "2021-09-08T12:46:59.008272Z",
    "referenceSourceCount": 49035372,
    "blobCount": 32612766,
    "gcedBlobCount": 3251,
    "errorCount": 0,
    "bloomFilterExpectedBlobCount": 67546056,
    "bloomFilterAssociatedProbability": 0.02
  },
  "status": "completed",
  "type": "BlobGCTask",
  "taskId": "01dd426c-7c03-467e-a25f-5426b618773b",
  "startedDate": "2021-09-08T10:49:58.916+0000",
  "completedDate": "2021-09-08T12:46:59.055+0000",
  "executedOn": "james-jmap-84bb8c66c5-qsdpf",
  "submittedFrom": "james-imap-smtp-c4fdffbdd-vwffh",
  "cancelledFrom": null,
  "submitDate": "2021-09-08T10:49:58.762+0000"
}
{code}

We will run a *third run* tomorrow with 1.000 blob listing batch size expecting 
no deletes, it will allow to discriminate which factor caused the run to be 
slow, the small page size or the deletions.

We could also plan a *fourth run*, exploring if further increasing the  blob 
listing batch size further improves performance.


> Implement Garbage Colletion for blobs
> -------------------------------------
>
>                 Key: JAMES-3150
>                 URL: https://issues.apache.org/jira/browse/JAMES-3150
>             Project: James Server
>          Issue Type: Improvement
>          Components: Blob
>    Affects Versions: 3.3.0
>            Reporter: Gautier DI FOLCO
>            Priority: Major
>          Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> With the blob store deduplication, dropping a blob in a distributed 
> environment is impossible if we want to keep an acceptable concurrency level.
> A Garbage Collector should be created in order to drop old blobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to