[jira] [Commented] (OAK-6254) DataStore: API to retrieve approximate storage size

2018-01-24 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16337772#comment-16337772
 ] 

Thomas Mueller commented on OAK-6254:
-

> It may be better to store the unstructured data in NodeStore itself under 
> specific node.

For a shared datastore, it would be better to use the "metadata space". That 
way, the info could be retrieved from everywhere (if access rights are in 
place), and even if the repository is stopped.

The data is only updates once per GC sweep cycle (so usually once a week).

Logging the data (with info level) would be nice. And making it available over 
JMX. That way it can also be stored in a RRD (round robin database).

> DataStore: API to retrieve approximate storage size
> ---
>
> Key: OAK-6254
> URL: https://issues.apache.org/jira/browse/OAK-6254
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Thomas Mueller
>Priority: Major
> Fix For: 1.10
>
>
> The estimated size of the datastore (on disk) is needed to:
> * monitor growth over time, or growth of certain operations
> * monitor if garbage collection is effective
> * avoid out of disk space
> * estimate backup size
> * statistical purposes (for example, if there are many repositories, to group 
> them by size)
> Datastore size: we could use the following heuristic: We could read the file 
> sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
> ./datastore/00 and multiply by 256. That would give a rough estimation 
> (within about 20% for repositories with datastore size > 50 GB).
> I think this is mainly important for the FileDataStore. The S3 datastore, if 
> there is a simple and fast S3 API to read the size, then that would be good 
> as well, but if there is none, then returning "unknown" is fine for me.
> As for the API, I would use something like this: {{long 
> getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
> inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
> slow). Similar to 
> [java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
>  I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
> for accuracyLevel 1, and possibly hours for level 9. With level 1, I would 
> read files in 00/00, with level 2 - 8 I would read files in 00, and with 
> level 9 I would read all the files. For level 1, I wouldn't stop; for level 
> 2, if it takes more than 5 seconds, I would stop and return the current best 
> estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-6254) DataStore: API to retrieve approximate storage size

2018-01-23 Thread Amit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335544#comment-16335544
 ] 

Amit Jain commented on OAK-6254:


[~tmueller]
bq. Much faster would be to calculate the size during garbage collection.
Yes that is certainly feasible and the storage could be in the metadata space 
in the datastore already being used for other things.

> DataStore: API to retrieve approximate storage size
> ---
>
> Key: OAK-6254
> URL: https://issues.apache.org/jira/browse/OAK-6254
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Thomas Mueller
>Priority: Major
> Fix For: 1.10
>
>
> The estimated size of the datastore (on disk) is needed to:
> * monitor growth over time, or growth of certain operations
> * monitor if garbage collection is effective
> * avoid out of disk space
> * estimate backup size
> * statistical purposes (for example, if there are many repositories, to group 
> them by size)
> Datastore size: we could use the following heuristic: We could read the file 
> sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
> ./datastore/00 and multiply by 256. That would give a rough estimation 
> (within about 20% for repositories with datastore size > 50 GB).
> I think this is mainly important for the FileDataStore. The S3 datastore, if 
> there is a simple and fast S3 API to read the size, then that would be good 
> as well, but if there is none, then returning "unknown" is fine for me.
> As for the API, I would use something like this: {{long 
> getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
> inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
> slow). Similar to 
> [java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
>  I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
> for accuracyLevel 1, and possibly hours for level 9. With level 1, I would 
> read files in 00/00, with level 2 - 8 I would read files in 00, and with 
> level 9 I would read all the files. For level 1, I wouldn't stop; for level 
> 2, if it takes more than 5 seconds, I would stop and return the current best 
> estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-6254) DataStore: API to retrieve approximate storage size

2018-01-23 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335543#comment-16335543
 ] 

Chetan Mehrotra commented on OAK-6254:
--

bq. The size could be stored in a file, and updated whenever datastore GC is 
run.

It may be better to store the unstructured data in NodeStore itself under 
specific node. 

> DataStore: API to retrieve approximate storage size
> ---
>
> Key: OAK-6254
> URL: https://issues.apache.org/jira/browse/OAK-6254
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Thomas Mueller
>Priority: Major
> Fix For: 1.10
>
>
> The estimated size of the datastore (on disk) is needed to:
> * monitor growth over time, or growth of certain operations
> * monitor if garbage collection is effective
> * avoid out of disk space
> * estimate backup size
> * statistical purposes (for example, if there are many repositories, to group 
> them by size)
> Datastore size: we could use the following heuristic: We could read the file 
> sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
> ./datastore/00 and multiply by 256. That would give a rough estimation 
> (within about 20% for repositories with datastore size > 50 GB).
> I think this is mainly important for the FileDataStore. The S3 datastore, if 
> there is a simple and fast S3 API to read the size, then that would be good 
> as well, but if there is none, then returning "unknown" is fine for me.
> As for the API, I would use something like this: {{long 
> getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
> inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
> slow). Similar to 
> [java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
>  I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
> for accuracyLevel 1, and possibly hours for level 9. With level 1, I would 
> read files in 00/00, with level 2 - 8 I would read files in 00, and with 
> level 9 I would read all the files. For level 1, I wouldn't stop; for level 
> 2, if it takes more than 5 seconds, I would stop and return the current best 
> estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-6254) DataStore: API to retrieve approximate storage size

2018-01-23 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335540#comment-16335540
 ] 

Thomas Mueller commented on OAK-6254:
-

Much faster would be to calculate the size during garbage collection. [~amjain] 
do you think it would be feasible? The size could be stored in a file, and 
updated whenever datastore GC is run.

We could also store a histogram, for different file sizes, as follows (for each 
size: total number of files, total file size): up to 2 KB, up to 16 KB, up to 
128 KB, up to 1 MB, up to 8 MB,...

> DataStore: API to retrieve approximate storage size
> ---
>
> Key: OAK-6254
> URL: https://issues.apache.org/jira/browse/OAK-6254
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Thomas Mueller
>Priority: Major
> Fix For: 1.10
>
>
> The estimated size of the datastore (on disk) is needed to:
> * monitor growth over time, or growth of certain operations
> * monitor if garbage collection is effective
> * avoid out of disk space
> * estimate backup size
> * statistical purposes (for example, if there are many repositories, to group 
> them by size)
> Datastore size: we could use the following heuristic: We could read the file 
> sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
> ./datastore/00 and multiply by 256. That would give a rough estimation 
> (within about 20% for repositories with datastore size > 50 GB).
> I think this is mainly important for the FileDataStore. The S3 datastore, if 
> there is a simple and fast S3 API to read the size, then that would be good 
> as well, but if there is none, then returning "unknown" is fine for me.
> As for the API, I would use something like this: {{long 
> getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
> inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
> slow). Similar to 
> [java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
>  I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
> for accuracyLevel 1, and possibly hours for level 9. With level 1, I would 
> read files in 00/00, with level 2 - 8 I would read files in 00, and with 
> level 9 I would read all the files. For level 1, I wouldn't stop; for level 
> 2, if it takes more than 5 seconds, I would stop and return the current best 
> estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)