[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823917#comment-17823917
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/6/24 8:58 AM:
--

I can add the method "expectedFpp()" in our code as well 
(getEstimatedEntryCount we already have), with documentation that this is O ( n 
). The implementation is pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

Actually I would suggest this method:

{noformat}
/**
 * Get the expected false positive rate for the current entries in the 
filter.
 * This will first calculate the estimated entry count, and then calculate 
the false positive probability from there.
...
 */
public double expectedFpp() {
return calculateFpp(getEstimatedEntryCount(), getBitCount(), getK());
}
{noformat}


was (Author: tmueller):
I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O ( n ). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823917#comment-17823917
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/6/24 8:52 AM:
--

I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O ( n ). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30


was (Author: tmueller):
I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O(n). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:44 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:44 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}

I can work on this, no issue. We need to also move over some tests.




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:43 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(obj.hashCode());
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)