[jira] [Updated] (OAK-10713) oak-lucene: add test coverage for stack overflow based on very long and complex regexp

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10713:
-
Summary: oak-lucene: add test coverage for stack overflow based on very 
long and complex regexp  (was: oak-lucene: add test coverage for stack overflow 
based on complex regexp)

> oak-lucene: add test coverage for stack overflow based on very long and 
> complex regexp
> --
>
> Key: OAK-10713
> URL: https://issues.apache.org/jira/browse/OAK-10713
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>  Labels: candidate_oak_1_22
> Fix For: 1.62.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10713) oak-lucene: add test coverage for stack overflow based on complex regexp

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10713:
-
Summary: oak-lucene: add test coverage for stack overflow based on complex 
regexp  (was: oak-lucene: add test coverage for potential DoS attack based on 
complex regexp)

> oak-lucene: add test coverage for stack overflow based on complex regexp
> 
>
> Key: OAK-10713
> URL: https://issues.apache.org/jira/browse/OAK-10713
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>  Labels: candidate_oak_1_22
> Fix For: 1.62.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10719) oak-lucene uses Lucene version that can throw a StackOverflowException

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10719:
-
Description: 
See .

Analysis so far:

- oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has reached 
EOL a long time ago
- the lucene version can in some cases throw a StackOverflowException, see 
OAK-10713
- oak-lucene *embeds* and *exports* lucene-core
- update to version >= 4.8 non-trivial due to backwards compat breakage

Work in :

- inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into oak-lucene
- fixed two JDK11 compile issues (potentially uninitialized vars in finally 
block) 
- backported fix from https://github.com/apache/lucene/issues/11537
- enable test added in OAK-10713
- ran Oak integration tests

Open questions:

- Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate that
- should we ask Lucene team for a public release (might be hard sell)
- alternatively, as tried here, inline source code into oak-lucene (maybe add 
explainers to all source files)
- do we need to adopt the lucene test suite as well?
- lucene-core dependencies in other Oak modules to be checked (seems mostly for 
tests, or for run modules)





  was:
See .

Analysis so far:

- oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has reached 
EOL a long time ago
- the version is vulnerable to an DoS attack (regexp stack overflow), see 
OAK-10713
- oak-lucene *embeds* and *exports* lucene-core
- update to version >= 4.8 non-trivial due to backwards compat breakage

Work in :

- inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into oak-lucene
- fixed two JDK11 compile issues (potentially uninitialized vars in finally 
block) 
- backported fix from https://github.com/apache/lucene/issues/11537
- enable test added in OAK-10713
- ran Oak integration tests

Open questions:

- Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate that
- should we ask Lucene team for a public release (might be hard sell)
- alternatively, as tried here, inline source code into oak-lucene (maybe add 
explainers to all source files)
- do we need to adopt the lucene test suite as well?
- lucene-core dependencies in other Oak modules to be checked (seems mostly for 
tests, or for run modules)






> oak-lucene uses Lucene version that can throw a StackOverflowException
> --
>
> Key: OAK-10719
> URL: https://issues.apache.org/jira/browse/OAK-10719
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> See .
> Analysis so far:
> - oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has 
> reached EOL a long time ago
> - the lucene version can in some cases throw a StackOverflowException, see 
> OAK-10713
> - oak-lucene *embeds* and *exports* lucene-core
> - update to version >= 4.8 non-trivial due to backwards compat breakage
> Work in :
> - inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into 
> oak-lucene
> - fixed two JDK11 compile issues (potentially uninitialized vars in finally 
> block) 
> - backported fix from https://github.com/apache/lucene/issues/11537
> - enable test added in OAK-10713
> - ran Oak integration tests
> Open questions:
> - Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate 
> that
> - should we ask Lucene team for a public release (might be hard sell)
> - alternatively, as tried here, inline source code into oak-lucene (maybe add 
> explainers to all source files)
> - do we need to adopt the lucene test suite as well?
> - lucene-core dependencies in other Oak modules to be checked (seems mostly 
> for tests, or for run modules)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10719) oak-lucene uses lucene version that can throw a StackOverflowException

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10719:
-
Summary: oak-lucene uses lucene version that can throw a 
StackOverflowException  (was: oak-lucene uses lucene version vulnerable to DoS 
attack)

> oak-lucene uses lucene version that can throw a StackOverflowException
> --
>
> Key: OAK-10719
> URL: https://issues.apache.org/jira/browse/OAK-10719
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> See .
> Analysis so far:
> - oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has 
> reached EOL a long time ago
> - the version is vulnerable to an DoS attack (regexp stack overflow), see 
> OAK-10713
> - oak-lucene *embeds* and *exports* lucene-core
> - update to version >= 4.8 non-trivial due to backwards compat breakage
> Work in :
> - inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into 
> oak-lucene
> - fixed two JDK11 compile issues (potentially uninitialized vars in finally 
> block) 
> - backported fix from https://github.com/apache/lucene/issues/11537
> - enable test added in OAK-10713
> - ran Oak integration tests
> Open questions:
> - Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate 
> that
> - should we ask Lucene team for a public release (might be hard sell)
> - alternatively, as tried here, inline source code into oak-lucene (maybe add 
> explainers to all source files)
> - do we need to adopt the lucene test suite as well?
> - lucene-core dependencies in other Oak modules to be checked (seems mostly 
> for tests, or for run modules)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10719) oak-lucene uses Lucene version that can throw a StackOverflowException

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10719:
-
Summary: oak-lucene uses Lucene version that can throw a 
StackOverflowException  (was: oak-lucene uses lucene version that can throw a 
StackOverflowException)

> oak-lucene uses Lucene version that can throw a StackOverflowException
> --
>
> Key: OAK-10719
> URL: https://issues.apache.org/jira/browse/OAK-10719
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> See .
> Analysis so far:
> - oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has 
> reached EOL a long time ago
> - the version is vulnerable to an DoS attack (regexp stack overflow), see 
> OAK-10713
> - oak-lucene *embeds* and *exports* lucene-core
> - update to version >= 4.8 non-trivial due to backwards compat breakage
> Work in :
> - inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into 
> oak-lucene
> - fixed two JDK11 compile issues (potentially uninitialized vars in finally 
> block) 
> - backported fix from https://github.com/apache/lucene/issues/11537
> - enable test added in OAK-10713
> - ran Oak integration tests
> Open questions:
> - Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate 
> that
> - should we ask Lucene team for a public release (might be hard sell)
> - alternatively, as tried here, inline source code into oak-lucene (maybe add 
> explainers to all source files)
> - do we need to adopt the lucene test suite as well?
> - lucene-core dependencies in other Oak modules to be checked (seems mostly 
> for tests, or for run modules)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10694) Clarify state of oak-search-mt

2024-03-08 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824653#comment-17824653
 ] 

Thomas Mueller commented on OAK-10694:
--

> So what do we do in 1.22? Remove as well?

Yes. I think that is simpler.

> Clarify state of oak-search-mt
> --
>
> Key: OAK-10694
> URL: https://issues.apache.org/jira/browse/OAK-10694
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: search-mt
>Reporter: Manfred Baedke
>Priority: Major
>  Labels: candidate_oak_1_22
>
> oak-search-mt depends on an artifact from the retired Apache Incubator 
> project 
> [Joshua|https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+Home],
>  which has a dependency to Guava 19.
> May it be deprecated/removed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10694) Clarify state of oak-search-mt

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824036#comment-17824036
 ] 

Thomas Mueller commented on OAK-10694:
--

[~reschke] I would remove it already know. I don't think this is used by anyone 
(and not maintained). It will likely result in more issues if we deprecate it 
but keep it, and less issues if we remove it.

If someone requires this, then it would be his task to maintain it, in my view.

> Clarify state of oak-search-mt
> --
>
> Key: OAK-10694
> URL: https://issues.apache.org/jira/browse/OAK-10694
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: search-mt
>Reporter: Manfred Baedke
>Priority: Major
>  Labels: candidate_oak_1_22
>
> oak-search-mt depends on an artifact from the retired Apache Incubator 
> project 
> [Joshua|https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+Home],
>  which has a dependency to Guava 19.
> May it be deprecated/removed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823917#comment-17823917
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/6/24 8:58 AM:
--

I can add the method "expectedFpp()" in our code as well 
(getEstimatedEntryCount we already have), with documentation that this is O ( n 
). The implementation is pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

Actually I would suggest this method:

{noformat}
/**
 * Get the expected false positive rate for the current entries in the 
filter.
 * This will first calculate the estimated entry count, and then calculate 
the false positive probability from there.
...
 */
public double expectedFpp() {
return calculateFpp(getEstimatedEntryCount(), getBitCount(), getK());
}
{noformat}


was (Author: tmueller):
I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O ( n ). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823917#comment-17823917
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/6/24 8:52 AM:
--

I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O ( n ). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30


was (Author: tmueller):
I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O(n). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823917#comment-17823917
 ] 

Thomas Mueller commented on OAK-10674:
--

I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O(n). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:44 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:44 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}

I can work on this, no issue. We need to also move over some tests.




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:43 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(obj.hashCode());
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822576#comment-17822576
 ] 

Thomas Mueller commented on OAK-10674:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(obj.hashCode());
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10648) "IS NULL" (Null Props) Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817235#comment-17817235
 ] 

Thomas Mueller edited comment on OAK-10648 at 2/14/24 3:00 PM:
---

I didn't test this yet, but the following change seem to be necessary:

https://github.com/apache/jackrabbit-oak/blob/trunk/oak-search/src/main/java/org/apache/jackrabbit/oak/plugins/index/search/spi/query/FulltextIndexPlanner.java#L851

{noformat}
oak-search FulltextIndexPlanner

 if (pr.isNotNullRestriction()) {
// don't use weight for "is not null" restrictions
weight = 1;
 missing code start --
} else if (pr.isNullRestriction()) {
// don't use weight for "is null" restrictions
weight = 1;
 missing code end --
} else {
if (weight > 1) {
// for non-equality conditions such as
// where x > 1, x < 2, x like y,...:
// use a maximum weight of 3,
// so assume we read at least 30%
if (!isEqualityRestriction(pr)) {
weight = Math.min(3, weight);
}
}
}
{noformat}

We should probably add a feature toggle / system property so that we can switch 
back to the original behavior, to we can switch back in case an application 
relies on the current behavior.


was (Author: tmueller):
I didn't test this yet, but the following change seem to be necessary:

{noformat}
oak-search FulltextIndexPlanner

 if (pr.isNotNullRestriction()) {
// don't use weight for "is not null" restrictions
weight = 1;
 missing code start --
} else if (pr.isNullRestriction()) {
// don't use weight for "is null" restrictions
weight = 1;
 missing code end --
} else {
if (weight > 1) {
// for non-equality conditions such as
// where x > 1, x < 2, x like y,...:
// use a maximum weight of 3,
// so assume we read at least 30%
if (!isEqualityRestriction(pr)) {
weight = Math.min(3, weight);
}
}
}
{noformat}

We should probably add a feature toggle / system property so that we can switch 
back to the original behavior, to we can switch back in case an application 
relies on the current behavior.

> "IS NULL" (Null Props) Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}
> Index definition for the "cq:movedTo" property:
> {noformat}
> "cqMovedTo": {
> "notNullCheckEnabled": true,
> "nullCheckEnabled": true,
> "propertyIndex": true,
> "name": "cq:movedTo",
> "type": "String"
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10648) "IS NULL" (Null Props) Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10648:
-
Description: 
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.

Queries:
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
'%ksb1325bm%') 
{noformat}
 
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
UNION
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
{noformat}

Index definition for the "cq:movedTo" property:

{noformat}
"cqMovedTo": {
"notNullCheckEnabled": true,
"nullCheckEnabled": true,
"propertyIndex": true,
"name": "cq:movedTo",
"type": "String"
}
{noformat}

  was:
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

 

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

 

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.

Queries:
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
'%ksb1325bm%') 
{noformat}
 
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
UNION
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
{noformat}



> "IS NULL" (Null Props) Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}
> Index definition for the "cq:movedTo" property:
> {noformat}
> "cqMovedTo": {
> "notNullCheckEnabled": true,
> "nullCheckEnabled": true,
> "propertyIndex": true,
> "name": "cq:movedTo",
> "type": "String"
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10648) "IS NULL" (Null Props) Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10648:
-
Summary: "IS NULL" (Null Props) Cause Incorrect Query Estimation  (was: 
Null Props Cause Incorrect Query Estimation)

> "IS NULL" (Null Props) Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
>  
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
>  
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10648) Null Props Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10648:
-
Description: 
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

 

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

 

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.

Queries:
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
'%ksb1325bm%') 
{noformat}
 
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
UNION
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
{noformat}


  was:
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

 

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

 

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.


> Null Props Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
>  
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
>  
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10648) Null Props Cause Incorrect Query Estimation

2024-02-13 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817235#comment-17817235
 ] 

Thomas Mueller commented on OAK-10648:
--

I didn't test this yet, but the following change seem to be necessary:

{noformat}
oak-search FulltextIndexPlanner

 if (pr.isNotNullRestriction()) {
// don't use weight for "is not null" restrictions
weight = 1;
 missing code start --
} else if (pr.isNullRestriction()) {
// don't use weight for "is null" restrictions
weight = 1;
 missing code end --
} else {
if (weight > 1) {
// for non-equality conditions such as
// where x > 1, x < 2, x like y,...:
// use a maximum weight of 3,
// so assume we read at least 30%
if (!isEqualityRestriction(pr)) {
weight = Math.min(3, weight);
}
}
}
{noformat}

We should probably add a feature toggle / system property so that we can switch 
back to the original behavior, to we can switch back in case an application 
relies on the current behavior.

> Null Props Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
>  
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
>  
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10424) Allow Fast Query Size and Insecure Facets to be selectively enabled with query options for permitted principals

2024-01-15 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806843#comment-17806843
 ] 

Thomas Mueller commented on OAK-10424:
--

Documentation (proposal):  https://github.com/apache/jackrabbit-oak/pull/1269

https://jackrabbit.apache.org/oak/docs/query/query-engine.html#result-size 
source code in 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/query/query-engine.md



> Allow Fast Query Size and Insecure Facets to be selectively enabled with 
> query options for permitted principals 
> 
>
> Key: OAK-10424
> URL: https://issues.apache.org/jira/browse/OAK-10424
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Affects Versions: 1.56.0
>Reporter: Mark Adamcin
>Assignee: Mark Adamcin
>Priority: Major
>  Labels: query
> Fix For: 1.62.0
>
>
> Setting the global QueryEngineSettingsService.getFastQuerySize() value to 
> true is currently the only way to allow service users to leverage JCR query 
> for collecting accurate repository count metrics in a performant way. 
> However, doing so in a multiuser repository may be inadvisable because the 
> fast result size is returned to the caller without considering the caller's 
> read permissions over the paths returned in the result, which may allow less 
> privileged users to discover the presence of nodes that are not otherwise 
> visible to them.
> See 
> [https://jackrabbit.apache.org/oak/docs/query/query-engine.html#result-size]
> As an alternative to the global setting, Oak should provide a query option 
> alongside [TRAVERSAL, OFFSET / LIMIT, and INDEX 
> TAG|https://jackrabbit.apache.org/oak/docs/query/query-engine.html#query-options],
>  such as "INSECURE RESULT SIZE" .
> Similarly, IndexDefinition.SecureFacetConfiguration.MODE.INSECURE (insecure 
> facets) can provide extremely valuable counts for property value distribution 
> in large repositories. At the moment, it can only be defined on an index 
> definition, even though it governs the facet counts at query time and has no 
> effect on the persisted content of the index at all. Like fastQuerySize, Oak 
> should provide a query option such as "INSECURE FACETS", for permitted system 
> users to leverage insecure facets even when the query execution plan uses an 
> index definition that only allows secure or statistical facet security. 
> For example, 
> select a.[jcr:path] from [nt:base] as a where contains(a.[text], 'Hello 
> World') option(insecure result size, insecure facets, offset 10)
> To address the security risk, the application should also provide a 
> configuration of some kind to restrict the ability to effectively leverage 
> this option to permitted system users, which could be implemented as a JCR 
> repository privilege or an allowlist property in the 
> QueryEngineSettingsService configuration.
> I have provided a PR that adds support for an INSECURE RESULT SIZE query 
> option and an INSECURE FACETS query option, as well as an 
> "rep:insecureQueryOptions" repository privilege. I think the JCR 
> privilege-based approach for configuration of this permission is more aligned 
> with how system users are defined in practice, but this approach requires a 
> minor version increase in the following oak-security-spi packages:
>  * org.apache.jackrabbit.oak.spi.security.authorization.permission
>  * org.apache.jackrabbit.oak.spi.security.privilege
> Because all registered permissions are serialized into a long bitset, there 
> is clearly a premium on adding another built-in privilege, so I figured that 
> it would be better to choose a name for the privilege that would make it 
> applicable to both of these new options, and any future query options that 
> may involve a tradeoff between security and performance.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10577) Advanced repository statistics

2024-01-03 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10577.
--
Fix Version/s: 1.62.0
   Resolution: Fixed

> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.62.0
>
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10577) Advanced repository statistics

2024-01-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802241#comment-17802241
 ] 

Thomas Mueller commented on OAK-10577:
--

Merged on 2023-01-03

> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-3583:

Issue Type: Improvement  (was: Wish)

> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: cache
>Reporter: Philipp Suter
>Assignee: Thomas Mueller
>Priority: Major
>
>  The currently used Guava Cache API should not be used, so that we no longer 
> depend on Guava.
> The JCache API [1] could be used maybe. The JCache API implementation should 
> be configurable/ pluggable so it could support one of the available 
> distributed implementations [2].
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-3583:
---

Assignee: Thomas Mueller

> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: cache
>Reporter: Philipp Suter
>Assignee: Thomas Mueller
>Priority: Major
>
>  The currently used Guava Cache API should not be used, so that we no longer 
> depend on Guava.
> The JCache API [1] could be used maybe. The JCache API implementation should 
> be configurable/ pluggable so it could support one of the available 
> distributed implementations [2].
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-3583:

Description: 
 The currently used Guava Cache API should not be used, so that we no longer 
depend on Guava.

The JCache API [1] could be used maybe. The JCache API implementation should be 
configurable/ pluggable so it could support one of the available distributed 
implementations [2].

[1] https://jcp.org/en/jsr/detail?id=107
[2] https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html

  was:
The JCache API [1] was finally released and is ready to be used. 

Ideally the currently used Guava Cache is replaced by the JCache API. The 
JCache API implementation should be configurable/ pluggable so it could support 
one of the available distributed implementations [2].

The default should be a wrapper around the current Guava Cache and LIRSCache 
implementations.

[1] https://jcp.org/en/jsr/detail?id=107
[2] https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html


> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: cache
>Reporter: Philipp Suter
>Priority: Major
>
>  The currently used Guava Cache API should not be used, so that we no longer 
> depend on Guava.
> The JCache API [1] could be used maybe. The JCache API implementation should 
> be configurable/ pluggable so it could support one of the available 
> distributed implementations [2].
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-3583:

Summary: Replace Guava API for caching  (was: Replace Guava API with JCache 
API)

> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: cache
>Reporter: Philipp Suter
>Priority: Major
>
> The JCache API [1] was finally released and is ready to be used. 
> Ideally the currently used Guava Cache is replaced by the JCache API. The 
> JCache API implementation should be configurable/ pluggable so it could 
> support one of the available distributed implementations [2].
> The default should be a wrapper around the current Guava Cache and LIRSCache 
> implementations.
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10577) Advanced repository statistics

2023-12-05 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793295#comment-17793295
 ] 

Thomas Mueller commented on OAK-10577:
--

PR (work in progress): https://github.com/apache/jackrabbit-oak/pull/1247

> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10577) Advanced repository statistics

2023-12-04 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10577:
-
Description: 
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Approximate number and size of distinct binaries.
* Number of distinct values per (indexed) property, and the top values. This is 
useful to improve cost estimation (the "weight" property of indexes) and 
estimate index sizes.


  was:
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Number and size of distinct binaries.
* Number of distinct values per (indexed) property, and the top values. This is 
useful to improve cost estimation (the "weight" property of indexes) and 
estimate index sizes.



> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10577) Advanced repository statistics

2023-12-04 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10577:
-
Description: 
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Number and size of distinct binaries.
* Number of distinct values per (indexed) property, and the top values. This is 
useful to improve cost estimation (the "weight" property of indexes) and 
estimate index sizes.


  was:
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Size of distinct binaries.


> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10577) Advanced repository statistics

2023-12-04 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10577:


 Summary: Advanced repository statistics
 Key: OAK-10577
 URL: https://issues.apache.org/jira/browse/OAK-10577
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: oak-run
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Size of distinct binaries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-20 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787939#comment-17787939
 ] 

Thomas Mueller commented on OAK-10549:
--

To avoid OOME when running the tests, I changed the test case to use only 10 
facets:

https://github.com/apache/jackrabbit-oak/commit/79ac7fd718b1abb495635cec38f9887a4a2b9219

With 200 facets, the test required 190 MB (-mx190m); with 10 facets, only 25 MB.

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-17 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10549.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-15 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786419#comment-17786419
 ] 

Thomas Mueller commented on OAK-10549:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/1215

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-15 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10549:
-
Description: 
Currently, reading many facets (eg. 20) at a time is quite slow when using a 
Lucene index. We already cache the data, but performance is not all that great. 
One of the reasons is that we run one Lucene query per facet column. 

It is possible to speed this up, using eager facet caching.

  was:Currently, reading many facets (eg. 20) at a time is quite slow when 
using a Lucene index.


> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785910#comment-17785910
 ] 

Thomas Mueller commented on OAK-10549:
--

The latest change in this area was here:
https://github.com/apache/jackrabbit-oak/compare/trunk...oak-indexing:jackrabbit-oak:OAK-8898

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-14 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10549:


 Summary: Improve performance of facet count at scale (Lucene)
 Key: OAK-10549
 URL: https://issues.apache.org/jira/browse/OAK-10549
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: lucene, query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Currently, reading many facets (eg. 20) at a time is quite slow when using a 
Lucene index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10527) Improve readability of the explain query output

2023-11-07 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10527.
--
Resolution: Fixed

> Improve readability of the explain query output
> ---
>
> Key: OAK-10527
> URL: https://issues.apache.org/jira/browse/OAK-10527
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently the output "explain query" of Oak (the query plan) is hard to 
> interpret.
> A more human-readable output would be better. Example:
> Old:
> {noformat}
> [nt:base] as [nt:base] /* 
> lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
> sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
> ([nt:base].[sling:vanityPath] is not null) and 
> (first([nt:base].[sling:vanityPath]) > '') */
> {noformat}
> New:
> {noformat}
> [nt:base] as [nt:base] /* lucene:slingResourceResolver-1
> indexDefinition: /oak:index/slingResourceResolver-1
> estimatedEntries: 46
> luceneQuery: sling:vanityPath:[* TO *]
> synchronousPropertyCondition: sling:vanityPath is not null
>  */
> {noformat}
> Also, the formatting of the logged query statement should be improved: 
> instead of one single line with the whole statement, the statement should 
> contain line breaks before the important keywords. Example:
> Old:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
> isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
> FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
> {noformat}
> New:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus]
>   FROM [nt:base]
>   WHERE NOT isdescendantnode('/jcr:system')
>   AND [sling:vanityPath] IS NOT NULL
>   AND FIRST([sling:vanityPath]) > ''
>   ORDER BY FIRST([sling:vanityPath])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10527) Improve readability of the explain query output

2023-11-07 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10527:
-
Fix Version/s: 1.60.0

> Improve readability of the explain query output
> ---
>
> Key: OAK-10527
> URL: https://issues.apache.org/jira/browse/OAK-10527
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently the output "explain query" of Oak (the query plan) is hard to 
> interpret.
> A more human-readable output would be better. Example:
> Old:
> {noformat}
> [nt:base] as [nt:base] /* 
> lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
> sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
> ([nt:base].[sling:vanityPath] is not null) and 
> (first([nt:base].[sling:vanityPath]) > '') */
> {noformat}
> New:
> {noformat}
> [nt:base] as [nt:base] /* lucene:slingResourceResolver-1
> indexDefinition: /oak:index/slingResourceResolver-1
> estimatedEntries: 46
> luceneQuery: sling:vanityPath:[* TO *]
> synchronousPropertyCondition: sling:vanityPath is not null
>  */
> {noformat}
> Also, the formatting of the logged query statement should be improved: 
> instead of one single line with the whole statement, the statement should 
> contain line breaks before the important keywords. Example:
> Old:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
> isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
> FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
> {noformat}
> New:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus]
>   FROM [nt:base]
>   WHERE NOT isdescendantnode('/jcr:system')
>   AND [sling:vanityPath] IS NOT NULL
>   AND FIRST([sling:vanityPath]) > ''
>   ORDER BY FIRST([sling:vanityPath])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10420) Tool to compare Lucene index content

2023-11-07 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10420.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> Tool to compare Lucene index content
> 
>
> Key: OAK-10420
> URL: https://issues.apache.org/jira/browse/OAK-10420
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> I want to verify that an Oak Lucene index matches another index. Comparing 
> the number of documents in each index is possible, but this comparison is not 
> sufficient. 
> The main problem is that aggregation order depends on the order that child 
> nodes are traversed, and this order is not guaranteed to be always the same 
> (e.g. segment node store returns children in a different order than the 
> document node store). This will make checksums of files different. Checksum 
> of files can't always be compared due to this.
> I would like to create a tool that makes comparison of index content easy. 
> This tool needs to account for small differences caused by the above problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10532) Cost estimation for "not(@x)" calculates cost for "@x='value'" instead

2023-11-03 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10532:


 Summary: Cost estimation for "not(@x)" calculates cost for 
"@x='value'" instead
 Key: OAK-10532
 URL: https://issues.apache.org/jira/browse/OAK-10532
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: lucene
Reporter: Thomas Mueller


The cost estimation for a query that uses a Lucene index calculates the cost 
incorrectly if there is a "not()" condition. Examples:

{noformat}
/jcr:root//*[(not(@x)) and (not(@y))
{noformat}

The Lucene query is then:
{noformat}
+:nullProps:x +:nullProps:y
{noformat}

But the cost estimation seems to take into account the number of documents for 
the fields "x" and "y", instead of the field ":nullProps"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10527) Improve readability of the explain query output

2023-11-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782462#comment-17782462
 ] 

Thomas Mueller commented on OAK-10527:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/1187

> Improve readability of the explain query output
> ---
>
> Key: OAK-10527
> URL: https://issues.apache.org/jira/browse/OAK-10527
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently the output "explain query" of Oak (the query plan) is hard to 
> interpret.
> A more human-readable output would be better. Example:
> Old:
> {noformat}
> [nt:base] as [nt:base] /* 
> lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
> sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
> ([nt:base].[sling:vanityPath] is not null) and 
> (first([nt:base].[sling:vanityPath]) > '') */
> {noformat}
> New:
> {noformat}
> [nt:base] as [nt:base] /* lucene:slingResourceResolver-1
> indexDefinition: /oak:index/slingResourceResolver-1
> estimatedEntries: 46
> luceneQuery: sling:vanityPath:[* TO *]
> synchronousPropertyCondition: sling:vanityPath is not null
>  */
> {noformat}
> Also, the formatting of the logged query statement should be improved: 
> instead of one single line with the whole statement, the statement should 
> contain line breaks before the important keywords. Example:
> Old:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
> isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
> FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
> {noformat}
> New:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus]
>   FROM [nt:base]
>   WHERE NOT isdescendantnode('/jcr:system')
>   AND [sling:vanityPath] IS NOT NULL
>   AND FIRST([sling:vanityPath]) > ''
>   ORDER BY FIRST([sling:vanityPath])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10527) Improve readability of the explain query output

2023-11-02 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10527:


 Summary: Improve readability of the explain query output
 Key: OAK-10527
 URL: https://issues.apache.org/jira/browse/OAK-10527
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Currently the output "explain query" of Oak (the query plan) is hard to 
interpret.
A more human-readable output would be better. Example:

Old:
{noformat}
[nt:base] as [nt:base] /* 
lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
([nt:base].[sling:vanityPath] is not null) and 
(first([nt:base].[sling:vanityPath]) > '') */
{noformat}

New:
{noformat}
[nt:base] as [nt:base] /* lucene:slingResourceResolver-1
indexDefinition: /oak:index/slingResourceResolver-1
estimatedEntries: 46
luceneQuery: sling:vanityPath:[* TO *]
synchronousPropertyCondition: sling:vanityPath is not null
 */
{noformat}

Also, the formatting of the logged query statement should be improved: instead 
of one single line with the whole statement, the statement should contain line 
breaks before the important keywords. Example:

Old:
{noformat}
Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
[sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
{noformat}

New:
{noformat}
Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
[sling:redirect], [sling:redirectStatus]
  FROM [nt:base]
  WHERE NOT isdescendantnode('/jcr:system')
  AND [sling:vanityPath] IS NOT NULL
  AND FIRST([sling:vanityPath]) > ''
  ORDER BY FIRST([sling:vanityPath])
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10265) Oak-run offline reindex - async lane revert not taking place for stored index def after index import

2023-10-30 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10265:
-
Component/s: query
 indexing

> Oak-run offline reindex - async lane revert not taking place for stored index 
> def after index import
> 
>
> Key: OAK-10265
> URL: https://issues.apache.org/jira/browse/OAK-10265
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: indexing, query
>Reporter: Nitin Gupta
>Assignee: Nitin Gupta
>Priority: Major
> Fix For: 1.54.0
>
>
> During offline reindex using oak-run,
> the index import phase first changes the async property to temp-async and 
> keeps the original value in async-previous property.
> This is reverted when the import is done. However it appears that the revert 
> doesn't happen for the stored index definition and leaves that at 
> async = temp-async
> async-previous = [async, nrt]
> By setting "refresh=true", the stored index definition is copied to the 
> regular index definition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10518) IndexInfo should have a isActive() method

2023-10-26 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10518.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> IndexInfo should have a isActive() method
> -
>
> Key: OAK-10518
> URL: https://issues.apache.org/jira/browse/OAK-10518
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> With the composite node store, it is a bit hard to find out if an index is 
> active or not, as only the latest version of an index is usually active that 
> is mounted. Unless if there is a merges property that resolves.
> The IndexInfoService / IndexInfo class should have a method isActive() so 
> it's easy to find out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10518) IndexInfo should have a isActive() method

2023-10-25 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779531#comment-17779531
 ] 

Thomas Mueller commented on OAK-10518:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/1180

> IndexInfo should have a isActive() method
> -
>
> Key: OAK-10518
> URL: https://issues.apache.org/jira/browse/OAK-10518
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> With the composite node store, it is a bit hard to find out if an index is 
> active or not, as only the latest version of an index is usually active that 
> is mounted. Unless if there is a merges property that resolves.
> The IndexInfoService / IndexInfo class should have a method isActive() so 
> it's easy to find out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10518) IndexInfo should have a isActive() method

2023-10-25 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10518:


 Summary: IndexInfo should have a isActive() method
 Key: OAK-10518
 URL: https://issues.apache.org/jira/browse/OAK-10518
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller
Assignee: Thomas Mueller


With the composite node store, it is a bit hard to find out if an index is 
active or not, as only the latest version of an index is usually active that is 
mounted. Unless if there is a merges property that resolves.

The IndexInfoService / IndexInfo class should have a method isActive() so it's 
easy to find out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10497) Properties order in FFS can be different across runs

2023-10-25 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10497.
--
Resolution: Fixed

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10497) Properties order in FFS can be different across runs

2023-10-25 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779395#comment-17779395
 ] 

Thomas Mueller commented on OAK-10497:
--

Merged on 2023-10-20

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10265) Oak-run offline reindex - async lane revert not taking place for stored index def after index import

2023-10-24 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10265:
-
Description: 
During offline reindex using oak-run,
the index import phase first changes the async property to temp-async and keeps 
the original value in async-previous property.

This is reverted when the import is done. However it appears that the revert 
doesn't happen for the stored index definition and leaves that at 
async = temp-async
async-previous = [async, nrt]

By setting "refresh=true", the stored index definition is copied to the regular 
index definition.

  was:
During offline reindex using oak-run,
the index import phase first changes the async property to temp-async and keeps 
the original value in async-previous property.

This is reverted when the import is done. However it appears that the revert 
doesn't happen for the stored index definition and leaves that at 
async = temp-async
async-previous = [async, nrt]

We should probably add refresh=true to avoid this.


> Oak-run offline reindex - async lane revert not taking place for stored index 
> def after index import
> 
>
> Key: OAK-10265
> URL: https://issues.apache.org/jira/browse/OAK-10265
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Nitin Gupta
>Priority: Major
> Fix For: 1.54.0
>
>
> During offline reindex using oak-run,
> the index import phase first changes the async property to temp-async and 
> keeps the original value in async-previous property.
> This is reverted when the import is done. However it appears that the revert 
> doesn't happen for the stored index definition and leaves that at 
> async = temp-async
> async-previous = [async, nrt]
> By setting "refresh=true", the stored index definition is copied to the 
> regular index definition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10497) Properties order in FFS can be different across runs

2023-10-19 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1393#comment-1393
 ] 

Thomas Mueller commented on OAK-10497:
--

New PRs:
https://github.com/apache/jackrabbit-oak/pull/1174 -- this I merged without 
running the tests :-/
https://github.com/apache/jackrabbit-oak/pull/1175 -- fixes the bug from the 
above PR

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10497) Properties order in FFS can be different across runs

2023-10-19 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10497:


Assignee: Thomas Mueller  (was: Nitin Gupta)

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10490) Suggest queries return duplicate entries if prefetch is enabled

2023-10-13 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10490.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> Suggest queries return duplicate entries if prefetch is enabled
> ---
>
> Key: OAK-10490
> URL: https://issues.apache.org/jira/browse/OAK-10490
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> If prefetch is enabled, and prefetch count is larger than 0, then suggest 
> queries return duplicate results.
> This seems to be caused by oak-search FulltextIndex.FulltextPathCursor: 
> FulltextPathCursor.next() returns a new IndexRow that references currentRow. 
> But pathIterator.next() updates currentRow. So that the following code can 
> return different results:
> {noformat}
> // here, excerpt1 and except2 are different:
> IndexRow row1 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt2 = row2.getValue("rep:excerpt"));
> // here, excerpt1 is equal to except2:
> IndexRow row1 = fulltextPathCursor.next();
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> String excerpt2 = row2.getValue("rep:excerpt"));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10490) Suggest queries return duplicate entries if prefetch is enabled

2023-10-12 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774555#comment-17774555
 ] 

Thomas Mueller commented on OAK-10490:
--

PR https://github.com/apache/jackrabbit-oak/pull/1148

> Suggest queries return duplicate entries if prefetch is enabled
> ---
>
> Key: OAK-10490
> URL: https://issues.apache.org/jira/browse/OAK-10490
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> If prefetch is enabled, and prefetch count is larger than 0, then suggest 
> queries return duplicate results.
> This seems to be caused by oak-search FulltextIndex.FulltextPathCursor: 
> FulltextPathCursor.next() returns a new IndexRow that references currentRow. 
> But pathIterator.next() updates currentRow. So that the following code can 
> return different results:
> {noformat}
> // here, excerpt1 and except2 are different:
> IndexRow row1 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt2 = row2.getValue("rep:excerpt"));
> // here, excerpt1 is equal to except2:
> IndexRow row1 = fulltextPathCursor.next();
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> String excerpt2 = row2.getValue("rep:excerpt"));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10490) Suggest queries return duplicate entries if prefetch is enabled

2023-10-12 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10490:


 Summary: Suggest queries return duplicate entries if prefetch is 
enabled
 Key: OAK-10490
 URL: https://issues.apache.org/jira/browse/OAK-10490
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


If prefetch is enabled, and prefetch count is larger than 0, then suggest 
queries return duplicate results.

This seems to be caused by oak-search FulltextIndex.FulltextPathCursor: 
FulltextPathCursor.next() returns a new IndexRow that references currentRow. 
But pathIterator.next() updates currentRow. So that the following code can 
return different results:

{noformat}
// here, excerpt1 and except2 are different:
IndexRow row1 = fulltextPathCursor.next();
String excerpt1 = row1.getValue("rep:excerpt"));
IndexRow row2 = fulltextPathCursor.next();
String excerpt2 = row2.getValue("rep:excerpt"));

// here, excerpt1 is equal to except2:
IndexRow row1 = fulltextPathCursor.next();
IndexRow row2 = fulltextPathCursor.next();
String excerpt1 = row1.getValue("rep:excerpt"));
String excerpt2 = row2.getValue("rep:excerpt"));
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-30 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10399.
--
Fix Version/s: 1.58.0
   Resolution: Fixed

> Automatically pick a merged index over multiple levels
> --
>
> Key: OAK-10399
> URL: https://issues.apache.org/jira/browse/OAK-10399
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Priority: Major
> Fix For: 1.58.0
>
>
> When using the composite node store for blue-green deployments, multiple 
> versions of a index can exist at the same time, for a short period of time 
> (while both blue and green are running at the same time). In OAK-9301 we 
> support merged indexes.
> What we don't support currently is merged indexes over multiple levels. 
> Example:
> * /oak:index/index-1 (first version of the index)
> * /oak:index/index-1-custom-1 (customization of that index)
> * /oak:index/index-2 (new base version)
> * /oak:index/index-2-custom-1 (auto-merged index)
> * /oak:index/index-3 (the second new base version)
> * /oak:index/index-3-custom-1 (auto-merged index)
> In this case, index-3 is used for queries, instead of index-3-custom-1.
> The reason is the following: whenever we auto-merge, we set the merges 
> property to the previous base version, and the previous customization. This 
> works well for index-2-custom-1, but doesn't work for index-3-custom-1.
> We need to change the index picking algorithm, such that only one level of 
> base indexes is checked: only the existence of index-3. The existence of 
> index-2 must not be checked. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-30 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10399:


Assignee: Thomas Mueller

> Automatically pick a merged index over multiple levels
> --
>
> Key: OAK-10399
> URL: https://issues.apache.org/jira/browse/OAK-10399
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.58.0
>
>
> When using the composite node store for blue-green deployments, multiple 
> versions of a index can exist at the same time, for a short period of time 
> (while both blue and green are running at the same time). In OAK-9301 we 
> support merged indexes.
> What we don't support currently is merged indexes over multiple levels. 
> Example:
> * /oak:index/index-1 (first version of the index)
> * /oak:index/index-1-custom-1 (customization of that index)
> * /oak:index/index-2 (new base version)
> * /oak:index/index-2-custom-1 (auto-merged index)
> * /oak:index/index-3 (the second new base version)
> * /oak:index/index-3-custom-1 (auto-merged index)
> In this case, index-3 is used for queries, instead of index-3-custom-1.
> The reason is the following: whenever we auto-merge, we set the merges 
> property to the previous base version, and the previous customization. This 
> works well for index-2-custom-1, but doesn't work for index-3-custom-1.
> We need to change the index picking algorithm, such that only one level of 
> base indexes is checked: only the existence of index-3. The existence of 
> index-2 must not be checked. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10420) Tool to compare Lucene index content

2023-08-28 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759579#comment-17759579
 ] 

Thomas Mueller commented on OAK-10420:
--

PR https://github.com/apache/jackrabbit-oak/pull/1086

> Tool to compare Lucene index content
> 
>
> Key: OAK-10420
> URL: https://issues.apache.org/jira/browse/OAK-10420
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I want to verify that an Oak Lucene index matches another index. Comparing 
> the number of documents in each index is possible, but this comparison is not 
> sufficient. 
> The main problem is that aggregation order depends on the order that child 
> nodes are traversed, and this order is not guaranteed to be always the same 
> (e.g. segment node store returns children in a different order than the 
> document node store). This will make checksums of files different. Checksum 
> of files can't always be compared due to this.
> I would like to create a tool that makes comparison of index content easy. 
> This tool needs to account for small differences caused by the above problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10420) Tool to compare Lucene index content

2023-08-28 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10420:


 Summary: Tool to compare Lucene index content
 Key: OAK-10420
 URL: https://issues.apache.org/jira/browse/OAK-10420
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller


I want to verify that an Oak Lucene index matches another index. Comparing the 
number of documents in each index is possible, but this comparison is not 
sufficient. 

The main problem is that aggregation order depends on the order that child 
nodes are traversed, and this order is not guaranteed to be always the same 
(e.g. segment node store returns children in a different order than the 
document node store). This will make checksums of files different. Checksum of 
files can't always be compared due to this.

I would like to create a tool that makes comparison of index content easy. This 
tool needs to account for small differences caused by the above problem.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754144#comment-17754144
 ] 

Thomas Mueller commented on OAK-10399:
--

PR https://github.com/apache/jackrabbit-oak/pull/1066

> Automatically pick a merged index over multiple levels
> --
>
> Key: OAK-10399
> URL: https://issues.apache.org/jira/browse/OAK-10399
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Priority: Major
>
> When using the composite node store for blue-green deployments, multiple 
> versions of a index can exist at the same time, for a short period of time 
> (while both blue and green are running at the same time). In OAK-9301 we 
> support merged indexes.
> What we don't support currently is merged indexes over multiple levels. 
> Example:
> * /oak:index/index-1 (first version of the index)
> * /oak:index/index-1-custom-1 (customization of that index)
> * /oak:index/index-2 (new base version)
> * /oak:index/index-2-custom-1 (auto-merged index)
> * /oak:index/index-3 (the second new base version)
> * /oak:index/index-3-custom-1 (auto-merged index)
> In this case, index-3 is used for queries, instead of index-3-custom-1.
> The reason is the following: whenever we auto-merge, we set the merges 
> property to the previous base version, and the previous customization. This 
> works well for index-2-custom-1, but doesn't work for index-3-custom-1.
> We need to change the index picking algorithm, such that only one level of 
> base indexes is checked: only the existence of index-3. The existence of 
> index-2 must not be checked. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-14 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10399:


 Summary: Automatically pick a merged index over multiple levels
 Key: OAK-10399
 URL: https://issues.apache.org/jira/browse/OAK-10399
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller


When using the composite node store for blue-green deployments, multiple 
versions of a index can exist at the same time, for a short period of time 
(while both blue and green are running at the same time). In OAK-9301 we 
support merged indexes.

What we don't support currently is merged indexes over multiple levels. Example:

* /oak:index/index-1 (first version of the index)
* /oak:index/index-1-custom-1 (customization of that index)
* /oak:index/index-2 (new base version)
* /oak:index/index-2-custom-1 (auto-merged index)
* /oak:index/index-3 (the second new base version)
* /oak:index/index-3-custom-1 (auto-merged index)

In this case, index-3 is used for queries, instead of index-3-custom-1.

The reason is the following: whenever we auto-merge, we set the merges property 
to the previous base version, and the previous customization. This works well 
for index-2-custom-1, but doesn't work for index-3-custom-1.

We need to change the index picking algorithm, such that only one level of base 
indexes is checked: only the existence of index-3. The existence of index-2 
must not be checked. 





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10333) Improved logging for queries that that traverse more than 10'000 nodes

2023-08-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750630#comment-17750630
 ] 

Thomas Mueller commented on OAK-10333:
--

PR merged on 2023-08-03

> Improved logging for queries that that traverse more than 10'000 nodes
> --
>
> Key: OAK-10333
> URL: https://issues.apache.org/jira/browse/OAK-10333
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.56.0
>
>
> We have many queries that traverse more than 10'000 nodes. A warning is 
> logged in this case. However, we don't know the code that runs this queries.
> Logging a stack trace for this case would help a lot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10333) Improved logging for queries that that traverse more than 10'000 nodes

2023-08-03 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10333.
--
Resolution: Fixed

> Improved logging for queries that that traverse more than 10'000 nodes
> --
>
> Key: OAK-10333
> URL: https://issues.apache.org/jira/browse/OAK-10333
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.56.0
>
>
> We have many queries that traverse more than 10'000 nodes. A warning is 
> logged in this case. However, we don't know the code that runs this queries.
> Logging a stack trace for this case would help a lot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10333) Improved logging for queries that that traverse more than 10'000 nodes

2023-08-03 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10333:
-
Fix Version/s: 1.56.0

> Improved logging for queries that that traverse more than 10'000 nodes
> --
>
> Key: OAK-10333
> URL: https://issues.apache.org/jira/browse/OAK-10333
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.56.0
>
>
> We have many queries that traverse more than 10'000 nodes. A warning is 
> logged in this case. However, we don't know the code that runs this queries.
> Logging a stack trace for this case would help a lot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10341) Indexing: replace FlatFileStore+PersistedLinkedList with a tree store

2023-07-07 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10341:


 Summary: Indexing: replace FlatFileStore+PersistedLinkedList with 
a tree store
 Key: OAK-10341
 URL: https://issues.apache.org/jira/browse/OAK-10341
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller


Currently, for indexing large repositories with the document store, we first 
read all nodes and write them to a sorted file (sorting and merging when 
needed). Then we index from that sorted file (called "FlatFileStore").

There are multiple problems with this mechanism:
* The last merging stage of the flat file store is actually not needed: we 
could index from the un-merged streams. It would save one step where we write 
and read all the data.
* It requires to know the aggregation in the index definition, in order to have 
a set of "preferred children". If this is unknown, then indexing might take 
nearly infinite time. 
* Even if it is known, indexing might be very very slow, specially if there are 
many direct child nodes for some of the nodes that require aggregation. 
* It requires a PersistedLinkedList to avoid running out of memory. This 
persisted linked list uses a key-value store internally. This is an additional 
overhead: we store and read the data again. However, access to that storage is 
still done using just an iterator, and not with a key lookup. So performance 
can still be quite bad.
* For parallel indexing, we split the flat file. This is not possible unless if 
we know the aggregation. Sometimes splitting is not possible.

We want to explore using a tree store that would solve all of the above 
problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10338) PipelinedMergeSortTaskTest is failing on Windows due to line end issues

2023-07-04 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740033#comment-17740033
 ] 

Thomas Mueller edited comment on OAK-10338 at 7/5/23 5:58 AM:
--

(Note: with https://openjdk.org/jeps/400 "JEP 400: UTF-8 by Default" this might 
not be as relevant any more)

[~baedke] (Sorry just read the issue title...) Yes I think we should change to 
UTF-8 in the ExternalSort.

Do you have Java 18? It would be interesting to know if the test on Windows 
fails or not. (I'm not saying everyone should switch to Java 18... but it would 
be interesting to know.)


was (Author: tmueller):
(Note: with https://openjdk.org/jeps/400 "JEP 400: UTF-8 by Default" this might 
not be as relevant any more)

[~baedke] (Sorry just read the issue title...) Yes I think we should change to 
UTF-8 in the ExternalSort.

> PipelinedMergeSortTaskTest is failing on Windows due to line end issues
> ---
>
> Key: OAK-10338
> URL: https://issues.apache.org/jira/browse/OAK-10338
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: oak-run
>Reporter: Manfred Baedke
>Assignee: Manfred Baedke
>Priority: Minor
>
> I think that this might actually be a bug and not a broken test. The problem 
> is that the method java.io.BufferedWrite.newLine() acts platform dependendly. 
> It's used here:
> https://github.com/apache/jackrabbit-oak/blob/25c01b81768c77e558078a92a31309910902f3a0/oak-commons/src/main/java/org/apache/jackrabbit/oak/commons/sort/ExternalSort.java#L732C29-L732C29
> As a result, the sorted file will have platform dependend line endings, while 
> the original file might not. I have no idea if that's a problem.
> [~thomasm], [~chetanm], [~nsantos], wdyt?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10338) PipelinedMergeSortTaskTest is failing on Windows due to line end issues

2023-07-04 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740033#comment-17740033
 ] 

Thomas Mueller edited comment on OAK-10338 at 7/5/23 5:55 AM:
--

(Note: with https://openjdk.org/jeps/400 "JEP 400: UTF-8 by Default" this might 
not be as relevant any more)

[~baedke] (Sorry just read the issue title...) Yes I think we should change to 
UTF-8 in the ExternalSort.


was (Author: tmueller):
(Note: with https://openjdk.org/jeps/400 "JEP 400: UTF-8 by Default" this might 
not be as relevant any more)

[~baedke] I'm sorry I don't have the context, it there a test failing on some 
environment? Yes I think we should change to UTF-8 in the ExternalSort, but we 
would need to know which test is broken, where it is broken, and then verify it 
is fixed with the change.


> PipelinedMergeSortTaskTest is failing on Windows due to line end issues
> ---
>
> Key: OAK-10338
> URL: https://issues.apache.org/jira/browse/OAK-10338
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: oak-run
>Reporter: Manfred Baedke
>Assignee: Manfred Baedke
>Priority: Minor
>
> I think that this might actually be a bug and not a broken test. The problem 
> is that the method java.io.BufferedWrite.newLine() acts platform dependendly. 
> It's used here:
> https://github.com/apache/jackrabbit-oak/blob/25c01b81768c77e558078a92a31309910902f3a0/oak-commons/src/main/java/org/apache/jackrabbit/oak/commons/sort/ExternalSort.java#L732C29-L732C29
> As a result, the sorted file will have platform dependend line endings, while 
> the original file might not. I have no idea if that's a problem.
> [~thomasm], [~chetanm], [~nsantos], wdyt?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10338) PipelinedMergeSortTaskTest is failing on Windows due to line end issues

2023-07-04 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740033#comment-17740033
 ] 

Thomas Mueller commented on OAK-10338:
--

(Note: with https://openjdk.org/jeps/400 "JEP 400: UTF-8 by Default" this might 
not be as relevant any more)

[~baedke] I'm sorry I don't have the context, it there a test failing on some 
environment? Yes I think we should change to UTF-8 in the ExternalSort, but we 
would need to know which test is broken, where it is broken, and then verify it 
is fixed with the change.


> PipelinedMergeSortTaskTest is failing on Windows due to line end issues
> ---
>
> Key: OAK-10338
> URL: https://issues.apache.org/jira/browse/OAK-10338
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: oak-run
>Reporter: Manfred Baedke
>Assignee: Manfred Baedke
>Priority: Minor
>
> I think that this might actually be a bug and not a broken test. The problem 
> is that the method java.io.BufferedWrite.newLine() acts platform dependendly. 
> It's used here:
> https://github.com/apache/jackrabbit-oak/blob/25c01b81768c77e558078a92a31309910902f3a0/oak-commons/src/main/java/org/apache/jackrabbit/oak/commons/sort/ExternalSort.java#L732C29-L732C29
> As a result, the sorted file will have platform dependend line endings, while 
> the original file might not. I have no idea if that's a problem.
> [~thomasm], [~chetanm], [~nsantos], wdyt?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10333) Improved logging for queries that that traverse more than 10'000 nodes

2023-06-30 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739106#comment-17739106
 ] 

Thomas Mueller commented on OAK-10333:
--

https://github.com/apache/jackrabbit-oak/pull/1012

> Improved logging for queries that that traverse more than 10'000 nodes
> --
>
> Key: OAK-10333
> URL: https://issues.apache.org/jira/browse/OAK-10333
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> We have many queries that traverse more than 10'000 nodes. A warning is 
> logged in this case. However, we don't know the code that runs this queries.
> Logging a stack trace for this case would help a lot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10333) Improved logging for queries that that traverse more than 10'000 nodes

2023-06-30 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10333:


 Summary: Improved logging for queries that that traverse more than 
10'000 nodes
 Key: OAK-10333
 URL: https://issues.apache.org/jira/browse/OAK-10333
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


We have many queries that traverse more than 10'000 nodes. A warning is logged 
in this case. However, we don't know the code that runs this queries.

Logging a stack trace for this case would help a lot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10262) Document ASCIIFolder and OakAnalyzer

2023-05-25 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10262.
--
Fix Version/s: 1.54.0
   Resolution: Fixed

> Document ASCIIFolder and OakAnalyzer
> 
>
> Key: OAK-10262
> URL: https://issues.apache.org/jira/browse/OAK-10262
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>  Labels: index, lucene
> Fix For: 1.54.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10261) Query with OR clause with COALESCE function incorrectly interpreted

2023-05-24 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10261.
--
Fix Version/s: 1.54.0
   Resolution: Fixed

> Query with OR clause with COALESCE function incorrectly interpreted
> ---
>
> Key: OAK-10261
> URL: https://issues.apache.org/jira/browse/OAK-10261
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.54.0
>
>
> The "coalesce" function incorrectly asks the index to do "is not null" for 
> the first property:
> {noformat}
> SELECT a.* 
> FROM [dam:Asset] AS a 
> WHERE ((COALESCE(a.[jcr:lastModified], a.[jcr:created]) < 
> cast('2023-05-08T20:51:06.239+03:00' AS date)) 
> OR (COALESCE(a.[jcr:lastModified], a.[jcr:created]) = 
> cast('2023-05-08T20:51:06.239+03:00' AS date) 
> [dam:Asset] as [asset] /* lucene:fragments-9(/oak:index/fragments-9)  
> +jcr:lastModified:[-9223372036854775808 TO 9223372036854775807]  
>  */ 
> {noformat}
> This is because the Coalesce implementation uses an incorrect 
> "getPropertyExistence" method. It is implemented as follows, so that it 
> implies the first operand is not null, which is incorrect: the first operand 
> can be null. Even the second operand can be null; just the combination can't 
> be null - but there seems to be no good reason to inform the index to do this.
> {noformat}
> // this is wrong:
> @Override
> public PropertyExistenceImpl getPropertyExistence() {
> PropertyExistenceImpl pe = operand1.getPropertyExistence();
> return pe != null ? pe : operand2.getPropertyExistence();
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10262) Document ASCIIFolder and OakAnalyzer

2023-05-24 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725812#comment-17725812
 ] 

Thomas Mueller commented on OAK-10262:
--

https://github.com/apache/jackrabbit-oak/pull/955

> Document ASCIIFolder and OakAnalyzer
> 
>
> Key: OAK-10262
> URL: https://issues.apache.org/jira/browse/OAK-10262
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>  Labels: index, lucene
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10262) Document ASCIIFolder and OakAnalyzer

2023-05-24 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10262:


 Summary: Document ASCIIFolder and OakAnalyzer
 Key: OAK-10262
 URL: https://issues.apache.org/jira/browse/OAK-10262
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10261) Query with OR clause with COALESCE function incorrectly interpreted

2023-05-24 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725793#comment-17725793
 ] 

Thomas Mueller commented on OAK-10261:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/954

> Query with OR clause with COALESCE function incorrectly interpreted
> ---
>
> Key: OAK-10261
> URL: https://issues.apache.org/jira/browse/OAK-10261
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The "coalesce" function incorrectly asks the index to do "is not null" for 
> the first property:
> {noformat}
> SELECT a.* 
> FROM [dam:Asset] AS a 
> WHERE ((COALESCE(a.[jcr:lastModified], a.[jcr:created]) < 
> cast('2023-05-08T20:51:06.239+03:00' AS date)) 
> OR (COALESCE(a.[jcr:lastModified], a.[jcr:created]) = 
> cast('2023-05-08T20:51:06.239+03:00' AS date) 
> [dam:Asset] as [asset] /* lucene:fragments-9(/oak:index/fragments-9)  
> +jcr:lastModified:[-9223372036854775808 TO 9223372036854775807]  
>  */ 
> {noformat}
> This is because the Coalesce implementation uses an incorrect 
> "getPropertyExistence" method. It is implemented as follows, so that it 
> implies the first operand is not null, which is incorrect: the first operand 
> can be null. Even the second operand can be null; just the combination can't 
> be null - but there seems to be no good reason to inform the index to do this.
> {noformat}
> // this is wrong:
> @Override
> public PropertyExistenceImpl getPropertyExistence() {
> PropertyExistenceImpl pe = operand1.getPropertyExistence();
> return pe != null ? pe : operand2.getPropertyExistence();
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10261) Query with OR clause with COALESCE function incorrectly interpreted

2023-05-24 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10261:


 Summary: Query with OR clause with COALESCE function incorrectly 
interpreted
 Key: OAK-10261
 URL: https://issues.apache.org/jira/browse/OAK-10261
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller


The "coalesce" function incorrectly asks the index to do "is not null" for the 
first property:

{noformat}
SELECT a.* 
FROM [dam:Asset] AS a 
WHERE ((COALESCE(a.[jcr:lastModified], a.[jcr:created]) < 
cast('2023-05-08T20:51:06.239+03:00' AS date)) 
OR (COALESCE(a.[jcr:lastModified], a.[jcr:created]) = 
cast('2023-05-08T20:51:06.239+03:00' AS date) 

[dam:Asset] as [asset] /* lucene:fragments-9(/oak:index/fragments-9)  
+jcr:lastModified:[-9223372036854775808 TO 9223372036854775807]  
 */ 
{noformat}

This is because the Coalesce implementation uses an incorrect 
"getPropertyExistence" method. It is implemented as follows, so that it implies 
the first operand is not null, which is incorrect: the first operand can be 
null. Even the second operand can be null; just the combination can't be null - 
but there seems to be no good reason to inform the index to do this.

{noformat}
// this is wrong:
@Override
public PropertyExistenceImpl getPropertyExistence() {
PropertyExistenceImpl pe = operand1.getPropertyExistence();
return pe != null ? pe : operand2.getPropertyExistence();
}
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10261) Query with OR clause with COALESCE function incorrectly interpreted

2023-05-24 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10261:


Assignee: Thomas Mueller

> Query with OR clause with COALESCE function incorrectly interpreted
> ---
>
> Key: OAK-10261
> URL: https://issues.apache.org/jira/browse/OAK-10261
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The "coalesce" function incorrectly asks the index to do "is not null" for 
> the first property:
> {noformat}
> SELECT a.* 
> FROM [dam:Asset] AS a 
> WHERE ((COALESCE(a.[jcr:lastModified], a.[jcr:created]) < 
> cast('2023-05-08T20:51:06.239+03:00' AS date)) 
> OR (COALESCE(a.[jcr:lastModified], a.[jcr:created]) = 
> cast('2023-05-08T20:51:06.239+03:00' AS date) 
> [dam:Asset] as [asset] /* lucene:fragments-9(/oak:index/fragments-9)  
> +jcr:lastModified:[-9223372036854775808 TO 9223372036854775807]  
>  */ 
> {noformat}
> This is because the Coalesce implementation uses an incorrect 
> "getPropertyExistence" method. It is implemented as follows, so that it 
> implies the first operand is not null, which is incorrect: the first operand 
> can be null. Even the second operand can be null; just the combination can't 
> be null - but there seems to be no good reason to inform the index to do this.
> {noformat}
> // this is wrong:
> @Override
> public PropertyExistenceImpl getPropertyExistence() {
> PropertyExistenceImpl pe = operand1.getPropertyExistence();
> return pe != null ? pe : operand2.getPropertyExistence();
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10214) Expose node counter value as a metric in Oak

2023-05-08 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720524#comment-17720524
 ] 

Thomas Mueller commented on OAK-10214:
--

Merged on 2023-05-08

> Expose node counter value as a metric in Oak
> 
>
> Key: OAK-10214
> URL: https://issues.apache.org/jira/browse/OAK-10214
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Steffen Van
>Priority: Minor
>
> Expose the node counter jmx bean as a metric in Oak.
> This is done so the value can be scraped and displayed in a Grafana dashboard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10214) Expose node counter value as a metric in Oak

2023-05-08 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10214.
--
Fix Version/s: 1.52.0
   Resolution: Fixed

> Expose node counter value as a metric in Oak
> 
>
> Key: OAK-10214
> URL: https://issues.apache.org/jira/browse/OAK-10214
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Steffen Van
>Priority: Minor
> Fix For: 1.52.0
>
>
> Expose the node counter jmx bean as a metric in Oak.
> This is done so the value can be scraped and displayed in a Grafana dashboard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10225) Utility to rate limit writes in case async indexing is delayed

2023-05-04 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10225.
--
Fix Version/s: 1.52.0
   Resolution: Fixed

> Utility to rate limit writes in case async indexing is delayed
> --
>
> Key: OAK-10225
> URL: https://issues.apache.org/jira/browse/OAK-10225
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.52.0
>
>
> An application might writes too much stuff too quickly to the repository. In 
> the extreme case, it might even write so much that the application might need 
> to be stopped.
> To avoid this, rate-limiting the writes would be good. For this, we can add a 
> utility class with a method rateLimitWrites(), which can be called in order 
> to limit the writes, in case the async indexes are lagging behind badly.
> The method should return immediately if all async indexes are up-to-date 
> (updated in the last 30 seconds).
> If indexing lanes are lagging behind, however, the method should wait (using 
> Thread.sleep) for at most 1 minute. If the method is called more than once 
> per minute, it should sleep for at most the time that passed until the last 
> call; that is, an application that is calling it a lot will be paused for up 
> to 50%. This assumes indexes will be able to catch up in this situation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10225) Utility to rate limit writes in case async indexing is delayed

2023-05-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718947#comment-17718947
 ] 

Thomas Mueller commented on OAK-10225:
--

PR https://github.com/apache/jackrabbit-oak/pull/922

> Utility to rate limit writes in case async indexing is delayed
> --
>
> Key: OAK-10225
> URL: https://issues.apache.org/jira/browse/OAK-10225
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> An application might writes too much stuff too quickly to the repository. In 
> the extreme case, it might even write so much that the application might need 
> to be stopped.
> To avoid this, rate-limiting the writes would be good. For this, we can add a 
> utility class with a method rateLimitWrites(), which can be called in order 
> to limit the writes, in case the async indexes are lagging behind badly.
> The method should return immediately if all async indexes are up-to-date 
> (updated in the last 30 seconds).
> If indexing lanes are lagging behind, however, the method should wait (using 
> Thread.sleep) for at most 1 minute. If the method is called more than once 
> per minute, it should sleep for at most the time that passed until the last 
> call; that is, an application that is calling it a lot will be paused for up 
> to 50%. This assumes indexes will be able to catch up in this situation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10225) Utility to rate limit writes in case async indexing is delayed

2023-05-03 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10225:


 Summary: Utility to rate limit writes in case async indexing is 
delayed
 Key: OAK-10225
 URL: https://issues.apache.org/jira/browse/OAK-10225
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller
Assignee: Thomas Mueller


An application might writes too much stuff too quickly to the repository. In 
the extreme case, it might even write so much that the application might need 
to be stopped.

To avoid this, rate-limiting the writes would be good. For this, we can add a 
utility class with a method rateLimitWrites(), which can be called in order to 
limit the writes, in case the async indexes are lagging behind badly.

The method should return immediately if all async indexes are up-to-date 
(updated in the last 30 seconds).

If indexing lanes are lagging behind, however, the method should wait (using 
Thread.sleep) for at most 1 minute. If the method is called more than once per 
minute, it should sleep for at most the time that passed until the last call; 
that is, an application that is calling it a lot will be paused for up to 50%. 
This assumes indexes will be able to catch up in this situation.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10210) Prefetch breaks Fast Result Size

2023-04-26 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10210.
--
Fix Version/s: 1.52.0
   Resolution: Fixed

Merged on 2023-04-26

> Prefetch breaks Fast Result Size
> 
>
> Key: OAK-10210
> URL: https://issues.apache.org/jira/browse/OAK-10210
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Priority: Major
>  Labels: query
> Fix For: 1.52.0
>
>
> If prefetch is enabled, then fast result size won't work.
> The problem is that PrefetchCursor (in oak-core) doesn't implement getSize as 
> it should. It is easy to fix, but we didn't know this so far.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10210) Prefetch breaks Fast Result Size

2023-04-25 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716252#comment-17716252
 ] 

Thomas Mueller commented on OAK-10210:
--

PR https://github.com/apache/jackrabbit-oak/pull/911

> Prefetch breaks Fast Result Size
> 
>
> Key: OAK-10210
> URL: https://issues.apache.org/jira/browse/OAK-10210
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Priority: Major
>  Labels: query
>
> If prefetch is enabled, then fast result size won't work.
> The problem is that PrefetchCursor (in oak-core) doesn't implement getSize as 
> it should. It is easy to fix, but we didn't know this so far.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10210) Prefetch breaks Fast Result Size

2023-04-25 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10210:


 Summary: Prefetch breaks Fast Result Size
 Key: OAK-10210
 URL: https://issues.apache.org/jira/browse/OAK-10210
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller


If prefetch is enabled, then fast result size won't work.

The problem is that PrefetchCursor (in oak-core) doesn't implement getSize as 
it should. It is easy to fix, but we didn't know this so far.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-9625) Support ordered index for first value of a multi-valued property, node name, and path

2023-04-17 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713112#comment-17713112
 ] 

Thomas Mueller commented on OAK-9625:
-

Improved documentation in 
https://github.com/apache/jackrabbit-oak/commit/941050b89d6db8332ecf30ef56b8bb00a9a54ea7

> Support ordered index for first value of a multi-valued property, node name, 
> and path
> -
>
> Key: OAK-9625
> URL: https://issues.apache.org/jira/browse/OAK-9625
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing, lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.42.0
>
>
> Keyset pagination 
> https://jackrabbit.apache.org/oak/docs/query/query-engine.html#Keyset_Pagination
>  requires ordered indexing on a property. 
> If all we have is a property "x", which is set on "nt:base" (or a similar 
> node type), then an ordered index on the property "x" can be used for 
> pagination. However, if the property is sometimes multi-valued, then it's not 
> possible, because we don't support ordered indexes on multi-valued properties.
> {noformat}
> /jcr:root//element(*, nt:base)
> [jcr:first(@alias) >= $lastEntry]
> order by jcr:first(@alias), @jcr:path
> /oak:index/aliasIndex
>   - type = lucene
>   - compatVersion = 2
>   - async = async
>   - includedPaths = [ "/" ]
>   - queryPaths = [ "/" ]
>   + indexRules
> + nt:base
>   + properties
> + firstAlias
>   - function = "first([alias])"
>   - propertyIndex = true
>   - ordered = true
> {noformat}
> If we have a property that is set on a mixin type (or primary node type), 
> then the index can be much smaller, as we only need to index that node type. 
> However, even here we need a property to do pagination. One option is to 
> order by the lower case version of the name. However, this is quite strange. 
> Also, the node name may not be unique, which complicates things further. It 
> would be good if we can define an ordered index on the path itself (which is 
> unique).
> {noformat}
> select [jcr:path], * from [nt:file]
> where path() >= $lastEntry
> and isdescendantnode(a, '/content')
> order by path()
> /oak:index/fileIndex
>   - type = lucene
>   - compatVersion = 2
>   - async = async
>   - includedPaths = [ "/content" ]
>   - queryPaths = [ "/content" ]
>   + indexRules
> + nt:file
>   + properties
> + path
>   - function = "path()"
>   - propertyIndex = true
>   - ordered = true
> {noformat}
> It would be good if ordering by node name would use the function index. Test 
> case:
> {noformat}
> select [jcr:path], * from [nt:file] as a
> where name(a) >= $lastEntry
> and isdescendantnode(a, '/content')
> order by name(a), [jcr:path]
> /oak:index/fileIndex
>   - type = lucene
>   - compatVersion = 2
>   - async = async
>   - includedPaths = [ "/content" ]
>   - queryPaths = [ "/content" ]
>   + indexRules
> + nt:file
>   + properties
> + nodeName
>   - function = "name()"
>   - propertyIndex = true
>   - ordered = true
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-9780) Prefetch node states

2023-03-31 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-9780:

Description: 
Cache warming for DocumentNodeStore. Goal is for indexing to warm up the cache 
for a select few eg paths to allow for faster iterating/reading of the same eg 
paths later on.

Example usage

The common usage is to read a number of query result paths from the index 
(depending on the configured prefetch count), and then prefetches these nodes 
from the node store. To enable prefetch for all queries, use the 
QueryEngineSettings JMX bean, setting "PrefetchCount" (default 0). Also 
supported is enabling prefetch for an individual query, by appending 
"option(prefetches )":
{noformat}
select * from [dam:Asset]
where contains(*, 'admin') 
option(prefetches 100)
{noformat}

Relative Prefetch

We found that existing code often executes queries, and then for each result 
node, it reads some additional nodes (satellite data). To better support this 
common pattern, we support relative prefetch. This is done by appending 
"option(prefetch(,...))" to the query. If we already know that 
the application will read the child node "jcr:content/comments" and 
"jcr:content/metadata" for each of the results, use:
{noformat}
select * from [dam:Asset]
where contains(*, 'admin') 
option (
prefetches 100, 
prefetch (
  'jcr:content/comments', 
  'jcr:content/metadata'
)
)
{noformat}

We found that often, prefetch of 20 to 100 is faster than a higher prefetch 
count. (This is because the cache might evict the entries if access is too far 
in the future.) We recommend to test with different prefetch options.

  was:
Cache warming for DocumentNodeStore. Goal is for indexing to warm up the cache 
for a select few eg paths to allow for faster iterating/reading of the same eg 
paths later on.

Example usage

The common usage is to read a number of query result paths from the index 
(depending on the configured prefetch count), and then prefetches these nodes 
from the node store. To enable prefetch for all queries, use the 
QueryEngineSettings JMX bean, setting "PrefetchCount" (default 0). Also 
supported is enabling prefetch for an individual query, by appending 
"option(prefetches )":
{noformat}
select * from [dam:Asset]
where contains(*, 'admin') 
option(prefetches 100)
{noformat}

Relative Prefetch

We found that existing code often executes queries, and then for each result 
node, it reads some additional nodes (satellite data). To better support this 
common pattern, we support relative prefetch. This is done by appending 
"option(prefetch(,...))" to the query. If we already know that 
the application will read the child node "jcr:content/comments" and 
"jcr:content/metadata" for each of the results, use:
{noformat}
select * from [dam:Asset]
where contains(*, 'admin') 
option (
prefetches 100, 
prefetch (
  'jcr:content/comments', 
  'jcr:content/metadata'
)
)
{noformat}


> Prefetch node states
> 
>
> Key: OAK-9780
> URL: https://issues.apache.org/jira/browse/OAK-9780
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: api, core, documentmk
>Reporter: Stefan Egli
>Assignee: Marcel Reutegger
>Priority: Major
> Fix For: 1.46.0
>
>
> Cache warming for DocumentNodeStore. Goal is for indexing to warm up the 
> cache for a select few eg paths to allow for faster iterating/reading of the 
> same eg paths later on.
> Example usage
> The common usage is to read a number of query result paths from the index 
> (depending on the configured prefetch count), and then prefetches these nodes 
> from the node store. To enable prefetch for all queries, use the 
> QueryEngineSettings JMX bean, setting "PrefetchCount" (default 0). Also 
> supported is enabling prefetch for an individual query, by appending 
> "option(prefetches )":
> {noformat}
> select * from [dam:Asset]
> where contains(*, 'admin') 
> option(prefetches 100)
> {noformat}
> Relative Prefetch
> We found that existing code often executes queries, and then for each result 
> node, it reads some additional nodes (satellite data). To better support this 
> common pattern, we support relative prefetch. This is done by appending 
> "option(prefetch(,...))" to the query. If we already know that 
> the application will read the child node "jcr:content/comments" and 
> "jcr:content/metadata" for each of the results, use:
> {noformat}
> select * from [dam:Asset]
> where contains(*, 'admin') 
> option (
> prefetches 100, 
> prefetch (
>   'jcr:content/comments', 
>   'jcr:content/metadata'
> )
> )
> {noformat}
> We found that often, prefetch 

[jira] [Updated] (OAK-9780) Prefetch node states

2023-03-31 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-9780:

Description: 
Cache warming for DocumentNodeStore. Goal is for indexing to warm up the cache 
for a select few eg paths to allow for faster iterating/reading of the same eg 
paths later on.

Example usage

The common usage is to read a number of query result paths from the index 
(depending on the configured prefetch count), and then prefetches these nodes 
from the node store. To enable prefetch for all queries, use the 
QueryEngineSettings JMX bean, setting "PrefetchCount" (default 0). Also 
supported is enabling prefetch for an individual query, by appending 
"option(prefetches )":
{noformat}
select * from [dam:Asset]
where contains(*, 'admin') 
option(prefetches 100)
{noformat}

Relative Prefetch

We found that existing code often executes queries, and then for each result 
node, it reads some additional nodes (satellite data). To better support this 
common pattern, we support relative prefetch. This is done by appending 
"option(prefetch(,...))" to the query. If we already know that 
the application will read the child node "jcr:content/comments" and 
"jcr:content/metadata" for each of the results, use:
{noformat}
select * from [dam:Asset]
where contains(*, 'admin') 
option (
prefetches 100, 
prefetch (
  'jcr:content/comments', 
  'jcr:content/metadata'
)
)
{noformat}

  was:
Cache warming for DocumentNodeStore. Goal is for indexing to warm up the cache 
for a select few eg paths to allow for faster iterating/reading of the same eg 
paths later on.

Example usage:
{noformat}
 select * from [dam:Asset]
where contains(*, 'admin') 
option(prefetches 100)

Relative Prefetch:
select * from [dam:Asset]
where contains(*, 'admin') 
option (
prefetches 100, 
prefetch (
  'jcr:content/comments', 
  'jcr:content/metadata'
)
)
{noformat}


> Prefetch node states
> 
>
> Key: OAK-9780
> URL: https://issues.apache.org/jira/browse/OAK-9780
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: api, core, documentmk
>Reporter: Stefan Egli
>Assignee: Marcel Reutegger
>Priority: Major
> Fix For: 1.46.0
>
>
> Cache warming for DocumentNodeStore. Goal is for indexing to warm up the 
> cache for a select few eg paths to allow for faster iterating/reading of the 
> same eg paths later on.
> Example usage
> The common usage is to read a number of query result paths from the index 
> (depending on the configured prefetch count), and then prefetches these nodes 
> from the node store. To enable prefetch for all queries, use the 
> QueryEngineSettings JMX bean, setting "PrefetchCount" (default 0). Also 
> supported is enabling prefetch for an individual query, by appending 
> "option(prefetches )":
> {noformat}
> select * from [dam:Asset]
> where contains(*, 'admin') 
> option(prefetches 100)
> {noformat}
> Relative Prefetch
> We found that existing code often executes queries, and then for each result 
> node, it reads some additional nodes (satellite data). To better support this 
> common pattern, we support relative prefetch. This is done by appending 
> "option(prefetch(,...))" to the query. If we already know that 
> the application will read the child node "jcr:content/comments" and 
> "jcr:content/metadata" for each of the results, use:
> {noformat}
> select * from [dam:Asset]
> where contains(*, 'admin') 
> option (
> prefetches 100, 
> prefetch (
>   'jcr:content/comments', 
>   'jcr:content/metadata'
> )
> )
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-9780) Prefetch node states

2023-03-31 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-9780:

Description: 
Cache warming for DocumentNodeStore. Goal is for indexing to warm up the cache 
for a select few eg paths to allow for faster iterating/reading of the same eg 
paths later on.

Example usage:
{noformat}
 select * from [dam:Asset]
where contains(*, 'admin') 
option(prefetches 100)

Relative Prefetch:
select * from [dam:Asset]
where contains(*, 'admin') 
option (
prefetches 100, 
prefetch (
  'jcr:content/comments', 
  'jcr:content/metadata'
)
)
{noformat}

  was:Proof of concept of cache warming for DocumentNodeStore. Goal is for 
indexing to warm up the cache for a select few eg paths to allow for faster 
iterating/reading of the same eg paths later on


> Prefetch node states
> 
>
> Key: OAK-9780
> URL: https://issues.apache.org/jira/browse/OAK-9780
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: api, core, documentmk
>Reporter: Stefan Egli
>Assignee: Marcel Reutegger
>Priority: Major
> Fix For: 1.46.0
>
>
> Cache warming for DocumentNodeStore. Goal is for indexing to warm up the 
> cache for a select few eg paths to allow for faster iterating/reading of the 
> same eg paths later on.
> Example usage:
> {noformat}
>  select * from [dam:Asset]
> where contains(*, 'admin') 
> option(prefetches 100)
> Relative Prefetch:
> select * from [dam:Asset]
> where contains(*, 'admin') 
> option (
> prefetches 100, 
> prefetch (
>   'jcr:content/comments', 
>   'jcr:content/metadata'
> )
> )
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10145) Rate limit the log messages for IndexUpdate

2023-03-17 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701727#comment-17701727
 ] 

Thomas Mueller edited comment on OAK-10145 at 3/17/23 1:17 PM:
---

PR merged: 2023-03-17


was (Author: tmueller):
The PR is now merged: https://github.com/apache/jackrabbit-oak/pull/871

> Rate limit the log messages for IndexUpdate
> ---
>
> Key: OAK-10145
> URL: https://issues.apache.org/jira/browse/OAK-10145
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Roxana-Elena Balasoiu
>Assignee: Thomas Mueller
>Priority: Major
>
> For those that want to use indexes in read-only mode (read from the indexes, 
> but not write to them), they can simply not add the Lucene index editor 
> provider when configuring Oak. However, in this case the following warning is 
> logged every second many times:
>  
> Missing provider for nrt/sync index: {} (rootState.async: {}).
> Please note, it means that index data should be trusted only after this index
> is processed in an async indexing cycle.
>  
> The message is correct, and I think it is good, but it shouldn't be logged 
> all that often: it is filling the disk. Instead, the message should be logged 
> only once per minute, or so.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10145) Rate limit the log messages for IndexUpdate

2023-03-17 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10145.
--
Fix Version/s: 1.52.0
   Resolution: Fixed

> Rate limit the log messages for IndexUpdate
> ---
>
> Key: OAK-10145
> URL: https://issues.apache.org/jira/browse/OAK-10145
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Roxana-Elena Balasoiu
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.52.0
>
>
> For those that want to use indexes in read-only mode (read from the indexes, 
> but not write to them), they can simply not add the Lucene index editor 
> provider when configuring Oak. However, in this case the following warning is 
> logged every second many times:
>  
> Missing provider for nrt/sync index: {} (rootState.async: {}).
> Please note, it means that index data should be trusted only after this index
> is processed in an async indexing cycle.
>  
> The message is correct, and I think it is good, but it shouldn't be logged 
> all that often: it is filling the disk. Instead, the message should be logged 
> only once per minute, or so.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10145) Rate limit the log messages for IndexUpdate

2023-03-17 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10145:


Assignee: Thomas Mueller

> Rate limit the log messages for IndexUpdate
> ---
>
> Key: OAK-10145
> URL: https://issues.apache.org/jira/browse/OAK-10145
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Roxana-Elena Balasoiu
>Assignee: Thomas Mueller
>Priority: Major
>
> For those that want to use indexes in read-only mode (read from the indexes, 
> but not write to them), they can simply not add the Lucene index editor 
> provider when configuring Oak. However, in this case the following warning is 
> logged every second many times:
>  
> Missing provider for nrt/sync index: {} (rootState.async: {}).
> Please note, it means that index data should be trusted only after this index
> is processed in an async indexing cycle.
>  
> The message is correct, and I think it is good, but it shouldn't be logged 
> all that often: it is filling the disk. Instead, the message should be logged 
> only once per minute, or so.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10145) Rate limit the log messages for IndexUpdate

2023-03-17 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701727#comment-17701727
 ] 

Thomas Mueller commented on OAK-10145:
--

The PR is now merged: https://github.com/apache/jackrabbit-oak/pull/871

> Rate limit the log messages for IndexUpdate
> ---
>
> Key: OAK-10145
> URL: https://issues.apache.org/jira/browse/OAK-10145
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Roxana-Elena Balasoiu
>Assignee: Thomas Mueller
>Priority: Major
>
> For those that want to use indexes in read-only mode (read from the indexes, 
> but not write to them), they can simply not add the Lucene index editor 
> provider when configuring Oak. However, in this case the following warning is 
> logged every second many times:
>  
> Missing provider for nrt/sync index: {} (rootState.async: {}).
> Please note, it means that index data should be trusted only after this index
> is processed in an async indexing cycle.
>  
> The message is correct, and I think it is good, but it shouldn't be logged 
> all that often: it is filling the disk. Instead, the message should be logged 
> only once per minute, or so.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10102) Disable lazy index download by default

2023-02-08 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10102.
--
Fix Version/s: 1.50.0
   Resolution: Fixed

> Disable lazy index download by default
> --
>
> Key: OAK-10102
> URL: https://issues.apache.org/jira/browse/OAK-10102
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.50.0
>
>
> I found that with the default settings (the system property 
> "oak.lucene.nonLazyIndex" not set), and a decent number of Lucene indexes, 
> each query that might use a Lucene index takes around 30 ms. With the 
> property set, each query takes only 2 ms or so.
> The lazy index download was meant to speed up startup, because only indexes 
> are downloaded that are actually needed. But on my side, I have disabled the 
> setting since then.
> I think it makes sense to set change the default value, like it was before. 
> So enable the system property  "oak.lucene.nonLazyIndex" by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10102) Disable lazy index download by default

2023-02-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684665#comment-17684665
 ] 

Thomas Mueller commented on OAK-10102:
--

[~joerghoh] Yes I think we can remove the code. But I'm not in a hurry to do 
that; possibly there are still some active uses.

> Disable lazy index download by default
> --
>
> Key: OAK-10102
> URL: https://issues.apache.org/jira/browse/OAK-10102
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I found that with the default settings (the system property 
> "oak.lucene.nonLazyIndex" not set), and a decent number of Lucene indexes, 
> each query that might use a Lucene index takes around 30 ms. With the 
> property set, each query takes only 2 ms or so.
> The lazy index download was meant to speed up startup, because only indexes 
> are downloaded that are actually needed. But on my side, I have disabled the 
> setting since then.
> I think it makes sense to set change the default value, like it was before. 
> So enable the system property  "oak.lucene.nonLazyIndex" by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10102) Disable lazy index download by default

2023-02-05 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684452#comment-17684452
 ] 

Thomas Mueller commented on OAK-10102:
--

[~joerghoh] yes - I attached the original issue. In theory, it is an 
interesting feature: large indexes are only downloaded if queries might need 
it. We have enabled this for some time, and it did reduce the startup time. 

However, the effect was not all that big: indexes are also downloaded when 
doing cost evaluation. So that queries that _might_ be used for a query (but in 
fact are not) are still downloaded unnecessarily. It is much better to 

* remove unneeded indexes (e.g. don't use a "generic" fulltext index that 
indexes all data), and 
* shrink the indexes (e.g. avoid indexing properties that are not strictly 
needed), and
* for very large indexes, use Elastic instead of Lucene, and
* speed up index download by doing it in parallel instead of each file 
sequentially.

All those have a much bigger impact on startup time.

So since them, I didn't use this feature. And now, we found out (quite late) 
that query time is much worse if the feature is enabled.

Because of all that, I think it is better to disable the feature by default.

> Disable lazy index download by default
> --
>
> Key: OAK-10102
> URL: https://issues.apache.org/jira/browse/OAK-10102
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I found that with the default settings (the system property 
> "oak.lucene.nonLazyIndex" not set), and a decent number of Lucene indexes, 
> each query that might use a Lucene index takes around 30 ms. With the 
> property set, each query takes only 2 ms or so.
> The lazy index download was meant to speed up startup, because only indexes 
> are downloaded that are actually needed. But on my side, I have disabled the 
> setting since then.
> I think it makes sense to set change the default value, like it was before. 
> So enable the system property  "oak.lucene.nonLazyIndex" by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10102) Disable lazy index download by default

2023-02-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683916#comment-17683916
 ] 

Thomas Mueller commented on OAK-10102:
--

https://github.com/apache/jackrabbit-oak/pull/842

> Disable lazy index download by default
> --
>
> Key: OAK-10102
> URL: https://issues.apache.org/jira/browse/OAK-10102
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I found that with the default settings (the system property 
> "oak.lucene.nonLazyIndex" not set), and a decent number of Lucene indexes, 
> each query that might use a Lucene index takes around 30 ms. With the 
> property set, each query takes only 2 ms or so.
> The lazy index download was meant to speed up startup, because only indexes 
> are downloaded that are actually needed. But on my side, I have disabled the 
> setting since then.
> I think it makes sense to set change the default value, like it was before. 
> So enable the system property  "oak.lucene.nonLazyIndex" by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10102) Disable lazy index download by default

2023-02-03 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10102:


 Summary: Disable lazy index download by default
 Key: OAK-10102
 URL: https://issues.apache.org/jira/browse/OAK-10102
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: lucene, query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


I found that with the default settings (the system property 
"oak.lucene.nonLazyIndex" not set), and a decent number of Lucene indexes, each 
query that might use a Lucene index takes around 30 ms. With the property set, 
each query takes only 2 ms or so.

The lazy index download was meant to speed up startup, because only indexes are 
downloaded that are actually needed. But on my side, I have disabled the 
setting since then.

I think it makes sense to set change the default value, like it was before. So 
enable the system property  "oak.lucene.nonLazyIndex" by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10096) IndexDefinitionBuilder not exported in oak-search

2023-02-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682920#comment-17682920
 ] 

Thomas Mueller commented on OAK-10096:
--

I guess it's just an oversight.

> IndexDefinitionBuilder not exported in oak-search
> -
>
> Key: OAK-10096
> URL: https://issues.apache.org/jira/browse/OAK-10096
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: search
>Reporter: Julian Reschke
>Priority: Major
>
> I was trying to update a project that used IndexDefinitionBuilder from 
> oak-lucene - which is deprecated - to use oak-search instead. However, in 
> that project, the API does not seem to be exported at all.
> Is this just an oversight? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-9979) Deprecate Lucene indexes with compatVersion 1

2023-01-11 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-9979.
-
Resolution: Done

Documented

> Deprecate Lucene indexes with compatVersion 1
> -
>
> Key: OAK-9979
> URL: https://issues.apache.org/jira/browse/OAK-9979
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing, lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Minor
>
> We should deprecate Lucene indexes with compatVersion 1. I don't think they 
> are used widely, and the disadvantages are too big. To allow removing support 
> for them in the future, we need to deprecate them now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10054) Improved trace level logging of JCR method calls

2023-01-09 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10054.
--
  Assignee: Thomas Mueller
Resolution: Fixed

> Improved trace level logging of JCR method calls
> 
>
> Key: OAK-10054
> URL: https://issues.apache.org/jira/browse/OAK-10054
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: jcr
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.48.0
>
> Attachments: OAK-10054.patch
>
>
> To analyze calls to the JCR API we can enable trace level logging. However, 
> currently only the operation is logged (e.g. getNode), but not the path. 
> Also, applications that have thousands of JCR method calls will write too 
> many log messages.
> That means we currently have both too much logging (too many lines), and not 
> enough logging (not logging useful info).
> It would be good if with trace level setting enabled, JCR method calls are 
> logged in more detail (e.g. path), and optionally with sampling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >