[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150243#comment-17150243 ] Ashish Chopra commented on OAK-8523: Thanks for your summary [~thomasm]! I really appreciate the questions you raised and suggestions you provided! > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150231#comment-17150231 ] Thomas Mueller commented on OAK-8523: - Target: the node that is referenced. Source: the node where the target is referenced. I understand that someone might want to use paths instead of UUIDs. It can still be done using an index. Storing the *path* of the target node in the source node: * Easy to read for humans. * Simple lookup to the target from the source: Session.getNode(absPath). * Disadvantage: need to update if the target is moved. * Disadvantage: need to run a query to get the list of sources. Storing the *UUID* of the target node in the source node: * Disadvantage: Hard to read for humans. * Simple to lookup the target from the source: Session.getNodeByIdentifier(uuid). * No need to update anything is the target is moved. * Simple to get the list of sources: Node.getReferences() * Disadvantage: best only add a UUID if the node is actually referenced, to avoid an unnecessary entry in the UUID index. Storing the *list of sources* in the target node: * Potentially a huge number of entries in the target (for example: if a "header" or "footer" referenced in millions of nodes). Needs to be updated whenever a reference is added or removed, which can lead to an exponential time algorithm and storage requirement. * Potentially can result in out-of-memory during indexing or other operations. > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150079#comment-17150079 ] Ashish Chopra commented on OAK-8523: {quote} bq. webpage-as-primary-key-cache Sorry I don't know what that means. {quote} That was my way of differentiating between the two "types" of structures that can store the reference-cache in question. Following is what I called a "reference-as-primary-key-cache": {noformat} /var + ref-cache + a + b + my-img.jpg + my.site.example.com - references = {/my/page/1, my/page/2} + myother.site.example.com - references = {/my/page/11} {noformat} What follows next is what I called "webpage-as-primary-key-cache" {noformat} /var + ref-cache + my.site.example.com + my + page + 1 - resources = {/a/b/my-img.jpg, ...} + 2 - resources = {/a/b/my-img.jpg, ...} + myother.site.example.com + my + page + 11 - resources = {/a/b/my-img.jpg, ...} {noformat} (I didn't use "source"/"target" terminology because I think it is easy to get confused - sorry). bq. if you use a JCR "reference", e.g. using [Node.setProperty(String, Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)], then you can get the list of references using [Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]... Internally it is running a query btw. Thanks for this suggestion! AIU these APIs rely on jcr:uuid of the resource (i.e., the "source"), right? If yes, do you think we should worry about the potential bloat of the UUID index? bq. The second option is to use a query (not a fulltext query, just a regular query). As in a query on "resources" property in the webpage-as-primary-key-cache (defined above), right? Something like: {noformat} select * from [nt:base] as a where a.[resources]='/a/b/my-img.jpg' and isdescendantnode(a, '/var') {noformat} {quote} bq. with resource-as-primary-key-cache, we just traverse right to the resource we need to find the reference for and read the prop - no search required. On the other hand, you have many downsides, like the risk of having too many references. ... you can also store paths. In which case you can't use the out-of-the-box feature and have to run the query yourself. {quote} It seems there's no easy answer here as there are tradeoffs involved - somethings are better if limits of the system/repository aren't expected to be reached/breached, others might have better scale characteristics but come at a cost of a slight increase in complexity and increased dependence on query-subsystem (and of course additional index-definiton+data if JCR references can't be used). bq. in summary, I think the warning for large string properties and many entries in a multi-value property are justified... In my view, the added complexity to run a query, or to use uuids, is worth the trouble to avoid those problems. Thanks for the discussion and recommendations [~thomasm]! I think I'd like to go back and perhaps get a more concrete idea of the scale we'd be operating at, and perhaps see if the warnings start to appear while operating at that scale. I have just one additional question (which I wrote above as well): If I start using [Node.setProperty(String, Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)], and retrieve list of references via [Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()] as you recommended, should I worry about potential UUID index bloat? > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g.
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150039#comment-17150039 ] Thomas Mueller commented on OAK-8523: - > webpage-as-primary-key-cache Sorry I don't know what that means. > we didn't go down this road because lookup in the > webpage-as-primary-key-cache would need a (full-text?) query There are two ways: * if you use a JCR "reference", e.g. using [Node.setProperty(String, Node)|setProperty(java.lang.String name, Node value)], then you can get the list of references using [Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]. Because this type of references is actually an out-of-the-box feature of JCR. Internally it is running a query btw. * The second option is to use a query (not a fulltext query, just a regular query). > OTOH, with resource-as-primary-key-cache, we just traverse right to the > resource we need to find the reference for and read the prop - no search > required. On the other hand, you have many downsides, like the risk of having too many references. > designed when IDs were considered evil Well you can also store paths. In which case you can't use the out-of-the-box feature and have to run the query yourself. So, in summary, I think the warning for large string properties and many entries in a multi-value property are justified. It is easy to make mistakes, in which case performance will be bad, and out-of-memory can occur. In my view, the added complexity to run a query, or to use uuids, is worth the trouble to avoid those problems. > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149570#comment-17149570 ] Ashish Chopra commented on OAK-8523: Thanks for reviewing the usecase and providing suggestions [~thomasm]! bq. If you update the list of references a lot, then storing it as a binary or a multi-valued property is problematic, if the list is large. In this case, the whole list is stored over and over again. I don't expect the list of references to be updated very frequently - perhaps once every 15-20min for a given reference - the whole subtree (containing the path of resources whose references are updated in the sites - i.e, the reference-cache's "primary-key") could see more frequent updates, though - but still in the order of 10^1 minutes, no sooner. bq. It might be better to store a multi-valued property at the place where the reference is (the source). And not in the target. IIUC, you mean a structure like: {noformat} /var + my.site.example.com + my + page + 1 - resources = {/a/b/my-img.jpg, ...} + 2 - resources = {/a/b/my-img.jpg, ...} + myother.site.example.com + my + page + 11 - resources = {/a/b/my-img.jpg, ...} {noformat} which is a webpage-as-primary-key-cache, right? bq. Because usually (I assume) a source only references a small set of targets. On the other hand, a target can be referenced by a huge number of sources. Right? And then have a index on that property. That's it. Indexes are updated efficiently. That's not entirely true IME (i.e., the number wouldn't be "small" in absolute sense) - but yes, it is _expected_ to be _smaller than_ the previous scenario. However, as you've already touched upon, we didn't go down this road because lookup in the cache would need a (full-text?) query (irrespective of whether we do it via MV string props or (potentially in-lined) binary-props). OTOH, with resource-as-primary-key-cache, we just traverse right to the resource we need to find the reference for and read the prop - no search required. Another issue was updating the 'deletions'. We can only get the webpages where the resource has been referenced at _any given time_. If we create a webpage-as-primary-key-cache, we'd need to issue full-text search through the whole cache to 'invalidate' the pages which _earlier_ had the resource-reference but no longer do. With resource-as-primary-key-cache we just delete the existing cache-entry and recreate it with the new data. bq. In [case of moves, copies, or deletes a source or target?] the references can easily get outdated. What about using UUIDs instead of paths (jcr:uuid)? Sadly, using jcr:uuid is not an option because we're dealing with (somewhat) legacy stuff - designed when IDs were considered evil [0] and thus the webpages contain paths as resource-reference instead of IDs. To answer the question, though: Considering resource ({{/a/b/my-img.jpg}}: * for moves: ** the references (webpages) will be adjusted and cache entries will be relocated to align with new resource-path (i.e., the key will be 'renamed'). * for copies ** this is effectively a new resource - the references will be created as content is authored and cache entries will be build accordingly * for deletes ** the cache entry will be invalidated Consider webpage ({{https://my.site.example.com/my/page/1.html}}) * for moves/copies/deletes: ** the cache-entries for resources referenced within the webpage being deleted will be invalidated and a rebuild issued [0] https://www.slideshare.net/jukka/content-storage-with-apache-jackrabbit/19-Content_modeling_Davids_model1_Data (see rule #7) > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid >
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149523#comment-17149523 ] Thomas Mueller commented on OAK-8523: - Btw what happens if someone moves, copies, or deletes a source or target? In this case the references can easily get outdates. What about using UUIDs instead of paths (jcr:uuid)? > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149515#comment-17149515 ] Thomas Mueller commented on OAK-8523: - If you update the list of references a lot, then storing it as a binary or a multi-valued property is problematic, if the list is large. In this case, the whole list is stored over and over again. It might be better to store a multi-valued property at the place where the reference is (the source). And _not_ in the target. Because usually (I assume) a source only references a small set of targets. On the other hand, a target can be referenced by a huge number of sources. Right? > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149510#comment-17149510 ] Ashish Chopra commented on OAK-8523: {quote}What you could do is store it as a MV string property in the normal case, and as a binary if it's larger. But that would make it more complex, so it's probably easier to always store it as a binary. {quote} Hm, thanks! {quote}I don't know what that is so I can't comment. Maybe you can give some examples? {quote} Sorry. I should have been clearer. A persisted reference cache allows finding the where all a resource has been _used_. e.g., if an entity {{/a/b/my-img.jpg}} has been used on a webpage at {{https://my.site.example.com/my/page/1.html}}, {{https://mysite.example.com/my/page/2.html}} and {{https://myother.site.example.com/my/page/11.html}}, then I'd create a tree like: {noformat} /var + a + b + my-img.jpg + my.site.example.com - references = {/my/page/1, my/page/2} + myother.site.example.com - references = {/my/page/11} {noformat} the {{references}} is the MV property which can contain multiple entries - and depending on how many pages have reference to the resource in question the number of elements in the MV prop can be high, but realistically never higher than order of 10^3 IME. Such a structure will help locate the complete list of references for a given resource without running any searches (just via traversal and property read). bq. maybe you want to convert the string array to JSON. Of course it's a few more lines of code, but shouldn't be that complicated. bq. Storing large strings or large multi-value properties comes with a high cost if there are many updates, because you have to store them each time. Also, very large nodes are a problem for caches. I agree, but the indexing issue would still remain, no? (if there's a full-text index on this subtree, then it'd likely suffer the same issue as having MV props?) Not sure how caching would be impacted, but likely in the same way? Or do binary-properties scale better in both the cases (though indexing one shouldn't matter to us - the caching one might be relevant). > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149488#comment-17149488 ] Thomas Mueller commented on OAK-8523: - > Usually it'd not be that many strings - much lower in fact (order of 10^1) - > it can go as high as 10^3 in rare cases What you could do is store it as a MV string property in the normal case, and as a binary if it's larger. But that would make it more complex, so it's probably easier to always store it as a binary. > a persisted reference-cache I don't know what that is so I can't comment. Maybe you can give some examples? > write binary-prop serialization/deserialization is slightly more involved Hm, maybe you want to convert the string array to JSON. Of course it's a few more lines of code, but shouldn't be that complicated. > concrete reasons Yes. Storing large strings or large multi-value properties comes with a high cost if there are many updates, because you have to store them each time. Also, very large nodes are a problem for caches. > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149464#comment-17149464 ] Ashish Chopra commented on OAK-8523: Thanks for sharing your perspective [~thomasm]! bq. \other problems are e.g. caches: they typically assume one entry is smaller than 100 MB or so, otherwise they don't work well. It can lead to out-of-memory. Thanks. That's helpful to know (though 100MB is far higher than the warning thresholds of 100KiB String sizes currently). bq. If it's so many strings, why don't you store it as a binary? That's the thing. _Usually_ it'd not be that many strings - much lower in fact (order of 10^1) - it can go _as high as_ 10^3 in rare cases, but anything larger would be well near a impossibility bq. I don't think it's much more complex to store huge lists as a binary instead of a multi-valued property. That's why I added _slightly_. Given the JCR APIs [0] [1] [2], reading/write MV props is near trivial - however for write binary-prop serialization/deserialization is slightly more involved (and I don't have any numbers, but I expect dealing with non-inlined binary-props to be slower than reading/writing MV props). bq. How can you be sure that nobody every indexes this data? This is expected to be internal data-structure (a persisted reference-cache, if I may add) under {{/var}} which doesn't need index - if anyone were to index it it'll be a huge error. bq. It's still revelvant. Thanks for this. OAK-1454 only mentions it is a non-goal for Oak to provide large MV support, but doesn't mention concrete reasons. Would you happen to know some? [0] https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getProperty(java.lang.String) [1] https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Property.html#getValues() [2] https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Value[],%20int) > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149458#comment-17149458 ] Thomas Mueller commented on OAK-8523: - > Can you please also help understand the other areas where this could be > problematic? Well, for indexing it's a very big problem. I think that alone is enough of a problem. But other problems are e.g. caches: they typically assume one entry is smaller than 100 MB or so, otherwise they don't work well. It can lead to out-of-memory. > single MV prop, possibly in the order of 10^3 If it's so many strings, why don't you store it as a binary? > slightly more complex code Sorry I don't think it's much more complex to store huge lists as a binary instead of a multi-valued property. > which are not a candidate for indexing How can you be sure that nobody every indexes this data? Often it's other people that define indexes, and they might not know that they need to exclude certain properties. E.g. a generic lucene fulltext index that indexes all the data. > Introduced such WARNs for 1000 elements in an MV property back in '14. It's still revelvant. > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149424#comment-17149424 ] Ashish Chopra commented on OAK-8523: {quote} ...There are multiple candidates: . . . * Number of elements for multi-valued properties {quote} OAK-1454 introduced such WARNs for 1000 elements in an MV property back in '14. Not sure if this is still relevant (or if the value needs to be raised) after ~6years. > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit
[ https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149420#comment-17149420 ] Ashish Chopra commented on OAK-8523: hi [~thomasm], the issue description states: bq. Oak supports very large properties (e.g. String). But 1 MB (or larger) properties are problematic in multiple areas like indexing. Can you please also help understand the other areas where this could be problematic? I ask because if I want to read/write MV String properties (each string a path, so not huge, but multiple such strings as values of a single MV prop, possibly in the order of 10^3) which are not a candidate for indexing, what should be a cause for concern? Asking because the alternatives I can think of, say, splitting to multiple nodes, or adding a binary-property, will come with their own consequences (slightly more complex code, possibly harder for NodeStore to cache, and potential delays in) and I'm not sure under what conditions one should be preferred over other(s)). Can you please provide your insights? Any alternative patterns to associate possibly large content/strings (again, not indexing candidates). \cc: [~tihom88] > Best Practices - Property Value Length Limit > > > Key: OAK-8523 > URL: https://issues.apache.org/jira/browse/OAK-8523 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, jcr >Reporter: Thomas Mueller >Priority: Major > > Right now, Oak supports very large properties (e.g. String). But 1 MB (or > larger) properties are problematic in multiple areas like indexing. It is > more important for software-as-a-service, where we need to guarantee SLOs, > but it also helps other cases. So we should: > * (1) Document best practises, e.g. "Property values should be smaller than > 100 KB". > * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB > and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. > Setting the hard limits to a lower value by default is problematic, because > it can break existing applications. With default value infinity, customers > can set lower limits e.g. in tests first, and once they are happy, in > production as well. > * (3) Log a warning if a property is larger than "softLimit". To avoid > logging many warnings (if there are many such properties) we then set > softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). > Logging is needed to know what _exactly_ is broken (path, stack trace of the > actual usage...) > * (4) Add a metric (monitoring) for detected large properties. Just logging > warnings might not be enough. > * (5) Throttling: we could add flow control (pauses; Thread.sleep) after > violations, to improve isolation (to prevent affecting other threads that > don't violate the contract). > * (6) We could expose the violation info in the session, so a framework could > check that data after executing custom code, and add more info (e.g. log). > * (7) If larger than the configurable hardLimit, fail the commit or reject > setProperty (throw an exception). > * (8) At some point, in a new Oak version, change the default value for > hardLimit to some reasonable number, e.g. 1 MB. > The "property length" is just one case. There are multiple candidates: > > * Number of properties for a node > * Number of elements for multi-valued properties > * Total size of a node (including inlined properties) > * Number of direct child nodes for orderable child nodes > * Number of direct child nodes for non-orderable child nodes > * Size of transaction > * Adding observations listeners that listen for all changes (global listeners) > For those cases, new Jira issue should be made. -- This message was sent by Atlassian Jira (v8.3.4#803005)