[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-02 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150243#comment-17150243
 ] 

Ashish Chopra commented on OAK-8523:


Thanks for your summary [~thomasm]! I really appreciate the questions you 
raised and suggestions you provided!

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-02 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150231#comment-17150231
 ] 

Thomas Mueller commented on OAK-8523:
-

Target: the node that is referenced. Source: the node where the target is 
referenced.

I understand that someone might want to use paths instead of UUIDs. It can 
still be done using an index.

Storing the *path* of the target node in the source node: 
 * Easy to read for humans.
 * Simple lookup to the target from the source: Session.getNode(absPath).
 * Disadvantage: need to update if the target is moved.
 * Disadvantage: need to run a query to get the list of sources.

Storing the *UUID* of the target node in the source node:
 * Disadvantage: Hard to read for humans.
 * Simple to lookup the target from the source:  
Session.getNodeByIdentifier(uuid).
 * No need to update anything is the target is moved.
 * Simple to get the list of sources: Node.getReferences()
 * Disadvantage: best only add a UUID if the node is actually referenced, to 
avoid an unnecessary entry in the UUID index.

Storing the *list of sources* in the target node:
 * Potentially a huge number of entries in the target (for example: if a 
"header" or "footer" referenced in millions of nodes). Needs to be updated 
whenever a reference is added or removed, which can lead to an exponential time 
algorithm and storage requirement. 
 * Potentially can result in out-of-memory during indexing or other operations.

 

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-02 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150079#comment-17150079
 ] 

Ashish Chopra commented on OAK-8523:


{quote}
bq. webpage-as-primary-key-cache
Sorry I don't know what that means.
{quote}
That was my way of differentiating between the two "types" of structures that 
can store the reference-cache in question.
Following is what I called a "reference-as-primary-key-cache":
{noformat}
/var
  + ref-cache
+ a
  + b
   + my-img.jpg
 + my.site.example.com
   - references = {/my/page/1, my/page/2}
 + myother.site.example.com
   - references = {/my/page/11}
{noformat}

What follows next is what I called "webpage-as-primary-key-cache"
{noformat}
/var
  + ref-cache
+ my.site.example.com
  + my
+ page
  + 1
- resources = {/a/b/my-img.jpg, ...}
  + 2
- resources = {/a/b/my-img.jpg, ...}
+ myother.site.example.com
  + my
+ page
  + 11
- resources = {/a/b/my-img.jpg, ...}
{noformat}

(I didn't use "source"/"target" terminology because I think it is easy to get 
confused - sorry).

bq. if you use a JCR "reference", e.g. using [Node.setProperty(String, 
Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)],
 then you can get the list of references using 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]...
 Internally it is running a query btw.
Thanks for this suggestion!
AIU these APIs rely on jcr:uuid of the resource (i.e., the "source"), right? If 
yes, do you think we should worry about the potential bloat of the UUID index?
bq. The second option is to use a query (not a fulltext query, just a regular 
query).
As in a query on "resources" property in the webpage-as-primary-key-cache 
(defined above), right? Something like:
{noformat}
select * from [nt:base] as a where a.[resources]='/a/b/my-img.jpg' and 
isdescendantnode(a, '/var')
{noformat}

{quote}
bq. with resource-as-primary-key-cache, we just traverse right to the resource 
we need to find the reference for and read the prop - no search required.
On the other hand, you have many downsides, like the risk of having too many 
references.
...
you can also store paths. In which case you can't use the out-of-the-box 
feature and have to run the query yourself.
{quote}
It seems there's no easy answer here as there are tradeoffs involved - 
somethings are better if limits of the system/repository aren't expected to be 
reached/breached, others might have better scale characteristics but come at a 
cost of a slight increase in complexity and increased dependence on 
query-subsystem (and of course additional index-definiton+data if JCR 
references can't be used).

bq. in summary, I think the warning for large string properties and many 
entries in a multi-value property are justified... In my view, the added 
complexity to run a query, or to use uuids, is worth the trouble to avoid those 
problems.
Thanks for the discussion and recommendations [~thomasm]!
I think I'd like to go back and perhaps get a more concrete idea of the scale 
we'd be operating at, and perhaps see if the warnings start to appear while 
operating at that scale.


I have just one additional question (which I wrote above as well):
If I start using [Node.setProperty(String, 
Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)],
 and retrieve list of references via 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]
 as you recommended, should I worry about potential UUID index bloat?

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. 

[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-02 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150039#comment-17150039
 ] 

Thomas Mueller commented on OAK-8523:
-

> webpage-as-primary-key-cache

Sorry I don't know what that means.

> we didn't go down this road because lookup in the 
> webpage-as-primary-key-cache would need a (full-text?) query

There are two ways:
* if you use a JCR "reference", e.g. using [Node.setProperty(String, 
Node)|setProperty(java.lang.String name, Node value)], then you can get the 
list of references using 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()].
 Because this type of references is actually an out-of-the-box feature of JCR. 
Internally it is running a query btw.
* The second option is to use a query (not a fulltext query, just a regular 
query).

>  OTOH, with resource-as-primary-key-cache, we just traverse right to the 
> resource we need to find the reference for and read the prop - no search 
> required.

On the other hand, you have many downsides, like the risk of having too many 
references.

> designed when IDs were considered evil 

Well you can also store paths. In which case you can't use the out-of-the-box 
feature and have to run the query yourself.

So, in summary, I think the warning for large string properties and many 
entries in a multi-value property are justified. It is easy to make mistakes, 
in which case performance will be bad, and out-of-memory can occur. In my view, 
the added complexity to run a query, or to use uuids, is worth the trouble to 
avoid those problems.

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149570#comment-17149570
 ] 

Ashish Chopra commented on OAK-8523:


Thanks for reviewing the usecase and providing suggestions [~thomasm]!
bq. If you update the list of references a lot, then storing it as a binary or 
a multi-valued property is problematic, if the list is large. In this case, the 
whole list is stored over and over again.
I don't expect the list of references to be updated very frequently - perhaps 
once every 15-20min for a given reference - the whole subtree (containing the 
path of resources whose references are updated in the sites - i.e, the 
reference-cache's "primary-key") could see more frequent updates, though - but 
still in the order of 10^1 minutes, no sooner.
bq. It might be better to store a multi-valued property at the place where the 
reference is (the source). And not in the target.
IIUC, you mean a structure like:
{noformat}
/var
  + my.site.example.com
+ my
  + page
+ 1
  - resources = {/a/b/my-img.jpg, ...}
+ 2
  - resources = {/a/b/my-img.jpg, ...}
  + myother.site.example.com
+ my
  + page
+ 11
  - resources = {/a/b/my-img.jpg, ...}
{noformat}
which is a webpage-as-primary-key-cache, right?
bq. Because usually (I assume) a source only references a small set of targets. 
On the other hand, a target can be referenced by a huge number of sources. 
Right? And then have a index on that property. That's it. Indexes are updated 
efficiently.
That's not entirely true IME (i.e., the number wouldn't be "small" in absolute 
sense) - but yes, it is _expected_ to be _smaller than_ the previous scenario.
However, as you've already touched upon, we didn't go down this road because 
lookup in the cache would need a (full-text?) query (irrespective of whether we 
do it via MV string props or (potentially in-lined) binary-props). OTOH, with 
resource-as-primary-key-cache, we just traverse right to the resource we need 
to find the reference for and read the prop - no search required.
Another issue was updating the 'deletions'. We can only get the webpages where 
the resource has been referenced at _any given time_. If we create a 
webpage-as-primary-key-cache, we'd need to issue full-text search through the 
whole cache to 'invalidate' the pages which _earlier_ had the 
resource-reference but no longer do. With resource-as-primary-key-cache we just 
delete the existing cache-entry and recreate it with the new data.
bq. In [case of moves, copies, or deletes a source or target?] the references 
can easily get outdated. What about using UUIDs instead of paths (jcr:uuid)?
Sadly, using jcr:uuid is not an option because we're dealing with (somewhat) 
legacy stuff - designed when IDs were considered evil [0] and thus the webpages 
contain paths as resource-reference instead of IDs.

To answer the question, though:
Considering resource ({{/a/b/my-img.jpg}}:
* for moves:
** the references (webpages) will be adjusted and cache entries will be 
relocated to align with new resource-path (i.e., the key will be 'renamed').
* for copies
** this is effectively a new resource - the references will be created as 
content is authored and cache entries will be build accordingly
* for deletes
** the cache entry will be invalidated

Consider webpage ({{https://my.site.example.com/my/page/1.html}})
* for moves/copies/deletes:
** the cache-entries for resources referenced within the webpage being deleted 
will be invalidated and a rebuild issued


[0] 
https://www.slideshare.net/jukka/content-storage-with-apache-jackrabbit/19-Content_modeling_Davids_model1_Data
 (see rule #7)

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> 

[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149523#comment-17149523
 ] 

Thomas Mueller commented on OAK-8523:
-

Btw what happens if someone moves, copies, or deletes a source or target? In 
this case the references can easily get outdates. What about using UUIDs 
instead of paths (jcr:uuid)?

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149515#comment-17149515
 ] 

Thomas Mueller commented on OAK-8523:
-

If you update the list of references a lot, then storing it as a binary or a 
multi-valued property is problematic, if the list is large. In this case, the 
whole list is stored over and over again. 

It might be better to store a multi-valued property at the place where the 
reference is (the source). And _not_ in the target. Because usually (I assume) 
a source only references a small set of targets. On the other hand, a target 
can be referenced by a huge number of sources. Right?

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149510#comment-17149510
 ] 

Ashish Chopra commented on OAK-8523:


{quote}What you could do is store it as a MV string property in the normal 
case, and as a binary if it's larger. But that would make it more complex, so 
it's probably easier to always store it as a binary.
{quote}
Hm, thanks!
{quote}I don't know what that is so I can't comment. Maybe you can give some 
examples?
{quote}
Sorry. I should have been clearer.
 A persisted reference cache allows finding the where all a resource has been 
_used_. e.g., if an entity {{/a/b/my-img.jpg}} has been used on a webpage at 
{{https://my.site.example.com/my/page/1.html}}, 
{{https://mysite.example.com/my/page/2.html}} and 
{{https://myother.site.example.com/my/page/11.html}}, then I'd create a tree 
like:
{noformat}
/var
 + a
   + b
+ my-img.jpg
  + my.site.example.com
- references = {/my/page/1, my/page/2}
  + myother.site.example.com
- references = {/my/page/11}
{noformat}
the {{references}} is the MV property which can contain multiple entries - and 
depending on how many pages have reference to the resource in question the 
number of elements in the MV prop can be high, but realistically never higher 
than order of 10^3 IME.
Such a structure will help locate the complete list of references for a given 
resource without running any searches (just via traversal and property read).
bq. maybe you want to convert the string array to JSON. Of course it's a few 
more lines of code, but shouldn't be that complicated.
bq. Storing large strings or large multi-value properties comes with a high 
cost if there are many updates, because you have to store them each time. Also, 
very large nodes are a problem for caches.
I agree, but the indexing issue would still remain, no? (if there's a full-text 
index on this subtree, then it'd likely suffer the same issue as having MV 
props?) Not sure how caching would be impacted, but likely in the same way?
Or do binary-properties scale better in both the cases (though indexing one 
shouldn't matter to us - the caching one might be relevant).

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149488#comment-17149488
 ] 

Thomas Mueller commented on OAK-8523:
-

> Usually it'd not be that many strings - much lower in fact (order of 10^1) - 
> it can go as high as 10^3 in rare cases

What you could do is store it as a MV string property in the normal case, and 
as a binary if it's larger. But that would make it more complex, so it's 
probably easier to always store it as a binary.

> a persisted reference-cache

I don't know what that is so I can't comment. Maybe you can give some examples?

> write binary-prop serialization/deserialization is slightly more involved

Hm, maybe you want to convert the string array to JSON. Of course it's a few 
more lines of code, but shouldn't be that complicated.

> concrete reasons

Yes. Storing large strings or large multi-value properties comes with a high 
cost if there are many updates, because you have to store them each time. Also, 
very large nodes are a problem for caches.



> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149464#comment-17149464
 ] 

Ashish Chopra commented on OAK-8523:


Thanks for sharing your perspective [~thomasm]!
bq. \other problems are e.g. caches: they typically assume one entry is smaller 
than 100 MB or so, otherwise they don't work well. It can lead to out-of-memory.
Thanks. That's helpful to know (though 100MB is far higher than the warning 
thresholds of 100KiB String sizes currently).
bq. If it's so many strings, why don't you store it as a binary?
That's the thing. _Usually_ it'd not be that many strings - much lower in fact 
(order of 10^1) - it can go _as high as_ 10^3 in rare cases, but anything 
larger would be well near a impossibility
bq. I don't think it's much more complex to store huge lists as a binary 
instead of a multi-valued property.
That's why I added _slightly_. Given the JCR APIs [0] [1] [2], reading/write MV 
props is near trivial - however for write binary-prop 
serialization/deserialization is slightly more involved (and I don't have any 
numbers, but I expect dealing with non-inlined binary-props to be slower than 
reading/writing MV props).
bq. How can you be sure that nobody every indexes this data?
This is expected to be internal data-structure (a persisted reference-cache, if 
I may add) under {{/var}} which doesn't need index - if anyone were to index it 
it'll be a huge error.
bq. It's still revelvant.
Thanks for this. OAK-1454 only mentions it is a non-goal for Oak to provide 
large MV support, but doesn't mention concrete reasons. Would you happen to 
know some?

[0] 
https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getProperty(java.lang.String)
[1] 
https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Property.html#getValues()
[2] 
https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Value[],%20int)

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149458#comment-17149458
 ] 

Thomas Mueller commented on OAK-8523:
-

> Can you please also help understand the other areas where this could be 
> problematic?

Well, for indexing it's a very big problem. I think that alone is enough of a 
problem. But other problems are e.g. caches: they typically assume one entry is 
smaller than 100 MB or so, otherwise they don't work well. It can lead to 
out-of-memory.

> single MV prop, possibly in the order of 10^3

If it's so many strings, why don't you store it as a binary?

> slightly more complex code

Sorry I don't think it's much more complex to store huge lists as a binary 
instead of a multi-valued property.

> which are not a candidate for indexing

How can you be sure that nobody every indexes this data? Often it's other 
people that define indexes, and they might not know that they need to exclude 
certain properties. E.g. a generic lucene fulltext index that indexes all the 
data.

> Introduced such WARNs for 1000 elements in an MV property back in '14. 

It's still revelvant.

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149424#comment-17149424
 ] 

Ashish Chopra commented on OAK-8523:


{quote}
...There are multiple candidates:
.
.
.
* Number of elements for multi-valued properties
{quote}
OAK-1454 introduced such WARNs for 1000 elements in an MV property back in '14. 
Not sure if this is still relevant (or if the value needs to be raised) after 
~6years.

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8523) Best Practices - Property Value Length Limit

2020-07-01 Thread Ashish Chopra (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149420#comment-17149420
 ] 

Ashish Chopra commented on OAK-8523:


hi [~thomasm], the issue description states:
bq. Oak supports very large properties (e.g. String). But 1 MB (or larger) 
properties are problematic in multiple areas like indexing.
Can you please also help understand the other areas where this could be 
problematic?
I ask because if I want to read/write MV String properties (each string a path, 
so not huge, but multiple such strings as values of a single MV prop, possibly 
in the order of 10^3) which are not a candidate for indexing, what should be a 
cause for concern?
Asking because the alternatives I can think of, say, splitting to multiple 
nodes, or adding a binary-property, will come with their own consequences 
(slightly more complex code, possibly harder for NodeStore to cache, and 
potential delays in) and I'm not sure under what conditions one should be 
preferred over other(s)).

Can you please provide your insights? Any alternative patterns to associate 
possibly large content/strings (again, not indexing candidates).

\cc: [~tihom88]

> Best Practices - Property Value Length Limit
> 
>
> Key: OAK-8523
> URL: https://issues.apache.org/jira/browse/OAK-8523
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, jcr
>Reporter: Thomas Mueller
>Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
> 
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)