I think that the main problem is not really about the specific case,
but in general that when people design relational databases, they
always use references (or more properly, joins) to define data that
belongs logically to many entities, but should not duplicated.
Imagine that you have a company tree, with "positions",
"departments", "employees", "health plans" etc.
An employee could belong to a department, have a position and an
health plan, but typically you would not make all those nodes child
nodes of the employee: you would instead define references to the
proper node in the "position" and "health plan" subtrees.
It's easy to see how, in a large company, there could be thousands of
employee holding the same position and health plan, and those
specific nodes ("Secretary" and "Plan A") would have thousand of
references pointing to them.
So, given the issue as explained by Marcel that "whenever a
reference is added that points to a node N the complete set of
references pointing to N is re-written to the persistence manager",
it seems that using references to a node that is very "popular" is
really going to be creating problems in the long term.
What could be the right way to model things? Maybe using a "path"
property to point to the node instead? Of course, it would not be as
easy to use as a reference, and it would be requiring global updates
if the pointed node ever change position, but I can't see other options.
Any suggestions?
Alessandro Bologna
On Apr 26, 2007, at 2:38 PM, Jukka Zitting wrote:
Hi,
On 4/26/07, Stefan Kurla <[EMAIL PROTECTED]> wrote:
I would appreciate the thoughts on references though. Reason being
that one of the biggest strengths of JSR-170 is the ability to store
references. I imagine a situation where i could have a nodetype call
docType which is either pdf or word strings. Say 80% of my documents
are word documents. Then the docType will have a reference to 80% of
all documents in my repository. If my repository is 100,000 files
then
docType references 80,000 nodes.
If what you say is correct that at every new reference, the complete
set of references are rewritten, then obviously this is a bottleneck.
Should such a situation be avoided?
Why would you need to use such references structure? I would rather
use the node types to model such information. A search query like
//element(*,my:wordDocument) will efficiently return you all such Word
documents in your workspace.
BR,
Jukka Zitting