Hi Alessandro,
thanks a lot for your thoughtful mail.
I think you hit the nail right on the head.
I think that the main problem is not really about the specific case,
but in general that when people design relational databases, they
always use references (or more properly, joins) to define data that
belongs logically to many entities, but should not duplicated.
I completely agree with your statement.
And I think this is one of the biggest challenges that we are
going to face.
People are thinking within the facilities provided by a relational
database and within the data modeling practices that they have
been using for decades now. Which is very understandable.
A content repository offers much richer facilities for content modelling
primarily through features like a hierarchy, multi-value properties or
even features like sorted children which in an RDBMS world have
to be modeled by the application developer.
Imagine that you have a company tree, with "positions",
"departments", "employees", "health plans" etc.
An employee could belong to a department, have a position and an
health plan, but typically you would not make all those nodes child
nodes of the employee: you would instead define references to the
proper node in the "position" and "health plan" subtrees.
I think one-to-many relationship should be modeled as a hierarchy.
So my initial gut feeling would be a datamodel like this:
/bigco
/bigco/marketingdept
/bigco/marketingdept/joeshmoe
and "joeshmoe" would be of nodetype
[bigco:employee]
- position
- healthplan
Now "position", "healthplan" are many-to-many relationships.
I think that those can either be modeled as references, paths,
names or strings.
People that come from a "hard structured" RDBMS background
very often think that a reference is the only option.
For example "position" might very well be a "string" or a "name"
if the application can deal with the fact that information is "dangling".
If we continue to model the above tree with...
/bigco/positions/
/bigco/positions/secretary
/bigco/positions/svp
... I think I would personally choose to store a "string"-property that is
human readable thats actually the name of the target node in
/bigco/positions.
So i would store "svp" or "secretary" in the position property.
Since I would not use namespaces for the names of the children
in "positions" I would not need the overhead of true name property in
my employee node.
While this probably rubs a lot "structure first" people the wrong
way I prefer this model since the information carried in the
string "secretary" is still valuable even if it is "dangling".
(...opposed to some UUID)
I think it is important to understand that there certainly are use cases
where referential integrity is very important, but it is important to understand
that it comes at a price.
Both in performance and even more importantly it constrains the
flexibility of your applications from a "data-first" perspective.
What could be the right way to model things? Maybe using a "path"
property to point to the node instead? Of course, it would not be as
easy to use as a reference, and it would be requiring global updates
if the pointed node ever change position, but I can't see other options.
If you would like to protect against "move"-operations but wants to avoid
the overhead of referential integrity, you can store the UUID of the target
in a string property. In JSR-283 we are looking at a "weak-reference" to
express a reference that can dangle in a more formal way.
It's easy to see how, in a large company, there could be thousands of
employee holding the same position and health plan, and those
specific nodes ("Secretary" and "Plan A") would have thousand of
references pointing to them.
So, given the issue as explained by Marcel that "whenever a
reference is added that points to a node N the complete set of
references pointing to N is re-written to the persistence manager",
it seems that using references to a node that is very "popular" is
really going to be creating problems in the long term.
Agreed. And I think we will not be able to re-educate everybody with
an RDBMS background before using Jackrabbit so I think Jackrabbit has
to be able to deal with very large quantities of references in a very
efficient way.
So I would recommend to fix that as noted by Tom in the last sentence of:
http://issues.apache.org/jira/browse/JCR-657
regards,
david