Re: Efficient but simple hash for storing lot of unidentifiable contents

Sébastien Launay Thu, 08 Jan 2009 02:31:25 -0800

Hi Thomas,

Thanks for your advices.


Thomas Müller a écrit :
> Hi,
>
> What about:
>
> 1) Start with the node of your choice (let's say /data/)
> 2) Check how many nodes and child nodes are in that node
> 3) If more than 50: pick a child node randomly from the range 00-50 and
> continue at step 2)
> 4) Create a node with a random name, done
>   
With this method a content will contains its applicative child nodes and
eventually other contents.
This implies two problems:
- if i want to delete a content, i must ensure to not delete contents below.
- the versioning of a content implies versioning of the contents below
  and recursively the others contents below.

I guess i can adapt this method with the notion of content holder with one
content and multiple sub content holder.
But if there are holder, there will be holes, at least without additional
management.
> I would make sure there are no same name siblings (for multiple reasons).
> There are at least two solutions:
>
> A) Disallow same name siblings, use node names n00-n50, and let the
> algorithm re-try if there is a clash. I'm not sure if that works well with
> clustering.
>   
I think the transient node creation will always works.
But as the retrieval of the next node is not synchronized between
cluster node,
it's the call to save (which do synchronize and do lock) will failed,
but as you
said i can retry like 3 times and finally the node will be created and
will be unique.
> B) Use a cryptographically secure pseudo random number generator (for
> example a UUID) as the node name. This works well with clustering, the node
> names will be quite long however.
>   
I see, like the storing of version history into
/jcr:root/jcr:system/jcr:versionStorage.
This implies a fixed hash for limiting child nodes.
This can also fails if a node of the hash already exists (especially
with the nodes of
the first level) but we can retry.

In Jackrabbit jcr:versionStorage node there is:
- 00 to ff = 256 child nodes for first level.
- 00 to ff = 256 child nodes for second level.
- 00 to ff = 256 child nodes for third level.
- no limitation for the fourth level.

Ex: fff0b38e-66... => ff/f0/b3/fff0b38e-66...

So, if the number generator distributes UUID with equals frequencies on
the 3 first
segment (ff, b0, b3 in the example) and IIUC there will be more than 10k
nodes in
the fourth level if the number of contents exceeds ((256)^3)*10000 = 167
772M.
The number generated are rarely distributed equally and 10k is the
limit, but with
167 G we can assume it will be working quite well with 500M contents
which is far
enough in our case.

Therefore, i think this algorithm is more simple and much easier to
implement
and will surely produce less bugs :).
I think i am gonna try this by first finding a UUID generator maybe the
same as
the one used in Jackrabbit from Jakarta Commons-Id project.

The only drawback is that the node may have an UUID different than the
name of the node which is another UUID ;).

Thanks again Thomas.
> Disadvantage:
> - Maybe performance. To improve that, you could keep the information 'root
> is full' in memory and start with a random node /data/00 - /data/50 directly
>
> Advantages:
> - No counter required
> - No synchronization required
> - Holes are automatically filled
>
> Regards,
> Thomas
>
>   
--
Sébastien Launay

Re: Efficient but simple hash for storing lot of unidentifiable contents

Reply via email to