Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()

Chetan Mehrotra Sun, 06 Aug 2017 21:15:44 -0700

> Every addition of a child node implies a change to the parent node Document


Looks like the parent nodetype is nt:unstructured which requires
orderable children. If you do not require that use a nodetype like
oak:Unstructured. See [1] for some background

Chetan Mehrotra
[1] 
https://jackrabbit.apache.org/oak/docs/dos_and_donts.html#Large_number_of_direct_child_node

On Mon, Aug 7, 2017 at 9:32 AM, Peter Harrison <[email protected]> wrote:
> 1) I knew many nodes under one node was an issue with 2.X but I thought Oak
> was going to address this issue.
>
> To get a better grasp of what is going on I took a look at the data
> structure in Mongo. It seems to be a 'flat' node Collection.  There is a
> Collection called 'nodes'. A document in this collection represents a node.
> Inside the node is a list of the ID's of the child nodes. Every addition of
> a child node implies a change to the parent node Document. Each revision of
> the number of children stores a complete new list of the children. This
> means the document becomes more unmanagable the more nodes are added
> directly under it. When you get the node you MUST also get the entire list
> of children ID's! Not only this, but for every modification a full list of
> all the children is stored. Thus removing a child of a node with lots of
> other nodes actually adds a huge amount of data.
>
> This is *insane*. No. Seriously. This is nuts. If I'm reading this right it
> means that if you have say 10 children you have 10 revisions each with its
> own set of children all in the one Document.
>
>
> 2) I experimented with the number of removes before a save. If you try and
> put too many under a single commit it blows up. The API I wrote had a
> parameter you could override to control the number or removes done for each
> commit. It didn't look like the commit was making much difference in terms
> of performance. I might be wrong on that one - see below.
>
> Now that I know how things work under the covers I have some idea of the
> scope of the problem. Each remove can actually adding a HUGE volume of data
> to the parent node, a copy of all the child id's previously less the
> removed children.
>
> Am I getting all this wrong?
>
>
>
> A sane implementation would have a separate collection for the links
> between nodes or each node would have a parent and finding out the children
> would involve a simple query to return all nodes that have a specific
> parent. This would be easy and fast as you can have an index on the
> parent_id. It would also mean you can perform a query and iterate the list
> without getting all the children at once. This would mean the hasNodes()
> and getNodes() would only need to get the first record. I'm sure there are
> reasons for all this, but nears as I can tell this is a pretty fatal flaw.
>
>
>
> Looks like that Cassandra spike is closer than I thought.
>
>
> On Mon, Aug 7, 2017 at 1:39 PM, Clay Ferguson <[email protected]> wrote:
>
>> Two thoughts:
>>
>> 1) It's a known issue (severe weakness) in the design of Jackrabbit/Oak
>> that it chokes like a dog on large numbers of child nodes all under the
>> same node. Many users have struggled with this, and imo it has been one of
>> the massive flaws that has kept the JCR from really taking off. I mean,
>> probably still only 1% of developers have ever heard of the JCR.
>>
>> 2) About cleaning up the massive child list, be sure you aren't doing a
>> commit (save) after each node. Try to run commits after 100 to 500 deletes
>> at a time.
>>
>> Good luck. That scalability issue is a pretty big problem. I sure wish
>> Adobe would find some people with the requisite skill to get that fixed.
>> Every serious user runs into this problem. I mean the Derby DB is
>> litterally 100x of times more powerful, and most people consider Derby a
>> toy.
>>
>>
>> Best regards,
>> Clay Ferguson
>> [email protected]
>>
>>
>> On Sun, Aug 6, 2017 at 7:38 PM, Peter Harrison <[email protected]>
>> wrote:
>>
>> > Over the last few days I've come across a problem while trying to recover
>> > from a ranaway script that created tens of thousands of nodes under a
>> > single node.
>> >
>> > When I get the parent node to this large number of new nodes and call
>> > hasNodes() things lock up and the Mongo query times out. Similar problem
>> > when you try to call getNodes() to return a nodeIterator.
>> >
>> > I know that one of the key points with Oak was meant to be the ability to
>> > handle a large number of child nodes,
>> >
>> >
>> >
>> > The second problem I have is in removing these nodes. While I was able to
>> > find out the node paths without the above calls to get each node by path
>> > when I call node.remove() it is taking about 20-30 seconds to delete each
>> > node. I wanted to remove about 300,000 nodes, but at 20 seconds a
>> node....
>> > just under 69 days. It took no more than 2 days to add then, probably
>> much
>> > shorter.
>> >
>> > While I'm working on ways around these problems - essentially by
>> rebuilding
>> > the repo - it would be good to see if these problems are known or whether
>> > there is something I'm doing wrong.
>> >
>>

Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()

Reply via email to