> Every addition of a child node implies a change to the parent node Document
Looks like the parent nodetype is nt:unstructured which requires orderable children. If you do not require that use a nodetype like oak:Unstructured. See [1] for some background Chetan Mehrotra [1] https://jackrabbit.apache.org/oak/docs/dos_and_donts.html#Large_number_of_direct_child_node On Mon, Aug 7, 2017 at 9:32 AM, Peter Harrison <cheetah...@gmail.com> wrote: > 1) I knew many nodes under one node was an issue with 2.X but I thought Oak > was going to address this issue. > > To get a better grasp of what is going on I took a look at the data > structure in Mongo. It seems to be a 'flat' node Collection. There is a > Collection called 'nodes'. A document in this collection represents a node. > Inside the node is a list of the ID's of the child nodes. Every addition of > a child node implies a change to the parent node Document. Each revision of > the number of children stores a complete new list of the children. This > means the document becomes more unmanagable the more nodes are added > directly under it. When you get the node you MUST also get the entire list > of children ID's! Not only this, but for every modification a full list of > all the children is stored. Thus removing a child of a node with lots of > other nodes actually adds a huge amount of data. > > This is *insane*. No. Seriously. This is nuts. If I'm reading this right it > means that if you have say 10 children you have 10 revisions each with its > own set of children all in the one Document. > > > 2) I experimented with the number of removes before a save. If you try and > put too many under a single commit it blows up. The API I wrote had a > parameter you could override to control the number or removes done for each > commit. It didn't look like the commit was making much difference in terms > of performance. I might be wrong on that one - see below. > > Now that I know how things work under the covers I have some idea of the > scope of the problem. Each remove can actually adding a HUGE volume of data > to the parent node, a copy of all the child id's previously less the > removed children. > > Am I getting all this wrong? > > > > A sane implementation would have a separate collection for the links > between nodes or each node would have a parent and finding out the children > would involve a simple query to return all nodes that have a specific > parent. This would be easy and fast as you can have an index on the > parent_id. It would also mean you can perform a query and iterate the list > without getting all the children at once. This would mean the hasNodes() > and getNodes() would only need to get the first record. I'm sure there are > reasons for all this, but nears as I can tell this is a pretty fatal flaw. > > > > Looks like that Cassandra spike is closer than I thought. > > > On Mon, Aug 7, 2017 at 1:39 PM, Clay Ferguson <wcl...@gmail.com> wrote: > >> Two thoughts: >> >> 1) It's a known issue (severe weakness) in the design of Jackrabbit/Oak >> that it chokes like a dog on large numbers of child nodes all under the >> same node. Many users have struggled with this, and imo it has been one of >> the massive flaws that has kept the JCR from really taking off. I mean, >> probably still only 1% of developers have ever heard of the JCR. >> >> 2) About cleaning up the massive child list, be sure you aren't doing a >> commit (save) after each node. Try to run commits after 100 to 500 deletes >> at a time. >> >> Good luck. That scalability issue is a pretty big problem. I sure wish >> Adobe would find some people with the requisite skill to get that fixed. >> Every serious user runs into this problem. I mean the Derby DB is >> litterally 100x of times more powerful, and most people consider Derby a >> toy. >> >> >> Best regards, >> Clay Ferguson >> wcl...@gmail.com >> >> >> On Sun, Aug 6, 2017 at 7:38 PM, Peter Harrison <cheetah...@gmail.com> >> wrote: >> >> > Over the last few days I've come across a problem while trying to recover >> > from a ranaway script that created tens of thousands of nodes under a >> > single node. >> > >> > When I get the parent node to this large number of new nodes and call >> > hasNodes() things lock up and the Mongo query times out. Similar problem >> > when you try to call getNodes() to return a nodeIterator. >> > >> > I know that one of the key points with Oak was meant to be the ability to >> > handle a large number of child nodes, >> > >> > >> > >> > The second problem I have is in removing these nodes. While I was able to >> > find out the node paths without the above calls to get each node by path >> > when I call node.remove() it is taking about 20-30 seconds to delete each >> > node. I wanted to remove about 300,000 nodes, but at 20 seconds a >> node.... >> > just under 69 days. It took no more than 2 days to add then, probably >> much >> > shorter. >> > >> > While I'm working on ways around these problems - essentially by >> rebuilding >> > the repo - it would be good to see if these problems are known or whether >> > there is something I'm doing wrong. >> > >>