Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
> Is using this nextNode (linked list built on Node properties) the best > practice for when ordering AND large numbers of children are an absolute > requirement? What do you guys think? Crazy idea or reasonable? Thats can be an option. However concurrent updates would result in conflicts which would need to be resolved. Also for inserting find the last entry would involve iterating over all previous entries. In general having an orderable list of large size would pose problem and should be avoided Chetan Mehrotra
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
I was unaware simply making nodes unorderable would allow good scalability. Good to know! I guess we could always experiment with using a nextNode property to allow iterating in order, and also get good scalability for inserting/deleting, but using that linked-list approach would be slow at iterating, because each node retrieved would have to come from a lookup of it's nextNode property. The only thing (afaik) that could significantly improve that performance would be if each node's children happened to be in contiguous storage so that disk caching at hardware layer played a role in the speedup. Is using this nextNode (linked list built on Node properties) the best practice for when ordering AND large numbers of children are an absolute requirement? What do you guys think? Crazy idea or reasonable? -Clay On Sun, Aug 6, 2017 at 11:15 PM, Chetan Mehrotrawrote: > > Every addition of a child node implies a change to the parent node > Document > > Looks like the parent nodetype is nt:unstructured which requires > orderable children. If you do not require that use a nodetype like > oak:Unstructured. See [1] for some background > > Chetan Mehrotra > [1] https://jackrabbit.apache.org/oak/docs/dos_and_donts.html#La > rge_number_of_direct_child_node > >
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
On 2017-08-07 03:39, Clay Ferguson wrote: Two thoughts: 1) It's a known issue (severe weakness) in the design of Jackrabbit/Oak that it chokes like a dog on large numbers of child nodes all under the same node. Many users have struggled with this, and imo it has been one of the massive flaws that has kept the JCR from really taking off. I mean, probably still only 1% of developers have ever heard of the JCR. ... Jackrabbit yes, Oak no. Oak has been designed to handle bigger flat collections, but it does require a container node type that doesn't need to maintain the collection ordering. Best regards, Julian
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
Thanks for the reference. Much appreciated. On Mon, Aug 7, 2017 at 4:15 PM, Chetan Mehrotrawrote: > > Every addition of a child node implies a change to the parent node > Document > > Looks like the parent nodetype is nt:unstructured which requires > orderable children. If you do not require that use a nodetype like > oak:Unstructured. See [1] for some background > > Chetan Mehrotra > [1] https://jackrabbit.apache.org/oak/docs/dos_and_donts.html# > Large_number_of_direct_child_node > > >
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
> Every addition of a child node implies a change to the parent node Document Looks like the parent nodetype is nt:unstructured which requires orderable children. If you do not require that use a nodetype like oak:Unstructured. See [1] for some background Chetan Mehrotra [1] https://jackrabbit.apache.org/oak/docs/dos_and_donts.html#Large_number_of_direct_child_node On Mon, Aug 7, 2017 at 9:32 AM, Peter Harrisonwrote: > 1) I knew many nodes under one node was an issue with 2.X but I thought Oak > was going to address this issue. > > To get a better grasp of what is going on I took a look at the data > structure in Mongo. It seems to be a 'flat' node Collection. There is a > Collection called 'nodes'. A document in this collection represents a node. > Inside the node is a list of the ID's of the child nodes. Every addition of > a child node implies a change to the parent node Document. Each revision of > the number of children stores a complete new list of the children. This > means the document becomes more unmanagable the more nodes are added > directly under it. When you get the node you MUST also get the entire list > of children ID's! Not only this, but for every modification a full list of > all the children is stored. Thus removing a child of a node with lots of > other nodes actually adds a huge amount of data. > > This is *insane*. No. Seriously. This is nuts. If I'm reading this right it > means that if you have say 10 children you have 10 revisions each with its > own set of children all in the one Document. > > > 2) I experimented with the number of removes before a save. If you try and > put too many under a single commit it blows up. The API I wrote had a > parameter you could override to control the number or removes done for each > commit. It didn't look like the commit was making much difference in terms > of performance. I might be wrong on that one - see below. > > Now that I know how things work under the covers I have some idea of the > scope of the problem. Each remove can actually adding a HUGE volume of data > to the parent node, a copy of all the child id's previously less the > removed children. > > Am I getting all this wrong? > > > > A sane implementation would have a separate collection for the links > between nodes or each node would have a parent and finding out the children > would involve a simple query to return all nodes that have a specific > parent. This would be easy and fast as you can have an index on the > parent_id. It would also mean you can perform a query and iterate the list > without getting all the children at once. This would mean the hasNodes() > and getNodes() would only need to get the first record. I'm sure there are > reasons for all this, but nears as I can tell this is a pretty fatal flaw. > > > > Looks like that Cassandra spike is closer than I thought. > > > On Mon, Aug 7, 2017 at 1:39 PM, Clay Ferguson wrote: > >> Two thoughts: >> >> 1) It's a known issue (severe weakness) in the design of Jackrabbit/Oak >> that it chokes like a dog on large numbers of child nodes all under the >> same node. Many users have struggled with this, and imo it has been one of >> the massive flaws that has kept the JCR from really taking off. I mean, >> probably still only 1% of developers have ever heard of the JCR. >> >> 2) About cleaning up the massive child list, be sure you aren't doing a >> commit (save) after each node. Try to run commits after 100 to 500 deletes >> at a time. >> >> Good luck. That scalability issue is a pretty big problem. I sure wish >> Adobe would find some people with the requisite skill to get that fixed. >> Every serious user runs into this problem. I mean the Derby DB is >> litterally 100x of times more powerful, and most people consider Derby a >> toy. >> >> >> Best regards, >> Clay Ferguson >> wcl...@gmail.com >> >> >> On Sun, Aug 6, 2017 at 7:38 PM, Peter Harrison >> wrote: >> >> > Over the last few days I've come across a problem while trying to recover >> > from a ranaway script that created tens of thousands of nodes under a >> > single node. >> > >> > When I get the parent node to this large number of new nodes and call >> > hasNodes() things lock up and the Mongo query times out. Similar problem >> > when you try to call getNodes() to return a nodeIterator. >> > >> > I know that one of the key points with Oak was meant to be the ability to >> > handle a large number of child nodes, >> > >> > >> > >> > The second problem I have is in removing these nodes. While I was able to >> > find out the node paths without the above calls to get each node by path >> > when I call node.remove() it is taking about 20-30 seconds to delete each >> > node. I wanted to remove about 300,000 nodes, but at 20 seconds a >> node >> > just under 69 days. It took no more than 2 days to add then, probably >> much >> > shorter. >> > >> > While I'm working on ways around these problems -
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
1) I knew many nodes under one node was an issue with 2.X but I thought Oak was going to address this issue. To get a better grasp of what is going on I took a look at the data structure in Mongo. It seems to be a 'flat' node Collection. There is a Collection called 'nodes'. A document in this collection represents a node. Inside the node is a list of the ID's of the child nodes. Every addition of a child node implies a change to the parent node Document. Each revision of the number of children stores a complete new list of the children. This means the document becomes more unmanagable the more nodes are added directly under it. When you get the node you MUST also get the entire list of children ID's! Not only this, but for every modification a full list of all the children is stored. Thus removing a child of a node with lots of other nodes actually adds a huge amount of data. This is *insane*. No. Seriously. This is nuts. If I'm reading this right it means that if you have say 10 children you have 10 revisions each with its own set of children all in the one Document. 2) I experimented with the number of removes before a save. If you try and put too many under a single commit it blows up. The API I wrote had a parameter you could override to control the number or removes done for each commit. It didn't look like the commit was making much difference in terms of performance. I might be wrong on that one - see below. Now that I know how things work under the covers I have some idea of the scope of the problem. Each remove can actually adding a HUGE volume of data to the parent node, a copy of all the child id's previously less the removed children. Am I getting all this wrong? A sane implementation would have a separate collection for the links between nodes or each node would have a parent and finding out the children would involve a simple query to return all nodes that have a specific parent. This would be easy and fast as you can have an index on the parent_id. It would also mean you can perform a query and iterate the list without getting all the children at once. This would mean the hasNodes() and getNodes() would only need to get the first record. I'm sure there are reasons for all this, but nears as I can tell this is a pretty fatal flaw. Looks like that Cassandra spike is closer than I thought. On Mon, Aug 7, 2017 at 1:39 PM, Clay Fergusonwrote: > Two thoughts: > > 1) It's a known issue (severe weakness) in the design of Jackrabbit/Oak > that it chokes like a dog on large numbers of child nodes all under the > same node. Many users have struggled with this, and imo it has been one of > the massive flaws that has kept the JCR from really taking off. I mean, > probably still only 1% of developers have ever heard of the JCR. > > 2) About cleaning up the massive child list, be sure you aren't doing a > commit (save) after each node. Try to run commits after 100 to 500 deletes > at a time. > > Good luck. That scalability issue is a pretty big problem. I sure wish > Adobe would find some people with the requisite skill to get that fixed. > Every serious user runs into this problem. I mean the Derby DB is > litterally 100x of times more powerful, and most people consider Derby a > toy. > > > Best regards, > Clay Ferguson > wcl...@gmail.com > > > On Sun, Aug 6, 2017 at 7:38 PM, Peter Harrison > wrote: > > > Over the last few days I've come across a problem while trying to recover > > from a ranaway script that created tens of thousands of nodes under a > > single node. > > > > When I get the parent node to this large number of new nodes and call > > hasNodes() things lock up and the Mongo query times out. Similar problem > > when you try to call getNodes() to return a nodeIterator. > > > > I know that one of the key points with Oak was meant to be the ability to > > handle a large number of child nodes, > > > > > > > > The second problem I have is in removing these nodes. While I was able to > > find out the node paths without the above calls to get each node by path > > when I call node.remove() it is taking about 20-30 seconds to delete each > > node. I wanted to remove about 300,000 nodes, but at 20 seconds a > node > > just under 69 days. It took no more than 2 days to add then, probably > much > > shorter. > > > > While I'm working on ways around these problems - essentially by > rebuilding > > the repo - it would be good to see if these problems are known or whether > > there is something I'm doing wrong. > > >
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
Peter, Also as a last resort if absolutely nothing else is workable, you could theoretically run an Export to XML, and then process that XML with custom code you write, and THEN re-import back into a new empty repo. Please share your solution with the group if you would, once found. Adobe might benefit from seeing what problems they are creating and how people are working around those problems. Hopefully that's a legit use of this email list also. Best regards, Clay Ferguson wcl...@gmail.com On Sun, Aug 6, 2017 at 7:38 PM, Peter Harrisonwrote: > Over the last few days I've come across a problem while trying to recover > from a ranaway script that created tens of thousands of nodes under a > single node. > > When I get the parent node to this large number of new nodes and call > hasNodes() things lock up and the Mongo query times out. Similar problem > when you try to call getNodes() to return a nodeIterator. > > I know that one of the key points with Oak was meant to be the ability to > handle a large number of child nodes, > > > > The second problem I have is in removing these nodes. While I was able to > find out the node paths without the above calls to get each node by path > when I call node.remove() it is taking about 20-30 seconds to delete each > node. I wanted to remove about 300,000 nodes, but at 20 seconds a node > just under 69 days. It took no more than 2 days to add then, probably much > shorter. > > While I'm working on ways around these problems - essentially by rebuilding > the repo - it would be good to see if these problems are known or whether > there is something I'm doing wrong. >
Re: node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
Two thoughts: 1) It's a known issue (severe weakness) in the design of Jackrabbit/Oak that it chokes like a dog on large numbers of child nodes all under the same node. Many users have struggled with this, and imo it has been one of the massive flaws that has kept the JCR from really taking off. I mean, probably still only 1% of developers have ever heard of the JCR. 2) About cleaning up the massive child list, be sure you aren't doing a commit (save) after each node. Try to run commits after 100 to 500 deletes at a time. Good luck. That scalability issue is a pretty big problem. I sure wish Adobe would find some people with the requisite skill to get that fixed. Every serious user runs into this problem. I mean the Derby DB is litterally 100x of times more powerful, and most people consider Derby a toy. Best regards, Clay Ferguson wcl...@gmail.com On Sun, Aug 6, 2017 at 7:38 PM, Peter Harrisonwrote: > Over the last few days I've come across a problem while trying to recover > from a ranaway script that created tens of thousands of nodes under a > single node. > > When I get the parent node to this large number of new nodes and call > hasNodes() things lock up and the Mongo query times out. Similar problem > when you try to call getNodes() to return a nodeIterator. > > I know that one of the key points with Oak was meant to be the ability to > handle a large number of child nodes, > > > > The second problem I have is in removing these nodes. While I was able to > find out the node paths without the above calls to get each node by path > when I call node.remove() it is taking about 20-30 seconds to delete each > node. I wanted to remove about 300,000 nodes, but at 20 seconds a node > just under 69 days. It took no more than 2 days to add then, probably much > shorter. > > While I'm working on ways around these problems - essentially by rebuilding > the repo - it would be good to see if these problems are known or whether > there is something I'm doing wrong. >
node.hasNodes() ,node.getNodes() and removing nodes with node.remove()
Over the last few days I've come across a problem while trying to recover from a ranaway script that created tens of thousands of nodes under a single node. When I get the parent node to this large number of new nodes and call hasNodes() things lock up and the Mongo query times out. Similar problem when you try to call getNodes() to return a nodeIterator. I know that one of the key points with Oak was meant to be the ability to handle a large number of child nodes, The second problem I have is in removing these nodes. While I was able to find out the node paths without the above calls to get each node by path when I call node.remove() it is taking about 20-30 seconds to delete each node. I wanted to remove about 300,000 nodes, but at 20 seconds a node just under 69 days. It took no more than 2 days to add then, probably much shorter. While I'm working on ways around these problems - essentially by rebuilding the repo - it would be good to see if these problems are known or whether there is something I'm doing wrong.