Re: /oak:index (DocumentNodeStore)
On 2015-07-09 09:15, Marcel Reutegger wrote: Hi Ian, there are mainly two reasons why we cannot use DocumentStore based indexes for this purpose: - MongoDB only supports a limited number of indexes (64 per collection) and applications usually have a need for more indexes. - Data in Oak is multi-versioned. It must be possible to query nodes at a specific revision of the tree. Lucene indexes are more efficient, but are only updated asynchronously. Whether this is acceptable usually depends on application requirements. Experience so far shows, many indexes can be asynchronous, because there was no hard requirement for synchronous index updates. Regards Marcel Do the above considerations also apply to the UUID index? Best regards, Julian
Re: /oak:index (DocumentNodeStore)
On Thu, Jul 9, 2015 at 12:45 PM, Marcel Reutegger mreut...@adobe.com wrote: - Data in Oak is multi-versioned. It must be possible to query nodes at a specific revision of the tree. To add - That also makes it difficult to use Mongo indexes as the index itself is versioned. So instead of just indexing property 'foo' you need to index it for every revision Chetan Mehrotra
Re: /oak:index (DocumentNodeStore)
Hi Ian, there are mainly two reasons why we cannot use DocumentStore based indexes for this purpose: - MongoDB only supports a limited number of indexes (64 per collection) and applications usually have a need for more indexes. - Data in Oak is multi-versioned. It must be possible to query nodes at a specific revision of the tree. Lucene indexes are more efficient, but are only updated asynchronously. Whether this is acceptable usually depends on application requirements. Experience so far shows, many indexes can be asynchronous, because there was no hard requirement for synchronous index updates. Regards Marcel On 08/07/15 18:18, ianbos...@gmail.com on behalf of Ian Boston wrote: Hi, I am confused at how /oak:index works and why it is needed in a MongoDB setting which has native database indexes that appear to cover the same functionality. Could the Oak Query engine use DB indexes directly for all indexes that are built into Oak, and Lucene indexes for all custom indexes ? I am asking this because in MongoDB I observe that 60% of the size of the nodes collection is attributable to /oak:index, and that the 60% increases every non sparse MongoDB index by about 3x. An _id + _modified compound index in MongoDB comes out at about 70GB for 100M documents (in part due to the size of _id). Without the duplication /oak:index it could be closer to 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, neither is page fault IO. I fully understand why TarMK needs /oak:index, but I can't understand (conceptually) the need to implement an index inside an database table. It's like trying to implement an inverted index in an RDBMS table, which everyone who has ever tried (or used) that approach doesn't scale nearly as far as Lucene bitmaps. Could /oak:index be replaced by something that doesn't generate Documents/db rows as fast as it does ? Best Regards Ian
Re: /oak:index (DocumentNodeStore)
Hi Marcel, Thanks for the response, that makes sense. I assume that there already 64 indexes in /oak:index before any custom ones are added, which makes it impossible to remove /oak:index for MongoDB. With that many it's going to be impractical for all RDBMS's. Would there be any benefit in moving /oak:index out of the main document collection so that any MongoDB indexes in the collection of no relevance to /oak:index don't get bloated ? or, more generally Is there a different way of storing the data in /oak:index so that it doesn't result in so many MongoDB documents ? Best Regards Ian On 9 July 2015 at 08:15, Marcel Reutegger mreut...@adobe.com wrote: Hi Ian, there are mainly two reasons why we cannot use DocumentStore based indexes for this purpose: - MongoDB only supports a limited number of indexes (64 per collection) and applications usually have a need for more indexes. - Data in Oak is multi-versioned. It must be possible to query nodes at a specific revision of the tree. Lucene indexes are more efficient, but are only updated asynchronously. Whether this is acceptable usually depends on application requirements. Experience so far shows, many indexes can be asynchronous, because there was no hard requirement for synchronous index updates. Regards Marcel On 08/07/15 18:18, ianbos...@gmail.com on behalf of Ian Boston wrote: Hi, I am confused at how /oak:index works and why it is needed in a MongoDB setting which has native database indexes that appear to cover the same functionality. Could the Oak Query engine use DB indexes directly for all indexes that are built into Oak, and Lucene indexes for all custom indexes ? I am asking this because in MongoDB I observe that 60% of the size of the nodes collection is attributable to /oak:index, and that the 60% increases every non sparse MongoDB index by about 3x. An _id + _modified compound index in MongoDB comes out at about 70GB for 100M documents (in part due to the size of _id). Without the duplication /oak:index it could be closer to 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, neither is page fault IO. I fully understand why TarMK needs /oak:index, but I can't understand (conceptually) the need to implement an index inside an database table. It's like trying to implement an inverted index in an RDBMS table, which everyone who has ever tried (or used) that approach doesn't scale nearly as far as Lucene bitmaps. Could /oak:index be replaced by something that doesn't generate Documents/db rows as fast as it does ? Best Regards Ian
Re: /oak:index (DocumentNodeStore)
On 9 July 2015 at 09:16, Chetan Mehrotra chetan.mehro...@gmail.com wrote: On Thu, Jul 9, 2015 at 12:45 PM, Marcel Reutegger mreut...@adobe.com wrote: - Data in Oak is multi-versioned. It must be possible to query nodes at a specific revision of the tree. To add - That also makes it difficult to use Mongo indexes as the index itself is versioned. So instead of just indexing property 'foo' you need to index it for every revision Won't compound indexes work ? { _id : 1, _modified: 1, _revision: 1 } ? They are bigger. _id is 211 bytes per entry average _modified: _id is 233 _revision, _modified, _id is probably close to 400 bytes as _revision is a string. I guess the way of telling is to generate the index on a test database and see what impact it has. Best Regards Ian Chetan Mehrotra
Re: /oak:index (DocumentNodeStore)
Hi, Using MongoDB indexes directly doesn't work because of the MVCC model. What we could do is add special collections (basically one collection per index). This would requires some work, which then would need to be repeated for RDBMK. It would be quite some work. I observe that 60% of the size of the nodes collection is attributable to /oak:index Could you try to find out which index(es) are responsible for that? There would be multiple ways to reduce the number of nodes: 0) remove unused indexes 1) convert some indexes to Lucene property indexes 2) convert to unique index if possible (as this uses less space) 3) add a feature to only index a subset of the keys (only index what we need) 4) convert the last x levels of the index structure as a property instead of as a node 3) and 4) would require changes in Oak. For 4), the change should reduce the number of nodes, but might cause merge conflicts (not sure). With level = 1, it would be: /content/products/a @color=red /content/products/b @color=red /oak:index/color/red/content /oak:index/color/red/content/products @a = true, @b = true instead of /oak:index/color/red/content /oak:index/color/red/content/products /oak:index/color/red/content/products/a @match = true /oak:index/color/red/content/products/b @match = true With level 1, it would require some escaping magic, but we could save some more nodes, and basically it would be: level = 2: /oak:index/color/red/content @products_a = true, @products_b = true level = 3: /oak:index/color/red @content_products_a = true, @content_products_b = true Regards, Thomas On 08/07/15 18:18, Ian Boston i...@tfd.co.uk wrote: Hi, I am confused at how /oak:index works and why it is needed in a MongoDB setting which has native database indexes that appear to cover the same functionality. Could the Oak Query engine use DB indexes directly for all indexes that are built into Oak, and Lucene indexes for all custom indexes ? I am asking this because in MongoDB I observe that 60% of the size of the nodes collection is attributable to /oak:index, and that the 60% increases every non sparse MongoDB index by about 3x. An _id + _modified compound index in MongoDB comes out at about 70GB for 100M documents (in part due to the size of _id). Without the duplication /oak:index it could be closer to 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, neither is page fault IO. I fully understand why TarMK needs /oak:index, but I can't understand (conceptually) the need to implement an index inside an database table. It's like trying to implement an inverted index in an RDBMS table, which everyone who has ever tried (or used) that approach doesn't scale nearly as far as Lucene bitmaps. Could /oak:index be replaced by something that doesn't generate Documents/db rows as fast as it does ? Best Regards Ian
Re: /oak:index (DocumentNodeStore)
Hi, On 9 July 2015 at 10:33, Thomas Mueller muel...@adobe.com wrote: Hi, Using MongoDB indexes directly doesn't work because of the MVCC model. What we could do is add special collections (basically one collection per index). This would requires some work, which then would need to be repeated for RDBMK. It would be quite some work. ok, understood. I observe that 60% of the size of the nodes collection is attributable to /oak:index Could you try to find out which index(es) are responsible for that? Marcel and Chetan have been working on the repository I was observing. I am sure they can point you to the details offline, if you are not aware of it already. They were able to remove about 25% of the 60% under /oak:index, but IIUC most of the remainder and not local customisations, and perhaps 40% of what remains is not local customisations and must be synchronous, which indicates a 1:2 ratio between real content nodes and MongoDB documents before any MongoDB indexes are considered. That ratio was the motivation for asking the question. Chetan thought I should discuss on oak-dev. Marcel and Chetan have executed 0) and 1) below, far more knowledgable than I in this area. Best Regards Ian There would be multiple ways to reduce the number of nodes: 0) remove unused indexes 1) convert some indexes to Lucene property indexes 2) convert to unique index if possible (as this uses less space) 3) add a feature to only index a subset of the keys (only index what we need) 4) convert the last x levels of the index structure as a property instead of as a node 3) and 4) would require changes in Oak. For 4), the change should reduce the number of nodes, but might cause merge conflicts (not sure). With level = 1, it would be: /content/products/a @color=red /content/products/b @color=red /oak:index/color/red/content /oak:index/color/red/content/products @a = true, @b = true instead of /oak:index/color/red/content /oak:index/color/red/content/products /oak:index/color/red/content/products/a @match = true /oak:index/color/red/content/products/b @match = true With level 1, it would require some escaping magic, but we could save some more nodes, and basically it would be: level = 2: /oak:index/color/red/content @products_a = true, @products_b = true level = 3: /oak:index/color/red @content_products_a = true, @content_products_b = true Regards, Thomas On 08/07/15 18:18, Ian Boston i...@tfd.co.uk wrote: Hi, I am confused at how /oak:index works and why it is needed in a MongoDB setting which has native database indexes that appear to cover the same functionality. Could the Oak Query engine use DB indexes directly for all indexes that are built into Oak, and Lucene indexes for all custom indexes ? I am asking this because in MongoDB I observe that 60% of the size of the nodes collection is attributable to /oak:index, and that the 60% increases every non sparse MongoDB index by about 3x. An _id + _modified compound index in MongoDB comes out at about 70GB for 100M documents (in part due to the size of _id). Without the duplication /oak:index it could be closer to 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, neither is page fault IO. I fully understand why TarMK needs /oak:index, but I can't understand (conceptually) the need to implement an index inside an database table. It's like trying to implement an inverted index in an RDBMS table, which everyone who has ever tried (or used) that approach doesn't scale nearly as far as Lucene bitmaps. Could /oak:index be replaced by something that doesn't generate Documents/db rows as fast as it does ? Best Regards Ian
Re: /oak:index (DocumentNodeStore)
A collection per index (or a separate one for indexes only), specially the asynchronous ones, will translate in a big benefit if the following occurs: - when querying on index nodes we don't need to get all related node documents (which is happening) - the write operations are distinct between indexes and nodes (which I think is also happening) N. On Thu, Jul 9, 2015 at 11:33 AM, Thomas Mueller muel...@adobe.com wrote: Hi, Using MongoDB indexes directly doesn't work because of the MVCC model. What we could do is add special collections (basically one collection per index). This would requires some work, which then would need to be repeated for RDBMK. It would be quite some work. I observe that 60% of the size of the nodes collection is attributable to /oak:index Could you try to find out which index(es) are responsible for that? There would be multiple ways to reduce the number of nodes: 0) remove unused indexes 1) convert some indexes to Lucene property indexes 2) convert to unique index if possible (as this uses less space) 3) add a feature to only index a subset of the keys (only index what we need) 4) convert the last x levels of the index structure as a property instead of as a node 3) and 4) would require changes in Oak. For 4), the change should reduce the number of nodes, but might cause merge conflicts (not sure). With level = 1, it would be: /content/products/a @color=red /content/products/b @color=red /oak:index/color/red/content /oak:index/color/red/content/products @a = true, @b = true instead of /oak:index/color/red/content /oak:index/color/red/content/products /oak:index/color/red/content/products/a @match = true /oak:index/color/red/content/products/b @match = true With level 1, it would require some escaping magic, but we could save some more nodes, and basically it would be: level = 2: /oak:index/color/red/content @products_a = true, @products_b = true level = 3: /oak:index/color/red @content_products_a = true, @content_products_b = true Regards, Thomas On 08/07/15 18:18, Ian Boston i...@tfd.co.uk wrote: Hi, I am confused at how /oak:index works and why it is needed in a MongoDB setting which has native database indexes that appear to cover the same functionality. Could the Oak Query engine use DB indexes directly for all indexes that are built into Oak, and Lucene indexes for all custom indexes ? I am asking this because in MongoDB I observe that 60% of the size of the nodes collection is attributable to /oak:index, and that the 60% increases every non sparse MongoDB index by about 3x. An _id + _modified compound index in MongoDB comes out at about 70GB for 100M documents (in part due to the size of _id). Without the duplication /oak:index it could be closer to 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, neither is page fault IO. I fully understand why TarMK needs /oak:index, but I can't understand (conceptually) the need to implement an index inside an database table. It's like trying to implement an inverted index in an RDBMS table, which everyone who has ever tried (or used) that approach doesn't scale nearly as far as Lucene bitmaps. Could /oak:index be replaced by something that doesn't generate Documents/db rows as fast as it does ? Best Regards Ian
/oak:index (DocumentNodeStore)
Hi, I am confused at how /oak:index works and why it is needed in a MongoDB setting which has native database indexes that appear to cover the same functionality. Could the Oak Query engine use DB indexes directly for all indexes that are built into Oak, and Lucene indexes for all custom indexes ? I am asking this because in MongoDB I observe that 60% of the size of the nodes collection is attributable to /oak:index, and that the 60% increases every non sparse MongoDB index by about 3x. An _id + _modified compound index in MongoDB comes out at about 70GB for 100M documents (in part due to the size of _id). Without the duplication /oak:index it could be closer to 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, neither is page fault IO. I fully understand why TarMK needs /oak:index, but I can't understand (conceptually) the need to implement an index inside an database table. It's like trying to implement an inverted index in an RDBMS table, which everyone who has ever tried (or used) that approach doesn't scale nearly as far as Lucene bitmaps. Could /oak:index be replaced by something that doesn't generate Documents/db rows as fast as it does ? Best Regards Ian