Re: /oak:index (DocumentNodeStore)

2015-07-13 Thread Julian Reschke

On 2015-07-09 09:15, Marcel Reutegger wrote:

Hi Ian,

there are mainly two reasons why we cannot use DocumentStore
based indexes for this purpose:

- MongoDB only supports a limited number of indexes (64 per
   collection) and applications usually have a need for more
   indexes.

- Data in Oak is multi-versioned. It must be possible to query
   nodes at a specific revision of the tree.

Lucene indexes are more efficient, but are only updated
asynchronously. Whether this is acceptable usually depends on
application requirements. Experience so far shows, many indexes
can be asynchronous, because there was no hard requirement
for synchronous index updates.

Regards
  Marcel


Do the above considerations also apply to the UUID index?

Best regards, Julian


Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Chetan Mehrotra
On Thu, Jul 9, 2015 at 12:45 PM, Marcel Reutegger mreut...@adobe.com wrote:
 - Data in Oak is multi-versioned. It must be possible to query
   nodes at a specific revision of the tree.

To add - That also makes it difficult to use Mongo indexes as the
index itself is versioned. So instead of just indexing property 'foo'
you need to index it for every revision

Chetan Mehrotra


Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Marcel Reutegger
Hi Ian,

there are mainly two reasons why we cannot use DocumentStore
based indexes for this purpose:

- MongoDB only supports a limited number of indexes (64 per
  collection) and applications usually have a need for more
  indexes. 

- Data in Oak is multi-versioned. It must be possible to query
  nodes at a specific revision of the tree.

Lucene indexes are more efficient, but are only updated
asynchronously. Whether this is acceptable usually depends on
application requirements. Experience so far shows, many indexes
can be asynchronous, because there was no hard requirement
for synchronous index updates.

Regards
 Marcel

On 08/07/15 18:18, ianbos...@gmail.com on behalf of Ian Boston wrote:

Hi,
I am confused at how /oak:index works and why it is needed in a MongoDB
setting which has native database indexes that appear to cover the same
functionality. Could the Oak Query engine use DB indexes directly for all
indexes that are built into Oak, and Lucene indexes for all custom
indexes ?

I am asking this because in MongoDB I observe that 60% of the size of the
nodes collection is attributable to /oak:index, and that the 60% increases
every non sparse MongoDB index by about 3x. An _id + _modified compound
index in MongoDB comes out at about 70GB for 100M documents (in part due
to
the size of _id). Without the duplication /oak:index it could be closer to
25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
neither is page fault IO.

I fully understand why TarMK needs /oak:index, but I can't understand
(conceptually) the need to implement an index inside an database table.
It's like trying to implement an inverted index in an RDBMS table, which
everyone who has ever tried (or used) that approach doesn't scale nearly
as
far as Lucene bitmaps.

Could /oak:index be replaced by something that doesn't generate
Documents/db rows as fast as it does ?

Best Regards
Ian



Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Ian Boston
Hi Marcel,
Thanks for the response, that makes sense.

I assume that there already  64 indexes in /oak:index before any custom
ones are added, which makes it impossible to remove /oak:index for
MongoDB.  With that many it's going to be impractical for all RDBMS's.

Would there be any benefit in moving /oak:index out of the main document
collection so that any MongoDB indexes in the collection of no relevance to
/oak:index don't get bloated ?
or, more generally
Is there a different way of storing the data in /oak:index so that it
doesn't result in so many MongoDB documents ?


Best Regards
Ian

On 9 July 2015 at 08:15, Marcel Reutegger mreut...@adobe.com wrote:

 Hi Ian,

 there are mainly two reasons why we cannot use DocumentStore
 based indexes for this purpose:

 - MongoDB only supports a limited number of indexes (64 per
   collection) and applications usually have a need for more
   indexes.

 - Data in Oak is multi-versioned. It must be possible to query
   nodes at a specific revision of the tree.

 Lucene indexes are more efficient, but are only updated
 asynchronously. Whether this is acceptable usually depends on
 application requirements. Experience so far shows, many indexes
 can be asynchronous, because there was no hard requirement
 for synchronous index updates.

 Regards
  Marcel

 On 08/07/15 18:18, ianbos...@gmail.com on behalf of Ian Boston wrote:

 Hi,
 I am confused at how /oak:index works and why it is needed in a MongoDB
 setting which has native database indexes that appear to cover the same
 functionality. Could the Oak Query engine use DB indexes directly for all
 indexes that are built into Oak, and Lucene indexes for all custom
 indexes ?
 
 I am asking this because in MongoDB I observe that 60% of the size of the
 nodes collection is attributable to /oak:index, and that the 60% increases
 every non sparse MongoDB index by about 3x. An _id + _modified compound
 index in MongoDB comes out at about 70GB for 100M documents (in part due
 to
 the size of _id). Without the duplication /oak:index it could be closer to
 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
 neither is page fault IO.
 
 I fully understand why TarMK needs /oak:index, but I can't understand
 (conceptually) the need to implement an index inside an database table.
 It's like trying to implement an inverted index in an RDBMS table, which
 everyone who has ever tried (or used) that approach doesn't scale nearly
 as
 far as Lucene bitmaps.
 
 Could /oak:index be replaced by something that doesn't generate
 Documents/db rows as fast as it does ?
 
 Best Regards
 Ian




Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Ian Boston
On 9 July 2015 at 09:16, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

 On Thu, Jul 9, 2015 at 12:45 PM, Marcel Reutegger mreut...@adobe.com
 wrote:
  - Data in Oak is multi-versioned. It must be possible to query
nodes at a specific revision of the tree.

 To add - That also makes it difficult to use Mongo indexes as the
 index itself is versioned. So instead of just indexing property 'foo'
 you need to index it for every revision


Won't compound indexes work ?

{ _id : 1, _modified: 1, _revision: 1 } ?

They are bigger.
_id is 211 bytes per entry average
_modified: _id is 233
_revision, _modified, _id is probably close to 400 bytes as _revision is a
string.

I guess the way of telling is to generate the index on a test database and
see what impact it has.

Best Regards
Ian




 Chetan Mehrotra



Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Thomas Mueller
Hi,

Using MongoDB indexes directly doesn't work because of the MVCC model.
What we could do is add special collections (basically one collection per
index). This would requires some work, which then would need to be
repeated for RDBMK. It would be quite some work.

 I observe that 60% of the size of the nodes collection is attributable
to /oak:index

Could you try to find out which index(es) are responsible for that? There
would be multiple ways to reduce the number of nodes:

0) remove unused indexes
1) convert some indexes to Lucene property indexes
2) convert to unique index if possible (as this uses less space)
3) add a feature to only index a subset of the keys (only index what we
need)
4) convert the last x levels of the index structure as a property instead
of as a node


3) and 4) would require changes in Oak. For 4), the change should reduce
the number of nodes, but might cause merge conflicts (not sure). With
level = 1, it would be:

  /content/products/a @color=red
  /content/products/b @color=red

  /oak:index/color/red/content
  /oak:index/color/red/content/products @a = true, @b = true

instead of

  /oak:index/color/red/content
  /oak:index/color/red/content/products
  /oak:index/color/red/content/products/a @match = true
  /oak:index/color/red/content/products/b @match = true

With level  1, it would require some escaping magic, but we could save
some more nodes, and basically it would be:

level = 2:

  /oak:index/color/red/content @products_a = true, @products_b = true


level = 3:

  /oak:index/color/red @content_products_a = true, @content_products_b =
true




Regards,
Thomas





On 08/07/15 18:18, Ian Boston i...@tfd.co.uk wrote:

Hi,
I am confused at how /oak:index works and why it is needed in a MongoDB
setting which has native database indexes that appear to cover the same
functionality. Could the Oak Query engine use DB indexes directly for all
indexes that are built into Oak, and Lucene indexes for all custom
indexes ?

I am asking this because in MongoDB I observe that 60% of the size of the
nodes collection is attributable to /oak:index, and that the 60% increases
every non sparse MongoDB index by about 3x. An _id + _modified compound
index in MongoDB comes out at about 70GB for 100M documents (in part due
to
the size of _id). Without the duplication /oak:index it could be closer to
25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
neither is page fault IO.

I fully understand why TarMK needs /oak:index, but I can't understand
(conceptually) the need to implement an index inside an database table.
It's like trying to implement an inverted index in an RDBMS table, which
everyone who has ever tried (or used) that approach doesn't scale nearly
as
far as Lucene bitmaps.

Could /oak:index be replaced by something that doesn't generate
Documents/db rows as fast as it does ?

Best Regards
Ian



Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Ian Boston
Hi,

On 9 July 2015 at 10:33, Thomas Mueller muel...@adobe.com wrote:

 Hi,

 Using MongoDB indexes directly doesn't work because of the MVCC model.
 What we could do is add special collections (basically one collection per
 index). This would requires some work, which then would need to be
 repeated for RDBMK. It would be quite some work.


ok, understood.



  I observe that 60% of the size of the nodes collection is attributable
 to /oak:index

 Could you try to find out which index(es) are responsible for that?


Marcel and Chetan have been working on the repository I was observing. I am
sure they can point you to the details offline, if you are not aware of it
already. They were able to remove about 25% of the 60% under /oak:index,
but IIUC most of the remainder and not local customisations, and perhaps
40% of what remains is not local customisations and must be synchronous,
which indicates a 1:2 ratio between real content nodes and MongoDB
documents before any MongoDB indexes are considered. That ratio was the
motivation for asking the question. Chetan thought I should discuss on
oak-dev.

Marcel and Chetan have executed 0) and 1) below, far more knowledgable than
I in this area.

Best Regards
Ian



 There
 would be multiple ways to reduce the number of nodes:

 0) remove unused indexes
 1) convert some indexes to Lucene property indexes

2) convert to unique index if possible (as this uses less space)

3) add a feature to only index a subset of the keys (only index what we
 need)
 4) convert the last x levels of the index structure as a property instead
 of as a node


 3) and 4) would require changes in Oak. For 4), the change should reduce
 the number of nodes, but might cause merge conflicts (not sure). With
 level = 1, it would be:

   /content/products/a @color=red
   /content/products/b @color=red

   /oak:index/color/red/content
   /oak:index/color/red/content/products @a = true, @b = true

 instead of

   /oak:index/color/red/content
   /oak:index/color/red/content/products
   /oak:index/color/red/content/products/a @match = true
   /oak:index/color/red/content/products/b @match = true

 With level  1, it would require some escaping magic, but we could save
 some more nodes, and basically it would be:

 level = 2:

   /oak:index/color/red/content @products_a = true, @products_b = true


 level = 3:

   /oak:index/color/red @content_products_a = true, @content_products_b =
 true




 Regards,
 Thomas





 On 08/07/15 18:18, Ian Boston i...@tfd.co.uk wrote:

 Hi,
 I am confused at how /oak:index works and why it is needed in a MongoDB
 setting which has native database indexes that appear to cover the same
 functionality. Could the Oak Query engine use DB indexes directly for all
 indexes that are built into Oak, and Lucene indexes for all custom
 indexes ?
 
 I am asking this because in MongoDB I observe that 60% of the size of the
 nodes collection is attributable to /oak:index, and that the 60% increases
 every non sparse MongoDB index by about 3x. An _id + _modified compound
 index in MongoDB comes out at about 70GB for 100M documents (in part due
 to
 the size of _id). Without the duplication /oak:index it could be closer to
 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
 neither is page fault IO.
 
 I fully understand why TarMK needs /oak:index, but I can't understand
 (conceptually) the need to implement an index inside an database table.
 It's like trying to implement an inverted index in an RDBMS table, which
 everyone who has ever tried (or used) that approach doesn't scale nearly
 as
 far as Lucene bitmaps.
 
 Could /oak:index be replaced by something that doesn't generate
 Documents/db rows as fast as it does ?
 
 Best Regards
 Ian




Re: /oak:index (DocumentNodeStore)

2015-07-09 Thread Norberto Leite
A collection per index (or a separate one for indexes only), specially the
asynchronous ones, will translate in a big benefit if the following occurs:
- when querying on index nodes we don't need to get all related node
documents (which is happening)
- the write operations are distinct between indexes and nodes (which I
think is also happening)

N.

On Thu, Jul 9, 2015 at 11:33 AM, Thomas Mueller muel...@adobe.com wrote:

 Hi,

 Using MongoDB indexes directly doesn't work because of the MVCC model.
 What we could do is add special collections (basically one collection per
 index). This would requires some work, which then would need to be
 repeated for RDBMK. It would be quite some work.

  I observe that 60% of the size of the nodes collection is attributable
 to /oak:index

 Could you try to find out which index(es) are responsible for that? There
 would be multiple ways to reduce the number of nodes:

 0) remove unused indexes
 1) convert some indexes to Lucene property indexes
 2) convert to unique index if possible (as this uses less space)
 3) add a feature to only index a subset of the keys (only index what we
 need)
 4) convert the last x levels of the index structure as a property instead
 of as a node


 3) and 4) would require changes in Oak. For 4), the change should reduce
 the number of nodes, but might cause merge conflicts (not sure). With
 level = 1, it would be:

   /content/products/a @color=red
   /content/products/b @color=red

   /oak:index/color/red/content
   /oak:index/color/red/content/products @a = true, @b = true

 instead of

   /oak:index/color/red/content
   /oak:index/color/red/content/products
   /oak:index/color/red/content/products/a @match = true
   /oak:index/color/red/content/products/b @match = true

 With level  1, it would require some escaping magic, but we could save
 some more nodes, and basically it would be:

 level = 2:

   /oak:index/color/red/content @products_a = true, @products_b = true


 level = 3:

   /oak:index/color/red @content_products_a = true, @content_products_b =
 true




 Regards,
 Thomas





 On 08/07/15 18:18, Ian Boston i...@tfd.co.uk wrote:

 Hi,
 I am confused at how /oak:index works and why it is needed in a MongoDB
 setting which has native database indexes that appear to cover the same
 functionality. Could the Oak Query engine use DB indexes directly for all
 indexes that are built into Oak, and Lucene indexes for all custom
 indexes ?
 
 I am asking this because in MongoDB I observe that 60% of the size of the
 nodes collection is attributable to /oak:index, and that the 60% increases
 every non sparse MongoDB index by about 3x. An _id + _modified compound
 index in MongoDB comes out at about 70GB for 100M documents (in part due
 to
 the size of _id). Without the duplication /oak:index it could be closer to
 25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
 neither is page fault IO.
 
 I fully understand why TarMK needs /oak:index, but I can't understand
 (conceptually) the need to implement an index inside an database table.
 It's like trying to implement an inverted index in an RDBMS table, which
 everyone who has ever tried (or used) that approach doesn't scale nearly
 as
 far as Lucene bitmaps.
 
 Could /oak:index be replaced by something that doesn't generate
 Documents/db rows as fast as it does ?
 
 Best Regards
 Ian




/oak:index (DocumentNodeStore)

2015-07-08 Thread Ian Boston
Hi,
I am confused at how /oak:index works and why it is needed in a MongoDB
setting which has native database indexes that appear to cover the same
functionality. Could the Oak Query engine use DB indexes directly for all
indexes that are built into Oak, and Lucene indexes for all custom indexes ?

I am asking this because in MongoDB I observe that 60% of the size of the
nodes collection is attributable to /oak:index, and that the 60% increases
every non sparse MongoDB index by about 3x. An _id + _modified compound
index in MongoDB comes out at about 70GB for 100M documents (in part due to
the size of _id). Without the duplication /oak:index it could be closer to
25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
neither is page fault IO.

I fully understand why TarMK needs /oak:index, but I can't understand
(conceptually) the need to implement an index inside an database table.
It's like trying to implement an inverted index in an RDBMS table, which
everyone who has ever tried (or used) that approach doesn't scale nearly as
far as Lucene bitmaps.

Could /oak:index be replaced by something that doesn't generate
Documents/db rows as fast as it does ?

Best Regards
Ian