Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-09-19 Thread Matthias Brantner
Superseded by use-dataguide merge proposal.
-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp


Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-07-23 Thread Nicolae Brinza
I've done some additional testing, and these are the results:

For the xray query, the largest that we have in the testsuite, compilation time 
with --compile-only is pretty much the same with and without the dataguide 
computaiton, at around ~0.08 sec.

With a specially constructed query that looks like this: (see dataguide-29.jq 
test)

let $col := dml:collection()
let $col2 := ($col.cat1, $col.cat2, ... , $col.cat10)
return $col2.category.category.category ... category  (repeated ~2000 times)

the compilation time goes from ~0.7s without the dataguide to ~10s with the 
dataguide enabled, so it is significant. But this is a worst-case scenario. The 
resulting dataguide is an object 2000-levels deep. 

The compilation can be improved significantly by:
- keeping track of the leaves nodes in the dataguide tree
- rewriting a bit the dataguide structure to store the trees incrementally 
instead of cloning them
- adding a depth cutoff

What do you think?


-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp


Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-07-19 Thread Nicolae Brinza

 DataGuides serve as dynamic schemas, generated from the database. What we 
 generate is a 
 schema from the query. 

Still, it is a data schema, not a query schema. The one in the paper would be a 
Database DataGuide and ours would be Query DataGuide. I would agree to change 
it to QueryDataguide but I don't think there would be any confusions if it was 
simply called Dataguide.


 I think we will run into a problem. 28msec has only one buffer that is 
 accessed by all db:collection() calls in a query. Hence, the information 
 needs to be the union.

If there is no way of removing that limitation then we can overcome this by 
doing an union on all db:collection() dataguides and this will ensure 
correctness. But it would be a pity to loose the individually computed 
dataguides for each separate call. Still, if the name of fields of different 
collections are mostly disjoint sets, then we won't loose much of the 
improvement. 

Again I suggest leaving this until I start implementing the push-down of 
projection info into the db:collection() calls. It has no impact on jn:parse() 
-- these dataguides can still be computed and kept individually for each call 
even if we do an union on db:collection() calls.

--



-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp


Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-07-18 Thread Nicolae Brinza
 - I find the name dataguide misleading because it's a guide on the query and
 not on the data. Maybe QueryPruneGuide would be more meaningful

The query itself is not pruned, the data is. I think dataguide is the 
established term -- see for example this paper: 
http://ilpubs.stanford.edu:8090/264/1/1997-50.pdf . 

 - Can the user also use the zann_explores_json annotation?

Yes, the users can use it as well. But does it make sense for them to use it? 
If they have an external function -- it is automatically handled as if it has 
the annotation. For a UDF it doesn't really make any sense to add it. 


 - Why is the dataguide parameter on the Store's getCollection() function?
 Shouldn't it be on the function that returns the iterator? The problem is that
 a Collection object within the simplestore exists only once per collection.
 What's the semantics if multiple queries access the collection (possibly in
 parallel)?

It very much depends on how the collections are handled. Currently for Zorba 
collections it doesn't make sense to have any dataguides at all, because 
they're in-memory collections. I have not taken a look at the Sausalito code 
and have not seen how e.g. the MongoDB collections are managed. 
getCollection() seemed the most logical place where it should be passed, but 
the dataguide parameter could be easily propagated to any Store class, 
including the function that returns the iterator.

Currently each and every db:collection() call has its own dataguide, even if 
they might refer to the same collection. If the collection manager currently 
caches or reuses the collection iterators, then it might make sense to forbid 
that so that the dataguide for each individual db:collection call could be 
used. 

Or alternatively, an union on the dataguides that refer to the same 
collection could be performed. But I think it is not always possible to 
determine if that is the case. 

I think this could be investigated and decided upon when implementing the 
Dataguide push-down into MongoDB or when I would take a better look at the 
Sausalito's collection manager code.


 - Did you measure the performance impact of the optimizer on some larger
 queries?


The expression tree is traversed in its entirety once and only once, visiting 
each node, so the performance should not be very different from any other 
dataflow computation, e.g. ignores sorts/order/etc. If there are no sources, 
i.e. db:collection() or jn:parse() calls, then the dataguide computation just 
propagates NULLs, doing no calculations and almost no memory allocations (at 
most one dataguide_cb allocation per fo_exprs and several others). If there are 
sources in the tree -- there will be some union operations being performed 
for some of the nodes. 

I will check if any of our larger queries have longer compilation times, but 
because none of them have db:collection() or jn:parse() calls, I do not expect 
any differences. 

It would make sense to have a specially constructed query that would do a 
stress-test of the dataguide code -- e.g. a 
db:collection().navigation.navigation. ... .navigation several thousand times 
or something similar. I will try that out and see if it manages to slow down 
the compilation.

--


-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp


Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-07-18 Thread Matthias Brantner
  - I find the name dataguide misleading because it's a guide on the query and
  not on the data. Maybe QueryPruneGuide would be more meaningful
 
 The query itself is not pruned, the data is. I think dataguide is the
 established term -- see for example this paper:
 http://ilpubs.stanford.edu:8090/264/1/1997-50.pdf .
DataGuides serve as dynamic schemas, generated from the database. What we 
generate is a schema from the query.


  - Why is the dataguide parameter on the Store's getCollection() function?
  Shouldn't it be on the function that returns the iterator? The problem is
 that
  a Collection object within the simplestore exists only once per collection.
  What's the semantics if multiple queries access the collection (possibly in
  parallel)?
 
 It very much depends on how the collections are handled. Currently for Zorba
 collections it doesn't make sense to have any dataguides at all, because
 they're in-memory collections. I have not taken a look at the Sausalito code
 and have not seen how e.g. the MongoDB collections are managed.
 getCollection() seemed the most logical place where it should be passed, but
 the dataguide parameter could be easily propagated to any Store class,
 including the function that returns the iterator.
 
 Currently each and every db:collection() call has its own dataguide, even if
 they might refer to the same collection. If the collection manager currently
 caches or reuses the collection iterators, then it might make sense to
 forbid that so that the dataguide for each individual db:collection call could
 be used.
 
 Or alternatively, an union on the dataguides that refer to the same
 collection could be performed. But I think it is not always possible to
 determine if that is the case.
 
 I think this could be investigated and decided upon when implementing the
 Dataguide push-down into MongoDB or when I would take a better look at the
 Sausalito's collection manager code.
I think we will run into a problem. 28msec has only one buffer that is accessed 
by all db:collection() calls in a query. Hence, the information needs to be the 
union.
-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp


Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-07-17 Thread Nicolae Brinza
Review: Approve


-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp


Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/dataguide into lp:zorba

2013-07-17 Thread Matthias Brantner
Review: Needs Fixing

- I find the name dataguide misleading because it's a guide on the query and 
not on the data. Maybe QueryPruneGuide would be more meaningful
- Can the user also use the zann_explores_json annotation?
- Why is the dataguide parameter on the Store's getCollection() function? 
Shouldn't it be on the function that returns the iterator? The problem is that 
a Collection object within the simplestore exists only once per collection. 
What's the semantics if multiple queries access the collection (possibly in 
parallel)?
- Did you measure the performance impact of the optimizer on some larger 
queries?
-- 
https://code.launchpad.net/~zorba-coders/zorba/dataguide/+merge/173026
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp