[Wikidata-bugs] [Maniphest] [Commented On] T88717: Investigate BigData for WDQ

Manybubbles Fri, 13 Feb 2015 14:54:59 -0800

Manybubbles added a comment.

In https://phabricator.wikimedia.org/T88717#1037986, @Beebs.systap wrote:


> In https://phabricator.wikimedia.org/T88717#1037902, @Manybubbles wrote:
>
> > @Beebs.systap, is this still true (from old blog post 
> > <http://blog.blazegraph.com/?p=281>):
> >  For example, this can happen if your query has a large result set and uses 
> > ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total 
> > materialization of the query result set even if you use OFFSET/LIMIT.
> >
> > It'd be wonderful if the optimizer had the option of walking an index to 
> > make materialization not required.  Assuming that is actually more 
> > efficient.  Is there a way to limit the number of results that are 
> > materialized before any actual order/limit/offset operation?  In our case 
> > we'd probably want to just tell folks that their query isn't selective 
> > enough to allow order/limit/offset rather than keep working on a very slow 
> > query.
>
>
> From Bryan T.
>
> DISTINCT does NOT force total materialization.  That is an error.  DISTINCT 
> is a streaming operator.
>
> ORDER BY generally does force total materialization since the order needs to 
> be respected.  You have a few options here.
>
> 1. We have (and can) optimize things when the ORDER BY corresponds to a 
> natural order of an index.  For example, ORDER BY ?date when ?date is an 
> xsd:dateTime and is inlined into the OSP index.  General approaches to 
> rewrites of such queries are certainly possible, but not all queries are 
> amenable to such a rewrite.
> 2. If you do not need total order, then you can push the SELECT down into a 
> sub-SELECT and put a LIMIT on that sub-SELECT.  You can then put an ORDER BY 
> in the outer SELECT.  This approach lets you sort the first N solutions.  For 
> example, you might limit yourselves to sorting the first 1000 solutions.  
> However, there might be many more than N solutions in which case this begins 
> to look like a random sampling.
> 3. You can not use ORDER BY when you want to avoid total materialization.  
> ORDER BY really requests materialization.  If you don't want it, don't ask 
> for it.
>
>   In general, the query plan generator tries to produce query plans that are 
> left-to-right and that do not require any intermediate result sets to be 
> materialized in hash joins.  If you have queries that are simple conjunctive 
> join plans, then you get your solutions in some unspecified (and unstable) 
> order and the time to the first solution is very fast.
>
>   If you are going to run queries with really large intermediate result sets, 
> then you should use the analytic query mode (just check the box under 
> advanced options or specify the URL Query parameter - see the 
> NanoSparqlServer wiki page).  This will put the hash indices for any solution 
> set joins onto the native C process heap rather than the managed java object 
> heap.  This is supported for nearly all operators - ORDER BY is actually the 
> exception.  However, a native heap ORDER BY could be implemented.  The trick 
> is to partition the value space of the ORDER BY based on the different data 
> types and their ordering semantics. This can then be turned into an external 
> memory sort for each data type and a merge sort across all of those external 
> memory sorted subsets.


Thanks again!  Wonderful.  We're evaluating exposing some kind of query 
directly over a public API.  For a while it looked like SPARQL was safe enough 
that we could simply slap a timeout on it and pass it through.  In this case, 
at least, it looks like we'd have to do some munging (pushing stuff to a 
subquery with a limit, probably).  Looks like we could do that munging on the 
server side with an ASTOptimizer.

Thanks for http://wiki.bigdata.com/wiki/index.php/QueryEvaluation, btw.  It was 
a good read.


TASK DETAIL
  https://phabricator.wikimedia.org/T88717

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Jdouglas, Beebs.systap, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel, JanZerebecki



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T88717: Investigate BigData for WDQ

Reply via email to