Beebs.systap added a comment.

In https://phabricator.wikimedia.org/T88717#1037902, @Manybubbles wrote:

> @Beebs.systap, is this still true (from old blog post 
> <http://blog.blazegraph.com/?p=281>):
>  For example, this can happen if your query has a large result set and uses 
> ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization 
> of the query result set even if you use OFFSET/LIMIT.
>
> It'd be wonderful if the optimizer had the option of walking an index to make 
> materialization not required.  Assuming that is actually more efficient.  Is 
> there a way to limit the number of results that are materialized before any 
> actual order/limit/offset operation?  In our case we'd probably want to just 
> tell folks that their query isn't selective enough to allow 
> order/limit/offset rather than keep working on a very slow query.


From Bryan T.

DISTINCT does NOT force total materialization.  That is an error.  DISTINCT is 
a streaming operator.

ORDER BY generally does force total materialization since the order needs to be 
respected.  You have a few options here.

1. We have (and can) optimize things when the ORDER BY corresponds to a natural 
order of an index.  For example, ORDER BY ?date when ?date is an xsd:dateTime 
and is inlined into the OSP index.  General approaches to rewrites of such 
queries are certainly possible, but not all queries are amenable to such a 
rewrite.

2. If you do not need total order, then you can push the SELECT down into a 
sub-SELECT and put a LIMIT on that sub-SELECT.  You can then put an ORDER BY in 
the outer SELECT.  This approach lets you sort the first N solutions.  For 
example, you might limit yourselves to sorting the first 1000 solutions.  
However, there might be many more than N solutions in which case this begins to 
look like a random sampling.

3. You can not use ORDER BY when you want to avoid total materialization.  
ORDER BY really requests materialization.  If you don't want it, don't ask for 
it.

In general, the query plan generator tries to produce query plans that are 
left-to-right and that do not require any intermediate result sets to be 
materialized in hash joins.  If you have queries that are simple conjunctive 
join plans, then you get your solutions in some unspecified (and unstable) 
order and the time to the first solution is very fast.

If you are going to run queries with really large intermediate result sets, 
then you should use the analytic query mode (just check the box under advanced 
options or specify the URL Query parameter - see the NanoSparqlServer wiki 
page).  This will put the hash indices for any solution set joins onto the 
native C process heap rather than the managed java object heap.  This is 
supported for nearly all operators - ORDER BY is actually the exception.  
However, a native heap ORDER BY could be implemented.  The trick is to 
partition the value space of the ORDER BY based on the different data types and 
their ordering semantics. This can then be turned into an external memory sort 
for each data type and a merge sort across all of those external memory sorted 
subsets.


TASK DETAIL
  https://phabricator.wikimedia.org/T88717

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Beebs.systap
Cc: Jdouglas, Beebs.systap, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel, JanZerebecki



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to