Manybubbles added a comment. In https://phabricator.wikimedia.org/T88717#1037986, @Beebs.systap wrote:
> In https://phabricator.wikimedia.org/T88717#1037902, @Manybubbles wrote: > > > @Beebs.systap, is this still true (from old blog post > > <http://blog.blazegraph.com/?p=281>): > > For example, this can happen if your query has a large result set and uses > > ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total > > materialization of the query result set even if you use OFFSET/LIMIT. > > > > It'd be wonderful if the optimizer had the option of walking an index to > > make materialization not required. Assuming that is actually more > > efficient. Is there a way to limit the number of results that are > > materialized before any actual order/limit/offset operation? In our case > > we'd probably want to just tell folks that their query isn't selective > > enough to allow order/limit/offset rather than keep working on a very slow > > query. > > > From Bryan T. > > DISTINCT does NOT force total materialization. That is an error. DISTINCT > is a streaming operator. > > ORDER BY generally does force total materialization since the order needs to > be respected. You have a few options here. > > 1. We have (and can) optimize things when the ORDER BY corresponds to a > natural order of an index. For example, ORDER BY ?date when ?date is an > xsd:dateTime and is inlined into the OSP index. General approaches to > rewrites of such queries are certainly possible, but not all queries are > amenable to such a rewrite. > 2. If you do not need total order, then you can push the SELECT down into a > sub-SELECT and put a LIMIT on that sub-SELECT. You can then put an ORDER BY > in the outer SELECT. This approach lets you sort the first N solutions. For > example, you might limit yourselves to sorting the first 1000 solutions. > However, there might be many more than N solutions in which case this begins > to look like a random sampling. > 3. You can not use ORDER BY when you want to avoid total materialization. > ORDER BY really requests materialization. If you don't want it, don't ask > for it. > > In general, the query plan generator tries to produce query plans that are > left-to-right and that do not require any intermediate result sets to be > materialized in hash joins. If you have queries that are simple conjunctive > join plans, then you get your solutions in some unspecified (and unstable) > order and the time to the first solution is very fast. > > If you are going to run queries with really large intermediate result sets, > then you should use the analytic query mode (just check the box under > advanced options or specify the URL Query parameter - see the > NanoSparqlServer wiki page). This will put the hash indices for any solution > set joins onto the native C process heap rather than the managed java object > heap. This is supported for nearly all operators - ORDER BY is actually the > exception. However, a native heap ORDER BY could be implemented. The trick > is to partition the value space of the ORDER BY based on the different data > types and their ordering semantics. This can then be turned into an external > memory sort for each data type and a merge sort across all of those external > memory sorted subsets. Thanks again! Wonderful. We're evaluating exposing some kind of query directly over a public API. For a while it looked like SPARQL was safe enough that we could simply slap a timeout on it and pass it through. In this case, at least, it looks like we'd have to do some munging (pushing stuff to a subquery with a limit, probably). Looks like we could do that munging on the server side with an ASTOptimizer. Thanks for http://wiki.bigdata.com/wiki/index.php/QueryEvaluation, btw. It was a good read. TASK DETAIL https://phabricator.wikimedia.org/T88717 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Manybubbles Cc: Jdouglas, Beebs.systap, Aklapper, Manybubbles, jkroll, Smalyshev, Wikidata-bugs, aude, GWicke, daniel, JanZerebecki _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
