Beebs.systap added a comment. In https://phabricator.wikimedia.org/T88717#1037902, @Manybubbles wrote:
> @Beebs.systap, is this still true (from old blog post > <http://blog.blazegraph.com/?p=281>): > For example, this can happen if your query has a large result set and uses > ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization > of the query result set even if you use OFFSET/LIMIT. > > It'd be wonderful if the optimizer had the option of walking an index to make > materialization not required. Assuming that is actually more efficient. Is > there a way to limit the number of results that are materialized before any > actual order/limit/offset operation? In our case we'd probably want to just > tell folks that their query isn't selective enough to allow > order/limit/offset rather than keep working on a very slow query. From Bryan T. DISTINCT does NOT force total materialization. That is an error. DISTINCT is a streaming operator. ORDER BY generally does force total materialization since the order needs to be respected. You have a few options here. 1. We have (and can) optimize things when the ORDER BY corresponds to a natural order of an index. For example, ORDER BY ?date when ?date is an xsd:dateTime and is inlined into the OSP index. General approaches to rewrites of such queries are certainly possible, but not all queries are amenable to such a rewrite. 2. If you do not need total order, then you can push the SELECT down into a sub-SELECT and put a LIMIT on that sub-SELECT. You can then put an ORDER BY in the outer SELECT. This approach lets you sort the first N solutions. For example, you might limit yourselves to sorting the first 1000 solutions. However, there might be many more than N solutions in which case this begins to look like a random sampling. 3. You can not use ORDER BY when you want to avoid total materialization. ORDER BY really requests materialization. If you don't want it, don't ask for it. In general, the query plan generator tries to produce query plans that are left-to-right and that do not require any intermediate result sets to be materialized in hash joins. If you have queries that are simple conjunctive join plans, then you get your solutions in some unspecified (and unstable) order and the time to the first solution is very fast. If you are going to run queries with really large intermediate result sets, then you should use the analytic query mode (just check the box under advanced options or specify the URL Query parameter - see the NanoSparqlServer wiki page). This will put the hash indices for any solution set joins onto the native C process heap rather than the managed java object heap. This is supported for nearly all operators - ORDER BY is actually the exception. However, a native heap ORDER BY could be implemented. The trick is to partition the value space of the ORDER BY based on the different data types and their ordering semantics. This can then be turned into an external memory sort for each data type and a merge sort across all of those external memory sorted subsets. TASK DETAIL https://phabricator.wikimedia.org/T88717 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Beebs.systap Cc: Jdouglas, Beebs.systap, Aklapper, Manybubbles, jkroll, Smalyshev, Wikidata-bugs, aude, GWicke, daniel, JanZerebecki _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
