GWicke added a comment. Re performance and indexing, from a mail thread: Earlier today Stas & I were looking a bit into what is happening behind the scenes in some of the slower queries like https://www.mediawiki.org/wiki/Wikibase/Indexing/Benchmarks#List_of_humans_having_occupation_writer_but_not_author
We found that in a query that uses predicates on more than a single property per vertex, the default titan query strategy is to retrieve a list of candidates based on the first index (fast), and then load details for each vertex in order to filter on the second predicate. This is pretty slow, as it involves O(candidates) queries to Cassandra. I think mixed indexes as documented in http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed should help here, as they support efficient matching on an arbitrary combination of attributes, along with advanced range and full text queries. They do require an indexing backend like Elasticsearch to be configured. Nik, is there an Elasticsearch setup we could use, or would it be reasonably easy to spin up a new one? Stas replied: > I've looked into this matter, and there's one wrinkle in this - mixed indexes > now don't support Date and SET fields, which means we can not index > everything - some fields (among them frequently used date start/end fields > and P*link fields which are SET) would not be indexable with Elastic. They > say it's a "temporary limitations" but so far 0.9 still has it. Dates are actually interesting re the ranges we need to represent. A signed long second resolution unix timestamp (as used by Java Date) reaches from 292269055 BC to 292278994 AD <http://stackoverflow.com/questions/5488038/valid-range-for-java-util-date>. Not quite enough to represent the current estimates for the Big Bang <http://en.wikipedia.org/wiki/Big_Bang> for example. ISO 8601 timestamps <http://en.wikipedia.org/wiki/ISO_8601> support much less range, even in the six-digit extended mode. Maybe it makes sense to index years separately as a long? This will also be fun in JSON. For SET I could imagine a full-text index on a string with each P*link property separated by a space. Elasticsearch should be good at indexing that. Not sure how much more complex that would be to maintain compared to the current SET index. TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, GWicke Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, Eloquence, aaron, jkroll, Wikidata-bugs, daniel _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
