[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

GWicke Mon, 22 Dec 2014 22:36:06 -0800

GWicke added a comment.

Re performance and indexing, from a mail thread:
Earlier today Stas & I were looking a bit into what is happening behind the 
scenes in some of the slower queries like 
https://www.mediawiki.org/wiki/Wikibase/Indexing/Benchmarks#List_of_humans_having_occupation_writer_but_not_author


We found that in a query that uses predicates on more than a single property 
per vertex, the default titan query strategy is to retrieve a list of 
candidates based on the first index (fast), and then load details for each 
vertex in order to filter on the second predicate. This is pretty slow, as it 
involves O(candidates) queries to Cassandra.

I think mixed indexes as documented in 
http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed should 
help here, as they support efficient matching on an arbitrary combination of 
attributes, along with advanced range and full text queries. They do require an 
indexing backend like Elasticsearch to be configured. Nik, is there an 
Elasticsearch setup we could use, or would it be reasonably easy to spin up a 
new one?

Stas replied:

> I've looked into this matter, and there's one wrinkle in this - mixed indexes 
> now don't support Date and SET fields, which means we can not index 
> everything - some fields (among them frequently used date start/end fields 
> and P*link fields which are SET) would not be indexable with Elastic. They 
> say it's a "temporary limitations" but so far 0.9 still has it.


Dates are actually interesting re the ranges we need to represent. A signed 
long second resolution unix timestamp (as used by Java Date)&nbsp;reaches from 
292269055 BC to 292278994 AD 
<http://stackoverflow.com/questions/5488038/valid-range-for-java-util-date>. 
Not quite enough to represent the current estimates for the Big Bang 
<http://en.wikipedia.org/wiki/Big_Bang> for example. ISO 8601 timestamps 
<http://en.wikipedia.org/wiki/ISO_8601> support much less range, even in the 
six-digit extended mode. Maybe it makes sense to index years separately as a 
long? This will also be fun in JSON.

For SET I could imagine a full-text index on a string with each P*link property 
separated by a space. Elasticsearch should be good at indexing that. Not sure 
how much more complex that would be to maintain compared to the current SET 
index.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

Reply via email to