Manybubbles added a comment. > I think mixed indexes as documented in > http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed > should help here, as they support efficient matching on an arbitrary > combination of attributes, along with advanced range and full text queries. > They do require an indexing backend like Elasticsearch to be configured. Nik, > is there an Elasticsearch setup we could use, or would it be reasonably easy > to spin up a new one?
I'd spin up a new one - probably just on a single node. I think in the long run we probably can run this on the production search cluster but for now lets keep it off just in case it does something stupid. I can put together some puppet changes to put a single node elasticsearch instance on einsteinium. > For Date, I wonder if support can't be added to Titan, since Elastic AFAIK > supports dates. It sure does. They are parsed and formatted automatically but amount to a java long since epoch under the hood. As @GWicke said that means they can't reach back until the big bang. If instead of dealing in dates we dealt in _seconds_ since epoch we could reach back to the big bang so long as current estimates are right to within an order of magnitude. Instead of 292 million years ago we'd have 292 billion years ago. > Also, while Java can support negative years, right now the import does not > support them since the Java parser fails to properly parse them. If we're > moving to Java 8, there are better date APIs AFAIR, so maybe they allow to > handle it properly. Elasticsearch is based on Joda Time which can handle negative years just fine. It can't handle negative years that far back though. I've filed an issue <https://github.com/elasticsearch/elasticsearch/issues/9048> for it but I imagine we'll be on our own. I believe they are getting infinite precision numbers at some point but the kind of dates we handle are probably best stored in floating point instead. > For SET it won't be more complex to maintain, probably, but I'm not sure if > the lookups would be fast enough. I could create an additional field for that > and see how it behaves, and then we could drop the field that is not needed. Elasticsearch totally supports sets. Like, so supports. We use them for stuff like hastemplate and incategory. Have a look at the dump <http://en.wikipedia.org/wiki/Nikola_Tesla?action=cirrusdump> for a page. Lucene has native support for sets of strings, sorta. Its a very leaky abstraction around pretending that they are one big string with junk tokens between them. The junk tokens prevent phase searches from spanning multiple values in the set. Looking at mixed indexes I wonder how they are backed to Elasticsearch. Lucene/Elasticsearch pretty much indexes everything independently and then ANDs the results of traversing multiple indexes together to get the answer. That deserves some looking into. TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Manybubbles Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, Eloquence, aaron, jkroll, Wikidata-bugs, daniel _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
