Manybubbles added a comment.

> I think mixed indexes as documented in 
> http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed 
> should help here, as they support efficient matching on an arbitrary 
> combination of attributes, along with advanced range and full text queries. 
> They do require an indexing backend like Elasticsearch to be configured. Nik, 
> is there an Elasticsearch setup we could use, or would it be reasonably easy 
> to spin up a new one?


I'd spin up a new one - probably just on a single node.  I think in the long 
run we probably can run this on the production search cluster but for now lets 
keep it off just in case it does something stupid.  I can put together some 
puppet changes to put a single node elasticsearch instance on einsteinium.

> For Date, I wonder if support can't be added to Titan, since Elastic AFAIK 
> supports dates.


It sure does.  They are parsed and formatted automatically but amount to a java 
long since epoch under the hood.  As @GWicke said that means they can't reach 
back until the big bang.  If instead of dealing in dates we dealt in _seconds_ 
since epoch we could reach back to the big bang so long as current estimates 
are right to within an order of magnitude.  Instead of 292 million years ago 
we'd have 292 billion years ago.

> Also, while Java can support negative years, right now the import does not 
> support them since the Java parser fails to properly parse them. If we're 
> moving to Java 8, there are better date APIs AFAIR, so maybe they allow to 
> handle it properly.


Elasticsearch is based on Joda Time which can handle negative years just fine.  
It can't handle negative years that far back though.  I've filed an issue 
<https://github.com/elasticsearch/elasticsearch/issues/9048> for it but I 
imagine we'll be on our own.  I believe they are getting infinite precision 
numbers at some point but the kind of dates we handle are probably best stored 
in floating point instead.

> For SET it won't be more complex to maintain, probably, but I'm not sure if 
> the lookups would be fast enough. I could create an additional field for that 
> and see how it behaves, and then we could drop the field that is not needed.


Elasticsearch totally supports sets.  Like, so supports.  We use them for stuff 
like hastemplate and incategory.  Have a look at the dump 
<http://en.wikipedia.org/wiki/Nikola_Tesla?action=cirrusdump> for a page.  
Lucene has native support for sets of strings, sorta.  Its a very leaky 
abstraction around pretending that they are one big string with junk tokens 
between them.  The junk tokens prevent phase searches from spanning multiple 
values in the set.

Looking at mixed indexes I wonder how they are backed to Elasticsearch.  
Lucene/Elasticsearch pretty much indexes everything independently and then ANDs 
the results of traversing multiple indexes together to get the answer.  That 
deserves some looking into.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to