[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread Smalyshev
Smalyshev added a comment.

Proposed storage format for dates:

1. Dates are stored as long signed integers, representing number of seconds 
since 1970-01-01 00:00:00 UTC.
2. This gives us range of 292 bln years 
http://www.wolframalpha.com/input/?i=9223372036854775807+seconds+in+years.
3. When parsing the date, if they year is below 29200 (by absolute value) 
we use Java functions, and hope they work right with the calendar, etc.
4. Beyond year 292M, we just count in whole years - i.e. the number stored is 
years*SECONDS_IN_YEAR

This allows us to keep reasonable accuracy (to the second range) within wide 
range of dates, while still preserving the sequence - i.e. the Big Bang can be 
listed as even that happened before year 2000 - and keeping the single 
property and data format, so to avoid complicating the data handling code with 
special cases.

We would probably want to define helper functions to convert from dates to 
seconds and back, for convenience.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread GWicke
GWicke added a comment.

In https://phabricator.wikimedia.org/T76373#943426, @Smalyshev wrote:

 Proposed storage format for dates:

 1. Dates are stored as long signed integers, representing number of seconds 
 since 1970-01-01 00:00:00 UTC.
 2. This gives us range of 292 bln years 
 http://www.wolframalpha.com/input/?i=9223372036854775807+seconds+in+years.
 3. When parsing the date, if they year is below 29200 (by absolute value) 
 we use Java functions, and hope they work right with the calendar, etc.
 4. Beyond year 292M, we just count in whole years - i.e. the number stored is 
 years*SECONDS_IN_YEAR


We should double-check that this transition is actually monotonic. In doubt, we 
can use the timestamp for year 292M according to Java + (year - 292M) * 
SECONDS_IN_YEAR.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread Smalyshev
Smalyshev added a comment.

@GWicke Do we really need per-second precision transitioning between year 292M 
and 292M+1? I'm not sure it is ever required. We should ensure, of course, that 
seconds(292M-12-31T23:59:59)  seconds(292M+1) and also the same for low dates, 
but beyond that I'm not sure anything more precise would be of any use, as I 
can imagine no calendar dates with such years that would make any sense.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread GWicke
GWicke added a comment.

In https://phabricator.wikimedia.org/T76373#943454, @Smalyshev wrote:

 seconds(292M-12-31T23:59:59)  seconds(292M+1)


It's pretty likely that there were more than 356 leap years between -292M and 
1970, so it's very possible for the years to be non-monotonic if we don't use 
the Java value (which should includes leap years) as the base.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread Smalyshev
Smalyshev added a comment.

For dates beyond real Gregorian calendar, the values more precise than years 
have little meaning anyway, so I don't think it matters too much as long as 
comparisons and lookups (i.e. which Greek philosopher was born in 427 BCE) work.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread GWicke
GWicke added a comment.

In https://phabricator.wikimedia.org/T76373#943459, @Smalyshev wrote:

 For dates beyond real Gregorian calendar, the values more precise than years 
 have little meaning anyway, so I don't think it matters too much as long as 
 comparisons and lookups (i.e. which Greek philosopher was born in 427 BCE) 
 work.


Agreed. But, we need monotonicity at least at the year level. Without that we 
also don't have bijectivity.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread Manybubbles
Manybubbles added a comment.

 I think mixed indexes as documented in 
 http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed 
 should help here, as they support efficient matching on an arbitrary 
 combination of attributes, along with advanced range and full text queries. 
 They do require an indexing backend like Elasticsearch to be configured. Nik, 
 is there an Elasticsearch setup we could use, or would it be reasonably easy 
 to spin up a new one?


I'd spin up a new one - probably just on a single node.  I think in the long 
run we probably can run this on the production search cluster but for now lets 
keep it off just in case it does something stupid.  I can put together some 
puppet changes to put a single node elasticsearch instance on einsteinium.

 For Date, I wonder if support can't be added to Titan, since Elastic AFAIK 
 supports dates.


It sure does.  They are parsed and formatted automatically but amount to a java 
long since epoch under the hood.  As @GWicke said that means they can't reach 
back until the big bang.  If instead of dealing in dates we dealt in _seconds_ 
since epoch we could reach back to the big bang so long as current estimates 
are right to within an order of magnitude.  Instead of 292 million years ago 
we'd have 292 billion years ago.

 Also, while Java can support negative years, right now the import does not 
 support them since the Java parser fails to properly parse them. If we're 
 moving to Java 8, there are better date APIs AFAIR, so maybe they allow to 
 handle it properly.


Elasticsearch is based on Joda Time which can handle negative years just fine.  
It can't handle negative years that far back though.  I've filed an issue 
https://github.com/elasticsearch/elasticsearch/issues/9048 for it but I 
imagine we'll be on our own.  I believe they are getting infinite precision 
numbers at some point but the kind of dates we handle are probably best stored 
in floating point instead.

 For SET it won't be more complex to maintain, probably, but I'm not sure if 
 the lookups would be fast enough. I could create an additional field for that 
 and see how it behaves, and then we could drop the field that is not needed.


Elasticsearch totally supports sets.  Like, so supports.  We use them for stuff 
like hastemplate and incategory.  Have a look at the dump 
http://en.wikipedia.org/wiki/Nikola_Tesla?action=cirrusdump for a page.  
Lucene has native support for sets of strings, sorta.  Its a very leaky 
abstraction around pretending that they are one big string with junk tokens 
between them.  The junk tokens prevent phase searches from spanning multiple 
values in the set.

Looking at mixed indexes I wonder how they are backed to Elasticsearch.  
Lucene/Elasticsearch pretty much indexes everything independently and then ANDs 
the results of traversing multiple indexes together to get the answer.  That 
deserves some looking into.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread GWicke
GWicke added a comment.

In https://phabricator.wikimedia.org/T76373#941917, @Manybubbles wrote:

 I'd spin up a new one - probably just on a single node.  I think in the long 
 run we probably can run this on the production search cluster but for now 
 lets keep it off just in case it does something stupid.  I can put together 
 some puppet changes to put a single node elasticsearch instance on 
 einsteinium.


Awesome, that would be great! I think storage is not very fast there (disks 
IIRC), but maybe the 64G of memory will compensate.

  For Date, I wonder if support can't be added to Titan, since Elastic AFAIK 
  supports dates.

 

 

 It sure does.  They are parsed and formatted automatically but amount to a 
 java long since epoch under the hood.  As @GWicke said that means they can't 
 reach back until the big bang.  If instead of dealing in dates we dealt in 
 _seconds_ since epoch we could reach back to the big bang so long as current 
 estimates are right to within an order of magnitude.  Instead of 292 million 
 years ago we'd have 292 billion years ago.


Seconds wouldn't actually work, but years will, especially if represented as a 
double. The big bang represented as years comfortably fits into the 52-bit 
mantissa of a double, so we wouldn't get weird rounding artifacts like reading 
back 13.834520923424352345234523 billion years after storing 13.8. But we'd 
have plenty of range for other cosmic dates.

 Elasticsearch is based on Joda Time which can handle negative years just 
 fine.  It can't handle negative years that far back though.  I've filed an 
 issue https://github.com/elasticsearch/elasticsearch/issues/9048 for it but 
 I imagine we'll be on our own.  I believe they are getting infinite precision 
 numbers at some point but the kind of dates we handle are probably best 
 stored in floating point instead.


Parsing out the year portion of an ISO 8601 timestamp in the API frontend is 
not all that hard, so we could just go ahead and index on a double 
representation of that in Titan. For dates within the long cut-off, we can also 
convert that to long (or double), and store that in a finer-grained index. 
We'll then need to switch between year-based indexing and finer-grained 
indexing in the front-end after looking at a string representation of a 
timestamp in the query.

 Looking at mixed indexes I wonder how they are backed to Elasticsearch.  
 Lucene/Elasticsearch pretty much indexes everything independently and then 
 ANDs the results of traversing multiple indexes together to get the answer.  
 That deserves some looking into.


The documentation says that it efficiently supports multi-predicate queries on 
the same index, so I assume that it sends off all predicates to elasticsearch 
at once, and lets it deal with efficiently ANDing.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread Smalyshev
Smalyshev added a comment.

 Elasticsearch totally supports sets.


Right, but Titan unfortunately doesn't support mixed indexes on SET properties. 
I would assume it's not a hard limitation but rather them not getting to 
implementing it yet. The mixed index type support is very limited now 
http://s3.thinkaurelius.com/docs/titan/0.5.2/search-predicates.html#mixeddatatypes
 - for one, they don't support booleans, for example.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread GWicke
GWicke added a comment.

Another fun article for dates: 
http://en.wikipedia.org/wiki/Timeline_of_the_far_future


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread Manybubbles
Manybubbles added a comment.

I was using that!


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-22 Thread GWicke
GWicke added a comment.

Re performance and indexing, from a mail thread:
Earlier today Stas  I were looking a bit into what is happening behind the 
scenes in some of the slower queries like 
https://www.mediawiki.org/wiki/Wikibase/Indexing/Benchmarks#List_of_humans_having_occupation_writer_but_not_author

We found that in a query that uses predicates on more than a single property 
per vertex, the default titan query strategy is to retrieve a list of 
candidates based on the first index (fast), and then load details for each 
vertex in order to filter on the second predicate. This is pretty slow, as it 
involves O(candidates) queries to Cassandra.

I think mixed indexes as documented in 
http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed should 
help here, as they support efficient matching on an arbitrary combination of 
attributes, along with advanced range and full text queries. They do require an 
indexing backend like Elasticsearch to be configured. Nik, is there an 
Elasticsearch setup we could use, or would it be reasonably easy to spin up a 
new one?

Stas replied:

 I've looked into this matter, and there's one wrinkle in this - mixed indexes 
 now don't support Date and SET fields, which means we can not index 
 everything - some fields (among them frequently used date start/end fields 
 and P*link fields which are SET) would not be indexable with Elastic. They 
 say it's a temporary limitations but so far 0.9 still has it.


Dates are actually interesting re the ranges we need to represent. A signed 
long second resolution unix timestamp (as used by Java Date)nbsp;reaches from 
292269055 BC to 292278994 AD 
http://stackoverflow.com/questions/5488038/valid-range-for-java-util-date. 
Not quite enough to represent the current estimates for the Big Bang 
http://en.wikipedia.org/wiki/Big_Bang for example. ISO 8601 timestamps 
http://en.wikipedia.org/wiki/ISO_8601 support much less range, even in the 
six-digit extended mode. Maybe it makes sense to index years separately as a 
long? This will also be fun in JSON.

For SET I could imagine a full-text index on a string with each P*link property 
separated by a space. Elasticsearch should be good at indexing that. Not sure 
how much more complex that would be to maintain compared to the current SET 
index.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-22 Thread Smalyshev
Smalyshev added a comment.

For SET it won't be more complex to maintain, probably, but I'm not sure if the 
lookups would be fast enough. I could create an additional field for that and 
see how it behaves, and then we could drop the field that is not needed.

For Date, I wonder if support can't be added to Titan, since Elastic AFAIK 
supports dates.

Also, while Java can support negative years, right now the import does not 
support them since the Java parser fails to properly parse them. If we're 
moving to Java 8, there are better date APIs AFAIR, so maybe they allow to 
handle it properly.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, aaron, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-10 Thread Smalyshev
Smalyshev added a comment.

Maybe worth checking out this: 
http://www.tinkerpop.com/docs/3.0.0.M6/#vertex-properties 
Titan 0.9 has TinkerPop 3, which has significantly expanded property model - in 
particular, the property can have multiple other properties attached to it, and 
itself can have multiple values.


TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-09 Thread Smalyshev
Smalyshev added a comment.

Doing the import with the new model I see that the import is significantly 
slower when claims have their own vertices. Not sure if it's a big deal or not. 
If it's an issue we may want to reconsider going back to claims as edges 
model.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-08 Thread Smalyshev
Smalyshev added a comment.

https://www.mediawiki.org/wiki/Wikibase/Indexing has been updated according to 
the comments. Main change - claims are now vertices.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
Eloquence, jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-05 Thread Smalyshev
Smalyshev added a comment.

I also put some more comments to the discussion page 
https://www.mediawiki.org/wiki/Talk:Wikibase/Indexing/Data_Model - I think it 
makes sense to discuss/clarify things there, but if anybody thinks there's a 
better place please tell.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-04 Thread JanZerebecki
JanZerebecki added a comment.

This is real data. If the system can not cope with data changes of the 
discussed scope, i have the suspicion it woun't be able to cope with normal day 
to day changes in the data.

If optimization for specific types of queries on specific properties is 
necessary that would constrain the systems usefulness. We should try to avoid 
that if possible. There are quite a few properties where it is correct to have 
no preferred and 10 normal ranked statements (examples from Obama: educated at, 
award received, website account on, described by source).

I would like to know if Titan is capable to compute cases like this in a 
generic way and still be fast enough. If it is, the correctly ranked data 
should result in faster responses. Then wouldn't the current data be enough for 
evaluation?

To write queries that correctly deal with multiple possibilities and also if we 
were to better optimize the data for answering a specific query their use needs 
to be more specific. E.g. for the cities with the biggest population there are 
multiple ways once the date qualifier is exhausted to get the latest number: a) 
More accurately represent possibilities (each top spot can have multiple 
possibilities and cities can appear on multiple spots). b) Higher number wins. 
c) Lower number wins.

Anyway it seems to me that such a query is not an interactive use case. It is 
like something to be evaluated as CPU is available in a batch way (think 
Hadoop) and the result saved for viewing, probably updating it at some point in 
the same way. If queries like that are not fast enough to be done in an 
interactive way, we need better, more specific use cases for interactive 
queries.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, JanZerebecki
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-04 Thread Smalyshev
Smalyshev added a comment.

Here is my initial proposal about the data model:

https://www.mediawiki.org/wiki/Wikibase/Indexing/Data_Model

Please comment.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread JanZerebecki
JanZerebecki added a comment.

! In T76373#799504, @Smalyshev wrote:
 # Right now we completely ignore references. Do we want to keep them in the 
 index too?

Yes, although the other parts of statements are more important. One might query 
for things that are not sourced and/or only have a source in Wikipedia, or one 
might want to find things that refer to a particular source.

 # Which fields we want to search against - i.e., do we want to search against 
 descriptions (we'd probably need full-text index, like Elastic, to make it 
 useful for anything)

We probably want to search against everything in statements and site links. We 
might not need to import labes, descriptions and aliases until we want to have 
them directly in the output of the search instead of getting them from 
mysql/memcache afterwards.
Full text not necessary at first: Label, description, aliases full text search 
is I think the least important feature as our current Elastic setup should 
already cover these (although not as separate fields). Although I'm sure 
someone could come up with queries they want that would need this. More 
interesting is probably values of monolingual text, but still not absolutely 
necessary to have full text index. None of the current example queries we 
collected need a full text index. For all the other string values full text 
would probably not be used.
Though in the future it might be beneficial to be able to meet our searching 
needs through one endpoint (like Titan) even for normal full text queries, 
which would allow us to easily add drill down or faceted search features based 
on graph queries.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, JanZerebecki
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, jkroll, Wikidata-bugs, aude, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread JanZerebecki
JanZerebecki added a comment.

! In T76373#799504, @Smalyshev wrote:
 # How to handle qualifiers? As an example, such ones as point-in-time and 
 start-date/end-date. Most queries we'd do probably would be interested in 
 current values, but some may need past values. 

For queries not explicitly asking for something besides the best, current 
consensus ranks are used to select the relevant statements. I.e. every time one 
selects a statement only use those that are rank preferred if a one of that 
rank exist for that property otherwise those of rank normal. Statements ranked 
as deprecated should be ignored unless specified otherwise in the query. See 
https://www.wikidata.org/wiki/Help:Ranking .

But yes for e.g. for a query like who was the major of Berlin in 1984 you would 
need to look at statements that are 1) ranked preferred or normal and 2) that 
have a qualifier start-date which is =1984 and 3) that do not have a qualifier 
end-date or whose end-date is = 1984 .

   ## With properties, it's easier because we can just stuff everything into a 
 vertex, but we need some standard format of storing non-current data.
   ## With edges we have a potential to pollute the graph with myriad of data 
 that we will have to filter out on each query. We may distinguish them by 
 edge properties, but that could complicate many queries. Alternatively, we 
 could make special past edges - named with prefix, etc. - so that queries 
 interested in non-current data 

All this should not be decided based on ranks not qualifiers. However if you 
evaluate which statements (based on rank) are to be considered by default on 
import then if you still want to answer questions like the above historic one 
you still need a solution to be able to specifically query for 
preferred/normal/deprecated statements ignoring default.

   ## Any such processing would require ad-hoc handing of specific properties, 
 such as point-in-time and start/end, to identify which data is current and 
 which is not. We may need to be careful with this since wrong handling of 
 this on initial import would require re-scan of the whole data set.

No, as above. Queries not explicitly asking for something besides the best, 
current consensus do not need to understand qualifiers, only ranks. That should 
remove that possibility of incorrect interpretation during import.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, JanZerebecki
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, jkroll, Wikidata-bugs, aude, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Smalyshev
Smalyshev added a comment.

@janzerebecki The issue here is that many items do not have preferred state. 
I.e. take https://www.wikidata.org/wiki/Q30. 

What is the population of the USA? We don't have any number marked as 
preferred. We either have to report we have no idea about US population, or 
take the latest value by time point. 

Now, if we go to contains administrative territorial entity, Territory of 
Alaska does not have deprecated mark, but it does have end date in 1959, so 
if we are asked what are the sub-entites of the USA we probably should not 
return it unless we're specifically asked about 1958. So again, we'd need to 
look into qualifiers.

Now, let's consider https://www.wikidata.org/wiki/Q76. Where does Barack Obama 
reside? We have two entries - Capitol Hill and the White House - none of 
them preferred, but only one of them current. 

But here we have more complicated case, because if we look at awards 
received, we may want all entries listed, not just the latest one. So how we 
distinguish that from US population case, where we do want only the latest? Can 
we do it automatically or we'd have a human to write separate queries for these 
two?

Now, more generic question. If we have one preferred entry and one not, do we 
return only the preferred one? What if there are no preferred ones? I.e., in 
ethnic group for Obama we have African American and Irish American, with 
the former preferred - do we return just that when asked for Obama's ethnic 
group? But for surname we don't have anything preferred - we can't return 
nothing there, right? We could opt for return preferred if exists, otherwise 
all but that means slowing things down as we'd have to scan all data just to 
find out if we have any preferred - and we'd have to do it on each step of the 
search, since match for non-preferred then won't count I assume. And we still 
have cases like US population where getting all of them is no good - i.e. if 
we ask what is 10 most populous countries we do want to compare latest 
figures, right?

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, jkroll, Wikidata-bugs, aude, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.

Do not worry about the current usage of ranks. As soon as their use becomes 
more meaningful though queries for example they will be used more.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, Lydia_Pintscher
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Smalyshev
Smalyshev added a comment.

So we have chicken-and-egg problem here. Should we code for data that is ranked 
properly (but does not exist yet) and hope the data will catch up, and the 
querying will be complicated until then, or should we code for current data to 
make querying the current data easy?

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread JanZerebecki
JanZerebecki added a comment.

Write code for data that is ranked properly. But even with properly ranked data 
everything needs to cope with multiple answers that are contradictory. See it 
as multiple possible realities. Being able to cope with this is designed into 
Wikidata and thus needs to carry over to something that queries Wikidata in a 
generic fashion.

Finding the items that can be corrected in an automatic or human assisted 
fashion would be a good use case for a WDQ service :) . For very clear cut 
cases like non-overlapping dates of population counts with no other qualifiers 
it is probably easy to find someone to make a bot run to correct those on 
wikidata.org , if there was a way to get a list. ( http://wdq.wmflabs.org can 
not express a query for that.) For the cases were someone wants to write code 
to be more clever based on qualifiers, that work should be done in a way that 
leads to correcting the actual data. (Btw. idea added there: 
https://www.wikidata.org/w/index.php?title=Wikidata:Constraint_violation_report_inputdiff=prevoldid=178922838
 ) For cases where one wants a query result to represent a certain view on 
reality that cleverness (or at least instructing it) needs to be in the query 
and not in the service.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, JanZerebecki
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Smalyshev
Smalyshev added a comment.

OK, this can be done but the issue here is we can't evaluate a solution (e.g. 
for performance, fitness to data, etc.) such as Titan/Gremlin if we have no 
data to test it on. Meaning, assume we coded up all the queries under 
assumption the data is ranked properly. But the current data, as far as I can 
see, is not. So how can we know if certain query would be fast or not on real 
data? 

Another issue is that if the cleverness is in the query, we can not pre-process 
data on import. Which means, if we need to do an operation as simple as select 
the latest relevant population figure, instead of preparing for it on data 
import, we'd be doing it every time the query is run, on every query, for every 
data item. I suspect that may hurt performance, especially if such items are 
combined. I.e. get me the mayors of 10 most populous capitals - now we have 3 
data sets to process, for mayors - mayors may be past or current, for capitals 
- capitals can be current or old, and for population figures. This means 
instead of considering one data point at each junction, we will have to 
consider multiple ones - and this work will be done anew each time. Of course, 
ideally we have all the latest data marked as preferred, but I don't think now 
it is the case even in relatively common data, and it probably is even worse 
once the data sets become more exotic and less visible/visited.

Maybe indeed the solution would be to make a bot that just reads qualifiers and 
puts ranks automatically on the latest figure if there's none, and that bot 
maybe can run more complex queries than the regular user. 

In any case, I'll add rank support to importer/query prototype, but I think we 
still need to consider optimizing for common case.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, aude, Lydia_Pintscher, 
jkroll, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-02 Thread GWicke
GWicke added a comment.

! In T76373#799513, @Smalyshev wrote:
 Technical issues:
 
 # On import, titan sometimes slows down and gets into GC loops. 
 # On querying, for vertices with a lot of edges (such as 
 `wd(Q5).in(P31)`, i.e. humans, titan produces a backend exception: 
 
 ```
 Caused by: org.apache.thrift.transport.TTransportException: Frame size 
 (17555240) larger than max length (16384000)!
 
 ```

Possibly related: 
http://stackoverflow.com/questions/23055507/java-sql-sqlnontransientconnectionexception-org-apache-thrift-transport-ttransp

Bumping up `thrift_framed_transport_size_in_mb` might help.

Generally, the thrift interface is the old  low-level way to interact with 
Cassandra. https://github.com/thinkaurelius/titan/issues/312 discusses 
switching to native transport, but it seems that it isn't merged yet. Native 
transport has some nifty features like autopaging / streaming.

Another option that might help is 
https://github.com/thinkaurelius/titan/wiki/Using-Cassandra#titan-embedded-mode,
 as that cuts out thrift altogether by running cassandra in-process.

 Example query: 
 
 ```
 g.wd('Q5').in('P31').labelEn[0]. 
 
 ```
 However, `g.wd('Q5').in('P31')[0].labelEn` works.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, jkroll, Wikidata-bugs, aude, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-02 Thread Smalyshev
Smalyshev added a comment.

OK, setting `thrift_framed_transport_size_in_mb` in `cassandra.yaml` in both 
Cassandra and Titan (they both have the yaml file) to 256 seems to eliminate 
the Frame size error, now `g.wd('Q5').in('P31').labelEn[0]` words and produces 
'Douglas Adams' as it should.

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, jkroll, Wikidata-bugs, aude, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-02 Thread GWicke
GWicke added a comment.

! In T76373#802449, @Smalyshev wrote:
 Note, that running Titan with Cassandra embedded requires GC tuning. While 
 embedded Cassandra can provide lower latency query answering, its GC behavior 
 under load is less predictable.

Yeah, agreed. GC scaling limits are kind of the achilles heel of JVM apps. And 
most other GC'ed languages with less refined GCs do even worse.

! In T76373#802893, @Smalyshev wrote:
 So current technical issues seem to be resolved. 

Cool!

TASK DETAIL
  https://phabricator.wikimedia.org/T76373

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: Smalyshev, GWicke
Cc: Smalyshev, Manybubbles, GWicke, JanZerebecki, jkroll, Wikidata-bugs, aude, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs