[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread Smalyshev
Smalyshev added a comment. Proposed storage format for dates: 1. Dates are stored as long signed integers, representing number of seconds since 1970-01-01 00:00:00 UTC. 2. This gives us range of 292 bln years http://www.wolframalpha.com/input/?i=9223372036854775807+seconds+in+years. 3. When

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread GWicke
GWicke added a comment. In https://phabricator.wikimedia.org/T76373#943426, @Smalyshev wrote: Proposed storage format for dates: 1. Dates are stored as long signed integers, representing number of seconds since 1970-01-01 00:00:00 UTC. 2. This gives us range of 292 bln years

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread Smalyshev
Smalyshev added a comment. @GWicke Do we really need per-second precision transitioning between year 292M and 292M+1? I'm not sure it is ever required. We should ensure, of course, that seconds(292M-12-31T23:59:59) seconds(292M+1) and also the same for low dates, but beyond that I'm not sure

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread GWicke
GWicke added a comment. In https://phabricator.wikimedia.org/T76373#943454, @Smalyshev wrote: seconds(292M-12-31T23:59:59) seconds(292M+1) It's pretty likely that there were more than 356 leap years between -292M and 1970, so it's very possible for the years to be non-monotonic if we don't

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread Smalyshev
Smalyshev added a comment. For dates beyond real Gregorian calendar, the values more precise than years have little meaning anyway, so I don't think it matters too much as long as comparisons and lookups (i.e. which Greek philosopher was born in 427 BCE) work. TASK DETAIL

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-24 Thread GWicke
GWicke added a comment. In https://phabricator.wikimedia.org/T76373#943459, @Smalyshev wrote: For dates beyond real Gregorian calendar, the values more precise than years have little meaning anyway, so I don't think it matters too much as long as comparisons and lookups (i.e. which Greek

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread Manybubbles
Manybubbles added a comment. I think mixed indexes as documented in http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed should help here, as they support efficient matching on an arbitrary combination of attributes, along with advanced range and full text queries.

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread GWicke
GWicke added a comment. In https://phabricator.wikimedia.org/T76373#941917, @Manybubbles wrote: I'd spin up a new one - probably just on a single node. I think in the long run we probably can run this on the production search cluster but for now lets keep it off just in case it does

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread Smalyshev
Smalyshev added a comment. Elasticsearch totally supports sets. Right, but Titan unfortunately doesn't support mixed indexes on SET properties. I would assume it's not a hard limitation but rather them not getting to implementing it yet. The mixed index type support is very limited now

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread GWicke
GWicke added a comment. Another fun article for dates: http://en.wikipedia.org/wiki/Timeline_of_the_far_future TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-23 Thread Manybubbles
Manybubbles added a comment. I was using that! TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-22 Thread GWicke
GWicke added a comment. Re performance and indexing, from a mail thread: Earlier today Stas I were looking a bit into what is happening behind the scenes in some of the slower queries like

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-22 Thread Smalyshev
Smalyshev added a comment. For SET it won't be more complex to maintain, probably, but I'm not sure if the lookups would be fast enough. I could create an additional field for that and see how it behaves, and then we could drop the field that is not needed. For Date, I wonder if support can't

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-10 Thread Smalyshev
Smalyshev added a comment. Maybe worth checking out this: http://www.tinkerpop.com/docs/3.0.0.M6/#vertex-properties Titan 0.9 has TinkerPop 3, which has significantly expanded property model - in particular, the property can have multiple other properties attached to it, and itself can have

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-09 Thread Smalyshev
Smalyshev added a comment. Doing the import with the new model I see that the import is significantly slower when claims have their own vertices. Not sure if it's a big deal or not. If it's an issue we may want to reconsider going back to claims as edges model. TASK DETAIL

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-08 Thread Smalyshev
Smalyshev added a comment. https://www.mediawiki.org/wiki/Wikibase/Indexing has been updated according to the comments. Main change - claims are now vertices. TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim,

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-05 Thread Smalyshev
Smalyshev added a comment. I also put some more comments to the discussion page https://www.mediawiki.org/wiki/Talk:Wikibase/Indexing/Data_Model - I think it makes sense to discuss/clarify things there, but if anybody thinks there's a better place please tell. TASK DETAIL

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-04 Thread JanZerebecki
JanZerebecki added a comment. This is real data. If the system can not cope with data changes of the discussed scope, i have the suspicion it woun't be able to cope with normal day to day changes in the data. If optimization for specific types of queries on specific properties is necessary

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-04 Thread Smalyshev
Smalyshev added a comment. Here is my initial proposal about the data model: https://www.mediawiki.org/wiki/Wikibase/Indexing/Data_Model Please comment. TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim,

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread JanZerebecki
JanZerebecki added a comment. ! In T76373#799504, @Smalyshev wrote: # Right now we completely ignore references. Do we want to keep them in the index too? Yes, although the other parts of statements are more important. One might query for things that are not sourced and/or only have a source

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread JanZerebecki
JanZerebecki added a comment. ! In T76373#799504, @Smalyshev wrote: # How to handle qualifiers? As an example, such ones as point-in-time and start-date/end-date. Most queries we'd do probably would be interested in current values, but some may need past values. For queries not explicitly

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Smalyshev
Smalyshev added a comment. @janzerebecki The issue here is that many items do not have preferred state. I.e. take https://www.wikidata.org/wiki/Q30. What is the population of the USA? We don't have any number marked as preferred. We either have to report we have no idea about US population,

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Lydia_Pintscher
Lydia_Pintscher added a comment. Do not worry about the current usage of ranks. As soon as their use becomes more meaningful though queries for example they will be used more. TASK DETAIL https://phabricator.wikimedia.org/T76373 REPLY HANDLER ACTIONS Reply to comment or attach files, or

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Smalyshev
Smalyshev added a comment. So we have chicken-and-egg problem here. Should we code for data that is ranked properly (but does not exist yet) and hope the data will catch up, and the querying will be complicated until then, or should we code for current data to make querying the current data

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread JanZerebecki
JanZerebecki added a comment. Write code for data that is ranked properly. But even with properly ranked data everything needs to cope with multiple answers that are contradictory. See it as multiple possible realities. Being able to cope with this is designed into Wikidata and thus needs to

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-03 Thread Smalyshev
Smalyshev added a comment. OK, this can be done but the issue here is we can't evaluate a solution (e.g. for performance, fitness to data, etc.) such as Titan/Gremlin if we have no data to test it on. Meaning, assume we coded up all the queries under assumption the data is ranked properly. But

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-02 Thread GWicke
GWicke added a comment. ! In T76373#799513, @Smalyshev wrote: Technical issues: # On import, titan sometimes slows down and gets into GC loops. # On querying, for vertices with a lot of edges (such as `wd(Q5).in(P31)`, i.e. humans, titan produces a backend exception: ``` Caused by:

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-02 Thread Smalyshev
Smalyshev added a comment. OK, setting `thrift_framed_transport_size_in_mb` in `cassandra.yaml` in both Cassandra and Titan (they both have the yaml file) to 256 seems to eliminate the Frame size error, now `g.wd('Q5').in('P31').labelEn[0]` words and produces 'Douglas Adams' as it should.

[Wikidata-bugs] [Maniphest] [Commented On] T76373: Evaluate Titan as graph storage/query engine for Wikidata Query service

2014-12-02 Thread GWicke
GWicke added a comment. ! In T76373#802449, @Smalyshev wrote: Note, that running Titan with Cassandra embedded requires GC tuning. While embedded Cassandra can provide lower latency query answering, its GC behavior under load is less predictable. Yeah, agreed. GC scaling limits are kind of