Smalyshev added a comment.
Proposed storage format for dates:
1. Dates are stored as long signed integers, representing number of seconds
since 1970-01-01 00:00:00 UTC.
2. This gives us range of 292 bln years
http://www.wolframalpha.com/input/?i=9223372036854775807+seconds+in+years.
3. When
GWicke added a comment.
In https://phabricator.wikimedia.org/T76373#943426, @Smalyshev wrote:
Proposed storage format for dates:
1. Dates are stored as long signed integers, representing number of seconds
since 1970-01-01 00:00:00 UTC.
2. This gives us range of 292 bln years
Smalyshev added a comment.
@GWicke Do we really need per-second precision transitioning between year 292M
and 292M+1? I'm not sure it is ever required. We should ensure, of course, that
seconds(292M-12-31T23:59:59) seconds(292M+1) and also the same for low dates,
but beyond that I'm not sure
GWicke added a comment.
In https://phabricator.wikimedia.org/T76373#943454, @Smalyshev wrote:
seconds(292M-12-31T23:59:59) seconds(292M+1)
It's pretty likely that there were more than 356 leap years between -292M and
1970, so it's very possible for the years to be non-monotonic if we don't
Smalyshev added a comment.
For dates beyond real Gregorian calendar, the values more precise than years
have little meaning anyway, so I don't think it matters too much as long as
comparisons and lookups (i.e. which Greek philosopher was born in 427 BCE) work.
TASK DETAIL
GWicke added a comment.
In https://phabricator.wikimedia.org/T76373#943459, @Smalyshev wrote:
For dates beyond real Gregorian calendar, the values more precise than years
have little meaning anyway, so I don't think it matters too much as long as
comparisons and lookups (i.e. which Greek
Manybubbles added a comment.
I think mixed indexes as documented in
http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed
should help here, as they support efficient matching on an arbitrary
combination of attributes, along with advanced range and full text queries.
GWicke added a comment.
In https://phabricator.wikimedia.org/T76373#941917, @Manybubbles wrote:
I'd spin up a new one - probably just on a single node. I think in the long
run we probably can run this on the production search cluster but for now
lets keep it off just in case it does
Smalyshev added a comment.
Elasticsearch totally supports sets.
Right, but Titan unfortunately doesn't support mixed indexes on SET properties.
I would assume it's not a hard limitation but rather them not getting to
implementing it yet. The mixed index type support is very limited now
GWicke added a comment.
Another fun article for dates:
http://en.wikipedia.org/wiki/Timeline_of_the_far_future
TASK DETAIL
https://phabricator.wikimedia.org/T76373
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
username.
EMAIL
Manybubbles added a comment.
I was using that!
TASK DETAIL
https://phabricator.wikimedia.org/T76373
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
username.
EMAIL PREFERENCES
GWicke added a comment.
Re performance and indexing, from a mail thread:
Earlier today Stas I were looking a bit into what is happening behind the
scenes in some of the slower queries like
Smalyshev added a comment.
For SET it won't be more complex to maintain, probably, but I'm not sure if the
lookups would be fast enough. I could create an additional field for that and
see how it behaves, and then we could drop the field that is not needed.
For Date, I wonder if support can't
Smalyshev added a comment.
Maybe worth checking out this:
http://www.tinkerpop.com/docs/3.0.0.M6/#vertex-properties
Titan 0.9 has TinkerPop 3, which has significantly expanded property model - in
particular, the property can have multiple other properties attached to it, and
itself can have
Smalyshev added a comment.
Doing the import with the new model I see that the import is significantly
slower when claims have their own vertices. Not sure if it's a big deal or not.
If it's an issue we may want to reconsider going back to claims as edges
model.
TASK DETAIL
Smalyshev added a comment.
https://www.mediawiki.org/wiki/Wikibase/Indexing has been updated according to
the comments. Main change - claims are now vertices.
TASK DETAIL
https://phabricator.wikimedia.org/T76373
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim,
Smalyshev added a comment.
I also put some more comments to the discussion page
https://www.mediawiki.org/wiki/Talk:Wikibase/Indexing/Data_Model - I think it
makes sense to discuss/clarify things there, but if anybody thinks there's a
better place please tell.
TASK DETAIL
JanZerebecki added a comment.
This is real data. If the system can not cope with data changes of the
discussed scope, i have the suspicion it woun't be able to cope with normal day
to day changes in the data.
If optimization for specific types of queries on specific properties is
necessary
Smalyshev added a comment.
Here is my initial proposal about the data model:
https://www.mediawiki.org/wiki/Wikibase/Indexing/Data_Model
Please comment.
TASK DETAIL
https://phabricator.wikimedia.org/T76373
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim,
JanZerebecki added a comment.
! In T76373#799504, @Smalyshev wrote:
# Right now we completely ignore references. Do we want to keep them in the
index too?
Yes, although the other parts of statements are more important. One might query
for things that are not sourced and/or only have a source
JanZerebecki added a comment.
! In T76373#799504, @Smalyshev wrote:
# How to handle qualifiers? As an example, such ones as point-in-time and
start-date/end-date. Most queries we'd do probably would be interested in
current values, but some may need past values.
For queries not explicitly
Smalyshev added a comment.
@janzerebecki The issue here is that many items do not have preferred state.
I.e. take https://www.wikidata.org/wiki/Q30.
What is the population of the USA? We don't have any number marked as
preferred. We either have to report we have no idea about US population,
Lydia_Pintscher added a comment.
Do not worry about the current usage of ranks. As soon as their use becomes
more meaningful though queries for example they will be used more.
TASK DETAIL
https://phabricator.wikimedia.org/T76373
REPLY HANDLER ACTIONS
Reply to comment or attach files, or
Smalyshev added a comment.
So we have chicken-and-egg problem here. Should we code for data that is ranked
properly (but does not exist yet) and hope the data will catch up, and the
querying will be complicated until then, or should we code for current data to
make querying the current data
JanZerebecki added a comment.
Write code for data that is ranked properly. But even with properly ranked data
everything needs to cope with multiple answers that are contradictory. See it
as multiple possible realities. Being able to cope with this is designed into
Wikidata and thus needs to
Smalyshev added a comment.
OK, this can be done but the issue here is we can't evaluate a solution (e.g.
for performance, fitness to data, etc.) such as Titan/Gremlin if we have no
data to test it on. Meaning, assume we coded up all the queries under
assumption the data is ranked properly. But
GWicke added a comment.
! In T76373#799513, @Smalyshev wrote:
Technical issues:
# On import, titan sometimes slows down and gets into GC loops.
# On querying, for vertices with a lot of edges (such as
`wd(Q5).in(P31)`, i.e. humans, titan produces a backend exception:
```
Caused by:
Smalyshev added a comment.
OK, setting `thrift_framed_transport_size_in_mb` in `cassandra.yaml` in both
Cassandra and Titan (they both have the yaml file) to 256 seems to eliminate
the Frame size error, now `g.wd('Q5').in('P31').labelEn[0]` words and produces
'Douglas Adams' as it should.
GWicke added a comment.
! In T76373#802449, @Smalyshev wrote:
Note, that running Titan with Cassandra embedded requires GC tuning. While
embedded Cassandra can provide lower latency query answering, its GC behavior
under load is less predictable.
Yeah, agreed. GC scaling limits are kind of
29 matches
Mail list logo