Ladsgroup claimed this task.
Ladsgroup moved this task from To Do to Peer Review on the Item Quality Scoring
Improvement (Item Quality Scoring Improvement - Sprint 1) board.
Ladsgroup added a comment.
Restricted Application added a project: User-Ladsgroup.
Our options and the downsides of each option:
- Using wbgetentity/special:entitydata (the approach suggested that's
alternative to this one)
- Has downside of basically not being operable on dumps as our entity dumps
don't have histories, etc., we can't run it on xmldumps because we can't just
inject the mapping somehow
- The separate data source (This suggestion)
- Has huge performance downsides, you have to hit something like
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=P17&props=datatype
for every request but it also doesn't fully solve the problem of dumps because
it still would hit api in every dump history read because that's how ores
handles datasources (it drops them for the next read) so 1B API hits just for
this if we want to rebuild the history dump
- We can introducing concept of localserver cache and hold the mapping there
(which ores should have and use it anyway)
- That would be a lot of work
- Also I'm not sure how that can be wired to the model and features (maybe
as a datasource? then ores injects the datasource as a extractor? but then
extractors don't have a chain to fallback to if it's not in the cache.
- We can add a basic cache wrapper around the APIExtractor, so anything
stays there but that would bloat the memory footprint drastically and also
wouldn't fully solve the performance issue (because it would need to hit API
quickly when the really hot cache expires) and it also needs to hit them when
it sees a new combination of properties...
- Hard-code the mapping which means we need to manually maintain such list
and bloat the model and its memory footprint (probably around 1 GB per node)
just for this.
- Diverge the dump-based model and the API based model and go for the first
option for API and the fourth option for dumps (it wouldn't bloat the memory
for that).
- We already do this because of the item completeness issue (it has to hit
property suggester API for every item). It doesn't mean we need to drop all
features using data types, it just means we can either hard-code the mapping or
hit the API for first part and then reshape the feature processing part based
on that (same features, different ways of achieving them)
- It has the downside of duplicating some efforts but it's not that much
Honestly, the last option sounds least hard to achieve. I think we should go
that way.
TASK DETAIL
https://phabricator.wikimedia.org/T260778
WORKBOARD
https://phabricator.wikimedia.org/project/board/4937/
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Ladsgroup
Cc: hoo, guergana.tzatchkova, Michael, Aklapper, Lydia_Pintscher, Ladsgroup,
Hazizibinmahdi, Akuckartz, darthmon_wmde, jeropbrenda, Nandana, Lahi, Gq86,
Xinbenlv, Vacio, Capankajsmilyo, GoranSMilovanovic, Fz-29, QZanden,
LawExplorer, _jensen, rosalieper, Scott_WUaS, notconfusing, Wikidata-bugs,
aude, Ricordisamoa, Alchimista, He7d3r, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs