dcausse created this task. dcausse added projects: Wikidata, Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION The rdf-spark-tools has a set of tools to import and munge a wikidata dump. This process makes the wikibase RDF graph available in hive as a simple table with four columns: - context: generally the entity URI allowing to rapidly group the triples belonging to an entity, minus Values and References which are grouped together - subject - predicate - object Splitting the graph might not be entirely trivial if applied directly on the dump files (given how values and references are managed) and thus applying a split process on top of the hive table mentioned mentioned above might be more appropriate. The object of this task is to write a new program in rdf-spark-tools that takes `wikibase_rdf` partition as an input and will output two partitions: - wikidata_main (name TBD) - scholarly_articles These two subgraphs must follow the same rules evaluated in T342111 <https://phabricator.wikimedia.org/T342111>: - scholarly_articles: `direct instanceof Q13442814` - wikidata_main: the rest Special care must be taken to properly carry references and values which can be shared across subgraphs. Ideally the code must be flexible enough that the rules definition can slightly change without requiring to fully change the whole program. AC: - a new program to generate this split is available from rdf-spark-tools - a manual dag (or an automated one) is available in the search airflow instance TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
