dcausse created this task.
dcausse added projects: Wikidata, Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  The rdf-spark-tools has a set of tools to import and munge a wikidata dump. 
This process makes the wikibase RDF graph available in hive as a simple table 
with four columns:
  
  - context: generally the entity URI allowing to rapidly group the triples 
belonging to an entity, minus Values and References which are grouped together
  - subject
  - predicate
  - object
  
  Splitting the graph might not be entirely trivial if applied directly on the 
dump files (given how values and references are managed) and thus applying a 
split process on top of the hive table mentioned mentioned above might be more 
appropriate.
  
  The object of this task is to write a new program in rdf-spark-tools that 
takes `wikibase_rdf` partition as an input and will output two partitions:
  
  - wikidata_main (name TBD)
  - scholarly_articles
  
  These two subgraphs must follow the same rules evaluated in T342111 
<https://phabricator.wikimedia.org/T342111>:
  
  - scholarly_articles: `direct instanceof Q13442814`
  - wikidata_main: the rest
  
  Special care must be taken to properly carry references and values which can 
be shared across subgraphs.
  Ideally the code must be flexible enough that the rules definition can 
slightly change without requiring to fully change the whole program.
  
  AC:
  
  - a new program to generate this split is available from rdf-spark-tools
  - a manual dag (or an automated one) is available in the search airflow 
instance

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to