| JAllemandou added a comment. |
@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :
spark.sql("SET spark.sql.shuffle.partitions=512")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")
val df = spark.sql("""
WITH namespaced_revisions AS (
SELECT
wiki_db,
revision_id,
event_timestamp,
page_title,
page_namespace,
CASE WHEN (LENGTH(namespace_localized_name) > 0)
THEN CONCAT(namespace_localized_name, ':', page_title)
ELSE page_title
END AS title_namespace_localized
FROM wmf.mediawiki_history mwh
INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
ON (
mwh.wiki_db = nsm.dbname
AND mwh.page_namespace = nsm.namespace
AND mwh.snapshot = nsm.snapshot
)
WHERE mwh.snapshot = '2019-01'
AND nsm.snapshot = '2019-01'
AND event_entity = 'revision'
AND NOT revision_is_deleted
),
wikidata_sitelinks AS (
SELECT
id as item_id,
EXPLODE(siteLinks) AS sitelink
FROM wikidata
WHERE size(siteLinks) > 0
)
SELECT
item_id,
wiki_db,
revision_id,
event_timestamp,
page_title,
page_namespace
FROM wikidata_sitelinks ws
INNER JOIN namespaced_revisions nsr
ON (
ws.sitelink.site = nsr.wiki_db
AND ws.sitelink.title = title_namespace_localized
)
""")TASK DETAIL
EMAIL PREFERENCES
To: JAllemandou
Cc: Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
Cc: Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
