| JAllemandou added a comment. |
Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:
spark.sql("SET spark.sql.shuffle.partitions=128")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")
spark.sql("""
WITH namespaced_revisions AS (
SELECT
wiki_db,
page_id,
page_title,
page_namespace,
CASE WHEN (LENGTH(namespace_localized_name) > 0)
THEN CONCAT(namespace_localized_name, ':', page_title)
ELSE page_title
END AS title_namespace_localized
FROM (
SELECT
wiki_db,
page_id,
page_title,
page_namespace,
row_number() OVER (PARTITION BY wiki_db, page_id ORDER BY start_timestamp DESC) as row_num
FROM wmf.mediawiki_page_history
WHERE snapshot = '2019-01'
AND page_id IS NOT NULL AND page_id > 0
AND page_title IS NOT NULL and LENGTH(page_title) > 0
) ph
INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
ON (
ph.wiki_db = nsm.dbname
AND ph.page_namespace = nsm.namespace
AND nsm.snapshot = '2019-01'
)
WHERE row_num = 1
),
wikidata_sitelinks AS (
SELECT
id as item_id,
EXPLODE(siteLinks) AS sitelink
FROM wikidata
WHERE size(siteLinks) > 0
)
SELECT
item_id,
wiki_db,
page_id,
page_title,
page_namespace,
title_namespace_localized
FROM wikidata_sitelinks ws
INNER JOIN namespaced_revisions nsr
ON (
ws.sitelink.site = nsr.wiki_db
AND ws.sitelink.title = title_namespace_localized
)
""").repartition(16).write.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204")@diego : I can generate similar data for by-revision, but before doing wanted to be sure we agree on the fact that it'll flag every historical revision associated to an item currently linked to a page. Is that what you're after ? Or more of the history of linkagebetween page and item ?
Thanks!
TASK DETAIL
EMAIL PREFERENCES
To: JAllemandou
Cc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
Cc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
