Isaac added a comment.
Hey @JAllemandou, some debugging: a number of items aren't showing up and I can't for the life of me figure out. The few I've looked at are pretty normal articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and show up in the original parquet files (`/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204`) But according to this analysis (T209891#4798717 <https://phabricator.wikimedia.org/T209891#4798717>) and ebernhardson's table (`SELECT count(page_id) from ebernhardson.cirrus2hive where wikiid = 'enwiki' and dump_date='20190121';`), there should be ~5.7 million english articles w/ associated wikidata items and I'm only seeing 916 thousand. I went through your query but could not find anything that would be causing this dropout so I'm at a loss. Thoughts? Code in case I'm doing something wrong: count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP BY wiki_db') wikidataParquetPath = '/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204' spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata') count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP BY wiki_db') If you sort the outcome then, you get: +--------------+--------+ | wiki_db|count(1)| +--------------+--------+ | zhwiki| 1245854| | jawiki| 1210483| | enwiki| 916393| | cebwiki| 891045| | svwiki| 778952| | dewiki| 656622| | frwiki| 414492| | nlwiki| 414469| | ruwiki| 413733| ... TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
