Isaac added a comment.

  Hey @JAllemandou, some debugging: a number of items aren't showing up and I 
can't for the life of me figure out. The few I've looked at are pretty normal 
articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and 
show up in the original parquet files 
(`/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204`)
  
  But according to this analysis (T209891#4798717 
<https://phabricator.wikimedia.org/T209891#4798717>) and ebernhardson's table 
(`SELECT count(page_id) from ebernhardson.cirrus2hive where wikiid = 'enwiki' 
and dump_date='20190121';`), there should be ~5.7 million english articles w/ 
associated wikidata items and I'm only seeing 916 thousand. I went through your 
query but could not find anything that would be causing this dropout so I'm at 
a loss. Thoughts?
  
  Code in case I'm doing something wrong:
  
    count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP 
BY wiki_db')
    wikidataParquetPath = 
'/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204'
    spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata')
    count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP 
BY wiki_db')
  
  If you sort the outcome then, you get:
  
    +--------------+--------+
    |       wiki_db|count(1)|
    +--------------+--------+
    |        zhwiki| 1245854|
    |        jawiki| 1210483|
    |        enwiki|  916393|
    |       cebwiki|  891045|
    |        svwiki|  778952|
    |        dewiki|  656622|
    |        frwiki|  414492|
    |        nlwiki|  414469|
    |        ruwiki|  413733|
    ...

TASK DETAIL
  https://phabricator.wikimedia.org/T215616

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, 
JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, 
Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to