Addshore added a comment.

  Using the sqooped tables..
  
  Looking at the first 10 million items
  
    // Find the diff
    spark.sql("""
    SELECT
      term_entity_id as old,
      wbit_item_id as new
    FROM (
    
    SELECT
      DISTINCT term_entity_id
    FROM joal.wikibase_wb_terms
    WHERE wiki_db = 'wikidatawiki'
      AND snapshot = '2019-10'
      AND term_entity_id > 0
      AND term_entity_id < 10000000
      AND term_entity_type = 'item'
    
    ) as old
    LEFT JOIN (
    
    SELECT
      DISTINCT wbit_item_id
    FROM joal.wikibase_wbt_item_terms
    WHERE wiki_db = 'wikidatawiki'
      AND snapshot = '2019-10'
      AND wbit_item_id > 0
      AND wbit_item_id < 10000000
    
    ) as new ON term_entity_id = wbit_item_id
    WHERE wbit_item_id IS NULL
    """).repartition(64).createOrReplaceTempView("wd_comparison_2")
    spark.table("wd_comparison_2").cache()
  
  I find there are 14054 that seem to appear in the old table but not in the 
new ones.
  
  I'll generate a list of what we need to run over

TASK DETAIL
  https://phabricator.wikimedia.org/T239470

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Addshore, Aklapper, Iflorez, darthmon_wmde, alaa_wmde, DannyS712, Nandana, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to