AndrewTavis_WMDE added a comment.

  @dcausse, do you have an idea why we're not getting that direct triples for 
SAs and its subclasses and direct triples for non-SAs and subclasses add to the 
same amount? Was working out for the last notebook as you saw. Only major 
change I've made is now it's `.where(col("object").isin(sa_and_sasc_qids))` 
rather than the equality where `sa_and_sasc_qids` is the hard coded QIDs from 
above including scholarly article's (I was getting some papers back when 
directly querying subclasses).
  
  The important snippets from the code:
  
    df_wikidata_rdf = (
        spark.table("discovery.wikibase_rdf")
        .where("wiki='wikidata' AND date = '20230717'")
        .alias("df_wikidata_rdf")
    )
    
    sa_and_sasc_ids = (
        df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
        .where(col("predicate") == P31_DIRECT_URL)
        .where(col("object").isin(sa_and_sasc_qids))
        .alias("sa_and_sasc_ids")
    )
    
    sa_and_sasc_direct_triples = (
        df_wikidata_rdf.join(
            other=sa_and_sasc_ids, 
            on=(sa_and_sasc_ids["sa_and_sasc_qids"] == 
df_wikidata_rdf["context"]), 
            how="inner"
        )
        .select("df_wikidata_rdf.*")
        .cache()
    )
    
    non_sa_and_sasc_direct_triples = (
        df_wikidata_rdf.join(
            other=sa_and_sasc_ids, 
            on=(sa_and_sasc_ids["sa_and_sasc_qids"] == 
df_wikidata_rdf["context"]), 
            how="leftanti"
        )
        .select("df_wikidata_rdf.*")
        .cache()
    )
    
    print_num_str_with_commas(total_triples)
    # 15,043,483,216
    
    print_num_str_with_commas(sa_and_sasc_direct_triples.count())
    # 7,778,494,249
    
    print_num_str_with_commas(non_sa_and_sasc_direct_triples.count())
    # 7,847,030,088
    
    print_num_str_with_commas(total_sa_and_sasc_direct_triples + 
total_non_sa_and_sasc_direct_triples)
    # 15,625,524,337
  
  Is there something going in with the relationship between the multiple 
classes? Do we need to switch the joins up for this one?

TASK DETAIL
  https://phabricator.wikimedia.org/T342123

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: dcausse, Lydia_Pintscher, dr0ptp4kt, Aklapper, Manuel, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to