AndrewTavis_WMDE added a comment.
@dcausse, do you have an idea why we're not getting that direct triples for
SAs and its subclasses and direct triples for non-SAs and subclasses add to the
same amount? Was working out for the last notebook as you saw. Only major
change I've made is now it's `.where(col("object").isin(sa_and_sasc_qids))`
rather than the equality where `sa_and_sasc_qids` is the hard coded QIDs from
above including scholarly article's (I was getting some papers back when
directly querying subclasses).
The important snippets from the code:
df_wikidata_rdf = (
spark.table("discovery.wikibase_rdf")
.where("wiki='wikidata' AND date = '20230717'")
.alias("df_wikidata_rdf")
)
sa_and_sasc_ids = (
df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
.where(col("predicate") == P31_DIRECT_URL)
.where(col("object").isin(sa_and_sasc_qids))
.alias("sa_and_sasc_ids")
)
sa_and_sasc_direct_triples = (
df_wikidata_rdf.join(
other=sa_and_sasc_ids,
on=(sa_and_sasc_ids["sa_and_sasc_qids"] ==
df_wikidata_rdf["context"]),
how="inner"
)
.select("df_wikidata_rdf.*")
.cache()
)
non_sa_and_sasc_direct_triples = (
df_wikidata_rdf.join(
other=sa_and_sasc_ids,
on=(sa_and_sasc_ids["sa_and_sasc_qids"] ==
df_wikidata_rdf["context"]),
how="leftanti"
)
.select("df_wikidata_rdf.*")
.cache()
)
print_num_str_with_commas(total_triples)
# 15,043,483,216
print_num_str_with_commas(sa_and_sasc_direct_triples.count())
# 7,778,494,249
print_num_str_with_commas(non_sa_and_sasc_direct_triples.count())
# 7,847,030,088
print_num_str_with_commas(total_sa_and_sasc_direct_triples +
total_non_sa_and_sasc_direct_triples)
# 15,625,524,337
Is there something going in with the relationship between the multiple
classes? Do we need to switch the joins up for this one?
TASK DETAIL
https://phabricator.wikimedia.org/T342123
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AndrewTavis_WMDE
Cc: dcausse, Lydia_Pintscher, dr0ptp4kt, Aklapper, Manuel,
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja,
ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden,
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]