dcausse added a comment.
At a glance I suspect that now you might get duplicated QIDs in
sa_and_sasc_ids = (
df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
.where(col("predicate") == P31_DIRECT_URL)
.where(col("object").isin(sa_and_sasc_qids))
.alias("sa_and_sasc_ids")
)
Which could be explained by entities being tagged with multiple entries found
in `sa_and_sasc_qids`.
What happens if you apply a `distinct` here:
sa_and_sasc_ids = (
df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
.where(col("predicate") == P31_DIRECT_URL)
.where(col("object").isin(sa_and_sasc_qids))
.disctinct()
.alias("sa_and_sasc_ids")
)
TASK DETAIL
https://phabricator.wikimedia.org/T342123
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AndrewTavis_WMDE, dcausse
Cc: dcausse, Lydia_Pintscher, dr0ptp4kt, Aklapper, Manuel,
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja,
ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden,
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]