AndrewTavis_WMDE added a comment.
Is what we were thinking too, @dcausse :) I'm realizing that where I had the
`.distinct()` was incorrect though. I think that it should go at the end of the
full definition of the PySpark df like this:
# Got rid of the sa_and_... because it was getting too verbose
df_sasc_ids = (
df_wikidata_rdf.select(col("subject").alias("distinct_sasc_qids"))
.where(col("predicate") == P31_DIRECT_URL)
.where(col("object").isin(sasc_qids))
.alias("df_sasc_ids")
).distinct()
Thanks for checking in!
TASK DETAIL
https://phabricator.wikimedia.org/T342123
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AndrewTavis_WMDE
Cc: dcausse, Lydia_Pintscher, dr0ptp4kt, Aklapper, Manuel,
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja,
ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden,
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]