AndrewTavis_WMDE added a comment.

  The UDF is up and running now, but we may need to discuss my limits as 
running what I'd assume to be a fairly simple UDF over `wmf.wikidata_entity` 
wasn't finishing (@dcausse, @JAllemandou). Even if it does finish, I'm fairly 
regularly getting:
  
    WARN TaskSetManager: Lost task 624.0 in stage 49.0 (TID 8638) 
(an-worker1114.eqiad.wmnet executor 1519): ExecutorLostFailure (executor 1519 
exited caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding physical memory limits. 8.8 GB of 8.8 GB physical memory used. 
Consider boosting spark.executor.memoryOverhead.
  
  Maybe there are ways for me to optimize this though as I'm just learning all 
this through this task. What I've got to check claims for whether an entity is 
a scholarly article (written in this way as UDFs return `StringType` by default 
apparently and I didn't want to fool around more to get a boolean):
  
    def check_if_sa(claims):
        """
        Check to see if an entity is a scholarly article via the SA 
wikibase-entityid.
        """    
        if claims is not None:
            if '{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}' 
in f"{claims}":
                return "SA"
            else:
                return "Not SA"
        
        else:
            return "Not SA"
    
    spark.udf.register("check_if_sa", check_if_sa)
    udf_check_if_sa = F.udf(lambda z: check_if_sa(z))
    
    df_wikidata_qid_entities = (
        spark.table("wmf.wikidata_entity")
        .where(f"snapshot = '2023-07-24'")
        .where("id LIKE 'Q%'")
        .alias("df_wikidata_entity")
    )
    
    sa_or_not = (
        
df_wikidata_qid_entities.select(udf_check_if_sa(col("claims")).alias("sa_or_not"))
    )
    
    sa_or_not.limit(100).groupBy("sa_or_not").count().collect().show()
    # [Row(sa_or_not='SA', count=53), Row(sa_or_not='Not SA', count=47)]
  
  The last line is has a `LIMIT` of 100 given that without it it won't finish. 
Happy to discuss this a bit on Monday as well, @dcausse 😊

TASK DETAIL
  https://phabricator.wikimedia.org/T342111

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: mpopov, JAllemandou, Lydia_Pintscher, dcausse, Gehel, dr0ptp4kt, 
AndrewTavis_WMDE, Aklapper, Manuel, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to