Isaac added a comment.

  Weekly updates:
  
  - I focused on the references component of the model this week. I built 
heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo 
Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata 
across languages: a hybrid approach." Journal of Data and Information Quality 
(JDIQ) 13, no. 4 (2021): 1-35. <https://arxiv.org/pdf/2109.09405.pdf>
  - I wrote a Python function (code below) that takes the references for a 
claim and maps it to high-level categories that tell us about the quality of 
the reference -- e.g., has an External URL associated with it vs. referring to 
internal Wikidata item or import from another Wikimedia project. I can imagine 
weak and strong recommendations based on this -- e.g., high priority would be 
adding missing references and lower priority might be updating Imported from 
Wikimedia Project to a external URL and very low priority might be adding a 
second reference.
  - Using that function, I can generate basic descriptive stats on reference 
distributions on Wikidata (table below) and split by property 
(top-100-most-common properties below). From this data, you can see that we 
might be able to automatically infer which properties definitely need 
references, which ones probably should have references, and which ones probably 
don't by just setting some basic heuristics. One challenge will be whether we 
use the current state of Wikidata (which is heavily bot-influenced so for 
certain properties, reflects the choice of a few people) or try to build a more 
nuanced dataset based on edit history of which properties have references when 
editors add them.
  
    # Code for categorizing references for a claim per a simple taxonomy that 
by proxy tells us something about authority/accessibility/usefulness of the 
reference
    # types of references from least -> best
    # so if a claim has two references and one is Internal-Stated and one is 
External-Direct, we keep External-Direct
    REF_ORDER = {r:i for r,i in enumerate(
        ['Internal-Inferred', 'Internal-Stated', 'Internal-Wikimedia',
         'External-Identifier', 'External-Direct'])}
    
    EXTERNAL_ID_PROPERTIES = set()
    # all Wikidata properties that are external IDs -- used for detecting when 
used as part of a reference
    # TODO: Maybe update to SPARQL query that is external identifier properties 
ONLY with URL formatter properties? (maybe that's essentially the same thing?)
    # https://quarry.wmcloud.org/query/69919
    with open('quarry-69919-wikidata-external-ids-run692643.tsv', 'r') as fin:
        for line in fin:
            EXTERNAL_ID_PROPERTIES.add(f'P{line.strip()}')
    
    def getReferenceType(references):
        """Map references for a claim to different categories.
        
        Heavily inspired by: https://arxiv.org/pdf/2109.09405.pdf
        Also: https://www.wikidata.org/wiki/Help:Sources
        """
        if references is None:
            ref_count = 'unreferenced'
            best_ref_type = None
        else:
            ref_count = 'single' if len(references) == 1 else 'multiple'
            best_ref_types = []
            for ref in references:
                # reference URL OR official website OR archive URL OR URL OR 
external data available at 
                if 'P854' in ref['snaksOrder'] or 'P856' in ref['snaksOrder'] 
or 'P1065' in ref['snaksOrder'] or 'P953' in ref['snaksOrder'] or 'P2699' in 
ref['snaksOrder'] or 'P1325' in ref['snaksOrder']:
                    best_ref_types.append('External-Direct')
                    break
                elif [p for p in ref['snaksOrder'] if p in 
EXTERNAL_ID_PROPERTIES]:
                    best_ref_types.append('External-Identifier')
                # Wikimedia import URL OR imported from Wikimedia project
                elif 'P4656' in ref['snaksOrder'] or 'P143' in 
ref['snaksOrder']:
                    best_ref_types.append('Internal-Wikimedia')
                # stated in
                elif 'P248' in ref['snaksOrder']:
                    best_ref_types.append('Internal-Stated')
                # inferred from Wikidata item OR based on heuristic OR based on
                elif 'P3452' in ref['snaksOrder'] or 'P887' in 
ref['snaksOrder'] or 'P144' in ref['snaksOrder']:
                    best_ref_types.append('Internal-Inferred')
                # title OR published in -- hard to interpret without more info 
but probably links to Wikidata item
                elif 'P1476' in ref['snaksOrder'] or 'P1433' in 
ref['snaksOrder']:
                    best_ref_types.append('Internal-Stated')
                else:
                    best_ref_types.append(f'Unknown: {ref["snaksOrder"]}')
            best_ref_type = max(best_ref_types, key=lambda x: REF_ORDER.get(x, 
-1))
        return (ref_count, best_ref_type)
  
  
  
    High-level descriptive stats for every num_refs/best_ref category over 1000 
claims:
    I manually inspect the top Unknown Properties to make sure they shouldn't 
be part of
    one of the official categories but otherwise they'd end up being mapped to 
unreferenced
    
    
+------------+-----------------------------------------------------+----------+
    |num_refs    |best_ref                                             
|num_claims|
    
+------------+-----------------------------------------------------+----------+
    |single      |External-Direct                                      
|651044816 |
    |unreferenced|null                                                 
|339814593 |
    |single      |Internal-Stated                                      
|191615754 |
    |single      |External-Identifier                                  
|154045142 |
    |single      |Internal-Wikimedia                                   
|55315642  |
    |single      |Internal-Inferred                                    
|21253250  |
    |multiple    |Internal-Stated                                      |3218113 
  |
    |multiple    |External-Direct                                      |2825364 
  |
    |multiple    |Internal-Wikimedia                                   |2791394 
  |
    |multiple    |External-Identifier                                  |2262353 
  |
    |single      |Unknown: ['P813']                                    |1243513 
  |
    |single      |Unknown: ['P1640', 'P813']                           |101331  
  |
    |multiple    |Internal-Inferred                                    |85786   
  |
    |single      |Unknown: ['P1810', 'P813']                           |81642   
  |
    |single      |Unknown: ['P6104']                                   |46210   
  |
    |multiple    |Unknown: ['P813']                                    |15468   
  |
    |single      |Unknown: ['P123']                                    |9992    
  |
    |single      |Unknown: ['P195']                                    |7011    
  |
    |multiple    |Unknown: ['P1640', 'P813']                           |4594    
  |
    |single      |Unknown: ['P459']                                    |3949    
  |
    |single      |Unknown: ['P217', 'P195']                            |3045    
  |
    |single      |Unknown: ['P217']                                    |3019    
  |
    |single      |Unknown: ['P373']                                    |2986    
  |
    |multiple    |Unknown: ['P304']                                    |2812    
  |
    |single      |Unknown: ['P195', 'P217']                            |2558    
  |
    |single      |Unknown: ['P1683']                                   |1572    
  |
    |single      |Unknown: ['P958']                                    |1549    
  |
    |single      |Unknown: ['P3014']                                   |1509    
  |
    |single      |Unknown: ['P10253']                                  |1348    
  |
    |multiple    |Unknown: ['P1343']                                   |1285    
  |
    |single      |Unknown: ['P304']                                    |1256    
  |
    |single      |Unknown: ['P973']                                    |1194    
  |
    |single      |Unknown: ['P407']                                    |1118    
  |
    |single      |Unknown: ['P1343']                                   |1089    
  |
  
  
  
    Top-100 most common properties on Wikidata and reference distribution
    
+--------+----------+-----------------+-------------+-------------+------------+
    
|property|num_claims|prop_unreferenced|prop_external|prop_internal|prop_unknown|
    
+--------+----------+-----------------+-------------+-------------+------------+
    |P2860   |287780712 |0.001            |0.999        |0.0          |0.0      
   |
    |P2093   |134861830 |0.118            |0.871        |0.011        |0.0      
   |
    |P31     |106419606 |0.404            |0.386        |0.21         |0.0      
   |
    |P1476   |43568811  |0.148            |0.837        |0.014        |0.001    
   |
    |P577    |41854467  |0.119            |0.846        |0.035        |0.0      
   |
    |P1433   |39313917  |0.115            |0.862        |0.022        |0.001    
   |
    |P304    |36279684  |0.101            |0.886        |0.013        |0.0      
   |
    |P478    |36170462  |0.1              |0.885        |0.014        |0.0      
   |
    |P1215   |33122905  |0.0              |0.0          |1.0          |0.0      
   |
    |P433    |33009310  |0.1              |0.897        |0.002        |0.0      
   |
    |P698    |32069969  |0.019            |0.98         |0.001        |0.0      
   |
    |P528    |28768139  |0.005            |0.004        |0.991        |0.0      
   |
    |P356    |28598411  |0.16             |0.831        |0.009        |0.0      
   |
    |P50     |27852061  |0.372            |0.604        |0.023        |0.0      
   |
    |P921    |24859815  |0.291            |0.038        |0.671        |0.0      
   |
    |P407    |16251420  |0.787            |0.123        |0.089        |0.0      
   |
    |P17     |15428560  |0.541            |0.218        |0.241        |0.0      
   |
    |P131    |11726405  |0.44             |0.266        |0.294        |0.0      
   |
    |P106    |10064651  |0.604            |0.199        |0.196        |0.001    
   |
    |P625    |9602347   |0.281            |0.299        |0.421        |0.0      
   |
    |P21     |8314129   |0.592            |0.103        |0.306        |0.0      
   |
    |P3083   |8152578   |0.996            |0.0          |0.004        |0.0      
   |
    |P6257   |8094420   |0.0              |0.0          |1.0          |0.0      
   |
    |P6258   |8094293   |0.0              |0.0          |1.0          |0.0      
   |
    |P6259   |8079526   |0.0              |0.0          |1.0          |0.0      
   |
    |P2671   |7391570   |0.998            |0.0          |0.002        |0.0      
   |
    |P59     |7374426   |0.994            |0.0          |0.006        |0.0      
   |
    |P735    |7089362   |0.888            |0.082        |0.029        |0.001    
   |
    |P932    |6564521   |0.125            |0.872        |0.003        |0.0      
   |
    |P569    |6051485   |0.186            |0.386        |0.428        |0.0      
   |
    |P2214   |5843262   |0.0              |0.0          |1.0          |0.0      
   |
    |P10752  |5226951   |0.0              |0.0          |1.0          |0.0      
   |
    |P10751  |5221836   |0.0              |0.0          |1.0          |0.0      
   |
    |P27     |4720704   |0.691            |0.085        |0.223        |0.0      
   |
    |P373    |4671599   |0.842            |0.001        |0.156        |0.001    
   |
    |P18     |4630968   |0.736            |0.016        |0.248        |0.0      
   |
    |P2216   |4599813   |0.0              |0.0          |1.0          |0.0      
   |
    |P5875   |4581754   |1.0              |0.0          |0.0          |0.0      
   |
    |P361    |4507712   |0.454            |0.271        |0.273        |0.003    
   |
    |P646    |4420934   |0.713            |0.0          |0.286        |0.0      
   |
    |P684    |4321135   |0.003            |0.0          |0.997        |0.0      
   |
    |P734    |4262859   |0.765            |0.113        |0.121        |0.001    
   |
    |P1566   |3750650   |0.136            |0.0          |0.864        |0.0      
   |
    |P171    |3629882   |0.88             |0.014        |0.106        |0.0      
   |
    |P225    |3621880   |0.786            |0.022        |0.192        |0.0      
   |
    |P105    |3618190   |0.864            |0.014        |0.122        |0.0      
   |
    |P2583   |3489597   |0.0              |0.0          |1.0          |0.0      
   |
    |P279    |3356545   |0.225            |0.479        |0.296        |0.0      
   |
    |P2888   |3287592   |0.864            |0.133        |0.002        |0.0      
   |
    |P214    |3194245   |0.498            |0.172        |0.318        |0.012    
   |
    |P19     |3188730   |0.217            |0.156        |0.627        |0.0      
   |
    |P570    |3099086   |0.185            |0.402        |0.413        |0.001    
   |
    |P1087   |2934424   |0.0              |0.153        |0.847        |0.0      
   |
    |P703    |2858721   |0.008            |0.488        |0.504        |0.0      
   |
    |P276    |2709383   |0.447            |0.459        |0.094        |0.0      
   |
    |P571    |2668156   |0.25             |0.319        |0.43         |0.0      
   |
    |P846    |2574389   |0.013            |0.001        |0.956        |0.03     
   |
    |P69     |2527872   |0.286            |0.278        |0.435        |0.0      
   |
    |P1412   |2499485   |0.71             |0.197        |0.093        |0.0      
   |
    |P1082   |2468600   |0.066            |0.5          |0.431        |0.004    
   |
    |P971    |2464447   |0.8              |0.0          |0.199        |0.001    
   |
    |P953    |2446948   |0.689            |0.278        |0.03         |0.003    
   |
    |P10585  |2279526   |0.015            |0.985        |0.0          |0.0      
   |
    |P1435   |2133882   |0.213            |0.344        |0.443        |0.0      
   |
    |P527    |2092753   |0.473            |0.412        |0.109        |0.006    
   |
    |P195    |2058009   |0.35             |0.609        |0.041        |0.0      
   |
    |P421    |2023470   |0.778            |0.04         |0.182        |0.0      
   |
    |P641    |1978144   |0.638            |0.138        |0.224        |0.0      
   |
    |P6216   |1966767   |0.758            |0.172        |0.069        |0.0      
   |
    |P281    |1954649   |0.391            |0.322        |0.287        |0.0      
   |
    |P7859   |1918647   |0.023            |0.977        |0.0          |0.0      
   |
    |P496    |1775935   |0.992            |0.006        |0.001        |0.002    
   |
    |P856    |1756660   |0.375            |0.166        |0.455        |0.004    
   |
    |P108    |1739754   |0.23             |0.636        |0.133        |0.001    
   |
    |P1104   |1645473   |0.066            |0.046        |0.887        |0.0      
   |
    |P136    |1639956   |0.496            |0.104        |0.4          |0.0      
   |
    |P1448   |1623660   |0.089            |0.258        |0.651        |0.001    
   |
    |P40     |1598046   |0.173            |0.017        |0.809        |0.0      
   |
    |P213    |1584397   |0.551            |0.271        |0.176        |0.001    
   |
    |P54     |1566111   |0.233            |0.011        |0.756        |0.0      
   |
    |P6179   |1540572   |0.916            |0.084        |0.0          |0.0      
   |
    |P39     |1528039   |0.359            |0.314        |0.327        |0.0      
   |
    |P161    |1473134   |0.193            |0.062        |0.745        |0.0      
   |
    |P495    |1471885   |0.619            |0.129        |0.252        |0.0      
   |
    |P2326   |1468833   |1.0              |0.0          |0.0          |0.0      
   |
    |P227    |1459531   |0.457            |0.173        |0.367        |0.004    
   |
    |P244    |1458838   |0.383            |0.253        |0.362        |0.001    
   |
    |P186    |1434411   |0.21             |0.576        |0.214        |0.001    
   |
    |P166    |1427240   |0.562            |0.167        |0.271        |0.001    
   |
    |P2044   |1374724   |0.265            |0.081        |0.654        |0.0      
   |
    |P5055   |1362707   |0.003            |0.0          |0.997        |0.0      
   |
    |P6375   |1347933   |0.612            |0.253        |0.135        |0.0      
   |
    |P235    |1312921   |0.846            |0.142        |0.012        |0.0      
   |
    |P234    |1304992   |0.857            |0.13         |0.013        |0.0      
   |
    |P1343   |1296440   |0.543            |0.28         |0.093        |0.084    
   |
    |P20     |1280216   |0.295            |0.21         |0.495        |0.0      
   |
    |P1090   |1279195   |0.0              |0.0          |1.0          |0.0      
   |
    |P155    |1267723   |0.653            |0.004        |0.343        |0.0      
   |
    |P156    |1249070   |0.663            |0.005        |0.332        |0.0      
   |
    |P680    |1206178   |0.001            |0.972        |0.027        |0.0      
   |
    
+--------+----------+-----------------+-------------+-------------+------------+

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Lydia_Pintscher, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to