Isaac added a comment.
Weekly updates: - I focused on the references component of the model this week. I built heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata across languages: a hybrid approach." Journal of Data and Information Quality (JDIQ) 13, no. 4 (2021): 1-35. <https://arxiv.org/pdf/2109.09405.pdf> - I wrote a Python function (code below) that takes the references for a claim and maps it to high-level categories that tell us about the quality of the reference -- e.g., has an External URL associated with it vs. referring to internal Wikidata item or import from another Wikimedia project. I can imagine weak and strong recommendations based on this -- e.g., high priority would be adding missing references and lower priority might be updating Imported from Wikimedia Project to a external URL and very low priority might be adding a second reference. - Using that function, I can generate basic descriptive stats on reference distributions on Wikidata (table below) and split by property (top-100-most-common properties below). From this data, you can see that we might be able to automatically infer which properties definitely need references, which ones probably should have references, and which ones probably don't by just setting some basic heuristics. One challenge will be whether we use the current state of Wikidata (which is heavily bot-influenced so for certain properties, reflects the choice of a few people) or try to build a more nuanced dataset based on edit history of which properties have references when editors add them. # Code for categorizing references for a claim per a simple taxonomy that by proxy tells us something about authority/accessibility/usefulness of the reference # types of references from least -> best # so if a claim has two references and one is Internal-Stated and one is External-Direct, we keep External-Direct REF_ORDER = {r:i for r,i in enumerate( ['Internal-Inferred', 'Internal-Stated', 'Internal-Wikimedia', 'External-Identifier', 'External-Direct'])} EXTERNAL_ID_PROPERTIES = set() # all Wikidata properties that are external IDs -- used for detecting when used as part of a reference # TODO: Maybe update to SPARQL query that is external identifier properties ONLY with URL formatter properties? (maybe that's essentially the same thing?) # https://quarry.wmcloud.org/query/69919 with open('quarry-69919-wikidata-external-ids-run692643.tsv', 'r') as fin: for line in fin: EXTERNAL_ID_PROPERTIES.add(f'P{line.strip()}') def getReferenceType(references): """Map references for a claim to different categories. Heavily inspired by: https://arxiv.org/pdf/2109.09405.pdf Also: https://www.wikidata.org/wiki/Help:Sources """ if references is None: ref_count = 'unreferenced' best_ref_type = None else: ref_count = 'single' if len(references) == 1 else 'multiple' best_ref_types = [] for ref in references: # reference URL OR official website OR archive URL OR URL OR external data available at if 'P854' in ref['snaksOrder'] or 'P856' in ref['snaksOrder'] or 'P1065' in ref['snaksOrder'] or 'P953' in ref['snaksOrder'] or 'P2699' in ref['snaksOrder'] or 'P1325' in ref['snaksOrder']: best_ref_types.append('External-Direct') break elif [p for p in ref['snaksOrder'] if p in EXTERNAL_ID_PROPERTIES]: best_ref_types.append('External-Identifier') # Wikimedia import URL OR imported from Wikimedia project elif 'P4656' in ref['snaksOrder'] or 'P143' in ref['snaksOrder']: best_ref_types.append('Internal-Wikimedia') # stated in elif 'P248' in ref['snaksOrder']: best_ref_types.append('Internal-Stated') # inferred from Wikidata item OR based on heuristic OR based on elif 'P3452' in ref['snaksOrder'] or 'P887' in ref['snaksOrder'] or 'P144' in ref['snaksOrder']: best_ref_types.append('Internal-Inferred') # title OR published in -- hard to interpret without more info but probably links to Wikidata item elif 'P1476' in ref['snaksOrder'] or 'P1433' in ref['snaksOrder']: best_ref_types.append('Internal-Stated') else: best_ref_types.append(f'Unknown: {ref["snaksOrder"]}') best_ref_type = max(best_ref_types, key=lambda x: REF_ORDER.get(x, -1)) return (ref_count, best_ref_type) High-level descriptive stats for every num_refs/best_ref category over 1000 claims: I manually inspect the top Unknown Properties to make sure they shouldn't be part of one of the official categories but otherwise they'd end up being mapped to unreferenced +------------+-----------------------------------------------------+----------+ |num_refs |best_ref |num_claims| +------------+-----------------------------------------------------+----------+ |single |External-Direct |651044816 | |unreferenced|null |339814593 | |single |Internal-Stated |191615754 | |single |External-Identifier |154045142 | |single |Internal-Wikimedia |55315642 | |single |Internal-Inferred |21253250 | |multiple |Internal-Stated |3218113 | |multiple |External-Direct |2825364 | |multiple |Internal-Wikimedia |2791394 | |multiple |External-Identifier |2262353 | |single |Unknown: ['P813'] |1243513 | |single |Unknown: ['P1640', 'P813'] |101331 | |multiple |Internal-Inferred |85786 | |single |Unknown: ['P1810', 'P813'] |81642 | |single |Unknown: ['P6104'] |46210 | |multiple |Unknown: ['P813'] |15468 | |single |Unknown: ['P123'] |9992 | |single |Unknown: ['P195'] |7011 | |multiple |Unknown: ['P1640', 'P813'] |4594 | |single |Unknown: ['P459'] |3949 | |single |Unknown: ['P217', 'P195'] |3045 | |single |Unknown: ['P217'] |3019 | |single |Unknown: ['P373'] |2986 | |multiple |Unknown: ['P304'] |2812 | |single |Unknown: ['P195', 'P217'] |2558 | |single |Unknown: ['P1683'] |1572 | |single |Unknown: ['P958'] |1549 | |single |Unknown: ['P3014'] |1509 | |single |Unknown: ['P10253'] |1348 | |multiple |Unknown: ['P1343'] |1285 | |single |Unknown: ['P304'] |1256 | |single |Unknown: ['P973'] |1194 | |single |Unknown: ['P407'] |1118 | |single |Unknown: ['P1343'] |1089 | Top-100 most common properties on Wikidata and reference distribution +--------+----------+-----------------+-------------+-------------+------------+ |property|num_claims|prop_unreferenced|prop_external|prop_internal|prop_unknown| +--------+----------+-----------------+-------------+-------------+------------+ |P2860 |287780712 |0.001 |0.999 |0.0 |0.0 | |P2093 |134861830 |0.118 |0.871 |0.011 |0.0 | |P31 |106419606 |0.404 |0.386 |0.21 |0.0 | |P1476 |43568811 |0.148 |0.837 |0.014 |0.001 | |P577 |41854467 |0.119 |0.846 |0.035 |0.0 | |P1433 |39313917 |0.115 |0.862 |0.022 |0.001 | |P304 |36279684 |0.101 |0.886 |0.013 |0.0 | |P478 |36170462 |0.1 |0.885 |0.014 |0.0 | |P1215 |33122905 |0.0 |0.0 |1.0 |0.0 | |P433 |33009310 |0.1 |0.897 |0.002 |0.0 | |P698 |32069969 |0.019 |0.98 |0.001 |0.0 | |P528 |28768139 |0.005 |0.004 |0.991 |0.0 | |P356 |28598411 |0.16 |0.831 |0.009 |0.0 | |P50 |27852061 |0.372 |0.604 |0.023 |0.0 | |P921 |24859815 |0.291 |0.038 |0.671 |0.0 | |P407 |16251420 |0.787 |0.123 |0.089 |0.0 | |P17 |15428560 |0.541 |0.218 |0.241 |0.0 | |P131 |11726405 |0.44 |0.266 |0.294 |0.0 | |P106 |10064651 |0.604 |0.199 |0.196 |0.001 | |P625 |9602347 |0.281 |0.299 |0.421 |0.0 | |P21 |8314129 |0.592 |0.103 |0.306 |0.0 | |P3083 |8152578 |0.996 |0.0 |0.004 |0.0 | |P6257 |8094420 |0.0 |0.0 |1.0 |0.0 | |P6258 |8094293 |0.0 |0.0 |1.0 |0.0 | |P6259 |8079526 |0.0 |0.0 |1.0 |0.0 | |P2671 |7391570 |0.998 |0.0 |0.002 |0.0 | |P59 |7374426 |0.994 |0.0 |0.006 |0.0 | |P735 |7089362 |0.888 |0.082 |0.029 |0.001 | |P932 |6564521 |0.125 |0.872 |0.003 |0.0 | |P569 |6051485 |0.186 |0.386 |0.428 |0.0 | |P2214 |5843262 |0.0 |0.0 |1.0 |0.0 | |P10752 |5226951 |0.0 |0.0 |1.0 |0.0 | |P10751 |5221836 |0.0 |0.0 |1.0 |0.0 | |P27 |4720704 |0.691 |0.085 |0.223 |0.0 | |P373 |4671599 |0.842 |0.001 |0.156 |0.001 | |P18 |4630968 |0.736 |0.016 |0.248 |0.0 | |P2216 |4599813 |0.0 |0.0 |1.0 |0.0 | |P5875 |4581754 |1.0 |0.0 |0.0 |0.0 | |P361 |4507712 |0.454 |0.271 |0.273 |0.003 | |P646 |4420934 |0.713 |0.0 |0.286 |0.0 | |P684 |4321135 |0.003 |0.0 |0.997 |0.0 | |P734 |4262859 |0.765 |0.113 |0.121 |0.001 | |P1566 |3750650 |0.136 |0.0 |0.864 |0.0 | |P171 |3629882 |0.88 |0.014 |0.106 |0.0 | |P225 |3621880 |0.786 |0.022 |0.192 |0.0 | |P105 |3618190 |0.864 |0.014 |0.122 |0.0 | |P2583 |3489597 |0.0 |0.0 |1.0 |0.0 | |P279 |3356545 |0.225 |0.479 |0.296 |0.0 | |P2888 |3287592 |0.864 |0.133 |0.002 |0.0 | |P214 |3194245 |0.498 |0.172 |0.318 |0.012 | |P19 |3188730 |0.217 |0.156 |0.627 |0.0 | |P570 |3099086 |0.185 |0.402 |0.413 |0.001 | |P1087 |2934424 |0.0 |0.153 |0.847 |0.0 | |P703 |2858721 |0.008 |0.488 |0.504 |0.0 | |P276 |2709383 |0.447 |0.459 |0.094 |0.0 | |P571 |2668156 |0.25 |0.319 |0.43 |0.0 | |P846 |2574389 |0.013 |0.001 |0.956 |0.03 | |P69 |2527872 |0.286 |0.278 |0.435 |0.0 | |P1412 |2499485 |0.71 |0.197 |0.093 |0.0 | |P1082 |2468600 |0.066 |0.5 |0.431 |0.004 | |P971 |2464447 |0.8 |0.0 |0.199 |0.001 | |P953 |2446948 |0.689 |0.278 |0.03 |0.003 | |P10585 |2279526 |0.015 |0.985 |0.0 |0.0 | |P1435 |2133882 |0.213 |0.344 |0.443 |0.0 | |P527 |2092753 |0.473 |0.412 |0.109 |0.006 | |P195 |2058009 |0.35 |0.609 |0.041 |0.0 | |P421 |2023470 |0.778 |0.04 |0.182 |0.0 | |P641 |1978144 |0.638 |0.138 |0.224 |0.0 | |P6216 |1966767 |0.758 |0.172 |0.069 |0.0 | |P281 |1954649 |0.391 |0.322 |0.287 |0.0 | |P7859 |1918647 |0.023 |0.977 |0.0 |0.0 | |P496 |1775935 |0.992 |0.006 |0.001 |0.002 | |P856 |1756660 |0.375 |0.166 |0.455 |0.004 | |P108 |1739754 |0.23 |0.636 |0.133 |0.001 | |P1104 |1645473 |0.066 |0.046 |0.887 |0.0 | |P136 |1639956 |0.496 |0.104 |0.4 |0.0 | |P1448 |1623660 |0.089 |0.258 |0.651 |0.001 | |P40 |1598046 |0.173 |0.017 |0.809 |0.0 | |P213 |1584397 |0.551 |0.271 |0.176 |0.001 | |P54 |1566111 |0.233 |0.011 |0.756 |0.0 | |P6179 |1540572 |0.916 |0.084 |0.0 |0.0 | |P39 |1528039 |0.359 |0.314 |0.327 |0.0 | |P161 |1473134 |0.193 |0.062 |0.745 |0.0 | |P495 |1471885 |0.619 |0.129 |0.252 |0.0 | |P2326 |1468833 |1.0 |0.0 |0.0 |0.0 | |P227 |1459531 |0.457 |0.173 |0.367 |0.004 | |P244 |1458838 |0.383 |0.253 |0.362 |0.001 | |P186 |1434411 |0.21 |0.576 |0.214 |0.001 | |P166 |1427240 |0.562 |0.167 |0.271 |0.001 | |P2044 |1374724 |0.265 |0.081 |0.654 |0.0 | |P5055 |1362707 |0.003 |0.0 |0.997 |0.0 | |P6375 |1347933 |0.612 |0.253 |0.135 |0.0 | |P235 |1312921 |0.846 |0.142 |0.012 |0.0 | |P234 |1304992 |0.857 |0.13 |0.013 |0.0 | |P1343 |1296440 |0.543 |0.28 |0.093 |0.084 | |P20 |1280216 |0.295 |0.21 |0.495 |0.0 | |P1090 |1279195 |0.0 |0.0 |1.0 |0.0 | |P155 |1267723 |0.653 |0.004 |0.343 |0.0 | |P156 |1249070 |0.663 |0.005 |0.332 |0.0 | |P680 |1206178 |0.001 |0.972 |0.027 |0.0 | +--------+----------+-----------------+-------------+-------------+------------+ TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Lydia_Pintscher, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
