Hi Sebastian, >Is there a list of geodata issues, somewhere? Can you give some example?
My main "pain" points: - the cebuano geo duplicates: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_proposed_course_of_action_for_dealing_with_cebwiki/svwiki_geographic_duplicates - detecting "anonym" editings of the wikidata labels from wikidata JSON dumps. As I know - Now it is impossible, - no similar information in the JSON dump, so I cant' create a score. This is similar problem like the original posts ; ( ~ quality score ) but I would like to use the original editing history and implementing/tuning my scoring algorithm. When somebody renaming some city names (trolls) , then my matching algorithm not find them, and in this cases I can use the previous "better" state of the wikidata. It is also important for merging openstreetmap place-names with wikidata labels for end users. > Do you have a reference dataset as well, or would that be NaturalEarth itself? Sorry, I don't have a reference datasets. and NaturalEarth is only a subset of the "reality" . not contains all cities, rivers, ... But maybe you can use OpenStreetMap as a best resource. Sometimes I matching add adding wikidata concordances to https://www.whosonfirst.org/ (WOF) gazetteer; but this data originated mostly from similar sources ( geonames,..) so can't use a quality indicator. If you need some easy example - probably the "airports" is a good start for checking wikidata completeness. (p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code ; p931_place_served ; p131_located_in ) > What would help you to measure completeness for adding concordances to NaturalEarth. I have created my own tools/scripts ; because waiting for the community for fixing cebwiki data problems is lot of times. I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so flexible/scalable for geo matchings , ) - adding some scoring based on cebwiki /srwiki ... - creating some sheets for manual checking. but this process is like a ~ "fuzzy left join" ... with lot of hacky codes and manual tunings. If I don't find some NaturalEarth/WOF object in the wikidata, then I have to manually debug. The most problems is - different transliterations / spellings / english vs. local names ... - some trolling by anonymous users ( mostly from mobile phone ). - problems with GPS coordinates. - changes in the real data ( cities joining / splitting ) so need lot of background research. best, Imre Sebastian Hellmann <[email protected]> ezt írta (időpont: 2019. aug. 28., Sze, 11:11): > Hi Imre, > > we can encode these rules using the JSON MongoDB database we created in > GlobalFactSync project ( > https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE). > As basis for the GFS Data Browser. The database has open read access. > > Is there a list of geodata issues, somewhere? Can you give some example? > GFS focuses on both: overall quality measures and very domain specific > adaptations. We will also try to flag these issues for Wikipedians. > > So I see that there is some notion of what is good and what not by source. > Do you have a reference dataset as well, or would that be NaturalEarth > itself? What would help you to measure completeness for adding concordances > to NaturalEarth. > > -- Sebastian > On 24.08.19 21:26, Imre Samu wrote: > > For geodata ( human settlements/rivers/mountains/... ) ( with GPS > coordinates ) my simple rules: > - if it has a "local wikipedia pages" or any big > lang["EN/FR/PT/ES/RU/.."] wikipedia page .. than it is OK. > - if it is only in "cebuano" AND outside of "cebuano BBOX" -> then .... > this is lower quality > - only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX -> this is lower > quality > - only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality > - geodata without GPS coordinate -> ... > - .... > so my rules based on wikipedia pages and languages areas ... and I prefer > wikidata - with local wikipedia pages. > > This is based on my experience - adding Wikidata ID concordances to > NaturalEarth ( https://www.naturalearthdata.com/blog/ ) > > -- > All the best, > Sebastian Hellmann > > Director of Knowledge Integration and Linked Data Technologies (KILT) > Competence Center > at the Institute for Applied Informatics (InfAI) at Leipzig University > Executive Director of the DBpedia Association > Projects: http://dbpedia.org, http://nlp2rdf.org, > http://linguistics.okfn.org, https://www.w3.org/community/ld4lt > <http://www.w3.org/community/ld4lt> > Homepage: http://aksw.org/SebastianHellmann > Research Group: http://aksw.org >
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
