Hi Sebastian,

>Is there a list of geodata issues, somewhere? Can you give some example?

My main "pain" points:

- the cebuano geo duplicates:
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_proposed_course_of_action_for_dealing_with_cebwiki/svwiki_geographic_duplicates

- detecting "anonym" editings  of the wikidata labels from wikidata JSON
dumps.  As I know - Now it is impossible, - no similar  information in the
JSON dump, so I cant' create a score.
  This is similar problem like the original posts ; ( ~ quality score )
 but I would like to use the original editing history and
implementing/tuning my scoring algorithm.

  When somebody renaming some city names (trolls) , then my matching
algorithm not find them,
  and in this cases I can use the previous "better" state of the wikidata.
  It is also important for merging openstreetmap place-names with wikidata
labels for end users.



> Do you have a reference dataset as well, or would that be NaturalEarth
itself?

Sorry, I don't have a reference datasets.  and NaturalEarth is only a
subset of the "reality" . not contains all cities, rivers, ...
But maybe you can use OpenStreetMap as a best resource.
Sometimes I matching add adding wikidata concordances to
https://www.whosonfirst.org/ (WOF)  gazetteer; but this data originated
mostly from  similar sources ( geonames,..)  so can't use a quality
indicator.

If you need some easy example - probably the "airports" is a good start for
checking wikidata completeness.
(p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code
; p931_place_served ;  p131_located_in )

> What would help you to measure completeness for adding concordances to
NaturalEarth.

I have created my own tools/scripts  ;  because waiting for the community
for fixing cebwiki data problems is lot of times.

I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so
flexible/scalable  for geo matchings , )
- adding some scoring based on cebwiki /srwiki ...
- creating some sheets for manual checking.
but this process is like a  ~ "fuzzy left join" ...  with lot of hacky
codes and manual tunings.

If I don't find some NaturalEarth/WOF  object in the wikidata, then I have
to manually debug.
The most problems is
- different transliterations / spellings / english vs. local names ...
- some trolling by  anonymous users ( mostly from mobile phone ).
- problems with  GPS coordinates.
- changes in the real data ( cities joining / splitting ) so need lot of
background research.

best,
Imre











Sebastian Hellmann <[email protected]> ezt írta (időpont:
2019. aug. 28., Sze, 11:11):

> Hi Imre,
>
> we can encode these rules using the JSON MongoDB database we created in
> GlobalFactSync project (
> https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE).
> As  basis for the GFS Data Browser. The database has open read access.
>
> Is there a list of geodata issues, somewhere? Can you give some example?
> GFS focuses on both: overall quality measures and very domain specific
> adaptations. We will also try to flag these issues for Wikipedians.
>
> So I see that there is some notion of what is good and what not by source.
> Do you have a reference dataset as well, or would that be NaturalEarth
> itself? What would help you to measure completeness for adding concordances
> to NaturalEarth.
>
> -- Sebastian
> On 24.08.19 21:26, Imre Samu wrote:
>
> For geodata ( human settlements/rivers/mountains/... )  ( with GPS
> coordinates ) my simple rules:
> - if it has a  "local wikipedia pages" or  any big
> lang["EN/FR/PT/ES/RU/.."]  wikipedia page ..  than it is OK.
> - if it is only in "cebuano" AND outside of "cebuano BBOX" ->  then ....
> this is lower quality
> - only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX ->  this is lower
> quality
> - only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
> - geodata without GPS coordinate ->  ...
> - ....
> so my rules based on wikipedia pages and languages areas ...  and I prefer
> wikidata - with local wikipedia pages.
>
> This is based on my experience - adding Wikidata ID concordances to
> NaturalEarth ( https://www.naturalearthdata.com/blog/ )
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects: http://dbpedia.org, http://nlp2rdf.org,
> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> <http://www.w3.org/community/ld4lt>
> Homepage: http://aksw.org/SebastianHellmann
> Research Group: http://aksw.org
>
_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to