Dear All, it is great that we are having this discussion, but may I please suggest to have it on the RfP page on Wikidata? People already asked similar questions there, and, in my experience, on-wiki discussion will likely lead to refined request which will accomodate all suggestions.
Cheers Yaroslav On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann < hellm...@informatik.uni-leipzig.de> wrote: > ah, ok, sorry, I was assuming that Blazegraph would transitively resolve > this automatically. > > Ok, so let's divide the problem: > > # Task 1: > > Connect all existing organisations with the data from the handelsregister. > (No new identifiers added, we can start right now) > > Add a constraint that all German organisations should be connected to a > court, i.e. the registering organisation as well as the id assigned by the > court. > > @all: any properties I can reuse for this? > > I will focus on this as it seems quite easy. We can first filter orgs by > other criteria, i.e. country as a blocking key and then string match the > rest. > > # Task 2: > > Add all missing identifiers for the remaining orgs in Handelsregister. > Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently. > > # regarding maintenance: > I find Wikidata as such very hard to maintain as all data is copied from > somewhere else eventually, but Wikipedia has the same problem. In the case > of the German Business register, maintenance is especially easy as the orgs > are stable and uniquely identifiable. Even the fact that a company gets > shut down should still be in Wikidata, so you have historical information. > I mean, you also keep the Roman Empire, the Hanse and even finished > projects in Wikidata. So even if an org ceases to exist, the entry in > Wikidata should stay. > > # regarding Opencorporates > I have a critical opinion with Opencorporates. It appears to be open, but > you actually can not get the data. If somebody has a data dump, please > forward to me. Thanks. > More on top, I consider Opencorporates a danger to open data. It appears > to push open availability of data, but then it is limited to open licenses. > Usefulness is limited as there are no free dumps and no possibility to > duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API > for exactly this reason. Everytime somebody wants to create an open > organisation dataset with no barriers, the existence of Opencorporates is > blocking this. > > Cheers, > Sebastian > > > On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote: > > And… my own count was wrong too, because I forgot to add DISTINCT in my > query (if there are multiple paths from the class to "organization > (Q43229)", items will appear multiple times). > > So, I get 1 168 084 now.http://tinyurl.com/yaeqlsnl > > It's easy to get these things wrong! > > Antonin > > On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote: > > Thanks Ettore for spotting that! > > Wikidata types (P31) only make sense when you consider the "subclass of" > (P279) property that we use to build the ontology (except in a few cases > where the community has decided not to use any subclass for a particular > type). > > So, to retrieve all items of a certain type in SPARQL, you need to use > something like this: > > ?item wdt:P31/wdt:P279* ?type > > You can also have other variants to accept non-truthy statements. > > Just with this truthy version, I currently get 1 208 227 items. But note > that there are still a lot of items where P31 is not provided, or > subclasses which have not been connected to "organization (Q43229)"… > > So in general, it's very hard to have any "guarantees that there are no > duplicates", just because you don't have any guarantees that the > information currently in Wikidata is complete or correct. > > I would recommend trying to import something a bit smaller to get > acquainted with how Wikidata works and what the matching process looks > like in practice. And beyond a one-off import, as Ettore said it is > important to think how the data will be maintained in the future… > > Antonin > > On 16/10/2017 13:46, Ettore RIZZA wrote: > > - Wikidata has 40k organisations: > > https://query.wikidata.org/#SELECT > <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> > %3Fitem %3FitemLabel %0AWHERE > %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { > bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} > > > Hi, > > I think Wikidata contains many more organizations than that. If we > choose the "instance of Business enterprise", we get 135570 results. And > I imagine there are many other categories that bring together commercial > companies. > > https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D > > On the substance, the project to add all companies of a country would > make Wikidata a kind of totally free clone of Open > Corporates<https://opencorporates.com/> <https://opencorporates.com/>. I > would of course be delighted to see > that, but is it not a challenge to maintain such a database? Companies > are like humans, it appears and disappears every day. > > > > 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann > <hellm...@informatik.uni-leipzig.de<mailto:hellm...@informatik.uni-leipzig.de> > <hellm...@informatik.uni-leipzig.de>>: > > Hi all, > > the technical challenges are not so difficult. > > - 2.2 million are the exact number of German organisations, i.e. > associations and companies. They are also unique. > > - Wikidata has 40k organisations: > > https://query.wikidata.org/#SELECT > <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> > %3Fitem %3FitemLabel %0AWHERE > %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { > bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} > > so there would be a maximum of 40k duplicates These are easy to find > and deduplicate > > - The crawl can be done easily, a colleague has done so before. > > > The issues here are: > > - Do you want to upload the data in Wikidata? It would be a real big > extension. Can I go ahead > > - If the data were available externally as structured data under > open license, I would probably not suggest loading it into wikidata, > as the data can be retrieved from the official source directly, > however, here this data will not be published in a decent format. > > I thought that the way data is copied from coyrighted sources, i.e. > only facts is ok for wikidata. This done in a lot of places, I > guess. Same for Wikipedia, i.e. News articles and copyrighted books > are referenced. So Wikimedia or the Wikimedia community are experts > on this. > > All the best, > > Sebastian > > > On 16.10.2017 10:18, Neubert, Joachim wrote: > > Hi Sebastian,____ > > __ __ > > This is huge! It will cover almost all currently existing German > companies. Many of these will have similar names, so preparing for > disambiguation is a concern.____ > > __ __ > > A good way for such an approach would be proposing a property for > an external identifier, loading the data into Mix-n-match, > creating links for companies already in Wikidata, and adding the > rest (or perhaps only parts of them - I’m not sure if having all > of them in Wikidata makes sense, but that’s another discussion), > preferably with location and/or sector of trade in the description > field.____ > > __ __ > > I’ve tried to figure out what could be used as key for a external > identifier property. However, it looks like the registry does not > offer any (persistent) URL to its entries. So for looking up a > company, apparently there are two options:____ > > __ __ > > - conducting an extended search for the exact string “A&A > Dienstleistungsgesellschaft mbH“____ > > - copying the register number “32853” plus selecting the > court (Leipzig) from the according dropdown list and search that____ > > __ __ > > Both ways are not very intuitive, even if we can provide a link to > the search form. This would make a weak connection to the source > of information. Much more important, it makes disambiguation in > Mix-n-match difficult. This applies for the preparation of your > initial load (you would not want to create duplicates). But much > more so for everybody else who wants to match his or her data > later on. Being forced to search for entries manually in a > cumbersome way for disambiguation of a new, possibly large and > rich dataset is, in my eyes, not something we want to impose on > future contributors. And often, the free information they find in > the registry (formal name, register number, legal form, address) > will not easily match with the information they have (common name, > location, perhaps founding date, and most important sector of > trade), so disambiguation may still be difficult.____ > > __ __ > > Have you checked which parts of the accessible information as > below can be crawled and added legally to external databases such > as Wikidata?____ > > __ __ > > Cheers, Joachim____ > > __ __ > > --____ > > Joachim Neubert____ > > __ __ > > ZBW – German National Library of Economics____ > > Leibniz Information Centre for Economics____ > > Neuer Jungfernstieg 21 > 20354 Hamburg____ > > Phone +49-42834-462____ > > __ __ > > __ __ > > __ __ > > *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org > <wikidata-boun...@lists.wikimedia.org> > <mailto:wikidata-boun...@lists.wikimedia.org> > <wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von > *Sebastian Hellmann > *Gesendet:* Sonntag, 15. Oktober 2017 09:45 > *An:* wikidata@lists.wikimedia.org > <mailto:wikidata@lists.wikimedia.org> <wikidata@lists.wikimedia.org> > *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German > organisations to Wikidata____ > > __ __ > > Hi all,____ > > the German business registry contains roughly 2.2 million > organisations. Some information is paid, but other is public, i.e. > the info you are searching for at and clicking on UT (see example > below):____ > > https://www.handelsregister.de/rp_web/mask.do?Typ=e > <https://www.handelsregister.de/rp_web/mask.do?Typ=e> > <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ > > __ __ > > I would like to add this to Wikidata, either by crawling or by > raising money to use crowdsourcing concepts like crowdflour or > amazon turk. ____ > > __ __ > > It should meet notability criteria 2: > https://www.wikidata.org/wiki/Wikidata:Notability > <https://www.wikidata.org/wiki/Wikidata:Notability> > <https://www.wikidata.org/wiki/Wikidata:Notability>____ > > 2. It refers to an instance of a *clearly identifiable > conceptual or material entity*. The entity must be notable, in > the sense that it *can be described using serious and publicly > available references*. If there is no item about you yet, you > are probably not notable.____ > > > The reference is the official German business registry, which is > serious and public. Orgs are also per definition clearly > identifiable legal entities. > > How can I get clearance to proceed on this? > > All the best, > Sebastian____ > > __ __ > > __ __ > > > Entity data____ > > __ __ > > Saxony District court *Leipzig HRB 32853 * – A&A > Dienstleistungsgesellschaft mbH ____ > > Legal status:____ > > > > Gesellschaft mit beschränkter Haftung ____ > > > > > Capital:____ > > > > 25.000,00 EUR ____ > > > > > Date of entry:____ > > > > 29/08/2016 > (When entering date of entry, wrong data input can occur due to > system failures!) ____ > > > > > Date of removal:____ > > > > - ____ > > > > > Balance sheet available: ____ > > > > - ____ > > > > > Address (subject to correction):____ > > > > A&A Dienstleistungsgesellschaft mbH > Prager Straße 38-40____ > > 04317 Leipzig ____ > > > > > __ __ > > -- > All the best, > Sebastian Hellmann > > Director of Knowledge Integration and Linked Data Technologies > (KILT) Competence Center > at the Institute for Applied Informatics (InfAI) at Leipzig University > Executive Director of the DBpedia Association > Projects: http://dbpedia.org, http://nlp2rdf.org, > http://linguistics.okfn.org, https://www.w3.org/community/ld4lt > <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> > Homepage: http://aksw.org/SebastianHellmann > <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> > Research Group: http://aksw.org____ > > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > <Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata > <https://lists.wikimedia.org/mailman/listinfo/wikidata> > <https://lists.wikimedia.org/mailman/listinfo/wikidata> > > -- > All the best, > Sebastian Hellmann > > Director of Knowledge Integration and Linked Data Technologies > (KILT) Competence Center > at the Institute for Applied Informatics (InfAI) at Leipzig University > Executive Director of the DBpedia Association > Projects: http://dbpedia.org, http://nlp2rdf.org, > http://linguistics.okfn.org, https://www.w3.org/community/ld4lt > <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> > Homepage: http://aksw.org/SebastianHellmann > <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> > Research Group: http://aksw.org > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > <Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata > <https://lists.wikimedia.org/mailman/listinfo/wikidata> > <https://lists.wikimedia.org/mailman/listinfo/wikidata> > > > > > _______________________________________________ > Wikidata mailing > listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata > > > _______________________________________________ > Wikidata mailing > listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata > > > _______________________________________________ > Wikidata mailing > listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata > > > -- > All the best, > Sebastian Hellmann > > Director of Knowledge Integration and Linked Data Technologies (KILT) > Competence Center > at the Institute for Applied Informatics (InfAI) at Leipzig University > Executive Director of the DBpedia Association > Projects: http://dbpedia.org, http://nlp2rdf.org, > http://linguistics.okfn.org, https://www.w3.org/community/ld4lt > <http://www.w3.org/community/ld4lt> > Homepage: http://aksw.org/SebastianHellmann > Research Group: http://aksw.org > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata