And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times).
So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl It's easy to get these things wrong! Antonin On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote: > Thanks Ettore for spotting that! > > Wikidata types (P31) only make sense when you consider the "subclass of" > (P279) property that we use to build the ontology (except in a few cases > where the community has decided not to use any subclass for a particular > type). > > So, to retrieve all items of a certain type in SPARQL, you need to use > something like this: > > ?item wdt:P31/wdt:P279* ?type > > You can also have other variants to accept non-truthy statements. > > Just with this truthy version, I currently get 1 208 227 items. But note > that there are still a lot of items where P31 is not provided, or > subclasses which have not been connected to "organization (Q43229)"… > > So in general, it's very hard to have any "guarantees that there are no > duplicates", just because you don't have any guarantees that the > information currently in Wikidata is complete or correct. > > I would recommend trying to import something a bit smaller to get > acquainted with how Wikidata works and what the matching process looks > like in practice. And beyond a one-off import, as Ettore said it is > important to think how the data will be maintained in the future… > > Antonin > > On 16/10/2017 13:46, Ettore RIZZA wrote: >> - Wikidata has 40k organisations: >> >> https://query.wikidata.org/#SELECT >> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} >> >> >> Hi, >> >> I think Wikidata contains many more organizations than that. If we >> choose the "instance of Business enterprise", we get 135570 results. And >> I imagine there are many other categories that bring together commercial >> companies. >> >> >> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D >> >> On the substance, the project to add all companies of a country would >> make Wikidata a kind of totally free clone of Open Corporates >> <https://opencorporates.com/>. I would of course be delighted to see >> that, but is it not a challenge to maintain such a database? Companies >> are like humans, it appears and disappears every day. >> >> >> >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann >> <hellm...@informatik.uni-leipzig.de >> <mailto:hellm...@informatik.uni-leipzig.de>>: >> >> Hi all, >> >> the technical challenges are not so difficult. >> >> - 2.2 million are the exact number of German organisations, i.e. >> associations and companies. They are also unique. >> >> - Wikidata has 40k organisations: >> >> https://query.wikidata.org/#SELECT >> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} >> >> so there would be a maximum of 40k duplicates These are easy to find >> and deduplicate >> >> - The crawl can be done easily, a colleague has done so before. >> >> >> The issues here are: >> >> - Do you want to upload the data in Wikidata? It would be a real big >> extension. Can I go ahead >> >> - If the data were available externally as structured data under >> open license, I would probably not suggest loading it into wikidata, >> as the data can be retrieved from the official source directly, >> however, here this data will not be published in a decent format. >> >> I thought that the way data is copied from coyrighted sources, i.e. >> only facts is ok for wikidata. This done in a lot of places, I >> guess. Same for Wikipedia, i.e. News articles and copyrighted books >> are referenced. So Wikimedia or the Wikimedia community are experts >> on this. >> >> All the best, >> >> Sebastian >> >> >> On 16.10.2017 10:18, Neubert, Joachim wrote: >>> >>> Hi Sebastian,____ >>> >>> __ __ >>> >>> This is huge! It will cover almost all currently existing German >>> companies. Many of these will have similar names, so preparing for >>> disambiguation is a concern.____ >>> >>> __ __ >>> >>> A good way for such an approach would be proposing a property for >>> an external identifier, loading the data into Mix-n-match, >>> creating links for companies already in Wikidata, and adding the >>> rest (or perhaps only parts of them - I’m not sure if having all >>> of them in Wikidata makes sense, but that’s another discussion), >>> preferably with location and/or sector of trade in the description >>> field.____ >>> >>> __ __ >>> >>> I’ve tried to figure out what could be used as key for a external >>> identifier property. However, it looks like the registry does not >>> offer any (persistent) URL to its entries. So for looking up a >>> company, apparently there are two options:____ >>> >>> __ __ >>> >>> - conducting an extended search for the exact string “A&A >>> Dienstleistungsgesellschaft mbH“____ >>> >>> - copying the register number “32853” plus selecting the >>> court (Leipzig) from the according dropdown list and search that____ >>> >>> __ __ >>> >>> Both ways are not very intuitive, even if we can provide a link to >>> the search form. This would make a weak connection to the source >>> of information. Much more important, it makes disambiguation in >>> Mix-n-match difficult. This applies for the preparation of your >>> initial load (you would not want to create duplicates). But much >>> more so for everybody else who wants to match his or her data >>> later on. Being forced to search for entries manually in a >>> cumbersome way for disambiguation of a new, possibly large and >>> rich dataset is, in my eyes, not something we want to impose on >>> future contributors. And often, the free information they find in >>> the registry (formal name, register number, legal form, address) >>> will not easily match with the information they have (common name, >>> location, perhaps founding date, and most important sector of >>> trade), so disambiguation may still be difficult.____ >>> >>> __ __ >>> >>> Have you checked which parts of the accessible information as >>> below can be crawled and added legally to external databases such >>> as Wikidata?____ >>> >>> __ __ >>> >>> Cheers, Joachim____ >>> >>> __ __ >>> >>> --____ >>> >>> Joachim Neubert____ >>> >>> __ __ >>> >>> ZBW – German National Library of Economics____ >>> >>> Leibniz Information Centre for Economics____ >>> >>> Neuer Jungfernstieg 21 >>> 20354 Hamburg____ >>> >>> Phone +49-42834-462____ >>> >>> __ __ >>> >>> __ __ >>> >>> __ __ >>> >>> *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org >>> <mailto:wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von >>> *Sebastian Hellmann >>> *Gesendet:* Sonntag, 15. Oktober 2017 09:45 >>> *An:* wikidata@lists.wikimedia.org >>> <mailto:wikidata@lists.wikimedia.org> >>> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German >>> organisations to Wikidata____ >>> >>> __ __ >>> >>> Hi all,____ >>> >>> the German business registry contains roughly 2.2 million >>> organisations. Some information is paid, but other is public, i.e. >>> the info you are searching for at and clicking on UT (see example >>> below):____ >>> >>> https://www.handelsregister.de/rp_web/mask.do?Typ=e >>> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ >>> >>> __ __ >>> >>> I would like to add this to Wikidata, either by crawling or by >>> raising money to use crowdsourcing concepts like crowdflour or >>> amazon turk. ____ >>> >>> __ __ >>> >>> It should meet notability criteria 2: >>> https://www.wikidata.org/wiki/Wikidata:Notability >>> <https://www.wikidata.org/wiki/Wikidata:Notability>____ >>> >>> 2. It refers to an instance of a *clearly identifiable >>> conceptual or material entity*. The entity must be notable, in >>> the sense that it *can be described using serious and publicly >>> available references*. If there is no item about you yet, you >>> are probably not notable.____ >>> >>> >>> The reference is the official German business registry, which is >>> serious and public. Orgs are also per definition clearly >>> identifiable legal entities. >>> >>> How can I get clearance to proceed on this? >>> >>> All the best, >>> Sebastian____ >>> >>> __ __ >>> >>> __ __ >>> >>> >>> Entity data____ >>> >>> __ __ >>> >>> Saxony District court *Leipzig HRB 32853 * – A&A >>> Dienstleistungsgesellschaft mbH ____ >>> >>> Legal status:____ >>> >>> >>> >>> Gesellschaft mit beschränkter Haftung ____ >>> >>> >>> >>> >>> Capital:____ >>> >>> >>> >>> 25.000,00 EUR ____ >>> >>> >>> >>> >>> Date of entry:____ >>> >>> >>> >>> 29/08/2016 >>> (When entering date of entry, wrong data input can occur due to >>> system failures!) ____ >>> >>> >>> >>> >>> Date of removal:____ >>> >>> >>> >>> - ____ >>> >>> >>> >>> >>> Balance sheet available: ____ >>> >>> >>> >>> - ____ >>> >>> >>> >>> >>> Address (subject to correction):____ >>> >>> >>> >>> A&A Dienstleistungsgesellschaft mbH >>> Prager Straße 38-40____ >>> >>> 04317 Leipzig ____ >>> >>> >>> >>> >>> __ __ >>> >>> -- >>> All the best, >>> Sebastian Hellmann >>> >>> Director of Knowledge Integration and Linked Data Technologies >>> (KILT) Competence Center >>> at the Institute for Applied Informatics (InfAI) at Leipzig University >>> Executive Director of the DBpedia Association >>> Projects: http://dbpedia.org, http://nlp2rdf.org, >>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt >>> <http://www.w3.org/community/ld4lt> >>> Homepage: http://aksw.org/SebastianHellmann >>> <http://aksw.org/SebastianHellmann> >>> Research Group: http://aksw.org____ >>> >>> >>> >>> _______________________________________________ >>> Wikidata mailing list >>> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> <https://lists.wikimedia.org/mailman/listinfo/wikidata> >> >> -- >> All the best, >> Sebastian Hellmann >> >> Director of Knowledge Integration and Linked Data Technologies >> (KILT) Competence Center >> at the Institute for Applied Informatics (InfAI) at Leipzig University >> Executive Director of the DBpedia Association >> Projects: http://dbpedia.org, http://nlp2rdf.org, >> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt >> <http://www.w3.org/community/ld4lt> >> Homepage: http://aksw.org/SebastianHellmann >> <http://aksw.org/SebastianHellmann> >> Research Group: http://aksw.org >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> <https://lists.wikimedia.org/mailman/listinfo/wikidata> >> >> >> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata