Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Antonin Delpeuch (lists) Mon, 16 Oct 2017 06:36:06 -0700

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).


So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> Thanks Ettore for spotting that!
> 
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
> 
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
> 
> ?item wdt:P31/wdt:P279* ?type
> 
> You can also have other variants to accept non-truthy statements.
> 
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
> 
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
> 
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
> 
> Antonin
> 
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>>     - Wikidata has 40k organisations: 
>>
>>     https://query.wikidata.org/#SELECT
>>     <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
>>     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>>     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>>
>> Hi, 
>>
>> I think Wikidata contains many more organizations than that. If we
>> choose the "instance of Business enterprise", we get 135570 results. And
>> I imagine there are many other categories that bring together commercial
>> companies.
>>
>>
>> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>>
>> On the substance, the project to add all companies of a country would
>> make Wikidata a kind of totally free clone of Open Corporates
>> <https://opencorporates.com/>. I would of course be delighted to see
>> that, but is it not a challenge to maintain such a database? Companies
>> are like humans, it appears and disappears every day.
>>
>>  
>>
>> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>> <hellm...@informatik.uni-leipzig.de
>> <mailto:hellm...@informatik.uni-leipzig.de>>:
>>
>>     Hi all,
>>
>>     the technical challenges are not so difficult.
>>
>>     - 2.2 million are the exact number of German organisations, i.e.
>>     associations and companies. They are also unique.
>>
>>     - Wikidata has 40k organisations:
>>
>>     https://query.wikidata.org/#SELECT
>>     <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
>>     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>>     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>>     so there would be a maximum of 40k duplicates These are easy to find
>>     and deduplicate
>>
>>     - The crawl can be done easily, a colleague has done so before.  
>>
>>
>>     The issues here are:
>>
>>     - Do you want to upload the data in Wikidata? It would be a real big
>>     extension. Can I go ahead
>>
>>     - If the data were available externally as structured data under
>>     open license, I would probably not suggest loading it into wikidata,
>>     as the data can be retrieved from the official source directly,
>>     however, here this data will not be published in a decent format.
>>
>>     I thought that the way data is copied from coyrighted sources, i.e.
>>     only facts is ok for wikidata. This done in a lot of places, I
>>     guess. Same for Wikipedia, i.e. News articles and copyrighted books
>>     are referenced. So Wikimedia or the Wikimedia community are experts
>>     on this.
>>
>>     All the best,
>>
>>     Sebastian
>>
>>
>>     On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>>     Hi Sebastian,____
>>>
>>>     __ __
>>>
>>>     This is huge! It will cover almost all currently existing German
>>>     companies. Many of these will have similar names, so preparing for
>>>     disambiguation is a concern.____
>>>
>>>     __ __
>>>
>>>     A good way for such an approach would be proposing a property for
>>>     an external identifier, loading the data into Mix-n-match,
>>>     creating links for companies already in Wikidata, and adding the
>>>     rest (or perhaps only parts of them - I’m not sure if having all
>>>     of them in Wikidata makes sense, but that’s another discussion),
>>>     preferably with location and/or sector of trade in the description
>>>     field.____
>>>
>>>     __ __
>>>
>>>     I’ve tried to figure out what could be used as key for a external
>>>     identifier property. However, it looks like the registry does not
>>>     offer any (persistent) URL to its entries. So for looking up a
>>>     company, apparently there are two options:____
>>>
>>>     __ __
>>>
>>>     -          conducting an extended search for the exact string “A&A
>>>     Dienstleistungsgesellschaft mbH“____
>>>
>>>     -          copying the register number “32853” plus selecting the
>>>     court (Leipzig) from the according dropdown list and search that____
>>>
>>>     __ __
>>>
>>>     Both ways are not very intuitive, even if we can provide a link to
>>>     the search form. This would make a weak connection to the source
>>>     of information. Much more important, it makes disambiguation in
>>>     Mix-n-match difficult. This applies for the preparation of your
>>>     initial load (you would not want to create duplicates). But much
>>>     more so for everybody else who wants to match his or her data
>>>     later on. Being forced to search for entries manually in a
>>>     cumbersome way for disambiguation of a new, possibly large and
>>>     rich dataset is, in my eyes, not something we want to impose on
>>>     future contributors. And often, the free information they find in
>>>     the registry (formal name, register number, legal form, address)
>>>     will not easily match with the information they have (common name,
>>>     location, perhaps founding date, and most important sector of
>>>     trade), so disambiguation may still be difficult.____
>>>
>>>     __ __
>>>
>>>     Have you checked which parts of the accessible information as
>>>     below can be crawled and added legally to external databases such
>>>     as Wikidata?____
>>>
>>>     __ __
>>>
>>>     Cheers, Joachim____
>>>
>>>     __ __
>>>
>>>     --____
>>>
>>>     Joachim Neubert____
>>>
>>>     __ __
>>>
>>>     ZBW – German National Library of Economics____
>>>
>>>     Leibniz Information Centre for Economics____
>>>
>>>     Neuer Jungfernstieg 21
>>>     20354 Hamburg____
>>>
>>>     Phone +49-42834-462____
>>>
>>>     __ __
>>>
>>>     __ __
>>>
>>>     __ __
>>>
>>>     *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org
>>>     <mailto:wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von
>>>     *Sebastian Hellmann
>>>     *Gesendet:* Sonntag, 15. Oktober 2017 09:45
>>>     *An:* wikidata@lists.wikimedia.org
>>>     <mailto:wikidata@lists.wikimedia.org>
>>>     *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
>>>     organisations to Wikidata____
>>>
>>>     __ __
>>>
>>>     Hi all,____
>>>
>>>     the German business registry contains roughly 2.2 million
>>>     organisations. Some information is paid, but other is public, i.e.
>>>     the info you are searching for at and clicking on UT (see example
>>>     below):____
>>>
>>>     https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>>     <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
>>>
>>>     __ __
>>>
>>>     I would like to add this to Wikidata, either by crawling or by
>>>     raising money to use crowdsourcing concepts like crowdflour or
>>>     amazon turk. ____
>>>
>>>     __ __
>>>
>>>     It should meet notability criteria 2:
>>>     https://www.wikidata.org/wiki/Wikidata:Notability
>>>     <https://www.wikidata.org/wiki/Wikidata:Notability>____
>>>
>>>         2. It refers to an instance of a *clearly identifiable
>>>         conceptual or material entity*. The entity must be notable, in
>>>         the sense that it *can be described using serious and publicly
>>>         available references*. If there is no item about you yet, you
>>>         are probably not notable.____
>>>
>>>
>>>     The reference is the official German business registry, which is
>>>     serious and public. Orgs are also per definition clearly
>>>     identifiable legal entities.
>>>
>>>     How can I get clearance to proceed on this?
>>>
>>>     All the best,
>>>     Sebastian____
>>>
>>>     __ __
>>>
>>>     __ __
>>>
>>>
>>>           Entity data____
>>>
>>>     __ __
>>>
>>>     Saxony District court *Leipzig HRB 32853 * – A&A
>>>     Dienstleistungsgesellschaft mbH ____
>>>
>>>     Legal status:____
>>>
>>>             
>>>
>>>     Gesellschaft mit beschränkter Haftung  ____
>>>
>>>             
>>>             
>>>
>>>     Capital:____
>>>
>>>             
>>>
>>>     25.000,00 EUR ____
>>>
>>>             
>>>             
>>>
>>>     Date of entry:____
>>>
>>>             
>>>
>>>     29/08/2016
>>>     (When entering date of entry, wrong data input can occur due to
>>>     system failures!) ____
>>>
>>>             
>>>             
>>>
>>>     Date of removal:____
>>>
>>>             
>>>
>>>     - ____
>>>
>>>             
>>>             
>>>
>>>     Balance sheet available: ____
>>>
>>>             
>>>
>>>     - ____
>>>
>>>             
>>>             
>>>
>>>     Address (subject to correction):____
>>>
>>>             
>>>
>>>     A&A Dienstleistungsgesellschaft mbH
>>>     Prager Straße 38-40____
>>>
>>>     04317 Leipzig ____
>>>
>>>             
>>>             
>>>
>>>     __ __
>>>
>>>     -- 
>>>     All the best,
>>>     Sebastian Hellmann
>>>
>>>     Director of Knowledge Integration and Linked Data Technologies
>>>     (KILT) Competence Center
>>>     at the Institute for Applied Informatics (InfAI) at Leipzig University
>>>     Executive Director of the DBpedia Association
>>>     Projects: http://dbpedia.org, http://nlp2rdf.org,
>>>     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>>>     <http://www.w3.org/community/ld4lt>
>>>     Homepage: http://aksw.org/SebastianHellmann
>>>     <http://aksw.org/SebastianHellmann>
>>>     Research Group: http://aksw.org____
>>>
>>>
>>>
>>>     _______________________________________________
>>>     Wikidata mailing list
>>>     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
>>>     https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>>
>>     -- 
>>     All the best,
>>     Sebastian Hellmann
>>
>>     Director of Knowledge Integration and Linked Data Technologies
>>     (KILT) Competence Center
>>     at the Institute for Applied Informatics (InfAI) at Leipzig University
>>     Executive Director of the DBpedia Association
>>     Projects: http://dbpedia.org, http://nlp2rdf.org,
>>     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>>     <http://www.w3.org/community/ld4lt>
>>     Homepage: http://aksw.org/SebastianHellmann
>>     <http://aksw.org/SebastianHellmann>
>>     Research Group: http://aksw.org
>>
>>     _______________________________________________
>>     Wikidata mailing list
>>     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
>>     https://lists.wikimedia.org/mailman/listinfo/wikidata
>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>>
>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> 
> 
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Reply via email to