Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Yaroslav Blanter Mon, 16 Oct 2017 09:07:31 -0700

Dear All,

it is great that we are having this discussion, but may I please suggest to
have it on the RfP page on Wikidata? People already asked similar questions
there, and, in my experience, on-wiki discussion will likely lead to
refined request which will accomodate all suggestions.


Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> wrote:

> ah, ok, sorry, I was assuming that Blazegraph would transitively resolve
> this automatically.
>
> Ok, so let's divide the problem:
>
> # Task 1:
>
> Connect all existing organisations with the data from the handelsregister.
> (No new identifiers added, we can start right now)
>
> Add a constraint that all German organisations should be connected to a
> court, i.e. the registering organisation as well as the id assigned by the
> court.
>
> @all: any properties I can reuse for this?
>
> I will focus on this as it seems quite easy. We can first filter orgs by
> other criteria, i.e. country as a blocking key and then string match the
> rest.
>
> # Task 2:
>
> Add all missing identifiers for the remaining orgs in Handelsregister.
> Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
>
> # regarding maintenance:
> I find Wikidata as such very hard to maintain as all data is copied from
> somewhere else eventually, but Wikipedia has the same problem. In the case
> of the German Business register, maintenance is especially easy as the orgs
> are stable and uniquely identifiable. Even the fact that a company gets
> shut down should still be in Wikidata, so you have historical information.
> I mean, you also keep the Roman Empire, the Hanse and even finished
> projects in Wikidata. So even if an org ceases to exist, the entry in
> Wikidata should stay.
>
> # regarding Opencorporates
> I have a critical opinion with Opencorporates. It appears to be open, but
> you actually can not get the data. If somebody has a data dump, please
> forward to me. Thanks.
> More on top, I consider Opencorporates a danger to open data. It appears
> to push open availability of data, but then it is limited to open licenses.
> Usefulness is limited as there are no free dumps and no possibility to
> duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API
> for exactly this reason. Everytime somebody wants to create an open
> organisation dataset with no barriers, the existence of Opencorporates is
> blocking this.
>
> Cheers,
> Sebastian
>
>
> On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
>
> And… my own count was wrong too, because I forgot to add DISTINCT in my
> query (if there are multiple paths from the class to "organization
> (Q43229)", items will appear multiple times).
>
> So, I get 1 168 084 now.http://tinyurl.com/yaeqlsnl
>
> It's easy to get these things wrong!
>
> Antonin
>
> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
>
> Thanks Ettore for spotting that!
>
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
>
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
>
> ?item wdt:P31/wdt:P279* ?type
>
> You can also have other variants to accept non-truthy statements.
>
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
>
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
>
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
>
> Antonin
>
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>
>     - Wikidata has 40k organisations:
>
>     https://query.wikidata.org/#SELECT
>     <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> 
> %3Fitem %3FitemLabel %0AWHERE
>     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>
>
> Hi,
>
> I think Wikidata contains many more organizations than that. If we
> choose the "instance of Business enterprise", we get 135570 results. And
> I imagine there are many other categories that bring together commercial
> companies.
>
> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>
> On the substance, the project to add all companies of a country would
> make Wikidata a kind of totally free clone of Open 
> Corporates<https://opencorporates.com/> <https://opencorporates.com/>. I 
> would of course be delighted to see
> that, but is it not a challenge to maintain such a database? Companies
> are like humans, it appears and disappears every day.
>
>
>
> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
> <hellm...@informatik.uni-leipzig.de<mailto:hellm...@informatik.uni-leipzig.de>
>  <hellm...@informatik.uni-leipzig.de>>:
>
>     Hi all,
>
>     the technical challenges are not so difficult.
>
>     - 2.2 million are the exact number of German organisations, i.e.
>     associations and companies. They are also unique.
>
>     - Wikidata has 40k organisations:
>
>     https://query.wikidata.org/#SELECT
>     <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> 
> %3Fitem %3FitemLabel %0AWHERE
>     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>
>     so there would be a maximum of 40k duplicates These are easy to find
>     and deduplicate
>
>     - The crawl can be done easily, a colleague has done so before.
>
>
>     The issues here are:
>
>     - Do you want to upload the data in Wikidata? It would be a real big
>     extension. Can I go ahead
>
>     - If the data were available externally as structured data under
>     open license, I would probably not suggest loading it into wikidata,
>     as the data can be retrieved from the official source directly,
>     however, here this data will not be published in a decent format.
>
>     I thought that the way data is copied from coyrighted sources, i.e.
>     only facts is ok for wikidata. This done in a lot of places, I
>     guess. Same for Wikipedia, i.e. News articles and copyrighted books
>     are referenced. So Wikimedia or the Wikimedia community are experts
>     on this.
>
>     All the best,
>
>     Sebastian
>
>
>     On 16.10.2017 10:18, Neubert, Joachim wrote:
>
>     Hi Sebastian,____
>
>     __ __
>
>     This is huge! It will cover almost all currently existing German
>     companies. Many of these will have similar names, so preparing for
>     disambiguation is a concern.____
>
>     __ __
>
>     A good way for such an approach would be proposing a property for
>     an external identifier, loading the data into Mix-n-match,
>     creating links for companies already in Wikidata, and adding the
>     rest (or perhaps only parts of them - I’m not sure if having all
>     of them in Wikidata makes sense, but that’s another discussion),
>     preferably with location and/or sector of trade in the description
>     field.____
>
>     __ __
>
>     I’ve tried to figure out what could be used as key for a external
>     identifier property. However, it looks like the registry does not
>     offer any (persistent) URL to its entries. So for looking up a
>     company, apparently there are two options:____
>
>     __ __
>
>     -          conducting an extended search for the exact string “A&A
>     Dienstleistungsgesellschaft mbH“____
>
>     -          copying the register number “32853” plus selecting the
>     court (Leipzig) from the according dropdown list and search that____
>
>     __ __
>
>     Both ways are not very intuitive, even if we can provide a link to
>     the search form. This would make a weak connection to the source
>     of information. Much more important, it makes disambiguation in
>     Mix-n-match difficult. This applies for the preparation of your
>     initial load (you would not want to create duplicates). But much
>     more so for everybody else who wants to match his or her data
>     later on. Being forced to search for entries manually in a
>     cumbersome way for disambiguation of a new, possibly large and
>     rich dataset is, in my eyes, not something we want to impose on
>     future contributors. And often, the free information they find in
>     the registry (formal name, register number, legal form, address)
>     will not easily match with the information they have (common name,
>     location, perhaps founding date, and most important sector of
>     trade), so disambiguation may still be difficult.____
>
>     __ __
>
>     Have you checked which parts of the accessible information as
>     below can be crawled and added legally to external databases such
>     as Wikidata?____
>
>     __ __
>
>     Cheers, Joachim____
>
>     __ __
>
>     --____
>
>     Joachim Neubert____
>
>     __ __
>
>     ZBW – German National Library of Economics____
>
>     Leibniz Information Centre for Economics____
>
>     Neuer Jungfernstieg 21
>     20354 Hamburg____
>
>     Phone +49-42834-462____
>
>     __ __
>
>     __ __
>
>     __ __
>
>     *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org 
> <wikidata-boun...@lists.wikimedia.org>
>     <mailto:wikidata-boun...@lists.wikimedia.org> 
> <wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von
>     *Sebastian Hellmann
>     *Gesendet:* Sonntag, 15. Oktober 2017 09:45
>     *An:* wikidata@lists.wikimedia.org
>     <mailto:wikidata@lists.wikimedia.org> <wikidata@lists.wikimedia.org>
>     *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
>     organisations to Wikidata____
>
>     __ __
>
>     Hi all,____
>
>     the German business registry contains roughly 2.2 million
>     organisations. Some information is paid, but other is public, i.e.
>     the info you are searching for at and clicking on UT (see example
>     below):____
>
>     https://www.handelsregister.de/rp_web/mask.do?Typ=e
>     <https://www.handelsregister.de/rp_web/mask.do?Typ=e> 
> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
>
>     __ __
>
>     I would like to add this to Wikidata, either by crawling or by
>     raising money to use crowdsourcing concepts like crowdflour or
>     amazon turk. ____
>
>     __ __
>
>     It should meet notability criteria 2:
>     https://www.wikidata.org/wiki/Wikidata:Notability
>     <https://www.wikidata.org/wiki/Wikidata:Notability> 
> <https://www.wikidata.org/wiki/Wikidata:Notability>____
>
>         2. It refers to an instance of a *clearly identifiable
>         conceptual or material entity*. The entity must be notable, in
>         the sense that it *can be described using serious and publicly
>         available references*. If there is no item about you yet, you
>         are probably not notable.____
>
>
>     The reference is the official German business registry, which is
>     serious and public. Orgs are also per definition clearly
>     identifiable legal entities.
>
>     How can I get clearance to proceed on this?
>
>     All the best,
>     Sebastian____
>
>     __ __
>
>     __ __
>
>
>           Entity data____
>
>     __ __
>
>     Saxony District court *Leipzig HRB 32853 * – A&A
>     Dienstleistungsgesellschaft mbH ____
>
>     Legal status:____
>
>       
>
>     Gesellschaft mit beschränkter Haftung  ____
>
>       
>       
>
>     Capital:____
>
>       
>
>     25.000,00 EUR ____
>
>       
>       
>
>     Date of entry:____
>
>       
>
>     29/08/2016
>     (When entering date of entry, wrong data input can occur due to
>     system failures!) ____
>
>       
>       
>
>     Date of removal:____
>
>       
>
>     - ____
>
>       
>       
>
>     Balance sheet available: ____
>
>       
>
>     - ____
>
>       
>       
>
>     Address (subject to correction):____
>
>       
>
>     A&A Dienstleistungsgesellschaft mbH
>     Prager Straße 38-40____
>
>     04317 Leipzig ____
>
>       
>       
>
>     __ __
>
>     --
>     All the best,
>     Sebastian Hellmann
>
>     Director of Knowledge Integration and Linked Data Technologies
>     (KILT) Competence Center
>     at the Institute for Applied Informatics (InfAI) at Leipzig University
>     Executive Director of the DBpedia Association
>     Projects: http://dbpedia.org, http://nlp2rdf.org,
>     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>     <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt>
>     Homepage: http://aksw.org/SebastianHellmann
>     <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann>
>     Research Group: http://aksw.org____
>
>
>
>     _______________________________________________
>     Wikidata mailing list
>     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> 
> <Wikidata@lists.wikimedia.org>
>     https://lists.wikimedia.org/mailman/listinfo/wikidata
>     <https://lists.wikimedia.org/mailman/listinfo/wikidata> 
> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>
>     --
>     All the best,
>     Sebastian Hellmann
>
>     Director of Knowledge Integration and Linked Data Technologies
>     (KILT) Competence Center
>     at the Institute for Applied Informatics (InfAI) at Leipzig University
>     Executive Director of the DBpedia Association
>     Projects: http://dbpedia.org, http://nlp2rdf.org,
>     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>     <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt>
>     Homepage: http://aksw.org/SebastianHellmann
>     <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann>
>     Research Group: http://aksw.org
>
>     _______________________________________________
>     Wikidata mailing list
>     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> 
> <Wikidata@lists.wikimedia.org>
>     https://lists.wikimedia.org/mailman/listinfo/wikidata
>     <https://lists.wikimedia.org/mailman/listinfo/wikidata> 
> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>
>
>
>
> _______________________________________________
> Wikidata mailing 
> listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> _______________________________________________
> Wikidata mailing 
> listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> _______________________________________________
> Wikidata mailing 
> listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects: http://dbpedia.org, http://nlp2rdf.org,
> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> <http://www.w3.org/community/ld4lt>
> Homepage: http://aksw.org/SebastianHellmann
> Research Group: http://aksw.org
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Reply via email to