Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-11-15 Thread Sebastian Hellmann

Hi Denny, all,

here is the second prototype of the new overarching DBpedia approach:

https://databus.dbpedia.org/vehnem/flexifusion/prefusion/2019.11.01

Datasets are grouped by property, DBpedia ontology is used, if exists. 
Data contains all Wkipedia languages mapped via DBpedia, Wikidata where 
mapped, some properties from DNB, Musicbrainz, Geonames.


We normalized the subjects based on the sameas links with some quality 
control. Datatypes will be normalised by rules plus machine learning in 
the future.


As soon as we make some adjustments, we can load it into the GFS GUI.

We are also working on an export using Wikidata Q's and P's so it is 
easier to ingest into Wikidata. More datasets from LOD will follow.


All the best,

Sebastian


On 04.10.19 01:23, Sebastian Hellmann wrote:


Hi Denny,

here are some initial points:

1. there is also the generic dataset from last month: 
https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2019.08.30 
dataset (We still need to copy the docu on the bus). This has the 
highest coverage, but lowest consistency. English has around 50k 
parent properties maybe more if you count child inverse and other 
variants. We would need to check the mappings at 
http://mappings.dbpedia.org , which we are doing at the moment anyhow. 
It could take only an hour to map some healthy chunks into the 
mappings dataset.


curl 
https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.30/infobox-properties_lang=en.ttl.bz2 
| bzcat | grep "/parent"


http://temporary.dbpedia.org/temporary/parentrel.nt.bz2

Normally this dataset is messy, but still quite useful, because you 
can write the queries with alternatives (see 
dbo:position|dbp:position) in a way that make them useable, like this 
query that works since 13 years:


soccer players, who are born in a country with more than 10 million 
inhabitants, who played as goalkeeper for a club that has a stadium 
with more than 30.000 seats and the club country is different from 
the birth country 

Maybe, we could also evaluate some queries which can be answered by 
one or the other? Can you do the query above in Wikidata?


2. We also have an API to get all references from infoboxes now as a 
partial result of the GFS project . See point 5 here : 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE


3. This particular dataset (generic/infobox-properties) above is also 
a good measure of non-adoption of Wikidata in Wikipedia. In total, it 
has over 500 million statements for all languages. Having a statement 
here means, that the data is using an infobox template parameter and 
no wikidata is used. The dataset is still extracted in the same way. 
We can check whether it got bigger or smaller. It is the same 
algorithm. But the fact that this still works and has a decent size 
indicates that Wikidata adoption by Wikipedians is low.


4. I need to look at the parent example in detail. However, I have to 
say that the property lends itself well for the Wikidata approach 
since it is easily understood and has sort of a truthiness and is easy 
to research and add.


I am not sure if it is representative as e.g. "employer" is more 
difficult to model (time scoped). Like my data here is outdated: 
https://www.wikidata.org/wiki/Q39429171


Also I don't see yet how this will become a more systematic approach 
that shows where to optimize, but I still need to read it fully.


We can start with this one however.

-- Sebastian

On 01.10.19 01:13, Denny Vrandečić wrote:

Hi all,

as promised, now that I am back from my trip, here's my draft of the 
comparison of Wikidata, DBpedia, and Freebase.


It is a draft, it is obviously potentially biased given my 
background, etc., but I hope that we can work on it together to get 
it into a good shape.


Markus, amusingly I took pretty much the same example that you went 
for, the parent predicate. So yes, I was also surprised by the 
results, and would love to have Sebastian or Kingsley look into it 
and see if I conducted it fairly.


SJ, Andra, thanks for offering to take a look. I am sure you all can 
contribute your own unique background and 

Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-10-03 Thread Sebastian Hellmann

Hi Denny,

here are some initial points:

1. there is also the generic dataset from last month: 
https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2019.08.30 
dataset (We still need to copy the docu on the bus). This has the 
highest coverage, but lowest consistency. English has around 50k parent 
properties maybe more if you count child inverse and other variants. We 
would need to check the mappings at http://mappings.dbpedia.org , which 
we are doing at the moment anyhow. It could take only an hour to map 
some healthy chunks into the mappings dataset.


curl 
https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.30/infobox-properties_lang=en.ttl.bz2 
| bzcat | grep "/parent"


http://temporary.dbpedia.org/temporary/parentrel.nt.bz2

Normally this dataset is messy, but still quite useful, because you can 
write the queries with alternatives (see dbo:position|dbp:position) in a 
way that make them useable, like this query that works since 13 years:


soccer players, who are born in a country with more than 10 million 
inhabitants, who played as goalkeeper for a club that has a stadium 
with more than 30.000 seats and the club country is different from the 
birth country 

Maybe, we could also evaluate some queries which can be answered by one 
or the other? Can you do the query above in Wikidata?


2. We also have an API to get all references from infoboxes now as a 
partial result of the GFS project . See point 5 here : 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE


3. This particular dataset (generic/infobox-properties) above is also a 
good measure of non-adoption of Wikidata in Wikipedia. In total, it has 
over 500 million statements for all languages. Having a statement here 
means, that the data is using an infobox template parameter and no 
wikidata is used. The dataset is still extracted in the same way. We can 
check whether it got bigger or smaller. It is the same algorithm. But 
the fact that this still works and has a decent size indicates that 
Wikidata adoption by Wikipedians is low.


4. I need to look at the parent example in detail. However, I have to 
say that the property lends itself well for the Wikidata approach since 
it is easily understood and has sort of a truthiness and is easy to 
research and add.


I am not sure if it is representative as e.g. "employer" is more 
difficult to model (time scoped). Like my data here is outdated: 
https://www.wikidata.org/wiki/Q39429171


Also I don't see yet how this will become a more systematic approach 
that shows where to optimize, but I still need to read it fully.


We can start with this one however.

-- Sebastian

On 01.10.19 01:13, Denny Vrandečić wrote:

Hi all,

as promised, now that I am back from my trip, here's my draft of the 
comparison of Wikidata, DBpedia, and Freebase.


It is a draft, it is obviously potentially biased given my background, 
etc., but I hope that we can work on it together to get it into a good 
shape.


Markus, amusingly I took pretty much the same example that you went 
for, the parent predicate. So yes, I was also surprised by the 
results, and would love to have Sebastian or Kingsley look into it and 
see if I conducted it fairly.


SJ, Andra, thanks for offering to take a look. I am sure you all can 
contribute your own unique background and make suggestions on how to 
improve things and whether the results ring true.


Marco, I totally agree with what you said - the project has stalled, 
and there is plenty of opportunity to harvest more data from Freebase 
and bring it to Wikidata, and this should be reignited. Sebastian, I 
also agree with you, and the numbers do so too, the same is true with 
the extraction results from DBpedia.


Sebastian, Kingsley, I tried to describe how I understand DBpedia, and 
all steps should be reproducible. As it seems that the two of you also 
have to discuss one or the other thing about DBpedia's identity, I am 
relieved that my confusion is not entirely unjustified. So I tried to 
use both the last stable DBpedia release as well as a new-style 
DBpedia fusion dataset for the comparison. But I 

Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-10-03 Thread hellmann
Hi Marco, 

On October 1, 2019 11:48:02 PM GMT+02:00, Marco Fossati  
wrote:
>Hi Denny,
>
>Thanks for publishing your Colab notebook!
>I went through it and would like to share my first thoughts here. We
>can 
>then move further discussion somewhere else.
>
>1. in general, how can we compare datasets with totally different time 
>stamps? Wikidata is alive, Freebase is dead, and the latest DBpedia
>dump 
>is old;

DBpedia made monthly releases for the past three months which will continue to 
improve and grow in an agile Manne, we focused on debugging and integration. 
Max age would be 30 days. I think that is OK.  Denny validated against the live 
endpoint. This is OK to drive  growth, but not reproducible scientifically 
compared to dumps. 



>2. given that all datasets contain Wikipedia links, perhaps we could
>use 
>them as a bridge for the comparison, instead of Wikidata mappings. I'm 
>assuming that Freebase and DBpedia entities with Wikidata mappings are 
>subsets of the whole datasets (but this should be verified);
>3. we could use record linkage techniques to connect Wikidata entities 
>with Freebase and DBpedia ones, then assess the agreement in terms of 
>statements per entity. There has been some experimental work (different
>
>use case and goal) in the soweego project:
>https://soweego.readthedocs.io/en/latest/validator.html
>
>
>On 10/1/19 1:13 AM, Denny Vrandečić wrote:
>> Marco, I totally agree with what you said - the project has stalled,
>and 
>> there is plenty of opportunity to harvest more data from Freebase and
>
>> bring it to Wikidata, and this should be reignited.
>Yeah, that would be great.
>There is known work to do, but it's hard to sustain such a big project 
>without allocated resources:
>https://phabricator.wikimedia.org/maniphest/query/CPiqkafGs5G./#R
>
>BTW, there is also version 2 of the Wikidata primary sources tool that 
>needs love, although I'm now skeptical that it will be an effective way
>
>to achieve the Freebase harvesting.
>We should probably rethink the whole thing, and restart small with very
>
>simple use cases, pretty much like the Harvest templates tool you
>mentioned:
>https://tools.wmflabs.org/pltools/harvesttemplates/
>
>Cheers,
>
>Marco
>
>P.S.: I *might* have found the freshest relevant DBpedia datasets:
>https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects
>I said *might* because it was really painful to find a download button 
>and to guess among multiple versions of the same dataset:
>https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/mappingbased-objects_lang=en.ttl.bz2
>@Sebastian may know if it's the good one :-)

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-10-01 Thread Gerard Meijssen
Hoi,
As indicated by the DBpedia people, there are two ways in which data gets
into their latest Fusion offering. There is consensus, all the available
sources agree and, there is the notion where one source is deemed
authoritative. Remember, DBpedia uses sources outside of the Wikimedia
movement like national libraries !!

What I miss in your paper is purpose, what is the way forward and how does
it compare with and improve on current practice. Current practice is that
people import data from anywhere, typically it is single sourced if at all
and including is introduced human error that is inherent in a manual
process. The DBpedia folks have a WMF sponsored project whereby they
facilitate the inclusion of data to Wikidata. Particularly where there is
consensus (no opposing sources) it is an improvement on current practice,
it complements nicely the existing Wikidata content. The content where
there is NO consensus, is useful because it enables the highlighting where
these errors occur. It will really help in finding false friends.

The Freebase data has been abandoned. It did not get the respect it
deserved and particularly at the time its quality was better than Wikidata.
The fact that it is dated IS a saving grace because Wikidata/ Wikipedia is
particularly strong on the content related to the period of Wikipedia
activity. My preferred way of treating the Freebase data is fusing it is
the Fusion project. All the data that is new or expands on what is known in
Fusion is of relevance. Given that no maintenance is done on the Freebase
data, the dissenting data at best can be used for curating what is in the
WMF projects.

In your paper you support the notion of harvesting based on single sources.
Maybe at a later date. First we need to integrate the uncontroversial data,
the data where there is a consensus in multiple projects. The biggest
benefit will be that a lot of make work is prevented. Work done because the
data just did not get into Wikidata.
Thanks,
GerardM



On Tue, 1 Oct 2019 at 01:14, Denny Vrandečić  wrote:

> Hi all,
>
> as promised, now that I am back from my trip, here's my draft of the
> comparison of Wikidata, DBpedia, and Freebase.
>
> It is a draft, it is obviously potentially biased given my background,
> etc., but I hope that we can work on it together to get it into a good
> shape.
>
> Markus, amusingly I took pretty much the same example that you went for,
> the parent predicate. So yes, I was also surprised by the results, and
> would love to have Sebastian or Kingsley look into it and see if I
> conducted it fairly.
>
> SJ, Andra, thanks for offering to take a look. I am sure you all can
> contribute your own unique background and make suggestions on how to
> improve things and whether the results ring true.
>
> Marco, I totally agree with what you said - the project has stalled, and
> there is plenty of opportunity to harvest more data from Freebase and bring
> it to Wikidata, and this should be reignited. Sebastian, I also agree with
> you, and the numbers do so too, the same is true with the extraction
> results from DBpedia.
>
> Sebastian, Kingsley, I tried to describe how I understand DBpedia, and all
> steps should be reproducible. As it seems that the two of you also have to
> discuss one or the other thing about DBpedia's identity, I am relieved that
> my confusion is not entirely unjustified. So I tried to use both the last
> stable DBpedia release as well as a new-style DBpedia fusion dataset for
> the comparison. But I might have gotten the whole procedure wrong. I am
> happy to be corrected.
>
> On Sat, Sep 28, 2019 at 12:28 AM 
> wrote:
>
>> > Meanwhile, Google crawls all the references and extracts facts from
> there. We don't
> > have that available, but there is Linked Open Data.
>
> Potentially, not a bad idea, but we don't do that.
>
> Everyone, this is the first time I share a Colab notebook, and I have no
> idea if I did it right. So any feedback of the form "oh you didn't switch
> on that bit over here" or "yes, this works, thank you" is very welcome,
> because I have no clue what I am doing :) Also, I never did this kind of
> analysis so transparently, which is kinda both totally cool and rather
> scary, because now you can all see how dumb I am :)
>
> So everyone is invited to send Pull Requests (I guess that's how this
> works?), and I would love for us to create a result together that we agree
> on. I see the result of this exercise to be potentially twofold:
>
> 1) a publication we can point people to who ask about the differences
> between Wikidata, DBpedia, and Freebase
>
> 2) to reignite or start projects and processes to reduce these differences
>
> So, here is the link to my Colab notebook:
>
>
> https://github.com/vrandezo/colabs/blob/master/Comparing_coverage_and_accuracy_of_DBpedia%2C_Freebase%2C_and_Wikidata_for_the_parent_predicate.ipynb
>
> Ideally, the third goal could be to get to a deeper understanding of how
> these three projects 

Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-10-01 Thread Marco Fossati

Hi Denny,

Thanks for publishing your Colab notebook!
I went through it and would like to share my first thoughts here. We can 
then move further discussion somewhere else.


1. in general, how can we compare datasets with totally different time 
stamps? Wikidata is alive, Freebase is dead, and the latest DBpedia dump 
is old;
2. given that all datasets contain Wikipedia links, perhaps we could use 
them as a bridge for the comparison, instead of Wikidata mappings. I'm 
assuming that Freebase and DBpedia entities with Wikidata mappings are 
subsets of the whole datasets (but this should be verified);
3. we could use record linkage techniques to connect Wikidata entities 
with Freebase and DBpedia ones, then assess the agreement in terms of 
statements per entity. There has been some experimental work (different 
use case and goal) in the soweego project:

https://soweego.readthedocs.io/en/latest/validator.html


On 10/1/19 1:13 AM, Denny Vrandečić wrote:
Marco, I totally agree with what you said - the project has stalled, and 
there is plenty of opportunity to harvest more data from Freebase and 
bring it to Wikidata, and this should be reignited.

Yeah, that would be great.
There is known work to do, but it's hard to sustain such a big project 
without allocated resources:

https://phabricator.wikimedia.org/maniphest/query/CPiqkafGs5G./#R

BTW, there is also version 2 of the Wikidata primary sources tool that 
needs love, although I'm now skeptical that it will be an effective way 
to achieve the Freebase harvesting.
We should probably rethink the whole thing, and restart small with very 
simple use cases, pretty much like the Harvest templates tool you mentioned:

https://tools.wmflabs.org/pltools/harvesttemplates/

Cheers,

Marco

P.S.: I *might* have found the freshest relevant DBpedia datasets:
https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects
I said *might* because it was really painful to find a download button 
and to guess among multiple versions of the same dataset:

https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/mappingbased-objects_lang=en.ttl.bz2
@Sebastian may know if it's the good one :-)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-09-30 Thread Denny Vrandečić
Hi all,

as promised, now that I am back from my trip, here's my draft of the
comparison of Wikidata, DBpedia, and Freebase.

It is a draft, it is obviously potentially biased given my background,
etc., but I hope that we can work on it together to get it into a good
shape.

Markus, amusingly I took pretty much the same example that you went for,
the parent predicate. So yes, I was also surprised by the results, and
would love to have Sebastian or Kingsley look into it and see if I
conducted it fairly.

SJ, Andra, thanks for offering to take a look. I am sure you all can
contribute your own unique background and make suggestions on how to
improve things and whether the results ring true.

Marco, I totally agree with what you said - the project has stalled, and
there is plenty of opportunity to harvest more data from Freebase and bring
it to Wikidata, and this should be reignited. Sebastian, I also agree with
you, and the numbers do so too, the same is true with the extraction
results from DBpedia.

Sebastian, Kingsley, I tried to describe how I understand DBpedia, and all
steps should be reproducible. As it seems that the two of you also have to
discuss one or the other thing about DBpedia's identity, I am relieved that
my confusion is not entirely unjustified. So I tried to use both the last
stable DBpedia release as well as a new-style DBpedia fusion dataset for
the comparison. But I might have gotten the whole procedure wrong. I am
happy to be corrected.

On Sat, Sep 28, 2019 at 12:28 AM  wrote:

> > Meanwhile, Google crawls all the references and extracts facts from
there. We don't
> have that available, but there is Linked Open Data.

Potentially, not a bad idea, but we don't do that.

Everyone, this is the first time I share a Colab notebook, and I have no
idea if I did it right. So any feedback of the form "oh you didn't switch
on that bit over here" or "yes, this works, thank you" is very welcome,
because I have no clue what I am doing :) Also, I never did this kind of
analysis so transparently, which is kinda both totally cool and rather
scary, because now you can all see how dumb I am :)

So everyone is invited to send Pull Requests (I guess that's how this
works?), and I would love for us to create a result together that we agree
on. I see the result of this exercise to be potentially twofold:

1) a publication we can point people to who ask about the differences
between Wikidata, DBpedia, and Freebase

2) to reignite or start projects and processes to reduce these differences

So, here is the link to my Colab notebook:

https://github.com/vrandezo/colabs/blob/master/Comparing_coverage_and_accuracy_of_DBpedia%2C_Freebase%2C_and_Wikidata_for_the_parent_predicate.ipynb

Ideally, the third goal could be to get to a deeper understanding of how
these three projects relate to each other - in my point of view, Freebase
is dead and outdated, Wikidata is the core knowledge base that anyone can
edit, and DBpedia is the core project to weave value-adding workflows on
top of Wikidata or other datasets from the linked open data cloud together.
But that's just a proposal.

Cheers,
Denny



On Sat, Sep 28, 2019 at 12:28 AM  wrote:

> Hi Gerard,
>
> I was not trying to judge here. I was just saying that it wasn't much data
> in the end.
> For me Freebase was basically cherry-picked.
>
> Meanwhile, the data we extract is more pertinent to the goal of having
> Wikidata cover the info boxes. We still have ~ 500 million statements left.
> But none of it is used yet. Hopefully we can change that.
>
> Meanwhile, Google crawls all the references and extracts facts from there.
> We don't have that available, but there is Linked Open Data.
>
> --
> Sebastian
>
> On September 27, 2019 5:26:43 PM GMT+02:00, Gerard Meijssen <
> gerard.meijs...@gmail.com> wrote:
>>
>> Hoi,
>> I totally reject the assertion was so bad. I have always had the opinion
>> that the main issue was an atrocious user interface. Add to this the people
>> that have Wikipedia notions about quality. They have and had a detrimental
>> effect on both the quantity and quality of Wikidata.
>>
>> When you add the functionality that is being build by the datawranglers
>> at DBpedia, it becomes easy/easier to compare the data from Wikipedias with
>> Wikidata (and why not Freebase) add what has consensus and curate the
>> differences. This will enable a true datasense of quality and allows us to
>> provide a much improved service.
>> Thanks,
>>   GerardM
>>
>> On Fri, 27 Sep 2019 at 15:54, Marco Fossati 
>> wrote:
>>
>>> Hey Sebastian,
>>>
>>> On 9/20/19 10:22 AM, Sebastian Hellmann wrote:
>>> > Not much of Freebase did end up in Wikidata.
>>>
>>> Dropping here some pointers to shed light on the migration of Freebase
>>> to Wikidata, since I was partially involved in the process:
>>> 1. WikiProject [1];
>>> 2. the paper behind [2];
>>> 3. datasets to be migrated [3].
>>>
>>> I can confirm that the migration has stalled: as of today, *528
>>> thousands*