Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-10-01 Thread Kingsley Idehen
On 9/28/15 2:36 PM, Paul Houle wrote:
> Anyhow,  there is this funny little thing that the gap between "5
> cents" and free is bigger than the gap between "5 cents" and $1000,
>  so you have the Bloombergs and Elseviers of the world charging $1000
> for what somebody could provide for much less.  This problem exists
> for the human readable web and so far advertising has been the answer,
>  but it has not been solved for open data.

There is a solution for Open Data, trouble is that attention is
increasingly mercurial.

You need Identity [1], Tickets [2],  and ACLs [3].

All doable using existing Web Architecture.


Links:

[1] http://linkeddata.uriburner.com/c/9DV22GPS -- About Tickets
[2] http://linkeddata.uriburner.com/c/9G36GVL -- About WebID
[3] http://linkeddata.uriburner.com/c/9DFX6GKO -- Attribute-Based Access
Controls (ABAC)


-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software 
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread Federico Leva (Nemo)

Denny Vrandečić, 28/09/2015 23:27:

Actually, my suggestion would be to switch on Primary Sources as a
default tool for everyone.


Yes, it's a desirable aim to have one-click suggested actions (à la 
Wikidata game) embedded into items for everyone. As for this tool, 
unrelatedly from the data used, at least slowness and misleading 
messaging need to be fixed first: 
https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool


(Compare: we already have very easy "remove" buttons on all statements 
on all items. So the interface for large-scale easy correction of 
mistakes is already there, while for *insertion* it's still missing. 
Which is also the gist of Gerard's argument, I believe. I agree with 
Lydia we can eventually do both, of course.)


Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread Federico Leva (Nemo)

Thomas Steiner, 28/09/2015 23:32:

Note: as far as I can tell, the stats available at
https://tools.wmflabs.org/wikidata-primary-sources/status.html  so far
do not differentiate between "fact wrong" (as in "Barack Obama is
president of Croatia" [fact wrong]) and "source wrong" ("Barack Obama
is president of the United States", "according to
http://www.theonion.com/; [fact correct, source wrong]).


Indeed. I only briefly tested "primary sources" because it's 
frustratingly slow, but the statements I rejected were not wrong, just 
ugly: for instance redundant references where we already had some. I'd 
dare calling them formatting issues, which a bot can certainly filter. 
But maybe I was lucky!


Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread John Erling Blad
Yes! +1

On Mon, Sep 28, 2015 at 11:27 PM, Denny Vrandečić 
wrote:

> Actually, my suggestion would be to switch on Primary Sources as a default
> tool for everyone. That should increase exposure and turnover, without
> compromising quality of data.
>
>
>
> On Mon, Sep 28, 2015 at 2:23 PM Denny Vrandečić 
> wrote:
>
>> Hi Gerard,
>>
>> given the statistics you cite from
>>
>> https://tools.wmflabs.org/wikidata-primary-sources/status.html
>>
>> I see that 19.6k statements have been approved through the tool, and 5.1k
>> statements have been rejected - which means that about 1 in 5 statements is
>> deemed unsuitable by the users of primary sources.
>>
>> Given that there are 12.4M statements in the tool, this means that about
>> 2.5M statements will turn out to be unsuitable for inclusion in Wikidata
>> (if the current ratio holds). Are you suggesting to upload all of these
>> statements to Wikidata?
>>
>> Tpt already did upload pieces of the data which have sufficient quality
>> outside the primary sources tool, and more is planned. But for the data
>> where the suitability for Wikidata seems questionable, I would not know
>> what other approach to use. Do you have a suggestion?
>>
>> Once you have a suggestion and there is community consensus in doing it,
>> no one will stand in the way of implementing that suggestion.
>>
>> Cheers,
>> Denny
>>
>>
>> On Mon, Sep 28, 2015 at 1:19 PM John Erling Blad 
>> wrote:
>>
>>> Another; make a kind of worklist on Wikidata that reflect the watchlist
>>> on the clients (Wikipedias) but then, we often have items on our watchlist
>>> that we don't know much about. (Digression: Somehow we should be able to
>>> sort out those things we know (the place we live, the persons we have meet)
>>> from those things we have done (edited, copy-pasted).)
>>>
>>> I been trying to get some interest in the past for worklists on
>>> Wikipedia, it isn't much interest to make them. It would speed up tedious
>>> tasks of finding the next page to edit after a given edit is completed. It
>>> is the same problem with imports from Freebase on Wikidata, locate the next
>>> item on Wikidata with the same queued statement from Freebase, but within
>>> some worklist that the user has some knowledge about.
>>>
>>> Imagine "municipalities within a county" or "municipalities that is also
>>> on the users watchlist", and combine that with available unhandled
>>> Freebase-statements.
>>>
>>> On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad 
>>> wrote:
>>>
 Could it be possible to create some kind of info (notification?) in a
 wikipedia article that additional data is available in a queue ("freebase")
 somewhere?

 If you have the article on your watch-list, then you will get a warning
 that says "You lazy boy, get your ass over here and help us out!" Or
 perhaps slightly rephrased.

 On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch <
 mar...@semantic-mediawiki.org> wrote:

> Hi Gerard, hi all,
>
> The key misunderstanding here is that the main issue with the Freebase
> import would be data quality. It is actually community support. The goal 
> of
> the current slow import process is for the Wikidata community to "adopt"
> the Freebase data. It's not about "storing" the data somewhere, but about
> finding a way to maintain it in the future.
>
> The import statistics show that Wikidata does not currently have
> enough community power for a quick import. This is regrettable, but not
> something that we can fix by dumping in more data that will then be
> orphaned.
>
> Freebase people: this is not a small amount of data for our young
> community. We really need your help to digest this huge amount of data! I
> am absolutely convinced from the emails I saw here that none of the former
> Freebase editors on this list would support low quality standards. They
> have fought hard to fix errors and avoid issues coming into their data for
> a long time.
>
> Nobody believes that either Freebase or Wikidata can ever be free of
> errors, and this is really not the point of this discussion at all [1]. 
> The
> experienced community managers among us know that it is not about the
> amount of data you have. Data is cheap and easy to get, even free data 
> with
> very high quality. But the value proposition of Wikidata is not that it 
> can
> provide storage space for lot of data -- it is that we have a functioning
> community that can maintain it. For the Freebase data donation, we do not
> seem to have this community yet. We need to find a way to engage people to
> do this. Ideas are welcome.
>
> What I can see from the statistics, however, is that some users (and I
> cannot say if they are "Freebase users" or "Wikidata users" ;-) are 
> putting
> 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread Gerard Meijssen
Hoi,
I have seen the statistics. The quality of Freebase cannot be understood by
simply looking at the problems. People have been looking for problems and
been identifying them. As a consequence more data ended up in the error
bucket than in the good bucket. I have for instance added a lot of
statements as "wrong" because they were exactly the same as the value
already present. Consequently the error rate is not representative.

Denny, I have a suggestion. It is backed by math, it is backed by how
people think. All the arguments are on my side. I have not heard your
arguments and the "primary sources tool" was announced as a good thing and
the community never agreed to having it. So leave the community out of it
and focus on arguments.

   - why would someone work on data in the primary sources tool when it is
   more effective to add data directly
   - why is data that is over 90% good denied access to Wikidata (ie as
   good as Wikidata itself)
   - how do you justify the pst when so little data was included in Wikidata
   - why not have Kian learn from the data set of Freebase and Wikidata and
   have smart suggestions
   - why waste people's time adding one item/statement at a time when you
   can focus on the statements that are in doubt (either in Freebase or in
   Wikidata

The notion of having all new data go through the primary sources tool will
see me leave the project when this is realised. I will feel that my time
and intelligence is wasted.

Thanks,

  GerardM

On 28 September 2015 at 22:54, Denny Vrandečić  wrote:

> Hi Gerard,
>
> given the statistics you cite from
>
> https://tools.wmflabs.org/wikidata-primary-sources/status.html
>
> I see that 19.6k statements have been approved through the tool, and 5.1k
> statements have been rejected - which means that about 1 in 5 statements is
> deemed unsuitable by the users of primary sources.
>
> Given that there are 12.4M statements in the tool, this means that about
> 2.5M statements will turn out to be unsuitable for inclusion in Wikidata
> (if the current ratio holds). Are you suggesting to upload all of these
> statements to Wikidata?
>
> Tpt already did upload pieces of the data which have sufficient quality
> outside the primary sources tool, and more is planned. But for the data
> where the suitability for Wikidata seems questionable, I would not know
> what other approach to use. Do you have a suggestion?
>
> Once you have a suggestion and there is community consensus in doing it,
> no one will stand in the way of implementing that suggestion.
>
> Cheers,
> Denny
>
>
> On Mon, Sep 28, 2015 at 1:19 PM John Erling Blad  wrote:
>
>> Another; make a kind of worklist on Wikidata that reflect the watchlist
>> on the clients (Wikipedias) but then, we often have items on our watchlist
>> that we don't know much about. (Digression: Somehow we should be able to
>> sort out those things we know (the place we live, the persons we have meet)
>> from those things we have done (edited, copy-pasted).)
>>
>> I been trying to get some interest in the past for worklists on
>> Wikipedia, it isn't much interest to make them. It would speed up tedious
>> tasks of finding the next page to edit after a given edit is completed. It
>> is the same problem with imports from Freebase on Wikidata, locate the next
>> item on Wikidata with the same queued statement from Freebase, but within
>> some worklist that the user has some knowledge about.
>>
>> Imagine "municipalities within a county" or "municipalities that is also
>> on the users watchlist", and combine that with available unhandled
>> Freebase-statements.
>>
>> On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad 
>> wrote:
>>
>>> Could it be possible to create some kind of info (notification?) in a
>>> wikipedia article that additional data is available in a queue ("freebase")
>>> somewhere?
>>>
>>> If you have the article on your watch-list, then you will get a warning
>>> that says "You lazy boy, get your ass over here and help us out!" Or
>>> perhaps slightly rephrased.
>>>
>>> On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch <
>>> mar...@semantic-mediawiki.org> wrote:
>>>
 Hi Gerard, hi all,

 The key misunderstanding here is that the main issue with the Freebase
 import would be data quality. It is actually community support. The goal of
 the current slow import process is for the Wikidata community to "adopt"
 the Freebase data. It's not about "storing" the data somewhere, but about
 finding a way to maintain it in the future.

 The import statistics show that Wikidata does not currently have enough
 community power for a quick import. This is regrettable, but not something
 that we can fix by dumping in more data that will then be orphaned.

 Freebase people: this is not a small amount of data for our young
 community. We really need your help to digest this huge amount of data! I
 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread Denny Vrandečić
Tpt did take a few datasets that have a high-enough quality from the
Freebase dataset and uploaded it directly. These numbers do not appear in
the Primary Sources tool, because they were uploaded directly - each set
going through the normal community process.

The Primary Sources Tool is left with the datasets where we were not able
to establish a high enough threshold of quality. For any dataset where this
quality can be demonstrated to the community, I assume they will agree with
a direct upload.

I am not sure what else to do here.

I am very thankful to Nemo for his rephrasing of the discussion and to pull
it to a constructive and actionable level.




Gerard, regarding your arguments:

   - why would someone work on data in the primary sources tool when it is
   more effective to add data directly

Can you explain what you mean with "add data directly". I am really not
sure what you mean with this argument. Are you suggesting to upload the
whole dataset without further review?

   - why is data that is over 90% good denied access to Wikidata (ie as
   good as Wikidata itself)

But it is not over 90% good! We have a rejection rate of almost 20%. Also,
10% errors means more than 1 Million errors. I yet need to see consensus to
upload this.

   - how do you justify the pst when so little data was included in Wikidata

The tool has been used to add thousands of statements and references to
Wikidata, and that by a rather small set of people (because you need to
intentionally install it). I would think that if we switch it on per
default, the throughput should grow considerably. Nemo identified a few
issues for that, and it would be good if we would work on these. Everyone
is invited to help out with that.

   - why not have Kian learn from the data set of Freebase and Wikidata and
   have smart suggestions

Kian is free to learn from the datasets. The data of Freebase has been
available for years, and Kian would by far not be the first ML tool to use
it for training purposes. If there is anything hindering Kian to use the
Freebase data, let me know, I will try to fix it.

   - why waste people's time adding one item/statement at a time when you
   can focus on the statements that are in doubt (either in Freebase or in
   Wikidata

Because we don't know which ones are which. If you could tell me which of
the 12 Million statements are good and which ones are not, and if there is
consensus about that assessment, I'd be happy to upload them.

I hope that this answers your arguments.

Again, I do not understand what your proposal is. I am going through the
process to release the data in an easy to use way. If the community agrees
with that, it can then be directly imported to Wikidata - I certainly won't
stop anyone from doing so and never had.

My feeling is that you are frustrated by what you perceive as slow
progress. You keep yelling at people that their ideas and work are not
good. I remember how much you attacked me about Wikidata and all the things
I have been doing wrong about it. Gerard, if you think you are motivating
me with your constant attacks, I have to tell you, you are not. I am not
speaking for anyone else, but I am getting tired of this. I appreciate a
critical voice, but not in the tone you are often delivering it.

So, instead of telling everyone how we are supposed to spend our volunteer
time in order to get things done better, and how we are doing things wrong,
why don't you lead by example, and do it right? All the data, all the
tools, for anything you want to get done are available to you for free. It
is a pretty amazing world - all you need is at click away. So go ahead and
do what you want to get done.







On Tue, Sep 29, 2015 at 1:07 AM Federico Leva (Nemo) 
wrote:

> Denny Vrandečić, 28/09/2015 23:27:
> > Actually, my suggestion would be to switch on Primary Sources as a
> > default tool for everyone.
>
> Yes, it's a desirable aim to have one-click suggested actions (à la
> Wikidata game) embedded into items for everyone. As for this tool,
> unrelatedly from the data used, at least slowness and misleading
> messaging need to be fixed first:
> https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool
>
> (Compare: we already have very easy "remove" buttons on all statements
> on all items. So the interface for large-scale easy correction of
> mistakes is already there, while for *insertion* it's still missing.
> Which is also the gist of Gerard's argument, I believe. I agree with
> Lydia we can eventually do both, of course.)
>
> Nemo
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread Tom Morris
Thanks for creating a dedicated thread, Markus.  It saddens me to see this
opportunity squandered and I'd love to be able to help, but I find the
project so opaque that it's difficult to find a way to engage.  Perhaps
it's just an artifact of the lack of transparency, but the current approach
seems very ad hoc to me.  It's difficult to tease apart which problems are
due to bad Freebase data, which are due to the way the Freebase data is
being processed for import, and which are due to the attitudes of the
reviewers.

As Jason Douglas said on the other thread, the Freebase data isn't
homogenous in terms of quality or importance and the appropriate way to
evaluate and import the data is by segmenting it, whether that be by
property, or data source, or whatever.  The only analysis that seems to
have been done so far is to rank properties by the number of values they
have which: a) isn't a good proxy for quality and b) isn't even a good
proxy for importance (there are a bunch of high frequency things which are
basically dead/obsolete).

The two things that I think would greatly improve things are:
- document the current process & methodology
- adopt a systematic, iterative, evaluation and improvement feedback loop

Since data is what drives this whole process understanding how the existing
data has been evaluated, filtered, transformed, etc before being loaded
into the primary sources tool is critical to understanding what the
starting basis is.  After that, understanding the meaning of the stats (and
fixing them if they don't have the right meanings) is necessary to know how
things need to be improved.

I'm having a hard time understanding the existing stats as well as
correlating them with both people's anecdotal accounts and my understanding
of the strengths and weaknesses of the Freebase data.  Additionally, the
stats represent, as I understand it, a single user's opinion of the quality
of the fact, the property mapping, the source URL and probably other
factors like their mood, how hungry they are, etc.  It's going to include
both false negatives and false positives.

When I look at one recent "approved" Freebase primary sources fact, I see
that it was reverted the next day

as a duplicate, but I also see that Maryse Condé's occupation (P106) has a
long and tortured history on Wikidata with Dexbot importing "Woman of
letters" from Italian Wikipedia, Brackibot switching it to "Author," then
Rezabot, and a few more users all taking a shot at changing it to what they
thought was best.

My gut feeling is that the bulk of the problems that people are complaining
about the Freebase-derived data that's been loaded into the Primary Sources
tool are due to the tool chain that's preparing the data, without better
stats and insight into the processes it's really impossible to say.  A
systematic analysis is needed, not a bunch of recitations of anecdotes.

Tom

On Mon, Sep 28, 2015 at 10:52 AM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Hi Gerard, hi all,
>
> The key misunderstanding here is that the main issue with the Freebase
> import would be data quality. It is actually community support. The goal of
> the current slow import process is for the Wikidata community to "adopt"
> the Freebase data. It's not about "storing" the data somewhere, but about
> finding a way to maintain it in the future.
>
> The import statistics show that Wikidata does not currently have enough
> community power for a quick import. This is regrettable, but not something
> that we can fix by dumping in more data that will then be orphaned.
>
> Freebase people: this is not a small amount of data for our young
> community. We really need your help to digest this huge amount of data! I
> am absolutely convinced from the emails I saw here that none of the former
> Freebase editors on this list would support low quality standards. They
> have fought hard to fix errors and avoid issues coming into their data for
> a long time.
>
> Nobody believes that either Freebase or Wikidata can ever be free of
> errors, and this is really not the point of this discussion at all [1]. The
> experienced community managers among us know that it is not about the
> amount of data you have. Data is cheap and easy to get, even free data with
> very high quality. But the value proposition of Wikidata is not that it can
> provide storage space for lot of data -- it is that we have a functioning
> community that can maintain it. For the Freebase data donation, we do not
> seem to have this community yet. We need to find a way to engage people to
> do this. Ideas are welcome.
>
> What I can see from the statistics, however, is that some users (and I
> cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting
> a lot of effort into integrating the data already. This is great, and we
> should thank these people because they are the ones who are now working 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Thomas Tanon
Hi!

Thank you Thad for your support!

First some pieces of news about the current progress:

The work on Primary Sources and the Freebase mapping is currently on hold since 
the last day of my Google internship (in late August). We have already a lot 
(13.7M) statements in the Primary Sources tool and I think that we should maybe 
try to make Wikidata adopt them before creating some other ones. 

Some answer:

> First ... it looks like you REALLY need my help to finish the Freebase 
> mapping ? Hardly anything looks done...and I have the time and knowledge to 
> fill it all in completely...  
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Class_mapping

This page is an attempt to map Freebase types to Wikidata classes. But it seems 
to me that it won't lead to any big addition of new good statements: the class 
hierarchy of Wikidata is very different from the Freebase type hierarchy making 
the mapping difficult. I have already done something for people by creating a 
file with the Qids of Wikidata items mapped to a /people/person but without P31 
Q5. Something like an half of these were not, in fact, items about a person 
(it's a wet finger estimation) so I decided not to add these data into Primary 
Sources. But I have given this file to Magnus who has imported them into his 
"person" game (thank you Magnus :-)).

> It looks like TPT had another page where the WD Properties were being mapped 
> to Freebase here: 
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping Do you 
> need help in filling that out more ?

I believe that the top properties are now mapped (we have 360 properties 
mapped). For example, if I take the dataset of facts tagged as reviewed in the 
dump [1] that have as subject a mapped topic, I am able to map 92% of them to 
Wikidata claims. So, if you have time to improve the mapping it would be a very 
nice task but I don't think it'll be the most rewarding. I believe that a task 
to improve the mapping between Freebase topics and Wikidata item will lead to 
far more additions (the mapping used to create the current content of the 
Primary Sources tool has only 4.56M connections).

>  This is great, and we should thank these people because they are the ones 
> who are now working on what we are just talking about here. In addition, we 
> should think about ways of engaging more community in this. Some ideas:

Thank you very much for all these ideas. I am currently working on these two 
sides in order to move forward the importation of the already mapped statements:

1. Import some "good" datasets using my bot. I have already done it for the 
"simple" facts about humans (birth date, birth place...) that are tagged as 
reviewed in the Freebase dump [1]. I have created a wiki page to coordinate 
this work: 
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Good_datasets

2. Optimize the Primary Sources tool in order to make it more usable. I have 
done some work in order to decrease the load time and my aim is now to try to 
avoid the unneeded page reloads.

Cheers,

Thomas

[1] See http://www.freebase.com/freebase/valuenotation/is_reviewed


> Le 28 sept. 2015 à 21:36, Markus Krötzsch  a 
> écrit :
> 
> Gerard,
> 
> Why do you spend so much energy on criticising the work of other volunteers 
> and companies that want to help Wikidata? Switching off Primary Sources would 
> not achieve any progress towards what you want. I have made some proposals in 
> my email on what else could be done to speed things up. You could work on 
> realising some of these ideas, you could propose other activities to the 
> community, or you could just help elsewhere on Wikidata. Focussing on a tool 
> you don't like and don't want to use will not make you (or the rest of us) 
> happy.
> 
> Markus
> 
> 
> On 28.09.2015 20:01, Gerard Meijssen wrote:
>> Hoi,
>> 
>> Sorry I disagree with your analysis. The fundamental issue is not
>> quality and it is not the size of our community. The issue is that we
>> have our priorities wrong. As far as I am concerned the "primary sources
>> tool" is a wrong approach for a dataset like Freebase or DBpedia.
>> 
>> What we should concentrate on is find likely issues that exist in
>> Wikidata. Make people aware of them and have a proper workflow that will
>> point people to the things they care about. When I care about "polders"
>> show me content where another source disagrees with what we have. As I
>> care about "polders" I will spend time on it BECAUSE I care and am
>> invited to resolve issues. I will be challenged because every item I
>> touch has an issue. I do not mind to do this when the data in Wikidata
>> differs from DBpedia, Freebase or whatever.. My time is well spend. THAT
>> is why I will be challenged, that is why I will be willing to work on this.
>> 
>> I will not do this for new data in the primary sources tool. At most I
>> will give it a glance and accept it. I would only do this 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread John Erling Blad
Could it be possible to create some kind of info (notification?) in a
wikipedia article that additional data is available in a queue ("freebase")
somewhere?

If you have the article on your watch-list, then you will get a warning
that says "You lazy boy, get your ass over here and help us out!" Or
perhaps slightly rephrased.

On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Hi Gerard, hi all,
>
> The key misunderstanding here is that the main issue with the Freebase
> import would be data quality. It is actually community support. The goal of
> the current slow import process is for the Wikidata community to "adopt"
> the Freebase data. It's not about "storing" the data somewhere, but about
> finding a way to maintain it in the future.
>
> The import statistics show that Wikidata does not currently have enough
> community power for a quick import. This is regrettable, but not something
> that we can fix by dumping in more data that will then be orphaned.
>
> Freebase people: this is not a small amount of data for our young
> community. We really need your help to digest this huge amount of data! I
> am absolutely convinced from the emails I saw here that none of the former
> Freebase editors on this list would support low quality standards. They
> have fought hard to fix errors and avoid issues coming into their data for
> a long time.
>
> Nobody believes that either Freebase or Wikidata can ever be free of
> errors, and this is really not the point of this discussion at all [1]. The
> experienced community managers among us know that it is not about the
> amount of data you have. Data is cheap and easy to get, even free data with
> very high quality. But the value proposition of Wikidata is not that it can
> provide storage space for lot of data -- it is that we have a functioning
> community that can maintain it. For the Freebase data donation, we do not
> seem to have this community yet. We need to find a way to engage people to
> do this. Ideas are welcome.
>
> What I can see from the statistics, however, is that some users (and I
> cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting
> a lot of effort into integrating the data already. This is great, and we
> should thank these people because they are the ones who are now working on
> what we are just talking about here. In addition, we should think about
> ways of engaging more community in this. Some ideas:
>
> (1) Find a way to clean and import some statements using bots. Maybe there
> are cases where Freebase already had a working import infrastructure that
> could be migrated to Wikidata? This would also solve the community support
> problem in one way. We just need to import the maintenance infrastructure
> together with the data.
>
> (2) Find a way to expose specific suggestions to more people. The Wikidata
> Games have attracted so many contributions. Could some of the Freebase data
> be solved in this way, with a dedicated UI?
>
> (3) Organise Freebase edit-a-thons where people come together to work
> through a bunch of suggested statements.
>
> (4) Form wiki projects that discuss a particular topic domain in Freebase
> and how it could be imported faster using (1)-(3) or any other idea.
>
> (5) Connect to existing Wiki projects to make them aware of valuable data
> they might take from Freebase.
>
> Freebase is a much better resource than many other data resources we are
> already using with similar approaches as (1)-(5) above, and yet it seems
> many people are waiting for Google alone to come up with a solution.
>
> Cheers,
>
> Markus
>
> [1] Gerard, if you think otherwise, please let us know which error rates
> you think are typical or acceptable for Freebase and Wikidata,
> respectively. Without giving actual numbers you just produce empty strawman
> arguments (for example: claiming that anyone would think that Wikidata is
> better quality than Freebase and then refuting this point, which nobody is
> trying to make). See https://en.wikipedia.org/wiki/Straw_man
>
>
> On 26.09.2015 18:31, Gerard Meijssen wrote:
>
>> Hoi,
>> When you analyse the statistics, it shows how bad the current state of
>> affairs is. Slightly over one in a thousanths of the content of the
>> primary sources tool has been included.
>>
>> Markus, Lydia and myself agree that the content of Freebase may be
>> improved. Where we differ is that the same can be said for Wikidata. It
>> is not much better and by including the data from Freebase we have a
>> much improved coverage of facts. The same can be said for the content of
>> DBpedia probably other sources as well.
>>
>> I seriously hate this procrastination and the denial of the efforts of
>> others. It is one type of discrimination that is utterly deplorable.
>>
>> We should concentrate on comparing Wikidata with other sources that are
>> maintained. We should do this repeatedly and concentrate on workflows
>> that seek the differences and provide 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Markus Krötzsch

Hi Thad,

thanks for your support. I think this can be really useful. Now just to 
clarify: I am not developing or maintaining the Primary Sources tool, I 
just want to see more Freebase data being migrated :-) I think making 
the mapping more complete is clearly necessary and valuable, but maybe 
someone with more insights into the current progress on that level can 
make a more insightful comment.


Markus


On 28.09.2015 20:44, Thad Guidry wrote:

Markus, Lydia...

It looks like TPT had another page where the WD Properties were being
mapped to Freebase here:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping

Do you need help in filling that out more ?

Thad
+ThadGuidry 



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Thad Guidry
First ... it looks like you REALLY need my help to finish the Freebase
mapping ? Hardly anything looks done...and I have the time and knowledge to
fill it all in completely...
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Class_mapping

Markus, do you want me to start on that ?  Probably take me this week to
fill it out.

Thad
+ThadGuidry 
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Thad Guidry
Markus, Lydia...

It looks like TPT had another page where the WD Properties were being
mapped to Freebase here:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping

Do you need help in filling that out more ?

Thad
+ThadGuidry 
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Gerard Meijssen
Hoi,

Sorry I disagree with your analysis. The fundamental issue is not quality
and it is not the size of our community. The issue is that we have our
priorities wrong. As far as I am concerned the "primary sources tool" is a
wrong approach for a dataset like Freebase or DBpedia.

What we should concentrate on is find likely issues that exist in Wikidata.
Make people aware of them and have a proper workflow that will point people
to the things they care about. When I care about "polders" show me content
where another source disagrees with what we have. As I care about "polders"
I will spend time on it BECAUSE I care and am invited to resolve issues. I
will be challenged because every item I touch has an issue. I do not mind
to do this when the data in Wikidata differs from DBpedia, Freebase or
whatever.. My time is well spend. THAT is why I will be challenged, that is
why I will be willing to work on this.

I will not do this for new data in the primary sources tool. At most I will
give it a glance and accept it. I would only do this where data in the
primary sources tool differs. That however is exactly the same scenario
that I just described.

I am not willing to look at data in Wikidata Freebase or DBpedia in the
primary sources tool one item/statement at a time; we know that they are of
a similar quality as Wikidata. The percentages make it a waste of time.
With iterative comparisons of other sources we will find the booboos easy
enough. We will spend the time of our communities effectively and we will
increase quality and quality and community.

The approach of the primary sources tool is wrong. It should only be about
linking data and define how this is done.

The problem is indeed with the community. Its time is wasted and it is much
more effective for me to add new data than work on data that is already in
the primary sources tool.
Thanks,
   GerardM

On 28 September 2015 at 16:52, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Hi Gerard, hi all,
>
> The key misunderstanding here is that the main issue with the Freebase
> import would be data quality. It is actually community support. The goal of
> the current slow import process is for the Wikidata community to "adopt"
> the Freebase data. It's not about "storing" the data somewhere, but about
> finding a way to maintain it in the future.
>
> The import statistics show that Wikidata does not currently have enough
> community power for a quick import. This is regrettable, but not something
> that we can fix by dumping in more data that will then be orphaned.
>
> Freebase people: this is not a small amount of data for our young
> community. We really need your help to digest this huge amount of data! I
> am absolutely convinced from the emails I saw here that none of the former
> Freebase editors on this list would support low quality standards. They
> have fought hard to fix errors and avoid issues coming into their data for
> a long time.
>
> Nobody believes that either Freebase or Wikidata can ever be free of
> errors, and this is really not the point of this discussion at all [1]. The
> experienced community managers among us know that it is not about the
> amount of data you have. Data is cheap and easy to get, even free data with
> very high quality. But the value proposition of Wikidata is not that it can
> provide storage space for lot of data -- it is that we have a functioning
> community that can maintain it. For the Freebase data donation, we do not
> seem to have this community yet. We need to find a way to engage people to
> do this. Ideas are welcome.
>
> What I can see from the statistics, however, is that some users (and I
> cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting
> a lot of effort into integrating the data already. This is great, and we
> should thank these people because they are the ones who are now working on
> what we are just talking about here. In addition, we should think about
> ways of engaging more community in this. Some ideas:
>
> (1) Find a way to clean and import some statements using bots. Maybe there
> are cases where Freebase already had a working import infrastructure that
> could be migrated to Wikidata? This would also solve the community support
> problem in one way. We just need to import the maintenance infrastructure
> together with the data.
>
> (2) Find a way to expose specific suggestions to more people. The Wikidata
> Games have attracted so many contributions. Could some of the Freebase data
> be solved in this way, with a dedicated UI?
>
> (3) Organise Freebase edit-a-thons where people come together to work
> through a bunch of suggested statements.
>
> (4) Form wiki projects that discuss a particular topic domain in Freebase
> and how it could be imported faster using (1)-(3) or any other idea.
>
> (5) Connect to existing Wiki projects to make them aware of valuable data
> they might take from Freebase.
>
> Freebase is a much better resource than 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Lydia Pintscher
On Sep 28, 2015 20:03, "Gerard Meijssen"  wrote:
>
> Hoi,
>
> Sorry I disagree with your analysis. The fundamental issue is not quality
and it is not the size of our community. The issue is that we have our
priorities wrong. As far as I am concerned the "primary sources tool" is a
wrong approach for a dataset like Freebase or DBpedia.
>
> What we should concentrate on is find likely issues that exist in
Wikidata. Make people aware of them and have a proper workflow that will
point people to the things they care about. When I care about "polders"
show me content where another source disagrees with what we have.

As i have said before the extension to check against third party databases
is being worked on. This is not an argument against the primary sources
tool. It is simply something very different.

> As I care about "polders" I will spend time on it BECAUSE I care and am
invited to resolve issues. I will be challenged because every item I touch
has an issue. I do not mind to do this when the data in Wikidata differs
from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I
will be challenged, that is why I will be willing to work on this.
>
> I will not do this for new data in the primary sources tool. At most I
will give it a glance and accept it. I would only do this where data in the
primary sources tool differs. That however is exactly the same scenario
that I just described.
>
> I am not willing to look at data in Wikidata Freebase or DBpedia in the
primary sources tool one item/statement at a time; we know that they are of
a similar quality as Wikidata. The percentages make it a waste of time.
With iterative comparisons of other sources we will find the booboos easy
enough. We will spend the time of our communities effectively and we will
increase quality and quality and community.
>
> The approach of the primary sources tool is wrong. It should only be
about linking data and define how this is done.
>
> The problem is indeed with the community. Its time is wasted and it is
much more effective for me to add new data than work on data that is
already in the primary sources tool.
> Thanks,
>GerardM
>
> On 28 September 2015 at 16:52, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:
>>
>> Hi Gerard, hi all,
>>
>> The key misunderstanding here is that the main issue with the Freebase
import would be data quality. It is actually community support. The goal of
the current slow import process is for the Wikidata community to "adopt"
the Freebase data. It's not about "storing" the data somewhere, but about
finding a way to maintain it in the future.
>>
>> The import statistics show that Wikidata does not currently have enough
community power for a quick import. This is regrettable, but not something
that we can fix by dumping in more data that will then be orphaned.
>>
>> Freebase people: this is not a small amount of data for our young
community. We really need your help to digest this huge amount of data! I
am absolutely convinced from the emails I saw here that none of the former
Freebase editors on this list would support low quality standards. They
have fought hard to fix errors and avoid issues coming into their data for
a long time.
>>
>> Nobody believes that either Freebase or Wikidata can ever be free of
errors, and this is really not the point of this discussion at all [1]. The
experienced community managers among us know that it is not about the
amount of data you have. Data is cheap and easy to get, even free data with
very high quality. But the value proposition of Wikidata is not that it can
provide storage space for lot of data -- it is that we have a functioning
community that can maintain it. For the Freebase data donation, we do not
seem to have this community yet. We need to find a way to engage people to
do this. Ideas are welcome.
>>
>> What I can see from the statistics, however, is that some users (and I
cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting
a lot of effort into integrating the data already. This is great, and we
should thank these people because they are the ones who are now working on
what we are just talking about here. In addition, we should think about
ways of engaging more community in this. Some ideas:
>>
>> (1) Find a way to clean and import some statements using bots. Maybe
there are cases where Freebase already had a working import infrastructure
that could be migrated to Wikidata? This would also solve the community
support problem in one way. We just need to import the maintenance
infrastructure together with the data.
>>
>> (2) Find a way to expose specific suggestions to more people. The
Wikidata Games have attracted so many contributions. Could some of the
Freebase data be solved in this way, with a dedicated UI?
>>
>> (3) Organise Freebase edit-a-thons where people come together to work
through a bunch of suggested statements.
>>
>> (4) Form wiki projects that discuss a 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Stas Malyshev
Hi!

> I see that 19.6k statements have been approved through the tool, and
> 5.1k statements have been rejected - which means that about 1 in 5
> statements is deemed unsuitable by the users of primary sources.

From my (limited) experience with Primary Sources, there are several
kinds of things there that I had rejected:

- Unsourced statements that contradict what is written in Wikidata
- Duplicate claims already existing in Wikidata
- Duplicate claims with worse data (i.e. less accurate location, less
specific categorization, etc) or unnecessary qualifiers (such as adding
information which is already contained in the item to item's qualifiers
- e.g. zip code for a building)
- Source references that do not exist (404, etc.)
- Source references that do exist but either duplicate existing one (a
number of sources just refer to different URL of the same data) or do
not contain the information they should (e.g. link to newspaper's
homepage instead of specific article)
- Claims that are almost obviously invalid (e.g. "United Kingdom" as a
genre of a play)

I think at least some of these - esp. references that do not exist and
duplicates with no refs - could be removed automatically, thus raising
the relative quality of the remaining items.

OTOH, some of the entries can be made self-evident - i.e. if we talk
about movie and Freebase has IMDB ID or Netflix ID, it may be quite easy
to check if that ID is valid and refers to a movie by the same name,
which should be enough to merge it.

Not sure if those one-off things worth bothering with, just putting it
out there to consider.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread John Erling Blad
Another; make a kind of worklist on Wikidata that reflect the watchlist on
the clients (Wikipedias) but then, we often have items on our watchlist
that we don't know much about. (Digression: Somehow we should be able to
sort out those things we know (the place we live, the persons we have meet)
from those things we have done (edited, copy-pasted).)

I been trying to get some interest in the past for worklists on Wikipedia,
it isn't much interest to make them. It would speed up tedious tasks of
finding the next page to edit after a given edit is completed. It is the
same problem with imports from Freebase on Wikidata, locate the next item
on Wikidata with the same queued statement from Freebase, but within some
worklist that the user has some knowledge about.

Imagine "municipalities within a county" or "municipalities that is also on
the users watchlist", and combine that with available unhandled
Freebase-statements.

On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad  wrote:

> Could it be possible to create some kind of info (notification?) in a
> wikipedia article that additional data is available in a queue ("freebase")
> somewhere?
>
> If you have the article on your watch-list, then you will get a warning
> that says "You lazy boy, get your ass over here and help us out!" Or
> perhaps slightly rephrased.
>
> On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch <
> mar...@semantic-mediawiki.org> wrote:
>
>> Hi Gerard, hi all,
>>
>> The key misunderstanding here is that the main issue with the Freebase
>> import would be data quality. It is actually community support. The goal of
>> the current slow import process is for the Wikidata community to "adopt"
>> the Freebase data. It's not about "storing" the data somewhere, but about
>> finding a way to maintain it in the future.
>>
>> The import statistics show that Wikidata does not currently have enough
>> community power for a quick import. This is regrettable, but not something
>> that we can fix by dumping in more data that will then be orphaned.
>>
>> Freebase people: this is not a small amount of data for our young
>> community. We really need your help to digest this huge amount of data! I
>> am absolutely convinced from the emails I saw here that none of the former
>> Freebase editors on this list would support low quality standards. They
>> have fought hard to fix errors and avoid issues coming into their data for
>> a long time.
>>
>> Nobody believes that either Freebase or Wikidata can ever be free of
>> errors, and this is really not the point of this discussion at all [1]. The
>> experienced community managers among us know that it is not about the
>> amount of data you have. Data is cheap and easy to get, even free data with
>> very high quality. But the value proposition of Wikidata is not that it can
>> provide storage space for lot of data -- it is that we have a functioning
>> community that can maintain it. For the Freebase data donation, we do not
>> seem to have this community yet. We need to find a way to engage people to
>> do this. Ideas are welcome.
>>
>> What I can see from the statistics, however, is that some users (and I
>> cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting
>> a lot of effort into integrating the data already. This is great, and we
>> should thank these people because they are the ones who are now working on
>> what we are just talking about here. In addition, we should think about
>> ways of engaging more community in this. Some ideas:
>>
>> (1) Find a way to clean and import some statements using bots. Maybe
>> there are cases where Freebase already had a working import infrastructure
>> that could be migrated to Wikidata? This would also solve the community
>> support problem in one way. We just need to import the maintenance
>> infrastructure together with the data.
>>
>> (2) Find a way to expose specific suggestions to more people. The
>> Wikidata Games have attracted so many contributions. Could some of the
>> Freebase data be solved in this way, with a dedicated UI?
>>
>> (3) Organise Freebase edit-a-thons where people come together to work
>> through a bunch of suggested statements.
>>
>> (4) Form wiki projects that discuss a particular topic domain in Freebase
>> and how it could be imported faster using (1)-(3) or any other idea.
>>
>> (5) Connect to existing Wiki projects to make them aware of valuable data
>> they might take from Freebase.
>>
>> Freebase is a much better resource than many other data resources we are
>> already using with similar approaches as (1)-(5) above, and yet it seems
>> many people are waiting for Google alone to come up with a solution.
>>
>> Cheers,
>>
>> Markus
>>
>> [1] Gerard, if you think otherwise, please let us know which error rates
>> you think are typical or acceptable for Freebase and Wikidata,
>> respectively. Without giving actual numbers you just produce empty strawman
>> arguments (for example: claiming that anyone 

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Paul Houle
I think more fundamentally there is the issue that Wikidata doesn't serve
end users well because the end users are not paying for it.  (Contrast an
NGO that would be doing things for people in Africa without asking the
people what they want as opposed to a commercial operation that is going to
fly or die based on the ability to serve identified needs of Africans.)

I am by no means a market fundamentalist but when you look at Amazon.com,
 you see there is a virtuous circle where small incremental improvements
that make the store better put money on the bottom line,  linking career
advancement to customer success,  etc.  Over time the incremental changes
snowball.  (Alternatively we could have exponential convergence instead of
expansion)

I was looking around for API management solutions,  and they all address
things like "creating stubs for the end user",  "increasing developer
engagement",  "converting XML to JSON and vice versa" and the always
dubious idea that adding a proxy server of some kind on the public internet
would help you meet an SLA.  None of them support the minimal viable
product function of 'charging people to use the API' at a basic level,
 although if you talk to the sales people maybe they will help you with a
"monetization engine" (who knows if it puts ads in the results) but you
will pay at least as much a month for this feature as the Silk Road spent
on software development (unfortunately earning it back in the form of
marked bitcoins)

And the API management sites are dealing with big name companies like
Target and Clorox,  all of these companies that are avaricious and smart
about money are not charging people for APIs.

If you are not the customer,  you are the product.

"End user" is a fuzzy word though because that Dutch guy who is interested
in Polders is not the ordinary end user,  although you practically need to
bring people like that into things like Wikidata because you need their
curation.  Another tough problem is that we all have our specialties,  so
one person really needs a good database of wine regions,  another one ski
areas,  another one cares about books and another couldn't care less about
books but is into video games.  (The person who wants to contribute or pay
for improvements for area Z does not care about area Y)

Freebase was not particularly successful at getting unpaid help to improve
their database because of these fundamental economics;  you might make the
case that friction in the form of "this data format is different from
everything else" or "the UI sux"  or "the rest of the world hasn't caught
up with us on tooling" is the main problem,  but people would overcome
those problems if the motivation existed.

Anyhow,  there is this funny little thing that the gap between "5 cents"
and free is bigger than the gap between "5 cents" and $1000,  so you have
the Bloombergs and Elseviers of the world charging $1000 for what somebody
could provide for much less.  This problem exists for the human readable
web and so far advertising has been the answer,  but it has not been solved
for open data.



On Mon, Sep 28, 2015 at 2:01 PM, Gerard Meijssen 
wrote:

> Hoi,
>
> Sorry I disagree with your analysis. The fundamental issue is not quality
> and it is not the size of our community. The issue is that we have our
> priorities wrong. As far as I am concerned the "primary sources tool" is a
> wrong approach for a dataset like Freebase or DBpedia.
>
> What we should concentrate on is find likely issues that exist in
> Wikidata. Make people aware of them and have a proper workflow that will
> point people to the things they care about. When I care about "polders"
> show me content where another source disagrees with what we have. As I care
> about "polders" I will spend time on it BECAUSE I care and am invited to
> resolve issues. I will be challenged because every item I touch has an
> issue. I do not mind to do this when the data in Wikidata differs from
> DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will
> be challenged, that is why I will be willing to work on this.
>
> I will not do this for new data in the primary sources tool. At most I
> will give it a glance and accept it. I would only do this where data in the
> primary sources tool differs. That however is exactly the same scenario
> that I just described.
>
> I am not willing to look at data in Wikidata Freebase or DBpedia in the
> primary sources tool one item/statement at a time; we know that they are of
> a similar quality as Wikidata. The percentages make it a waste of time.
> With iterative comparisons of other sources we will find the booboos easy
> enough. We will spend the time of our communities effectively and we will
> increase quality and quality and community.
>
> The approach of the primary sources tool is wrong. It should only be about
> linking data and define how this is done.
>
> The problem is indeed with the community. Its time is