Re: [Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Tony Bowden
On 14 June 2016 at 18:53, Tom Morris  wrote:
> A specific instance of the structural impedance mismatch is enwiki's
> handling of genes & proteins. Sometimes they have a page for each, but often
> they have a single page that deals with both or, worse, a page who's text
> says its about the protein, but where the page includes a gene infobox.

This is also a problem with pages on elections. It's very common for
national-level elections to be for more than one thing at the same
time — e.g. a Presidential election, and a Parliamentary one. In most
Wikipedias there will only be a single page for, say, "Brazilian
general election, 2014", though occasionally you'll get separate pages
in _some_ languages for (for example) "Brazilian legislative election,
2010" and "Brazilian presidential election, 2010" (also split in pt:),
whereas those will be combined in other languages (de:Wahlen in
Brasilien 2010 / pl:Wybory powszechne w Brazylii w 2010 roku).

Mostly this material hasn't had a lot of attention yet on Wikidata, so
it's not _too_ hard to split out separate pages for each conceptually
different thing and each of which is 'part of' a wider 'general
election' (though for an added twist, the legislative elections are
often themselves for multiple houses (eg the Assembly and Senate)
simultaneously, and almost never have distinct Wikipedia pages).

I have seen at least one case though where someone then merged two of
these, presumably (although I didn't dig into deeply enough to be
sure) because each of the Wikidata pages mapped to a single page in
"their" Wikipedia. Thankfully this doesn't appear to have been too
common an occurrence yet, but that's potentially just because very few
of them have even been split up in the first place yet. (Currently I'm
largely just picking off the lower hanging fruit of just making sure
that each of the national elections in the world over last hundred
years or so even has a basic Wikidata entry *at all*.)  I'm hoping
that such merges would be less likely in cases where each of the
individual Wikidata pages had quite rich information on candidates,
turnout, winners, etc, but as it's comparatively difficult to even
semi-automate the import of statements like that (and many of the
existing pages already have confusing combined data presumably from
Wikipedia infobox imports from such mismatched pages), it's likely
that it'll take quite a while for that to happen, and I fear that
there will thus be quite a long period where it will be tempting for
people to mis-merge pages.

Tony

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Lydia Pintscher
On Jun 14, 2016 23:56, "Tom Morris"  wrote:
>
> Thanks for the reminder. So that solves the "asking" part.
>
> Does anyone *not* think that the Wikidata engineering team is the correct
place for this?
>
> Lydia - can you assign someone to come up to speed at whatever level
Denny requires to feel comfortable making the transfer?

I will take care of it with Denny in the next days.

Cheers
Lydia

>
> Tom
>
> On Tue, Jun 14, 2016 at 4:26 PM, Thomas Steiner  wrote:
>>
>> Hi Tom, all,
>>
>> > Have you considered asking
>> > Google to transfer ownership of the project since they're no longer
doing
>> > anything with it?
>> Denny wrote [1] that we are open to transferring the project to a new
>> owner: "If anyone wants to take over the project, we would invite you
>> to contribute a bit for a while, and then let’s discuss about it. I
>> would be thrilled to see this tool develop.". This still stands :-)
>>
>> Cheers,
>> Tom
>>
>> --
>> [1]
https://lists.wikimedia.org/pipermail/wikidata/2016-February/008316.html
>>
>>
>> --
>> Dr. Thomas Steiner, Employee (http://blog.tomayac.com,
>> https://twitter.com/tomayac)
>>
>> Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
>> Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle
>> Registration office and registration number: Hamburg, HRB 86891
>>
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v2.0.29 (GNU/Linux)
>>
>>
iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom
>> hTtPs://xKcd.cOm/1181/
>> -END PGP SIGNATURE-
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Tom Morris
Hi Gerard. There's often a tension between supporting "power users" and the
regular users, but in this case, I left out a little nuance - if you
flagged an item created by yourself for either deletion or merger and no
one else had edited it in the mean time, the operation was processed
automatically without having to go through the voting process. This allowed
everyone to fix their own mistakes quickly. Finding the right balance for
these processes typically takes a little tuning.

I forgot to mention another aspect of the current merge process that I
think is dangerous and I've seen cause problems and that is merge "games."
High impact operations like merges seem like a particularly poor fit for
gamification, particularly when there's no safety net such as a second set
of eyes.

Tom

On Tue, Jun 14, 2016 at 4:08 PM, Gerard Meijssen 
wrote:

> Hoi,
> I add "many" entries. As a consequence I make the occasional mistake.
> Typically I find them myself and rectify. When you interfere with that, I
> can no longer sort out the mess I make. That is fine. It is then for
> someone else to fix.
> Thanks,
>  GerardM
>
> On 14 June 2016 at 21:20, Benjamin Good  wrote:
>
>> Hi Tom,
>>
>> I think the example you have there is actually linked up properly at the
>> moment?
>> https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the
>> protein as are most Wikipedia articles of this nature.  And it is linked to
>> the gene the way we encourage modeling
>> https://www.wikidata.org/wiki/Q18052679  - and indeed the protein item
>> is not linked to a Wikipedia article again following our preferred pattern.
>>
>> For the moment...  _our_ merge problem seems to be mostly resolved.
>> Correcting the sitelinks on the non-english Wikipedias in a big batch
>> seemed to help slow the flow dramatically.  We have also introduced some
>> flexibility into the Lua code that produces infobox_gene on Wikipedia.  It
>> can handle most of the possible situations (e.g. wikipedia linked to
>> protein, wikipedia linked to gene) automatically so that helps prevent
>> visible disasters..
>>
>> On the main issue you raise about merges..  I'm a little on the fence.
>> Generally I'm opposed to putting constraints in place that slow people down
>> - e.g. we have a lot of manual merge work that needs to be done in the
>> medical arena and I do appreciate that the current process is pretty fast.
>> I guess I would advocate a focus on making the interface more vehemently
>> educational as a first step.  E.g. lots of 'are you sure' etc. forms to
>> click through but ultimately still letting people get their work done
>> without enforcing an approval process.
>>
>> -Ben
>>
>> On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris  wrote:
>>
>>> Bad merges have been mentioned a couple of times recently and I think
>>> one of the contexts with Ben's gene/protein work.
>>>
>>> I think there are two general issues here which could be improved:
>>>
>>> 1. Merging is too easy. Because splitting/unmerging is much harder than
>>> merging, particularly after additional edits, the process should be biased
>>> to mark merging more difficult.
>>>
>>> 2. The impedance mismatch between Wikidata and Wikipedias tempts
>>> wikipedians who are new to wikidata to do the wrong thing.
>>>
>>> The second is a community education issue which will hopefully improve
>>> over time, but the first could be improved, in my opinion, by requiring
>>> more than one person to approve a merge. The Freebase scheme was that
>>> duplicate topics could be flagged for merge by anyone, but instead of
>>> merging, they'd be placed in a queue for voting. Unanimous votes would
>>> cause merges to be automatically processed. Conflicting votes would get
>>> bumped to a second level queue for manual handling. This wasn't foolproof,
>>> but caught a lot of the naive "these two things have the same name, so they
>>> must be the same thing" merge proposals by newbies. There are lots of
>>> variations that could be implemented, but the general idea is to get more
>>> than one pair of eyes involved.
>>>
>>> A specific instance of the structural impedance mismatch is enwiki's
>>> handling of genes & proteins. Sometimes they have a page for each, but
>>> often they have a single page that deals with both or, worse, a page who's
>>> text says its about the protein, but where the page includes a gene infobox.
>>>
>>> This unanswered RFC from Oct 2015 asks whether protein & gene should be
>>> merged:
>>> https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT_gene
>>>
>>> I recently ran across a similar situation where this Wikidata gene
>>> SPATA5 https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki
>>> page about the associated protein https://en.wikipedia.org/wiki/SPATA5,
>>> while the Wikidata protein is not linked to any wikis
>>> https://www.wikidata.org/wiki/Q21207860
>>>
>>> These differences in handling make the reconciliation process very
>>> difficult and the resul

Re: [Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Tom Morris
Hi Ben. On reflection, I think the SPATA5 page is more about the gene, than
the protein, despite the lead (and only) sentence which has the protein as
its subject. The other example from the RFC (Neurophysin I
) seems less clearcut to me
since the text is entirely about the protein, while the infobox is the only
thing talking about the gene. I generally discount infoboxes since they can
have little to do with the main subject of the page (e.g. civil war battle
with infoboxes about the opposing generals or the associated NRHP place).

Despite the textual confusion, I realized that there isn't really much
merge risk here since the "encodes" property linking the gene and the
protein should prevent any attempted merge from happening.

Tom

On Tue, Jun 14, 2016 at 3:20 PM, Benjamin Good 
wrote:

> Hi Tom,
>
> I think the example you have there is actually linked up properly at the
> moment?
> https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the
> protein as are most Wikipedia articles of this nature.  And it is linked to
> the gene the way we encourage modeling
> https://www.wikidata.org/wiki/Q18052679  - and indeed the protein item is
> not linked to a Wikipedia article again following our preferred pattern.
>
> For the moment...  _our_ merge problem seems to be mostly resolved.
> Correcting the sitelinks on the non-english Wikipedias in a big batch
> seemed to help slow the flow dramatically.  We have also introduced some
> flexibility into the Lua code that produces infobox_gene on Wikipedia.  It
> can handle most of the possible situations (e.g. wikipedia linked to
> protein, wikipedia linked to gene) automatically so that helps prevent
> visible disasters..
>
> On the main issue you raise about merges..  I'm a little on the fence.
> Generally I'm opposed to putting constraints in place that slow people down
> - e.g. we have a lot of manual merge work that needs to be done in the
> medical arena and I do appreciate that the current process is pretty fast.
> I guess I would advocate a focus on making the interface more vehemently
> educational as a first step.  E.g. lots of 'are you sure' etc. forms to
> click through but ultimately still letting people get their work done
> without enforcing an approval process.
>
> -Ben
>
> On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris  wrote:
>
>> Bad merges have been mentioned a couple of times recently and I think one
>> of the contexts with Ben's gene/protein work.
>>
>> I think there are two general issues here which could be improved:
>>
>> 1. Merging is too easy. Because splitting/unmerging is much harder than
>> merging, particularly after additional edits, the process should be biased
>> to mark merging more difficult.
>>
>> 2. The impedance mismatch between Wikidata and Wikipedias tempts
>> wikipedians who are new to wikidata to do the wrong thing.
>>
>> The second is a community education issue which will hopefully improve
>> over time, but the first could be improved, in my opinion, by requiring
>> more than one person to approve a merge. The Freebase scheme was that
>> duplicate topics could be flagged for merge by anyone, but instead of
>> merging, they'd be placed in a queue for voting. Unanimous votes would
>> cause merges to be automatically processed. Conflicting votes would get
>> bumped to a second level queue for manual handling. This wasn't foolproof,
>> but caught a lot of the naive "these two things have the same name, so they
>> must be the same thing" merge proposals by newbies. There are lots of
>> variations that could be implemented, but the general idea is to get more
>> than one pair of eyes involved.
>>
>> A specific instance of the structural impedance mismatch is enwiki's
>> handling of genes & proteins. Sometimes they have a page for each, but
>> often they have a single page that deals with both or, worse, a page who's
>> text says its about the protein, but where the page includes a gene infobox.
>>
>> This unanswered RFC from Oct 2015 asks whether protein & gene should be
>> merged:
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT_gene
>>
>> I recently ran across a similar situation where this Wikidata gene SPATA5
>> https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page
>> about the associated protein https://en.wikipedia.org/wiki/SPATA5, while
>> the Wikidata protein is not linked to any wikis
>> https://www.wikidata.org/wiki/Q21207860
>>
>> These differences in handling make the reconciliation process very
>> difficult and the resulting errors encourage erroneous merges. The
>> gene/protein case probably needs multiple fixes, but many mergers harder
>> would help.
>>
>> Tom
>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> ___
> Wikidata mailing list
> Wiki

Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Tom Morris
Thanks for the reminder. So that solves the "asking" part.

Does anyone *not* think that the Wikidata engineering team is the correct
place for this?

Lydia - can you assign someone to come up to speed at whatever level Denny
requires to feel comfortable making the transfer?

Tom

On Tue, Jun 14, 2016 at 4:26 PM, Thomas Steiner  wrote:

> Hi Tom, all,
>
> > Have you considered asking
> > Google to transfer ownership of the project since they're no longer doing
> > anything with it?
> Denny wrote [1] that we are open to transferring the project to a new
> owner: "If anyone wants to take over the project, we would invite you
> to contribute a bit for a while, and then let’s discuss about it. I
> would be thrilled to see this tool develop.". This still stands :-)
>
> Cheers,
> Tom
>
> --
> [1]
> https://lists.wikimedia.org/pipermail/wikidata/2016-February/008316.html
>
>
> --
> Dr. Thomas Steiner, Employee (http://blog.tomayac.com,
> https://twitter.com/tomayac)
>
> Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
> Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle
> Registration office and registration number: Hamburg, HRB 86891
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.29 (GNU/Linux)
>
>
> iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom
> hTtPs://xKcd.cOm/1181/
> -END PGP SIGNATURE-
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Thomas Steiner
Hi Tom, all,

> Have you considered asking
> Google to transfer ownership of the project since they're no longer doing
> anything with it?
Denny wrote [1] that we are open to transferring the project to a new
owner: "If anyone wants to take over the project, we would invite you
to contribute a bit for a while, and then let’s discuss about it. I
would be thrilled to see this tool develop.". This still stands :-)

Cheers,
Tom

--
[1] https://lists.wikimedia.org/pipermail/wikidata/2016-February/008316.html


-- 
Dr. Thomas Steiner, Employee (http://blog.tomayac.com,
https://twitter.com/tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle
Registration office and registration number: Hamburg, HRB 86891

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.29 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom
hTtPs://xKcd.cOm/1181/
-END PGP SIGNATURE-

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Gerard Meijssen
Hoi,
I add "many" entries. As a consequence I make the occasional mistake.
Typically I find them myself and rectify. When you interfere with that, I
can no longer sort out the mess I make. That is fine. It is then for
someone else to fix.
Thanks,
 GerardM

On 14 June 2016 at 21:20, Benjamin Good  wrote:

> Hi Tom,
>
> I think the example you have there is actually linked up properly at the
> moment?
> https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the
> protein as are most Wikipedia articles of this nature.  And it is linked to
> the gene the way we encourage modeling
> https://www.wikidata.org/wiki/Q18052679  - and indeed the protein item is
> not linked to a Wikipedia article again following our preferred pattern.
>
> For the moment...  _our_ merge problem seems to be mostly resolved.
> Correcting the sitelinks on the non-english Wikipedias in a big batch
> seemed to help slow the flow dramatically.  We have also introduced some
> flexibility into the Lua code that produces infobox_gene on Wikipedia.  It
> can handle most of the possible situations (e.g. wikipedia linked to
> protein, wikipedia linked to gene) automatically so that helps prevent
> visible disasters..
>
> On the main issue you raise about merges..  I'm a little on the fence.
> Generally I'm opposed to putting constraints in place that slow people down
> - e.g. we have a lot of manual merge work that needs to be done in the
> medical arena and I do appreciate that the current process is pretty fast.
> I guess I would advocate a focus on making the interface more vehemently
> educational as a first step.  E.g. lots of 'are you sure' etc. forms to
> click through but ultimately still letting people get their work done
> without enforcing an approval process.
>
> -Ben
>
> On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris  wrote:
>
>> Bad merges have been mentioned a couple of times recently and I think one
>> of the contexts with Ben's gene/protein work.
>>
>> I think there are two general issues here which could be improved:
>>
>> 1. Merging is too easy. Because splitting/unmerging is much harder than
>> merging, particularly after additional edits, the process should be biased
>> to mark merging more difficult.
>>
>> 2. The impedance mismatch between Wikidata and Wikipedias tempts
>> wikipedians who are new to wikidata to do the wrong thing.
>>
>> The second is a community education issue which will hopefully improve
>> over time, but the first could be improved, in my opinion, by requiring
>> more than one person to approve a merge. The Freebase scheme was that
>> duplicate topics could be flagged for merge by anyone, but instead of
>> merging, they'd be placed in a queue for voting. Unanimous votes would
>> cause merges to be automatically processed. Conflicting votes would get
>> bumped to a second level queue for manual handling. This wasn't foolproof,
>> but caught a lot of the naive "these two things have the same name, so they
>> must be the same thing" merge proposals by newbies. There are lots of
>> variations that could be implemented, but the general idea is to get more
>> than one pair of eyes involved.
>>
>> A specific instance of the structural impedance mismatch is enwiki's
>> handling of genes & proteins. Sometimes they have a page for each, but
>> often they have a single page that deals with both or, worse, a page who's
>> text says its about the protein, but where the page includes a gene infobox.
>>
>> This unanswered RFC from Oct 2015 asks whether protein & gene should be
>> merged:
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT_gene
>>
>> I recently ran across a similar situation where this Wikidata gene SPATA5
>> https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page
>> about the associated protein https://en.wikipedia.org/wiki/SPATA5, while
>> the Wikidata protein is not linked to any wikis
>> https://www.wikidata.org/wiki/Q21207860
>>
>> These differences in handling make the reconciliation process very
>> difficult and the resulting errors encourage erroneous merges. The
>> gene/protein case probably needs multiple fixes, but many mergers harder
>> would help.
>>
>> Tom
>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Gerard Meijssen
Hoi,
That would be really welcome.. I notice how much duplication is going on...
SAD... My position on this has been clear.
Thanks,
 GerardM

On 14 June 2016 at 21:18, Thad Guidry  wrote:

> Lydia,
>
> I would contribute.  Most of us from Freebase around would gladly help out
> I'm sure.  We'd like to see the data eventually move into Wikidata rather
> than rot away.
>
>
> Thad
> +ThadGuidry 
>
> On Tue, Jun 14, 2016 at 12:30 PM, Lydia Pintscher <
> lydia.pintsc...@wikimedia.de> wrote:
>
>> On Tue, Jun 14, 2016 at 7:26 PM, Tom Morris  wrote:
>> > OK, Marco Fossati (aka marfox on Github) is a name I recognize.
>> >
>> > Lydia - I'm not upset, just confused by the RFC. Have you considered
>> asking
>> > Google to transfer ownership of the project since they're no longer
>> doing
>> > anything with it? That would do away with the requirement that
>> contributors
>> > sign a Google CLA and open the door for active project leadership.
>>
>> I am happy to handle that if I have someone who says they'd contribute
>> to the tool without the CLA but not with it, yes.
>>
>>
>> Cheers
>> Lydia
>>
>> --
>> Lydia Pintscher - http://about.me/lydia.pintscher
>> Product Manager for Wikidata
>>
>> Wikimedia Deutschland e.V.
>> Tempelhofer Ufer 23-24
>> 10963 Berlin
>> www.wikimedia.de
>>
>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>>
>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Benjamin Good
Hi Tom,

I think the example you have there is actually linked up properly at the
moment?
https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the protein
as are most Wikipedia articles of this nature.  And it is linked to the
gene the way we encourage modeling https://www.wikidata.org/wiki/Q18052679
- and indeed the protein item is not linked to a Wikipedia article again
following our preferred pattern.

For the moment...  _our_ merge problem seems to be mostly resolved.
Correcting the sitelinks on the non-english Wikipedias in a big batch
seemed to help slow the flow dramatically.  We have also introduced some
flexibility into the Lua code that produces infobox_gene on Wikipedia.  It
can handle most of the possible situations (e.g. wikipedia linked to
protein, wikipedia linked to gene) automatically so that helps prevent
visible disasters..

On the main issue you raise about merges..  I'm a little on the fence.
Generally I'm opposed to putting constraints in place that slow people down
- e.g. we have a lot of manual merge work that needs to be done in the
medical arena and I do appreciate that the current process is pretty fast.
I guess I would advocate a focus on making the interface more vehemently
educational as a first step.  E.g. lots of 'are you sure' etc. forms to
click through but ultimately still letting people get their work done
without enforcing an approval process.

-Ben

On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris  wrote:

> Bad merges have been mentioned a couple of times recently and I think one
> of the contexts with Ben's gene/protein work.
>
> I think there are two general issues here which could be improved:
>
> 1. Merging is too easy. Because splitting/unmerging is much harder than
> merging, particularly after additional edits, the process should be biased
> to mark merging more difficult.
>
> 2. The impedance mismatch between Wikidata and Wikipedias tempts
> wikipedians who are new to wikidata to do the wrong thing.
>
> The second is a community education issue which will hopefully improve
> over time, but the first could be improved, in my opinion, by requiring
> more than one person to approve a merge. The Freebase scheme was that
> duplicate topics could be flagged for merge by anyone, but instead of
> merging, they'd be placed in a queue for voting. Unanimous votes would
> cause merges to be automatically processed. Conflicting votes would get
> bumped to a second level queue for manual handling. This wasn't foolproof,
> but caught a lot of the naive "these two things have the same name, so they
> must be the same thing" merge proposals by newbies. There are lots of
> variations that could be implemented, but the general idea is to get more
> than one pair of eyes involved.
>
> A specific instance of the structural impedance mismatch is enwiki's
> handling of genes & proteins. Sometimes they have a page for each, but
> often they have a single page that deals with both or, worse, a page who's
> text says its about the protein, but where the page includes a gene infobox.
>
> This unanswered RFC from Oct 2015 asks whether protein & gene should be
> merged:
> https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT_gene
>
> I recently ran across a similar situation where this Wikidata gene SPATA5
> https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about
> the associated protein https://en.wikipedia.org/wiki/SPATA5, while the
> Wikidata protein is not linked to any wikis
> https://www.wikidata.org/wiki/Q21207860
>
> These differences in handling make the reconciliation process very
> difficult and the resulting errors encourage erroneous merges. The
> gene/protein case probably needs multiple fixes, but many mergers harder
> would help.
>
> Tom
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Thad Guidry
Lydia,

I would contribute.  Most of us from Freebase around would gladly help out
I'm sure.  We'd like to see the data eventually move into Wikidata rather
than rot away.


Thad
+ThadGuidry 

On Tue, Jun 14, 2016 at 12:30 PM, Lydia Pintscher <
lydia.pintsc...@wikimedia.de> wrote:

> On Tue, Jun 14, 2016 at 7:26 PM, Tom Morris  wrote:
> > OK, Marco Fossati (aka marfox on Github) is a name I recognize.
> >
> > Lydia - I'm not upset, just confused by the RFC. Have you considered
> asking
> > Google to transfer ownership of the project since they're no longer doing
> > anything with it? That would do away with the requirement that
> contributors
> > sign a Google CLA and open the door for active project leadership.
>
> I am happy to handle that if I have someone who says they'd contribute
> to the tool without the CLA but not with it, yes.
>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Tom Morris
Bad merges have been mentioned a couple of times recently and I think one
of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

1. Merging is too easy. Because splitting/unmerging is much harder than
merging, particularly after additional edits, the process should be biased
to mark merging more difficult.

2. The impedance mismatch between Wikidata and Wikipedias tempts
wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over
time, but the first could be improved, in my opinion, by requiring more
than one person to approve a merge. The Freebase scheme was that duplicate
topics could be flagged for merge by anyone, but instead of merging, they'd
be placed in a queue for voting. Unanimous votes would cause merges to be
automatically processed. Conflicting votes would get bumped to a second
level queue for manual handling. This wasn't foolproof, but caught a lot of
the naive "these two things have the same name, so they must be the same
thing" merge proposals by newbies. There are lots of variations that could
be implemented, but the general idea is to get more than one pair of eyes
involved.

A specific instance of the structural impedance mismatch is enwiki's
handling of genes & proteins. Sometimes they have a page for each, but
often they have a single page that deals with both or, worse, a page who's
text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be
merged:
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT_gene

I recently ran across a similar situation where this Wikidata gene SPATA5
https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about
the associated protein https://en.wikipedia.org/wiki/SPATA5, while the
Wikidata protein is not linked to any wikis
https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very
difficult and the resulting errors encourage erroneous merges. The
gene/protein case probably needs multiple fixes, but many mergers harder
would help.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Lydia Pintscher
On Tue, Jun 14, 2016 at 7:26 PM, Tom Morris  wrote:
> OK, Marco Fossati (aka marfox on Github) is a name I recognize.
>
> Lydia - I'm not upset, just confused by the RFC. Have you considered asking
> Google to transfer ownership of the project since they're no longer doing
> anything with it? That would do away with the requirement that contributors
> sign a Google CLA and open the door for active project leadership.

I am happy to handle that if I have someone who says they'd contribute
to the tool without the CLA but not with it, yes.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Tom Morris
OK, Marco Fossati (aka marfox on Github) is a name I recognize.

Lydia - I'm not upset, just confused by the RFC. Have you considered asking
Google to transfer ownership of the project since they're no longer doing
anything with it? That would do away with the requirement that contributors
sign a Google CLA and open the door for active project leadership.

Marco - Centralizing the discussion is good, but why not pick one of the
three existing channels (issue tracker, project page, this mailing list)
rather than creating a fourth channel? As much as I love playing and
watching soccer, I'm much more interested in the vast trove of identifiers
and other curated information in Freebase than I am in improving Wikidata's
soccer coverage, but the Primary Sources tool could be useful for some
portions of the Freebase data, if it could be usable. I'm sure you've seen
my issues and pull requests on Github.

Tom

On Tue, Jun 14, 2016 at 6:53 AM, Marco Fossati 
wrote:

> Hi Tom and thanks Lydia for the clarification,
>
> that request for comments (RFC) [1] aims at gathering feedback both on the
> primary sources tool and the available datasets (especially StrepHit [2]),
> which are closely intertwined: the dataset is in the tool, so people can
> play with both in one single interaction and leave their thoughts in the
> RFC.
>
> Sorry if the title is misleading: the pipeline is indeed semi-automatic,
> as the StrepHit dataset is generated automatically, while its validation
> requires human attention.
>
> Since I'm trying to centralize the discussion, it would be great if you
> could expand in the RFC the 3 fundamental questions you raised.
>
> Best,
>
> Marco
>
> [1]
> https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements
> [2]
> https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
>
>
> On 6/14/16 08:27, Lydia Pintscher wrote:
>
>> On Tue, Jun 14, 2016 at 1:03 AM Tom Morris > > wrote:
>>
>> I'm confused by this from today's Wikidata weekly summary:
>>
>>   * New request for comments: Semi-automatic Addition of References
>> to Wikidata Statements - feedback on the Primary Sources Tool
>> <
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements
>> >
>>
>> First of all, the title makes no sense because "semi-automatic
>> addition of references to Wikidata statements" is one of the main
>> things that the tool can't currently do. You'll almost always end up
>> with duplicate statements if there's an existing statement, rather
>> than the desired behavior of just adding the statement.
>>
>> Second, I'm not sure who "Hjfocs" is (why does everyone have to make
>> up fake wikinames?), but why are they asking for more feedback when
>> there's been *ample* feedback already? There hasn't been an issue
>> with getting people to test the tool or provide feedback based on
>> the testing. The issue has been with getting anyone to *act* on the
>>
>> feedback. Everything is a) "too hard," or b) "beyond our resources,"
>> or depends on something in category a or b, or is incompatible with
>> the arbitrary implementation scheme chosen, or some other excuse.
>>
>> We're 12-18+ months into the project, depending on how you measure,
>> and not only is the tool not usable yet, but it's no longer
>> improving, so I think it's time to take a step back and ask some
>> fundamental questions.
>>
>> - Is the current data pipeline and front end gadget the right
>> approach and the right technology for this task? Can they be fixed
>> to be suitable for users?
>> - If so, should Google continue to have sole responsibility for it
>> or should it be transferred to the Wikidata team or someone else
>> who'll actually work on it?
>> - If not, what should the data pipeline and tooling look like to
>> make maximum use of the Freebase data?
>>
>> The whole project needs a reboot.
>>
>>
>> I realize you are upset but you are really barking up the wrong tree.
>> Marco is trying to give the whole thing more structure and sort through
>> all the requests to find a way forward. He is actually doing something
>> constructive about the issues you are raising.
>>
>>
>> Cheers
>> Lydia
>> --
>> Lydia Pintscher - http://about.me/lydia.pintscher
>> Product Manager for Wikidata
>>
>> Wikimedia Deutschland e.V.
>> Tempelhofer Ufer 23-24
>> 10963 Berlin
>> www.wikimedia.de 
>>
>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>>
>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
>>
>>
>> __

Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Marco Fossati

Hi Tom and thanks Lydia for the clarification,

that request for comments (RFC) [1] aims at gathering feedback both on 
the primary sources tool and the available datasets (especially StrepHit 
[2]), which are closely intertwined: the dataset is in the tool, so 
people can play with both in one single interaction and leave their 
thoughts in the RFC.


Sorry if the title is misleading: the pipeline is indeed semi-automatic, 
as the StrepHit dataset is generated automatically, while its validation 
requires human attention.


Since I'm trying to centralize the discussion, it would be great if you 
could expand in the RFC the 3 fundamental questions you raised.


Best,

Marco

[1] 
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements
[2] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References



On 6/14/16 08:27, Lydia Pintscher wrote:

On Tue, Jun 14, 2016 at 1:03 AM Tom Morris mailto:tfmor...@gmail.com>> wrote:

I'm confused by this from today's Wikidata weekly summary:

  * New request for comments: Semi-automatic Addition of References
to Wikidata Statements - feedback on the Primary Sources Tool



First of all, the title makes no sense because "semi-automatic
addition of references to Wikidata statements" is one of the main
things that the tool can't currently do. You'll almost always end up
with duplicate statements if there's an existing statement, rather
than the desired behavior of just adding the statement.

Second, I'm not sure who "Hjfocs" is (why does everyone have to make
up fake wikinames?), but why are they asking for more feedback when
there's been *ample* feedback already? There hasn't been an issue
with getting people to test the tool or provide feedback based on
the testing. The issue has been with getting anyone to *act* on the
feedback. Everything is a) "too hard," or b) "beyond our resources,"
or depends on something in category a or b, or is incompatible with
the arbitrary implementation scheme chosen, or some other excuse.

We're 12-18+ months into the project, depending on how you measure,
and not only is the tool not usable yet, but it's no longer
improving, so I think it's time to take a step back and ask some
fundamental questions.

- Is the current data pipeline and front end gadget the right
approach and the right technology for this task? Can they be fixed
to be suitable for users?
- If so, should Google continue to have sole responsibility for it
or should it be transferred to the Wikidata team or someone else
who'll actually work on it?
- If not, what should the data pipeline and tooling look like to
make maximum use of the Freebase data?

The whole project needs a reboot.


I realize you are upset but you are really barking up the wrong tree.
Marco is trying to give the whole thing more structure and sort through
all the requests to find a way forward. He is actually doing something
constructive about the issues you are raising.


Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de 

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata