Re: [Wikidata-l] How can I increase the throughput of ProteinBoxBot?

zehetner Sat, 18 Oct 2014 02:16:43 -0700

Great work Andra!
Is there any possibility to add for Ensembl related properties (Gene ID,
Transcript ID etc.) the version of Ensembl from which these Ids are
extracted (maybe by adding a qualifier to the ID value) as Entrez seems to
provide this information? Between Ensembl versions these IDs can change in
some cases very drastically, so a gene can have a completely different ID
or an ID can point to a different item in two Ensembl versions. Without
knowing the Ensembl version number these IDs are not that useful especially
when trying to correlate them with older published IDs.


Thanks,
Günther

On Fri, 17 Oct 2014 23:29:20 +0200, Andra Waagmeester <[email protected]>
wrote:
> The suggestion to use wbeditentity was great. It took me some time to
get
> used to using that call, but finally I managed and the optimisation was
> great. So great that we also finished including the Mouse genome,
> yesterday. It only took 2 days to complete, in contrast to the weeks
with
> the human genome. The suggestion to use wbeditentity really made my day.
> 
> Adding the mouse genome to wikidata, did however resulted in ~1000
> duplicates. [1]
> 
> The issue is that an items already existed with an identical identifier
and
> as such resulted in unique value violations [2]
> 
> In our current approach we can't prevent this, since the gene
description
> is currently key. We are looking into ways to use the identifier as key
in
> contrast to the label as we do now. The simplest option would be to add
the
> identifier as alias, but it would be ideal if we could use the same
> algorithm as the one generating the constrained violations, before
adding a
> new item. Is this possible? Can a bot query for a claim P351 with a
given
> value (e.g. 1017).
> 
> Any input would be appreciated.
> 
> Regards,
> 
> Andra
> 
> [1]
https://www.wikidata.org/wiki/User_talk:Andrawaag#.7E_1000_duplicates .
> [2] (
>
https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P351#.22Unique_value.22_violations
> 
> 
> 
> 
> 
> On Tue, Sep 30, 2014 at 9:05 PM, Daniel Kinzler
> <[email protected]
>> wrote:
> 
>> What makes it so slow?
>>
>> Note that you can use wbeditentity to perform complex edits with a
single
>> api
>> call. It's not as streight forward to use as, say, wbaddclaim, but much
>> more
>> powerfull and efficient.
>>
>> -- daniel
>>
>> Am 30.09.2014 19:00, schrieb Andra Waagmeester:
>> > Hi All,
>> >
>> >       I have joined the development team of the ProteinBoxBot
>> > (https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to
>> > make
>> > Wikidata the canonical resource for referencing and translating
>> identifiers for
>> > genes and proteins from different species.
>> >
>> > Currently adding all genes from the human genome and their related
>> identifiers
>> > to Wikidata takes more then a month to complete. With the objective
to
>> add other
>> > species, as well as having frequent updates for each of the genomes,
it
>> would be
>> > convenient if we could increase this throughput.
>> >
>> > Would it be accepted if we increase the throughput by running
multiple
>> instances
>> > of ProteinBoxBot in parallel. If so, what would be an accepted number
>> > of
>> > parallel instances of a bot to run? We can run multiple instances
from
>> different
>> > geographical locations if necessary.
>> >
>> > Kind regards,
>> >
>> >
>> > Andra
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Wikidata-l mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>> >
>>
>>
>> --
>> Daniel Kinzler
>> Senior Software Developer
>>
>> Wikimedia Deutschland
>> Gesellschaft zur Förderung Freien Wissens e.V.
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] How can I increase the throughput of ProteinBoxBot?

Reply via email to