from:"Tom Morris"

[Wikidata] Re: WoolNet: New application to find connections between Wikidata entities

2023-07-26 Thread Tom Morris

On Wed, Jul 26, 2023 at 3:58 PM Aidan Hogan  wrote:

>
> Thoughts, comments, questions, etc., very welcome!
>

Surely you could have found more appropriate subjects than Hitler and
Mussolini!

Tom
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/436ZQNT4V7BGUN47TFHS6YFXWTBURTCO/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

Re: [Wikidata] Questions about Mix'n'match tool

2020-05-07 Thread Tom Morris

Speaking of workflow, this Mix'n'Match report page

says
"*If you fix something from this list on Wikidata, please fix it on
Mix'n'match as well, if applicable*" without giving any directions or hints
as to how one might accomplish that. I just cleaned up "Multiple Wikidata
items with external ID OL4859603A
 : Pierre Vidal
(Q18002076) , Pierre Vidal
(Q3387281) ".

How would I tell Mix'n'Match about that? Or, better yet, why can't the tool
be taught not to overwrite human edits (particularly reverts)? To be fair,
the original confusion is understandable, because the two entries/gentlemen
are very confusable.

Tom

On Wed, May 6, 2020 at 4:28 AM Magnus Manske 
wrote:

> Hi,
>
> I am the author of Mix'n'match, so I hope I can answer your questions.
>
> Match mode:
> By default, "match mode" only shows unmatched entries, example:
> https://tools.wmflabs.org/mix-n-match/#/random/473
>
> You can force pre-matched entries, but currently they won't show the
> automatic predictions:
> https://tools.wmflabs.org/mix-n-match/#/random/473/prematched
>
> If you/others like it, I can have predictions show for auto-matched as
> well, and/or mix unmatched and pre-matched in results.
>
> "Next entry" (aka skip) will never change the bucket
> (pre-matched/unmatched).
>
>
> Mobile game:
> The mobile game shows unmatched only. There is currently no override.
> This would be easy to change as well, if desired, though I'd probably have
> to show/highlight the pre-matched entry somehow.
>
> "Skip" will never change the bucket (pre-matched/unmatched).
>
>
> Visual tool:
> The visual tool will show entries from both pre-matched and unmatched.
>
> "Load another one" (aka "skip") will never change the bucket
> (pre-matched/unmatched).
>
> I hope that answers your questions, please let me know if there is
> anything else I can do!
>
> Cheers,
> Magnus
>
> On Wed, May 6, 2020 at 7:26 AM Palonen, Tuomas Y E <
> tuomas.palo...@helsinki.fi> wrote:
>
>> Hello,
>>
>> I am an information specialist at the National Library of Finland, where
>> I am linking our General Finnish Ontology YSO (30,000+ concepts with terms
>> in Finnish-Swedish-English) to Wikidata at the moment. I just joined this
>> mailing list. I am currently using Mix'n'match tool and would have a couple
>> of questions. I would be very happy for any answers or contact info to
>> someone who might have the answers. Thank you very much! Here are my
>> questions:
>>
>>- Is there a connection in Mix'n'match between the Preliminarily
>>matched/Unmatched division and any of the three: Match mode, Mobile
>>matching, Visual tool?
>>- More precisely, are the link suggestions provided by Match mode /
>>Mobile matching / Visual tool created only from Preliminary matched list,
>>only from Unmatched list or from both?
>>- Also, if I reject a link suggestion in Match mode / Mobile matching
>>/ Visual tool, will that concept be added to the Unmatched list?
>>
>> This would be important for my own (and possibly anybody else's)
>> Mix'n'match work flow / method. If it was up to me, I would suggest
>> rejected concepts not be added to the Unmatched list, but that's just me. I
>> am planning to write a report of my linking project in the future, where I
>> will most likely include some development suggestions / wishes. One
>> immediate wish is that it would be great to tag or somehow put aside link
>> suggestions that require more research and cannot be decided on the spot
>> (for example, Wikidata items may include poor Finnish/Swedish terms and may
>> require some corrections before I can do the linking).
>>
>> Thanks for your help!
>> Best regards,
>> Tuomas / National Library of Finland
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata in the LOD Cloud

2018-06-29 Thread Tom Morris

On Wed, Jun 27, 2018 at 4:40 PM Federico Leva (Nemo) 
wrote:

> I don't see any corresponding URL in
> http://lod-cloud.net/versions/2018-30-05/lod-data.json

A complete aside, but who chose the date format for that URL? That's wacky!

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] NLP text corpus annotated with Wikidata entities?

2017-02-06 Thread Tom Morris

I don't know of such a resource off-hand, but you might want to consider
expanding your search to text corpuses annotated with Freebase or Google
Knowledge Graph IDs (the same IDs are used for both). Wikidata contains
mappings to Freebase IDs, although it is somewhat incomplete (and this
additional mapping adds an extra layer of variability).

The other issue is that all of the corpuses that I'm aware of are
automatically annotated, so their not "gold standard" truth sets, but you
could cherry pick the high confidence annotations and/or do additional
human verification.

Two that I know of are:

ClueWeb09 & ClueWeb12 - 800M documents, 11B "clues" -
https://research.googleblog.com/2013/07/11-billion-clues-in-800-million.html
TREC KBA Stream Corpus 2014 - 394M documents, 9.4B mentions -
http://trec-kba.org/data/fakba1/

I haven't seen any recent releases of similar stuff. Not sure what
identifiers Google will use for this kind of work in the future now that
they've shutdown Freebase.

Tom

On Sun, Feb 5, 2017 at 9:47 AM, Samuel Printz 
wrote:

> Hello everyone,
>
> I am looking for a text corpus that is annotated with Wikidata entites.
> I need this for the evaluation of an entity linking tool based on
> Wikidata, which is part of my bachelor thesis.
>
> Does such a corpus exist?
>
> Ideal would be a corpus annotated in the NIF format [1], as I want to
> use GERBIL [2] for the evaluation. But it is not necessary.
>
> Thanks for hints!
> Samuel
>
> [1] https://site.nlp2rdf.org/
> [2] http://aksw.org/Projects/GERBIL.html
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Confusion about memorials: Q5003624 (Gedenkstätte) and Q1885014 (Mahnmal)

2017-01-07 Thread Tom Morris

Based on the definitions here: https://en.wiktionary.org/wiki/Mahnmal
I'd say the subclass relationship is backwards.
It's also sounds like it's a loan word in English, so Mahnmal could be used
as the English label.

Tom

On Sat, Jan 7, 2017 at 11:36 AM, Edward Betts  wrote:

> Q5003624 (memorial) is a subclass of Q1885014 (no English label).
>
> Labels for Q5003624   https://www.wikidata.org/wiki/Q5003624
> English: memorial
> French: mémorial
> German: Gedenkstätte
> Portuguese: Monumento comemorativo
> Spanish: Monumento conmemorativo
>
> Labels for Q1885014   https://www.wikidata.org/wiki/Q1885014
> English: no label
> French: monument commémoratif
> German: Mahnmal
>
> These are the relevant articles on German language Wikipedia:
> Q5003624: https://de.wikipedia.org/wiki/Gedenkst%C3%A4tte
> Q1885014: https://de.wikipedia.org/wiki/Mahnmal
>
> The English description of Q1885014 is "type of monument that serves as a
> warning". Is that right?
>
> Should we invent an English label, for Q1885014? How about "Warning
> memorial"?
>
> It seems the like the French label for Q1885014 is the same as the
> Portuguese
> and Spanish labels of Q5003624. Maybe one of these labels is wrong.
>
> Is Q5003624 (Gedenkstätte) genuinely a subclass of Q1885014 (Mahnmal)?
> --
> Edward.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Genes, proteins, and bad merges in general

2016-06-14 Thread Tom Morris

Hi Gerard. There's often a tension between supporting "power users" and the
regular users, but in this case, I left out a little nuance - if you
flagged an item created by yourself for either deletion or merger and no
one else had edited it in the mean time, the operation was processed
automatically without having to go through the voting process. This allowed
everyone to fix their own mistakes quickly. Finding the right balance for
these processes typically takes a little tuning.

I forgot to mention another aspect of the current merge process that I
think is dangerous and I've seen cause problems and that is merge "games."
High impact operations like merges seem like a particularly poor fit for
gamification, particularly when there's no safety net such as a second set
of eyes.

Tom

On Tue, Jun 14, 2016 at 4:08 PM, Gerard Meijssen <gerard.meijs...@gmail.com>
wrote:

> Hoi,
> I add "many" entries. As a consequence I make the occasional mistake.
> Typically I find them myself and rectify. When you interfere with that, I
> can no longer sort out the mess I make. That is fine. It is then for
> someone else to fix.
> Thanks,
>  GerardM
>
> On 14 June 2016 at 21:20, Benjamin Good <ben.mcgee.g...@gmail.com> wrote:
>
>> Hi Tom,
>>
>> I think the example you have there is actually linked up properly at the
>> moment?
>> https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the
>> protein as are most Wikipedia articles of this nature.  And it is linked to
>> the gene the way we encourage modeling
>> https://www.wikidata.org/wiki/Q18052679  - and indeed the protein item
>> is not linked to a Wikipedia article again following our preferred pattern.
>>
>> For the moment...  _our_ merge problem seems to be mostly resolved.
>> Correcting the sitelinks on the non-english Wikipedias in a big batch
>> seemed to help slow the flow dramatically.  We have also introduced some
>> flexibility into the Lua code that produces infobox_gene on Wikipedia.  It
>> can handle most of the possible situations (e.g. wikipedia linked to
>> protein, wikipedia linked to gene) automatically so that helps prevent
>> visible disasters..
>>
>> On the main issue you raise about merges..  I'm a little on the fence.
>> Generally I'm opposed to putting constraints in place that slow people down
>> - e.g. we have a lot of manual merge work that needs to be done in the
>> medical arena and I do appreciate that the current process is pretty fast.
>> I guess I would advocate a focus on making the interface more vehemently
>> educational as a first step.  E.g. lots of 'are you sure' etc. forms to
>> click through but ultimately still letting people get their work done
>> without enforcing an approval process.
>>
>> -Ben
>>
>> On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris <tfmor...@gmail.com> wrote:
>>
>>> Bad merges have been mentioned a couple of times recently and I think
>>> one of the contexts with Ben's gene/protein work.
>>>
>>> I think there are two general issues here which could be improved:
>>>
>>> 1. Merging is too easy. Because splitting/unmerging is much harder than
>>> merging, particularly after additional edits, the process should be biased
>>> to mark merging more difficult.
>>>
>>> 2. The impedance mismatch between Wikidata and Wikipedias tempts
>>> wikipedians who are new to wikidata to do the wrong thing.
>>>
>>> The second is a community education issue which will hopefully improve
>>> over time, but the first could be improved, in my opinion, by requiring
>>> more than one person to approve a merge. The Freebase scheme was that
>>> duplicate topics could be flagged for merge by anyone, but instead of
>>> merging, they'd be placed in a queue for voting. Unanimous votes would
>>> cause merges to be automatically processed. Conflicting votes would get
>>> bumped to a second level queue for manual handling. This wasn't foolproof,
>>> but caught a lot of the naive "these two things have the same name, so they
>>> must be the same thing" merge proposals by newbies. There are lots of
>>> variations that could be implemented, but the general idea is to get more
>>> than one pair of eyes involved.
>>>
>>> A specific instance of the structural impedance mismatch is enwiki's
>>> handling of genes & proteins. Sometimes they have a page for each, but
>>> often they have a single page that deals with both or, worse, a page who's
>>> text says its about the protein, but where the page includes a gene infobox.

Re: [Wikidata] WDQS URL shortener

2016-06-01 Thread Tom Morris

On Thu, Jun 2, 2016 at 12:22 AM, Julie McMurry 
wrote:

> While I agree the primary aim isn't shortening, the result is usually much
> shorter by virtue of cutting out everything non essential to
> identification.

Except in the case of a giant query string, such as a complex SPARQL query.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Status and ETA External ID conversion

2016-05-08 Thread Tom Morris

Has the identifier migration stalled? I was just looking at this page:

https://www.wikidata.org/wiki/Q622828

and the first 9 claims on the page are all identifiers. There are only two
(Freebase & Disease Ontology) in the identifier section at the bottom of
the page.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata ontology

2016-05-01 Thread Tom Morris

On Sat, Apr 30, 2016 at 7:34 PM, Jan Macura  wrote:

>
> I've been using the  namespace for
> datatype properties for some time (more than a year).
> Now I can see everywhere only the  ns.
> Was there some reason for change? Are these two somehow compatible? Will
> the first one be deprecated?
>

The original is currently a 404 which isn't very cool URI
-ish.  Shouldn't the ontology still be
available for those who have used it in the past?

Are there additional URIs for non-Swedish versions of the new ontology?

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] item / claim usage tracking

2016-04-26 Thread Tom Morris

On Tue, Apr 26, 2016 at 1:01 PM, Dan Garry  wrote:

> On 26 April 2016 at 08:41, Benjamin Good  wrote:
>
>> Perhaps you could use the query log (just the list of SPARQL queries) and
>> utilize an offline installation of the query service to execute them and
>> generate aggregate statistics?
>>
>
> As a rule of thumb, if you think you've found a convenient way around
> needing an NDA... you probably haven't. ;-)
>
> The log of the list of queries would also be covered under the privacy
> policy. The log contains arbitrary, free-form user input and therefore is
> treated as containing personally identifying information until proven
> otherwise. You're correct that aggregates (like the ones that you're after)
> are generally fine to release publicly, but the person creating those
> aggregates would still need an NDA.
>

I understood Ben's "you" in the snippet that you quoted to refer to the
Wikidata team. Does their Wikimedia employment contract not include the
necessary non-disclosure provisions?

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] StatBank at Statistics Norway will be accessible through new open API

2016-04-23 Thread Tom Morris

I'm not sure what this has to do with Wikidata, but since the topic was
brought up, why did they feel the need to invent a new license? I'm not
sure I'd consider their license "open" despite its name since, although the
original license grant is perpetual and they allow redistribution, they
withhold sublicensing rights and require people receiving redistributions
to get their own license directly from Statistics Norway. This effectively
means that redistributions aren't guaranteed to be granted licenses in
perpetuity since they could decide to stop granting licenses whenever they
want.

In addition to making it impossible for anyone to build a commercial
information production with, it probably also means that the data can't be
used in Wikidata. (Well, actually even the attribution requirement is in
conflict with Wikidata's CC0).

Tom

On Sat, Apr 23, 2016 at 1:00 PM, John Erling Blad  wrote:

> At Statistics Norway (SSB) there is a service called "StatBank Norway"
> ("Statistikkbanken").[1][2] For some time it has been possible to access
> this through an open API, serving JSON-stat.[4] Now they open up all the
> remaining access and all 5000 tables will be made available.[3]
>
> SSB use NLOD,[5][6] an open license on their published data. (I asked them
> and all they really want is the source to be clearly given so to avoid
> falsified data.)
>
> [1] https://www.ssb.no/en/statistikkbanken
> [2]
> https://www.ssb.no/en/informasjon/om-statistikkbanken/how-to-use-statbank-norway
> [3]
> http://www.ssb.no/omssb/om-oss/nyheter-om-ssb/ssb-gjor-hele-statistikkbanken-tilgjengelig-som-apne-data
> (Norwegian)
> [4] https://json-stat.org/
> [5] http://www.ssb.no/en/informasjon/copyright
> [6] http://data.norge.no/nlod/en/1.0
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wordnet mappings

2016-04-12 Thread Tom Morris

Freebase has 40,760

WordNet
synsets mapped to equivalent Freebase topics and another 2,300 mapped to
broader topics. This could be used to bootstrap any efforts in this space.
Of course, it's out of a total of almost 118,000 synsets, so it's only
35-40% coverage, but it probably covers the most popular/common topics.
Something to be careful of is semantic drift in Wikipedia/Wikidata, since
these alignments were originally done back in 2009.

Here's the schema  for the
types associated with Wordnet in Freebase. (Yes, I know types are evil and
illegal in Wikidata, but just consider them collections of properties
grouped together for human organizational convenience.)

BTW, I agree with Daniel that care is needed here. It's easy for casual
observers to get confused about what should be connected, particularly
since Wikipedia articles aren't necessarily divided along strict semantic
lines.

Tom

On Tue, Apr 12, 2016 at 5:17 AM, Daniel Kinzler  wrote:

> Am 12.04.2016 um 08:42 schrieb Stas Malyshev:
> > Hi!
> >
> >> Is there a property for WordnetId?
>
> More mappings are always good. The case of WordNet is a bit tricky though,
> since
> WordNet is about words, not concepts. Wikidata items can perhaps be mapped
> to
> SynSets, but we still have to be careful not to get confused about the
> semantics.
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] should MathML dictate a specific graphical rendering

2016-04-07 Thread Tom Morris

I agree that it's worthwhile to take a step back and consider the bigger
picture, but wouldn't a more appropriate discussion for a Wikidata list be
-- is there a critical need to represent mathematical notations in Wikidata
and, if so, what form should that take?

On Thu, Apr 7, 2016 at 7:25 PM, Paul Topping  wrote:

> Rather than discussing whether MathML is a failed standard, web or
> otherwise, I recommend we discuss specific, constructive topics. I suggest
> the discussion be in the context of MathML where appropriate, not because I
> want to defend MathML but because it is an existing standard. It is a place
> to start. If the solutions we reach replace MathML all or in part, so be
> it. Let's not start by throwing it out but by addressing its problems. We
> can certainly create a new standard if MathML can't be fixed. Finally, if
> this is the wrong venue for this topic or any other, please suggest a
> better one. If there are other parties that need to know about the
> discussion, please let them know.
>
> Assuming others agree, let’s start with perhaps an important issue. Should
> Presentation MathML dictate a specific rendering or leave formatting
> choices up to the renderer. Peter says, "I have the impression people
> generally expect consistent rendering across browsers. But anecdotal
> evidence is, well, anecdotal." I would agree with this statement. People do
> expect this. I believe they get that expectation from TeX but it does make
> sense. Why would a user want a different rendering in a different browser?
>
> The reason I said "no" to this before was because the MathML spec leaves a
> lot of rendering decisions up to the implementation. Someone reading the
> MathML spec should NOT expect all renderings to be the same. In fact, the
> spec doesn't specify the rendering at the required level of detail. Doing
> so would be difficult. TeX doesn't specify its rendering in detail either
> except via the code itself. In other words, the only proper rendering of
> TeX is that done by TeX itself.
>
> We could create a MathML 4 in which the graphical rendering is specified
> in writing and in detail. Implementations would be constrained much more
> than by the current spec. Another way to achieve this goal is to create a
> reference implementation. This would be the TeX way, or close to it.
>
> We could even map MathML onto TeX somehow and then defer to TeX's
> rendering. The MathML spec would be annotated by TeX templates (perhaps
> macros) that serve to define the rendering. The reference implementation
> would consist of a MathML-to-TeX convertor and the TeX engine itself.
> Implementations that intend to abide by the MathML 4 spec could use the
> reference implementation or roll their own.
>
> When I say rendering above, I only mean graphical rendering. When we talk
> about audio or braille rendering, things are much less clear. The state of
> the art in MathML-to-speech has certainly not reached a point where
> everyone can agree. Besides, there is personal taste of the reader and
> multiple languages to consider.
>
> Ok, I'll stop there and take a breath.
>
> Paul
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Status and ETA External ID conversion

2016-03-10 Thread Tom Morris

On Thu, Mar 10, 2016 at 12:58 PM, Egon Willighagen <
egon.willigha...@gmail.com> wrote:

> On Thu, Mar 10, 2016 at 6:12 PM, Tom Morris <tfmor...@gmail.com> wrote:
> > On Wed, Mar 9, 2016 at 7:37 PM, Stas Malyshev <smalys...@wikimedia.org>
> wrote:
> > From a machine processing point of view, a more interesting statement is
> probably:
> >
> > wd:Q1000336 owl:sameAs <https://rdf.freebase.com/ns/m.03pvzn>
>
> Yes, but this proposal matches part of the discussion...

Actually, it doesn't, but for some reason you chose not to quote the
original URL which showed the difference. The URL
https://www.freebase.com/m/03pvzn <https://rdf.freebase.com/ns/m.03pvzn> is
not the same as the URI above.

> owl:sameAs is
> in many cases not appropriate and likely should not be the goal in the
> first place: in many cases there is not such a clear 1-to-1 relation,
> and even if there is a 1-to-1 relation, the above may still be
> inappropriate.

So choose a predicate that you think is more appropriate. The important
thing is that the URIs match so that computers can tell that they're the
same thing.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Status and ETA External ID conversion

2016-03-06 Thread Tom Morris

On Sun, Mar 6, 2016 at 5:31 PM, Lydia Pintscher <
lydia.pintsc...@wikimedia.de> wrote:

> On Sun, Mar 6, 2016 at 10:56 PM Stas Malyshev 
> wrote:
>
>> Is there a process somewhere of how the checking is done, what are
>> criteria, etc.? I've read
>> https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a
>> lot of discussion but not clear if it ever come to some end. Also not
>> clear what the process is - should I just move a property I like to
>> "good to convert"? Should I run it through some checklist first? Should
>> I ask somebody?
>>
>
> Yes. Good ones should be moved to good to convert. If no-one disagrees
> we'll convert them.
>

So, no decision criteria? Just whatever we individually like?

What are the rules for "disputed" - is some process for review planned?
>>
>
> Let's concentrate on the ones people can agree on for now. We'll tackle
> the ones that are disputed in the next step. If editors can't sort it out I
> will make an executive decision at some point but I don't think this will
> be needed.
>

I think the fact that some obvious good identifiers like IMDb have been
blocked has made potential contributors unsure how to evaluate other
candidates which would also, on the surface, seem obviously good.

Perhaps since the criteria aren't being used, someone could just delete all
the proposed criteria from the page and replace the old text with something
like "Whatever you, personally, think is best" so that people know what's
expected of them? That might help break the logjam. I know it would make me
more comfortable in contributing.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Status and ETA External ID conversion

2016-03-06 Thread Tom Morris

If an identifier system provides for merging of entities along with the
retention of both their previous IDs (as all good identifier systems which
guarantee stable identifiers should), duplicate IDs are inevitable.  Well
known examples include Freebase, MusicBrainz, OpenLibrary, and yes, even
Wikipedia & Wikidata. Duplicates may be silently resolved as is the case
with Freebase, redirects like OpenLibrary and Wiki*, or a hybrid like
MusicBrainz (some page types redirect, others don't). Merged identities may
be relatively rare (Freebase) or more common (OpenLibrary, MusicBrainz),
but they'll always happen. Mandating uniqueness would force the "losing"
IDs to be deleted from Wikidata, losing the benefit that they bring for
enhancing and strengthening the mesh of identifiers.

I've looked at the identifier list a couple of times with an eye towards
helping with the curation, but I could never make heads nor tails of what
the criteria were, whether there was consensus about the criteria, why some
perfectly acceptably identifiers were being vehemently argued against and
one what grounds, etc. The "community" driving this process on those wiki
pages seems to be just a handful of vocal and opinionated people. Is that
going to generate good results?

Tom

On Sun, Mar 6, 2016 at 4:17 AM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Another reason why "uniqueness" is not such a good criterion: it cannot be
> applied to decide the type of a newly created property (no statements, no
> uniqueness score). In general, the fewer statements there are for a
> property, the more likely they are to be unique. The criterion rewards data
> incompleteness (example: if Luca deletes the six multiple ids he mentioned,
> then the property could be converted -- and he could later add the
> statements again). If you think about it, it does not seem like a very good
> idea to make the datatype of a property depend on its current usage in
> Wikidata.
>
> Markus
>
>
> On 05.03.2016 17:15, Markus Krötzsch wrote:
>
>> Hi,
>>
>> I agree with Egon that the uniqueness requirement is rather weird. What
>> it means is that a thing is only considered an "identifier" if it points
>> to a database that uses a similar granularity for modelling the world as
>> Wikidata. If the external database is more fine-grained than Wikidata
>> (several ids for one item), then it is not a valid "identifier",
>> according to the uniqueness idea. I wonder what good this may do. In
>> particular, anybody who cares about uniqueness can easily determine it
>> from the data without any property type that says this.
>>
>> Markus
>>
>>
>> On 05.03.2016 15:35, Egon Willighagen wrote:
>>
>>> On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher
>>>  wrote:
>>>
 On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen
 

> What is the exact process? Do you just plan to wait longer to see if
> anyone supports/contradicts my tagging? Should I get other Wikidata
> users and contributors to back up my suggestion?
>

 Add them to the list Katie linked if you think they should be
 converted. We
 wait a bit to see if anyone disagrees and I also do a quick sanity
 check for
 each property myself before conversion.

>>>
>>> I am adding comments for now. I am also looking at the comments for
>>> what it takes to be "identifier":
>>>
>>>
>>> https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_external_identifiers
>>>
>>>
>>> What is the resolution in these? There are some strong, often
>>> contradiction, opinions...
>>>
>>> For example, the uniqueness requirement is interesting... if an
>>> identifier must be unique for a single Wikidata entry, this is
>>> effectively disqualifying most identifiers used in the life
>>> sciences... simply because Wikidata rarely has the exact same concept
>>> in Wikidata as it has in the remote database.
>>>
>>> I'm sure we can give examples from any life science field, but
>>> consider a gene: the concept of a gene in Wikidata is not like a gene
>>> sequence in a DNA sequence database. Hence, an identifier from that
>>> database could not be linked as "identifier" to that Wikidata entry.
>>>
>>> Same for most identifiers for small organic compounds (like drugs,
>>> metabolites, etc). I already commented on CAS (P231) and InChI (P234),
>>> both are used as identifier, but none are unique to concepts used as
>>> "types" in Wikidata. The CAS for formaldehyde and formaline is
>>> identical. The InChI may be unique, but only of you strongly type the
>>> definition of a chemical graph instead of a substance (as is now)...
>>> etc.
>>>
>>> So, in order to make a decision which chemical identifiers should be
>>> marked as "identifier" type depends on resolution of those required
>>> characteristics...
>>>
>>> Can you please inform me about the state of those characteristics
>>> (accepted or declined)?
>>>
>>> Egon
>>>
>>>

Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-23 Thread Tom Morris

On Tue, Feb 23, 2016 at 1:52 PM, Stas Malyshev 
wrote:

>
> > As Gerard has pointed out before, he prefers to re-enter statements
> > instead of approving them. This means that the real number of "imported"
> > statements is higher than what is shown in the dashboard (how much so
> > depends on how many statements Gerard and others with this approach have
> > added). It seems that one should rather analyse the number of statements
>
> Yes, I do that sometimes too - if there is a statement saying "spouse:
> X" on wikidata, and statement in Freebase saying the same but with the
> start date, or the Freebase one has more precise date than the Wikidata
> one, such as full date instead of just year, I will modify the original
> statement and reject the Freebase one.

I filed a bug report for this yesterday:
https://github.com/google/primarysources/issues/73
I'll add the information about more precise qualifiers, since I didn't
address that part.

> I'm not sure this is the best
> practice with regard to tracking numbers but it's easiest and even if my
> personal numbers do not matter too much I imagine other people do this
> too. So rejection does not really mean the data was not entered - it may
> mean it was entered in a different way. Sometimes also while the data is
> already there, the reference is not, so the reference gets added.
>

Even if you don't care about your personal numbers, I'd argue that not
being able to track the quality of data sources feeding the Primary Sources
tool is an issue.  It's valuable to not only measure quality for entire
data sets, but also for particular slices of them since data sources, at
least large ones like Freebase, are rarely homogenous in quality.

It's also clearly an issue that the tool is so awkward that people are
working around it instead of having it help them.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata stats

2016-02-23 Thread Tom Morris

On Tue, Feb 23, 2016 at 12:21 PM, Magnus Manske <magnusman...@googlemail.com
> wrote:

> Depends on your definition of "real". Wikidata Special Page stats count
> all non-empty (as in "have statements", AFAIK) items, my stats
> (wikidata-todo) count all items. Pick your truth.
>

Thanks, that's very helpful. And labels and descriptions don't count as
statements, right? So, the 3M delta represents items with a label, and
perhaps a description, but no other information. Are Wikipedia links also
in the "not a statement" category?

Tom


> On Tue, Feb 23, 2016 at 5:16 PM Tom Morris <tfmor...@gmail.com> wrote:
>
>> What is the canonical, authoritative source to find statistics about
>> Wikidata?
>>
>> On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch <
>> mar...@semantic-mediawiki.org> wrote:
>>
>>>
>>> Is it possible that you have actually used the flawed statistics from
>>> the Wikidata main page regarding the size of the project? 14.5M items in
>>> Aug 2015 seems far too low a number. Our RDF exports from mid August
>>> already contained more than 18.4M items. It would be nice to get this fixed
>>> at some point. There are currently almost 20M items, and the main page
>>> still shows only 16.5M.
>>
>>
>> I see the following counts:
>>
>> 16.5M (current?) - https://www.wikidata.org/wiki/Special:Statistics
>> 19.2M (December 2015) - https://tools.wmflabs.org/wikidata-todo/stats.php
>>
>> Where do I look for the real number?
>>
>> Tom
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-23 Thread Tom Morris

On Tue, Feb 23, 2016 at 1:28 AM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> On 22.02.2016 18:28, Tom Morris wrote:
>
>> On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch
>> <mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
>> wrote:
>>     On 21.02.2016 20 <tel:21.02.2016%2020>:37, Tom Morris wrote:
>> On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch
>> <mar...@semantic-mediawiki.org
>>
>> <mailto:mar...@semantic-mediawiki.org>>>
>> wrote:
>>
>>  On 18.02.2016 15:59, Lydia Pintscher wrote:
>>
>>  Thomas, Denny, Sebastian, Thomas, and I have published
>> a paper
>>  which was
>>  accepted for the industry track at WWW 2016. It covers
>> the migration
>>  from Freebase to Wikidata. You can now read it here:
>> http://research.google.com/pubs/archive/44818.pdf
>>
>>  Is it possible that you have actually used the flawed
>> statistics
>>  from the Wikidata main page regarding the size of the
>> project? 14.5M
>>  items in Aug 2015 seems far too low a number. Our RDF
>> exports from
>>  mid August already contained more than 18.4M items. It
>> would be nice
>>  to get this fixed at some point. There are currently almost
>> 20M
>>  items, and the main page still shows only 16.5M.
>>
>> Numbers are off throughout the paper.  They also quote 48M
>> instead of
>> 58M topics for Freebase and mischaracterize some other key
>> points. They
>> key number is that 3.2 billion facts for 58 million topics has
>> generated
>> 106,220 new statements for Wikidata. If my calculator had more
>> decimal
>> places, I could tell you what percentage that is.
>>
>> Obviously, any tool can only import statements for which we have
>> items and properties at all, so the number of importable facts is
>> much lower.
>>
>> Obviously, but "much lower" from 3.2B is probably something like
>> 50M-300M, not 0.1M.
>>
>
> That estimate might be a bit off. The paper contains a detailed discussion
> of this aspect.


Or the paper might be off. Addressing the flaws in the paper would require
a full paper in its own right.

I don't mean to imply that numbers are the only thing that's important,
because that's just one measure of how much value has been extracted from
the Freebase data, the relative magnitudes of the numbers are startling.


> The total number of statements that could be translated from Freebase to
> Wikidata is given as 17M, of which only 14M were new. So this seems to be
> the current upper bound of what you could import with PS or any other tool.


Upper bound using that particular methodology, only 4.5M of the 20M
Wikidata topics were mapped when, given the fact that Wikidata items have
to appear in a Wikipedia and that Freebase include all of English
Wikipedia, one would expect a much higher percentage to be mappable.


> The authors mention that this already includes more than 90% of the
> "reviewed" content of Freebase that refers to Wikidata items. The paper
> seems to suggest that these mapped+reviewed statements were already
> imported directly -- maybe Lydia could clarify if this was the case.
>

More clarity and information is always welcome, but since this is mentioned
as a possible future work item in Section 7, I'm guessing it wasn't done
yet.

>
> It seems that if you want to go to the dimensions that you refer to
> (50M/300M/3200M) you would need to map more Wikidata items to Freebase
> topics in some way. The paper gives several techniques that were used to
> obtain mappings that are already more than what we have stored in Wikidata
> now. So it is probably not the lack of mappings but the lack of items that
> is the limit here. Data can only be imported if we have a page at all ;-)
>

If it's true that only 25% of Wikidata items appear in Freebase, I'd be
amazed (and I'd like to see an analysis of what makes up that other 75%).


> Btw. where do the 100K imported statements come from that you mentioned
> here? I was also interested in that number but I could not find it in the
> paper.


The paper says in section 4, "At the time of writing (January, 2016), the
tool has been used by more than a hundred users who performed about 90,000
approval or rejection actions." which probably means ~80,000 new statements
(since ~10% get rejected). My 106K number is from the current dashboard
<https://tools.wmflabs.org/wikidata-primary-sources/status.html>.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-22 Thread Tom Morris

On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> On 21.02.2016 20:37, Tom Morris wrote:
>
>> On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch
>> <mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
>> wrote:
>>
>> On 18.02.2016 15:59, Lydia Pintscher wrote:
>>
>> Thomas, Denny, Sebastian, Thomas, and I have published a paper
>> which was
>> accepted for the industry track at WWW 2016. It covers the
>> migration
>> from Freebase to Wikidata. You can now read it here:
>> http://research.google.com/pubs/archive/44818.pdf
>>
>> Is it possible that you have actually used the flawed statistics
>> from the Wikidata main page regarding the size of the project? 14.5M
>> items in Aug 2015 seems far too low a number. Our RDF exports from
>> mid August already contained more than 18.4M items. It would be nice
>> to get this fixed at some point. There are currently almost 20M
>> items, and the main page still shows only 16.5M.
>>
>> Numbers are off throughout the paper.  They also quote 48M instead of
>> 58M topics for Freebase and mischaracterize some other key points. They
>> key number is that 3.2 billion facts for 58 million topics has generated
>> 106,220 new statements for Wikidata. If my calculator had more decimal
>> places, I could tell you what percentage that is.
>>
>
> Obviously, any tool can only import statements for which we have items and
> properties at all, so the number of importable facts is much lower.


Obviously, but "much lower" from 3.2B is probably something like 50M-300M,
not 0.1M.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Freebase to Wikidata: Results from Tpt internship

2016-02-21 Thread Tom Morris

On Fri, Oct 2, 2015 at 11:59 AM, Tom Morris <tfmor...@gmail.com> wrote:

> Denny/Thomas - Thanks for publishing these artefacts.  I'll look forward
> to the report with the metrics.
>

This is now, finally, available:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44818.pdf


> Are there plans for next steps or is this the end of the project as far as
> the two of you go?
>

I'm going to assume that the lack of answer to this question over the last
four months, the lack of updates on the project, and the fact no one is
even bothering to respond to issues
<https://github.com/google/primarysources/issues> means that this project
is dead and abandoned.  That's pretty sad. For an internship, it sounds
like a cool project and a decent result. As an actual serious attempt to
make productive use of the Freebase data, it's a weak, half-hearted effort
by Google.

Is there any interest in the Wikidata community for making use of the
Freebase data now that Google has abandoned their effort, or is there too
much negative sentiment against it to make it worth the effort?

Tom

p.s. I'm surprised that none of the stuff mentioned below is addressed in
the paper. Was it already submitted by the beginning of October?

Comments on individual items inline below:
>
> On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrande...@google.com>
> wrote:
>
>>
>> The scripts that were created and used can be found here:
>>
>> https://github.com/google/freebase-wikidata-converter
>>
>
> Oh no!  Not PHP!! :-)  One thing that concerns me is that the scripts seem
> to work on the Freebase RDF dump which is derivative artefact subject to a
> lossy transform.  I assumed that one of the reasons for having this work
> hosted at Google was that it would allow direct access to the Freebase
> graphd quads.  Is that not what happened?  There's a bunch of provenance
> information which is very valuable for quality analysis in the graphd graph
> which gets lost during the RDF transformation.
>

This isn't addressed in the paper and represents a significant loss of
provenance information.

>
>
>> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-missing.tsv.gz
>> The actual missing statements, including URLs for sources, are in this
>> file. This was filtered against statements already existing in Wikidata,
>> and the statements are mapped to Wikidata IDs. This contains about 14.3M
>> statements (214MB gzipped, 831MB unzipped). These are created using the
>> mappings below in addition to the mappings already in Wikidata. The quality
>> of these statements is rather mixed.
>>
>
> From my brief exposure and the comments of others, the quality seems
> highly problematic, but the issue seems to mainly be with the URLs
> proposed, which are of unknown provenance.  Presumably in whatever Google
> database these were derived from, they were tagged with the tool/pipeline
> that produced them and some type of probability of relevance.  Including
> this information in the data set would help pick the most relevant URLs to
> present and also help identify low-quality sources as voting feedback is
> collected.  Also, filtering the URLs for known unacceptable citations (485K
> IMDB references, BBC Music entries which consist solely of EN Wikipedia
> snippets, etc) would cut down on a lot of the noise.
>
> Some quick stats in addition to the 14.3M statements: 2.3M entities, 183
> properties, 284K different web sites.
>
> Additional datasets that we know meet a higher quality bar have been
>> previously released and uploaded directly to Wikidata by Tpt, following
>> community consultation.
>>
>
> Is there a pointer to these?
>
>
>> https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.pairs.gz
>> Contains additional mappings between Freebase MIDs and Wikidata QIDs,
>> which are not available in Wikidata. These are mappings based on
>> statistical methods and single interwiki links. Unlike the first set of
>> mappings we had created and published previously (which required multiple
>> interwiki links at least), these mappings are expected to have a lower
>> quality - sufficient for a manual process, but probably not sufficient for
>> an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB
>> unzipped).
>>
>
> I was really excited when I saw this because the first step in the
> Freebase migration project should be to increase the number of topic
> mappings between the two databases and 3.4M would almost double the number
> of existing mappings.  Then I looked at the first 10K Q numbers and found
> of the 7,500 "new" mappings, almost 6,700 were alr

Re: [Wikidata-tech] On interface stability and forward compatibility

2016-02-05 Thread Tom Morris

Sounds a lot like a restatement of Postel's Law

https://en.wikipedia.org/wiki/Robustness_principle

Tom

On Fri, Feb 5, 2016 at 7:10 AM, Daniel Kinzler 
wrote:

> Hi all!
>
> In the context of introducing the new "math" and "external-id" data types,
> the
> question came up whether this introduction constitutes a breaking change
> to the
> data model. The answer to this depends on whether you take the "English"
> or the
> "German" approach to interpreting the format: According to
> <
> https://en.wikipedia.org/wiki/Everything_which_is_not_forbidden_is_allowed>,
> in
> England, "everything which is not forbidden is allowed", while, in
> Germany, the
> opposite applies, so "everything which is not allowed is forbidden".
>
> In my mind, the advantage of formats like JSON, XML and RDF is that they
> provide
> good discovery by eyeballing, and that they use a mix-and-match approach.
> In
> this context, I favour the English approach: anything not explicitly
> forbidden
> in the JSON or RDF is allowed.
>
> So I think clients should be written in a forward-compatible way: they
> should
> handle unknown constructs or values gracefully.
>
>
> In this vein, I would like to propose a few guiding principles for the
> design of
> client libraries that consume Wikibase RDF and particularly JSON output:
>
> * When encountering an unknown structure, such as an unexpected key in a
> JSON
> encoded object, the consumer SHOULD skip that structure. Depending on
> context
> and use case, a warning MAY be issued to alert the user that some part of
> the
> data was not processed.
>
> * When encountering a malformed structure, such as missing a required key
> in a
> JSON encoded object, the consumer MAY skip that structure, but then a
> warning
> MUST be issued to alert the user that some part of the data was not
> processed.
> If the structure is not skipped, the consumer MUST fail with a fatal error.
>
> * Clients MUST make a clear distinction of data types and values types: A
> Snak's
> data type determines the interpretation of the value, while the type of the
> Snak's data value specifies the structure of the value representation.
>
> * Clients SHOULD be able to process a Snak about a Property of unknown data
> type, as long as the value type is known. In such a case, the client
> SHOULD fall
> back to the behaviour defined for the value type. If this is not possible,
> the
> Snak MUST be skipped and a warning SHOULD be issued to alert the user that
> some
> part of the data could not be interpreted.
>
> * When encountering an unknown type of data value (value type), the client
> MUST
> either ignore the respective Snak, or fail with a fatal error. A warning
> SHOULD
> be issued to alert the user that some part of the data could not be
> processed.
>
>
> Do you think these guidelines are reasonable? It seems to me that adopting
> them
> should save everyone some trouble.
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Duplicates in Wikidata

2015-12-31 Thread Tom Morris

I did a survey of all 315 duplicate pairs and almost 270 of them had the
duplicate entity created with P428 Botanist Abbreviation set, which I
consider prima facie evidence that they came from the same source.  For
some of the pairs, each entry was created within hours of each
<https://www.wikidata.org/w/index.php?title=Q21516236=history> other
<https://www.wikidata.org/w/index.php?title=Q21515511=history> by
Reinheitsgebot and Succu, one importing from Species Wiki and the other
importing from IPNI, with neither one checking for pre-existing entries.

Bots (and users) really need to take more care not to create duplicates
because they make future reconciliation tasks much harder for others.

Tom

p.s. Is there a way to get the edit history for an entity in JSON or some
other parseable format?


On Mon, Dec 28, 2015 at 2:29 PM, Tom Morris <tfmor...@gmail.com> wrote:

> I think there are at least two uses for information like this.  Fixing the
> actual errors is good, but perhaps more important is looking at why they
> happened in the first place.  Are there infrastructure/process issues which
> need to be improved? Are there systemic problems with particular tool
> chains, users, domains, etc? What patterns does the data show?
>
> I've attached a munged version of the list in a format which I find a
> little easier to work with and added Wikidata links
>
> Looking at the 30 oldest entities, 22 (!) of the duplicates were added by
> a single user (bot?) who was adding botanists, apparently based on data
> from the International Plant Names Index
> <http://www.ipni.org/ipni/authorsearchpage.do>, without first checking to
> see if they already existed.  The user's page indicates that they've had
> 2.5 *million* items deleted or merged (~12% of everything they've added).
> I'd hope to see high volume users/bots/tools in the 99%+ range for quality,
> not <90%.
>
> One pair is not a duplicate, but rather a father
> <https://www.wikidata.org/wiki/Q1228686> & son
> <https://www.wikidata.org/wiki/Q716655> with the same name, apparently
> flagged because they were both born in the 2nd century and died in the 3rd
> century, making them a "match."
>
> A few of remaining the duplicates were created by a variety of bots
> importing Wikipedia entries with incompletely fused sitelinks (not terribly
> surprising when the only structured information is a name and a sitelink).
>
> The last few pairs of duplicates don't really have enough provenance to
> figure out the source of the data.  One was created just a couple of weeks
> ago by a bot
> <https://www.wikidata.org/w/index.php?title=Q18603442=history>
> using "data from the Rijksmuseum" (no link or other provenance given),
> apparently without checking for existing entries first.  A few
> <https://www.wikidata.org/w/index.php?title=Q16825734=history>
> others
> <https://www.wikidata.org/w/index.php?title=Q19619143=history> was 
> created
> by Widar
> <https://www.wikidata.org/w/index.php?title=Q19933200=history>,
> but I can't tell what game, what data source, etc.
>
> Looking at three pairs of entries which were created at nearly the same
> time (min QNumberDelta), each pair was created by a single game/bot,
> indicating inadequate internal duplicate checks on the input data.
>
> It seems like post hoc analysis of merged entries to mine for patterns
> would be a very useful tool to identify systemic issues. Is that something
> that is done currently?
>
> Tom
>
>
>
> On Wed, Dec 23, 2015 at 5:05 PM, Proffitt,Merrilee <proff...@oclc.org>
> wrote:
>
>> Hello colleagues,
>>
>>
>>
>> During the most recent VIAF harvest we encountered a number of duplicate
>> records in Wikidata. Forwarding on in case this is of interest (there is an
>> attached file – not sure if that will go through on this list or not).
>>
>>
>>
>> Some discussion from OCLC colleagues is included below.
>>
>>
>>
>> Merrilee Proffitt, Senior Program Officer
>> OCLC Research
>>
>>
>>
>> *From:* Toves,Jenny
>> *Sent:* Tuesday, December 22, 2015 6:02 AM
>> *To:* Proffitt,Merrilee
>> *Subject:* FW: 201551 vs 201552
>>
>>
>>
>> Good morning Merrilee,
>>
>>
>>
>> You probably know that we harvest wikidata monthly for ingest into VIAF.
>> This month we found 315 pairs of records that appear to be duplicates. That
>> was a jump from previous months. I am not sure who would be interested in
>> this but Thom & I thought you might be. The attached report has 630 lines
>> showing what viaf saw as duplicates. So this pair of lines:
>>
>>

Re: [Wikidata] Place name ambiguation

2015-12-29 Thread Tom Morris

On Tue, Dec 29, 2015 at 1:56 PM, Maarten Dammers <maar...@mdammers.nl>
wrote:

> Hi Tom,
>
> Op 29-12-2015 om 19:33 schreef Tom Morris:
>
>> Thanks Stas & Thomas.  That's unambiguous. :-)  (And thanks to
>> Jdforrester who went through and fixed all my examples)
>>
> Please keep the long label around as an alias. This really helps when you
> enter data.


The guidelines say to move the disambiguation information to the
description (which makes sense since that's the only information the
autocomplete widget shows). Speaking of aliases, I got a chuckle out of the
species alias discussion - "The label of Helianthus annuus (Q171497)
<https://www.wikidata.org/wiki/Q171497> is the common name, while the
scientific name (*Helianthus annuus*) is featured as an alias"


> I wonder if someone ever ran a bot to clean up these disambiguations. The
> US alone must be thousands of items.


>From my observations, it appears that most, but not all, of the
parenthetical disambiguation information was removed on import from
wikipedia, but little or no comma separated disambiguation was removed.  If
P131 "located in the administrative territorial entity" was set correctly,
it would be relatively straightforward for a bot to identify information to
be removed, but, unfortunately, the bot that set all these, at least
locally, (AudeBot) used an administrative entity two levels higher than it
should have, making the property value useless.  Fixing P131 first would
make it easier to identify disambiguation information to be removed from
labels.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Place name ambiguation

2015-12-28 Thread Tom Morris

I'm pretty sure ambiguation is not a word, but what are the guidelines on
removing disambiguation information from names/labels of items.  Usually
this disambiguation information was added to a Wikipedia article name to
enforce their uniqueness requirements, but Wikidata has no such need and
the presence of the information makes it very difficult to construct things
like placename hierarchies which don't look clumsy/weird.

It appears that, in general, disambiguation text has been removed from
names, but this isn't always the case.  Is it safe/recommended to clean up
those that have been missed?

Here are some examples that I've collected:

Linden Park, Massachusetts https://www.wikidata.org/wiki/Q6552338
boulevard carnot cannes https://www.wikidata.org/wiki/Q2921787
Newton Lower Falls, Massachusetts https://www.wikidata.org/wiki/Q7020301
Harrington House (Weston, Massachusetts)
https://www.wikidata.org/wiki/Q14715508
Thomas Fleming House (Sherborn, Massachusetts)
https://www.wikidata.org/wiki/Q14715881
Peabody, Cambridge, Massachusetts https://www.wikidata.org/wiki/Q7157211

Is the standard the "natural" name or the invented name that Wikpedians
came up with?

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] [Spam] Re: Seasons Greetings and a call to action

2015-12-23 Thread Tom Morris

Thanks for the explanations. I'd forgotten that Wikipedia articles can link
to random images that are only peripherally related to the subject of the
article.

The pointers to the tools are appreciated, since I think tooling is key to
getting widespread participation in tasks like this, but I'd suggest that
the tooling isn't quite "there" yet.  The first tool, suggested by Lydia &
Gerard, gave me a page of text in Swedish about someone I've never heard of
and wanted me to say whether the photo was correct or not.  Sorry, not
playing that game (I don't read Swedish).

The map is a good idea, but the current implementation isn't zoomable and
the resolution is so coarse that it's basically a wild guess to get within
100km or so of a location.  Why not have the default be "near me, sorted by
distance from me" like the Special:Nearby page?  And, as an aside, why does
it use WDQ instead of the official query API?

The list of monuments didn't have anything for my country or the closest
neighboring country, so decided to take a look at the really cool photo of
the Swallow's Nest
<https://commons.wikimedia.org/wiki/File:%D0%97%D0%B0%D0%BC%D0%BE%D0%BA_%22%D0%9B%D0%B0%D1%81%D1%82%D0%BE%D1%87%D0%BA%D0%B8%D0%BD%D0%BE_%D0%B3%D0%BD%D0%B5%D0%B7%D0%B4%D0%BE%22,_%D0%AF%D0%BB%D1%82%D0%B0,_%D0%90%D0%A0_%D0%9A%D1%80%D1%8B%D0%BC.jpg>
on the Crimean Peninsula.  Turns out that the only article that links to it
is the Ukranian article about the Crimean Peninsula and it isn't linked
from the main Ukranian
<https://uk.wikipedia.org/wiki/%D0%9B%D0%B0%D1%81%D1%82%D1%96%D0%B2%D1%87%D0%B8%D0%BD%D0%B5_%D0%B3%D0%BD%D1%96%D0%B7%D0%B4%D0%BE>
(or any other) article about it, so there's no easy path to the
correct Wikidata
item <https://www.wikidata.org/wiki/Q1353643>.  That's just waaayyy too
much work to expect someone to do.

It wouldn't take much work on the tooling to significantly increase the
rate of contribution for tasks like this.

Tom

On Wed, Dec 23, 2015 at 3:35 AM, Andrew Gray <andrew.g...@dunelm.org.uk>
wrote:

> The FIST tool is great for this:
>
> https://tools.wmflabs.org/fist/wdfist/
>
> Can be fed with a WDQ query or a Wikipedia category; generates a list
> of all items which don't have a P18 but do have plausible candidate
> images, and lets you one-click add images (so everything is human
> confirmed).
>
> A seasonal example (takes a few seconds to load):
>
> https://tools.wmflabs.org/fist/wdfist/?category=Christmas=3=en=wikipedia_images_only=1=1
>
> If you want to do it geographically, I knocked up a world map
> interface to the tool: http://www.generalist.org.uk/wikidata/
>
> There are now 1,071,000 Wikidata items with main images (broke a
> million about a month ago). 431,000 of these are people, which means
> about one in seven of our Wikidata biographies have a main portrait
> image.
>
> Andrew.
>
> On 22 December 2015 at 21:17, Maxime Lathuilière <gro...@maxlath.eu>
> wrote:
> > What about making an automatic list of Wikipedia articles with a image in
> > Commons without an image in Wikidata, and then maybe turn the list into a
> > Wikidata Game if a human confirmation is required?
> >
> > Maxime
> >
> > Le 22/12/2015 22:01, Gerard Meijssen a écrit :
> >
> > Hoi,
> > Yes, the English Wikipedia allows for "fair use". Image with fair use are
> > not permissible in Wikidata. There is no such thing as automatically it
> > takes people to start a process and think it true.
> > Thanks,
> >  GerardM
> >
> > On 22 December 2015 at 20:52, Tom Morris <tfmor...@gmail.com> wrote:
> >>
> >> Is there a reason that Wikidata can't just use the same images that
> >> Wikipedia does?
> >>
> >> For example, this article:
> >> https://en.wikipedia.org/wiki/Walter_Frost_House has a nice public
> domain
> >> photo that someone contributed back in 2010, but it's not referenced
> from
> >> Wikidata https://www.wikidata.org/wiki/Q7964895
> >>
> >> Why couldn't these all be referenced automatically?
> >>
> >> Tom
> >>
> >> On Tue, Dec 22, 2015 at 1:33 PM, Jane Darnell <jane...@gmail.com>
> wrote:
> >>>
> >>> Hi all,
> >>> I cordially invite you to donate some time and add an image from
> >>> Wikimedia Commons to Wikidata this year. For inspiration you may
> choose one
> >>> of the recent Wiki Loves Monuments prize winning photos from your local
> >>> competition being added to the Wikidata item about that monument, one
> of
> >>> your favorite paintings from your local museum being added to the
> Wikidata
> >>> item about that painting, an image of christmas

Re: [Wikidata] Seasons Greetings and a call to action

2015-12-22 Thread Tom Morris

Is there a reason that Wikidata can't just use the same images that
Wikipedia does?

For example, this article: https://en.wikipedia.org/wiki/Walter_Frost_House
has a nice public domain photo that someone contributed back in 2010, but
it's not referenced from Wikidata https://www.wikidata.org/wiki/Q7964895

Why couldn't these all be referenced automatically?

Tom

On Tue, Dec 22, 2015 at 1:33 PM, Jane Darnell  wrote:

> Hi all,
> I cordially invite you to donate some time and add an image from Wikimedia
> Commons to Wikidata this year. For inspiration you may choose one of the
> recent Wiki Loves Monuments prize winning photos from your local
> competition being added to the Wikidata item about that monument, one of
> your favorite paintings from your local museum being added to the Wikidata
> item about that painting, an image of christmas cookies that you made that
> you could add to an item about that type of cookie, or anything else that
> has an item but no image yet.
>
> To help you overcome the difficulties with editing on Wikidata, here is a
> short film by User:Jan_Ainali_(WMSE) on how to add an image to an item:
> https://commons.wikimedia.org/wiki/File:Add_image_to_Wikidata.webm
>
> Thanks, good luck and let's hope we see some structured data for Commons
> in 2016,
> Jane
>
> https://commons.wikimedia.org/wiki/Wiki_Loves_Monuments_2015_winners
> https://commons.wikimedia.org/wiki/Category:Museums
> https://commons.wikimedia.org/wiki/Category:Christmas_cookies
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Commons
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Photographers' Identities Catalog (& WikiData)

2015-12-08 Thread Tom Morris

Can you explain what "indexing" means in this context?  Is there some type
of matching process?  How are duplicates resolved, if at all? Was the
Wikidata info extracted from a dump or one of the APIs?

When I looked at the first person I picked at random, Pierre Berdoy
(ID:269710), I see that both Wikidata and Wikipedia claim that he was born
in Biarritz while the NYPL database claims he was born in Nashua, NH.  So,
it would appear that there are either two different people with the same
name, born in different places, or the birth place is wrong.

http://mgiraldo.github.io/pic/?=2028247=269710|42.7575,-71.4644
https://www.wikidata.org/wiki/Q3383941

Tom




On Tue, Dec 8, 2015 at 7:10 PM, David Lowe  wrote:

> Hello all,
> The Photographers' Identities Catalog (PIC) is an ongoing project of
> visualizing photo history through the lives of photographers and photo
> studios. I have information on 115,000 photographers and studios as of
> tonight. It is still under construction, but as I've almost completed an
> initial indexing of the ~12,000 photographers in WikiData, I thought I'd
> share it with you. We (the New York Public Library) hope to launch it
> officially in mid to late January. This represents about 12 years worth of
> my work of researching in NYPL's photography collection, censuses and
> business directories, and scraping or indexing trusted websites, databases,
> and published biographical dictionaries pertaining to photo history.
> Again, please bear in mind that our programmer is still hard at work (and
> I continue to refine and add to the data*), but we welcome your feedback,
> questions, critiques, etc. To see the WikiData photographers, select
> WikiData from the Source dropdown. Have fun!
>
> *PIC*
> 
>
> Thanks,
> David
>
> *Tomorrow,  for instance, I'll start mining Wikidata for birth & death
> locations.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] WDQS updates have stopped

2015-11-18 Thread Tom Morris

So, the page that Markus points to describes heeding the replication lag
limit as a recommendation.  Since running a bot is a privilege, not a
right, why isn't the "recommendation" a requirement instead of a
recommendation?

Tom

On Wed, Nov 18, 2015 at 3:30 PM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> On 18.11.2015 19:40, Federico Leva (Nemo) wrote:
>
>> Andra Waagmeester, 18/11/2015 19:03:
>>
>>> How do you do add "hunderds (if not thousands)" items per minute?
>>>
>>
>> Usually
>> 1) concurrency,
>> 2) low latency.
>>
>
> In fact, it is not hard to get this. I guess Andra is getting speeds of
> 20-30 items because their bot framework is throttling the speed on purpose.
> If I don't throttle WDTK, I can easily do well over 100 edits per minute in
> a single thread (I did not try the maximum ;-).
>
> Already a few minutes of fast editing might push up the median dispatch
> lag sufficiently for a bot to stop/wait. While the slow edit rate is a
> rough guess (not a strict rule), respecting the dispatch stats is mandatory
> for Wikidata bots, so things will eventually slow down (or your bot be
> blocked ;-). See [1].
>
> Markus
>
> [1] https://www.wikidata.org/wiki/Wikidata:Bots
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Query Help

2015-11-09 Thread Tom Morris

Freebase has another 18,000 Twitter handles which are linked to IMDB, G+,
etc which don't have English Wikipedia links (as well as 13K which are
linked to English Wikipedia, although those should be in Wikidata too).
http://tinyurl.com/omb6bxf

I know some Wikipedias actively discourage links to social networking sites
like Twitter [1].  What is Wikidata's position on Twitter handles?  Is the
existence of the property P2002 sufficient justification to fill it in?
(Unlike enwiki's "Yes, we have Twitter template, but you shouldn't be using
it" stance [2]).

Tom

[1]
https://en.wikipedia.org/wiki/Wikipedia:External_links#Links_normally_to_be_avoided
[2] https://en.wikipedia.org/wiki/Template:Twitter
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Use of Sparql service is going through the roof

2015-11-06 Thread Tom Morris

Tangential question - is there a similar dashboard for WDQ (no S)?  Or
better yet, one that charts both query services so that they can be
compared?

Tom

On Fri, Nov 6, 2015 at 9:27 AM, James Heald  wrote:

> Does anyone know what's going on with the Sparql service ?
>
> Up until a couple of days ago, the most hits ever in one day was about
> 6000.
>
> But according to
>  http://searchdata.wmflabs.org/wdqs/
>
> two days ago suddenly there were 6.77 *million* requests, and yesterday
> over 21 million.
>
> Does anyone know what sort of requests these are, and whether they are all
> coming from the same place ?
>
>-- James.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Data model explanation and protection

2015-10-28 Thread Tom Morris

This is a deep-seated semantic confusion going back to at least 2006 [1]
when the Protein Infobox had Entrez and OMIM gene IDs.  Freebase naively
adopted in its initial protein schema in 2007 when it was importing from
those infoboxes.  Although it made some progress in improving the schema
later, anything not aligned with how Wikipedians want to do things is
shoveling against the tide.  It's also very difficult to manage
equivalences when Wikipedia articles are about multiple things like the
protein/gene articles.

If you look at the recent merge of Reelin [3] you can see that it was done
by the same user who contributed substantially to the article back in 2006
[4], so clearly, as the "owner" of that article, they clearly know what's
best.  :-) It's going to be very difficult to get people to unlearn a
decade of habits.

Another issue is that, as soon as you start trying to split things out into
semantically clean pieces, you immediately run afoul of the notability
restrictions. Because human (and mouse) genes don't have their own
Wikipedia pages, they're clearly not notable, so they can't be added to
Wikidata.

This problem of chunking by notability (or lack thereof), length of text
article, relatedness, and other attributes rather than semantic
individuality is much more widespread than just proteins/genes.  It also
effects things like pairs (or small sets) of people who aren't notable
enough to have an article on their own, articles which contain infoboxes
about people who aren't notable, so they got tacked onto related article to
give them a how, etc.

The inverse problem exists as well where a single semantic topic is broken
up into multiple articles purely for reasons of length.  Other types of
semantic mismatches include articles along precoordinated facets like
Transportation in New York City (or even History of Transportation in New
York City!), list articles (* Filmography, * Discography, * Videography,
List of *).  Of course, some lists, like the Fortune 500, make sense to
talk about as entities, but most Wikipedia lists are just mechanically
generated things for human browsing which don't really need a semantic
identifier.  Freebase deleted most of this Wikipedia cruft.

Going back to Ben's original problem, one tool that Freebase used to help
manage the problem of incompatible type merges was a set of curated sets of
incompatible types [5] which was used by the merge tools to warn users that
the merge they were proposing probably wasn't a good idea.  People could
ignore the warning in the Freebase implementation, but Wikidata could make
it a hard restriction or just a warning.

Tom

[1]
https://en.wikipedia.org/w/index.php?title=Reelin=56108806=56101233
[2] http://www.freebase.com/biology/protein/entrez_gene_id
[3]
https://www.wikidata.org/w/index.php?title=Q414043=revision=262778265=262243280
[4]
https://en.wikipedia.org/w/index.php?title=Reelin=prev=history
[5] http://www.freebase.com/dataworld/incompatible_types?instances=

On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good 
wrote:

> The Gene Wiki team is experiencing a problem that may suggest some areas
> for improvement in the general wikidata experience.
>
> When our project was getting started, we had some fairly long public
> debates about how we should structure the data we wanted to load [1].
> These resulted in a data model that, we think, remains pretty much true to
> the semantics of the data, at the cost of distributing information about
> closely related things (genes, proteins, orthologs) across multiple,
> interlinked items.  Now, as long as these semantic links between the
> different item classes are maintained, this is working out great.  However,
> we are consistently seeing people merging items that our model needs to be
> distinct.  Most commonly, we see people merging items about genes with
> items about the protein product of the gene (e.g. [2]]).  This happens
> nearly every day - especially on items related to the more popular
> Wikipedia articles. (More examples [3])
>
> Merges like this, as well as other semantics-breaking edits, make it very
> challenging to build downstream apps (like the wikipedia infobox) that
> depend on having certain structures in place.  My question to the list is
> how to best protect the semantic models that span multiple entity types in
> wikidata?  Related to this, is there an opportunity for some consistent way
> of explaining these structures to the community when they exist?
>
> I guess the immediate solutions are to (1) write another bot that watches
> for model-breaking edits and reverts them and (2) to create an article on
> wikidata somewhere that succinctly explains the model and links back to the
> discussions that went into its creation.
>
> It seems that anyone that works beyond a single entity type is going to
> face the same kind of problems, so I'm posting this here in hopes that
> generalizable patterns (and perhaps even supporting code) can be

Re: [Wikidata] Data model explanation and protection

2015-10-28 Thread Tom Morris

BTW, merges aren't the only problem.  For all languages except English,
it's the protein Wikidata item [1] that points to the corresponding
Wikipedia page, while for Engish it's the gene item [2] that points to the
corresponding English article [3].

[1] https://www.wikidata.org/wiki/Q13561329
[2] https://www.wikidata.org/wiki/Q414043
[3] https://en.wikipedia.org/wiki/Reelin


On Wed, Oct 28, 2015 at 3:08 PM, Tom Morris <tfmor...@gmail.com> wrote:

> This is a deep-seated semantic confusion going back to at least 2006 [1]
> when the Protein Infobox had Entrez and OMIM gene IDs.  Freebase naively
> adopted in its initial protein schema in 2007 when it was importing from
> those infoboxes.  Although it made some progress in improving the schema
> later, anything not aligned with how Wikipedians want to do things is
> shoveling against the tide.  It's also very difficult to manage
> equivalences when Wikipedia articles are about multiple things like the
> protein/gene articles.
>
> If you look at the recent merge of Reelin [3] you can see that it was done
> by the same user who contributed substantially to the article back in 2006
> [4], so clearly, as the "owner" of that article, they clearly know what's
> best.  :-) It's going to be very difficult to get people to unlearn a
> decade of habits.
>
> Another issue is that, as soon as you start trying to split things out
> into semantically clean pieces, you immediately run afoul of the notability
> restrictions. Because human (and mouse) genes don't have their own
> Wikipedia pages, they're clearly not notable, so they can't be added to
> Wikidata.
>
> This problem of chunking by notability (or lack thereof), length of text
> article, relatedness, and other attributes rather than semantic
> individuality is much more widespread than just proteins/genes.  It also
> effects things like pairs (or small sets) of people who aren't notable
> enough to have an article on their own, articles which contain infoboxes
> about people who aren't notable, so they got tacked onto related article to
> give them a how, etc.
>
> The inverse problem exists as well where a single semantic topic is broken
> up into multiple articles purely for reasons of length.  Other types of
> semantic mismatches include articles along precoordinated facets like
> Transportation in New York City (or even History of Transportation in New
> York City!), list articles (* Filmography, * Discography, * Videography,
> List of *).  Of course, some lists, like the Fortune 500, make sense to
> talk about as entities, but most Wikipedia lists are just mechanically
> generated things for human browsing which don't really need a semantic
> identifier.  Freebase deleted most of this Wikipedia cruft.
>
> Going back to Ben's original problem, one tool that Freebase used to help
> manage the problem of incompatible type merges was a set of curated sets of
> incompatible types [5] which was used by the merge tools to warn users that
> the merge they were proposing probably wasn't a good idea.  People could
> ignore the warning in the Freebase implementation, but Wikidata could make
> it a hard restriction or just a warning.
>
> Tom
>
> [1]
> https://en.wikipedia.org/w/index.php?title=Reelin=56108806=56101233
> [2] http://www.freebase.com/biology/protein/entrez_gene_id
> [3]
> https://www.wikidata.org/w/index.php?title=Q414043=revision=262778265=262243280
> [4]
> https://en.wikipedia.org/w/index.php?title=Reelin=prev=history
> [5] http://www.freebase.com/dataworld/incompatible_types?instances=
>
>
> On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good <ben.mcgee.g...@gmail.com>
> wrote:
>
>> The Gene Wiki team is experiencing a problem that may suggest some areas
>> for improvement in the general wikidata experience.
>>
>> When our project was getting started, we had some fairly long public
>> debates about how we should structure the data we wanted to load [1].
>> These resulted in a data model that, we think, remains pretty much true to
>> the semantics of the data, at the cost of distributing information about
>> closely related things (genes, proteins, orthologs) across multiple,
>> interlinked items.  Now, as long as these semantic links between the
>> different item classes are maintained, this is working out great.  However,
>> we are consistently seeing people merging items that our model needs to be
>> distinct.  Most commonly, we see people merging items about genes with
>> items about the protein product of the gene (e.g. [2]]).  This happens
>> nearly every day - especially on items related to the more popular
>> Wikipedia articles. (More examples [3])
>>
>> Merges like this,

Re: [Wikidata] Primary Sources Tool Backend Updates

2015-10-02 Thread Tom Morris

Sebastian - thanks for the quick turnaround on my requests.  It'll make the
data analysis much easier.

Tom

On Fri, Oct 2, 2015 at 8:20 AM, Sebastian Schaffert 
wrote:

> I've been reading mostly the archives and the GitHub tickets so far, but
> given the interest in the Primary Sources Tool maybe it's time I joined the
> mailinglist ;-)
>
> Following up on the discussions the last weeks I added two new features to
> the backend:
>
> 1. for any request listing statements it is now possible to filter by
> state, with the default being "unapproved"
>
> For example, you can select 10 random statements that have been marked
> "wrong" (i.e. rejected in the UI) with
>
> curl -i "
> https://tools.wmflabs.org/wikidata-primary-sources/statements/any?state=wrong
> "
>
> or retrieve all approved statements with
>
> curl -i "
> https://tools.wmflabs.org/wikidata-primary-sources/statements/all?state=approved
> "
>
> (NB: the /all endpoint is using paging, use the offset= and limit=
> parameters to control how much is returned)
>
> the different acceptable states are defined in
> https://github.com/google/primarysources/blob/master/backend/Statement.h#L14
>
>
> 2. all statements that already had some form of interaction (e.g. have
> been approved or rejected) now contain a new JSON field "activities"
> listing the activities acting on the statement; even though usually there
> will be at most one activity (i.e. approved or rejected), the system stores
> (and already stored since we launched it) a complete history, e.g. for
> transitions like unapproved -> wrong -> unapproved -> approved.
>
> You can try it out by retrieving a random selection of statements in other
> states than "unapproved", e.g. as before:
>
> curl -i "
> https://tools.wmflabs.org/wikidata-primary-sources/statements/any?state=wrong
> "
>
> will give you results like:
>
> {
> "activities" : [
> {
> "state" : "wrong",
> "timestamp" : "+2015-05-09T14:26:45Z/14",
> "user" : "Hoo man"
> }
> ]
> ,
> "dataset" : "freebase",
> "format" : "v1",
> "id" : 31,
> "state" : "wrong",
> "statement" : "Q1702409\tP27\tQ145\tS854\t\"
> http://www.astrotheme.com/astrology/Warren_Mitchell\";,
> "upload" : 0
> }
>
> Hope it is useful ;-)
>
> Otherwise let me know if you are interested in other analysis data. I'll
> try adding features as time permits.
>
> Cheers!
>
>
>
> --
> Dr. Sebastian Schaffert | GMail Site Reliability Manager |
> schaff...@google.com | +41 44 668 06 25
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Freebase to Wikidata: Results from Tpt internship

2015-10-02 Thread Tom Morris

Denny/Thomas - Thanks for publishing these artefacts.  I'll look forward to
the report with the metrics.  Are there plans for next steps or is this the
end of the project as far as the two of you go?

Comments on individual items inline below:

On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić 
wrote:

>
> The scripts that were created and used can be found here:
>
> https://github.com/google/freebase-wikidata-converter
>

Oh no!  Not PHP!! :-)  One thing that concerns me is that the scripts seem
to work on the Freebase RDF dump which is derivative artefact subject to a
lossy transform.  I assumed that one of the reasons for having this work
hosted at Google was that it would allow direct access to the Freebase
graphd quads.  Is that not what happened?  There's a bunch of provenance
information which is very valuable for quality analysis in the graphd graph
which gets lost during the RDF transformation.

https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-missing.tsv.gz
> The actual missing statements, including URLs for sources, are in this
> file. This was filtered against statements already existing in Wikidata,
> and the statements are mapped to Wikidata IDs. This contains about 14.3M
> statements (214MB gzipped, 831MB unzipped). These are created using the
> mappings below in addition to the mappings already in Wikidata. The quality
> of these statements is rather mixed.
>

>From my brief exposure and the comments of others, the quality seems highly
problematic, but the issue seems to mainly be with the URLs proposed, which
are of unknown provenance.  Presumably in whatever Google database these
were derived from, they were tagged with the tool/pipeline that produced
them and some type of probability of relevance.  Including this information
in the data set would help pick the most relevant URLs to present and also
help identify low-quality sources as voting feedback is collected.  Also,
filtering the URLs for known unacceptable citations (485K IMDB references,
BBC Music entries which consist solely of EN Wikipedia snippets, etc) would
cut down on a lot of the noise.

Some quick stats in addition to the 14.3M statements: 2.3M entities, 183
properties, 284K different web sites.

Additional datasets that we know meet a higher quality bar have been
> previously released and uploaded directly to Wikidata by Tpt, following
> community consultation.
>

Is there a pointer to these?

https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.pairs.gz
> Contains additional mappings between Freebase MIDs and Wikidata QIDs,
> which are not available in Wikidata. These are mappings based on
> statistical methods and single interwiki links. Unlike the first set of
> mappings we had created and published previously (which required multiple
> interwiki links at least), these mappings are expected to have a lower
> quality - sufficient for a manual process, but probably not sufficient for
> an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB
> unzipped).
>

I was really excited when I saw this because the first step in the Freebase
migration project should be to increase the number of topic mappings
between the two databases and 3.4M would almost double the number of
existing mappings.  Then I looked at the first 10K Q numbers and found of
the 7,500 "new" mappings, almost 6,700 were already in Wikidata.

Fortunately when I took a bigger sample, things improved.  For a 4% sample,
it looks just under 30% are already in Wikidata, so if the quality of the
remainder is good, that would yield an additional 2.4M mappings, which is
great!  Interestingly there were also a smattering of Wikidata 404s (25),
redirects (71), and values which conflicted with Wikidata (530), a cursory
analysis of the latter showed that they were mostly the result of merges on
the Freebase end (so the entity now has two MIDs).

https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.tsv.gz
> This file includes labels and aliases for Wikidata items which seem to be
> currently missing. The quality of these labels is undetermined. The file
> contains about 860k labels in about 160 languages, with 33 languages having
> more than 10k labels each (14MB gzipped, 32MB unzipped).
>

Their provenance is available in the Freebase graph.  The most likely
source is other language Wikipedias, but this could be easily confirmed.

>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-missing.tsv.gz
> This is an interesting file as it includes a quality signal for the
> statements in Freebase. What you will find here are ordered pairs of
> Freebase mids and properties, each indicating that the given pair were
> going through a review process and likely have a higher quality on average.
> This is only for those pairs that are missing from Wikidata. The file
> includes about 1.4M pairs, and this can be used for importing part of the
> data

[Wikidata] Wikidata & Freebase visualizations (was internship results)

2015-10-02 Thread Tom Morris

On Fri, Oct 2, 2015 at 4:51 PM, Federico Leva (Nemo) 
wrote:

> Thad Guidry, 02/10/2015 21:44:
>
>> To my eyes, it shows that the Asia continent is still generally void of
>> any useful machine-readable Knowledge, in either Freebase or Wikidata.
>> (or anywhere else)  But this is already a known state of affairs and
>> probably will not improve until 1 Million USA students learn Mandarin. :)
>>
>
Extrapolating from Freebase and Wikidata may not be reliable.  For example,
Baidu is extracting structured microdata from web sites:
http://chineseseoshifu.com/blog/baidu-structured-data-wordpress-plugin.html

> It also shows that Wikidata and Freebase have different opinions on what's
> the centre of Europe (or maybe one of the two has tons of statements on
> Cape Town! too lazy to manually calculate labels on the axes).

I agree that labeled axes would be useful.  Both graphs have a strong
vertical lines which line up with each other near the prime meridian
(Paris?), so I don't think there's a shift involved.  The extra line is on
the Wikidata graph.  I'm guessing it's Rome, not Cape Town.  The Italians
are pretty keen Wikidatans, aren't they?

The Wikidata graph also seems to exhibit more and bigger strong horizontal
features than the Freebase graph, in particular one in ~875AD? which spans
half the globe.

If there's an intermediary form of the data that includes lat/lon instead
of just longitude, an interactive visualization overlaid on a world map
with a slider for date and selectable Freebase/Wikidata plots might be fun.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Duplicate identifiers (redirects & non-redirects)

2015-10-01 Thread Tom Morris

On Thu, Oct 1, 2015 at 4:19 AM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> On 01.10.2015 00:58, Ricordisamoa wrote:
>
>> I think Tom is referring to external identifiers such as MusicBrainz
>> artist ID  etc. and whether
>> Wikidata items should show all of them or 'preferred' ones only as we
>> did for VIAF redirects
>> <
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/SamoaBot_38
>> >.
>>
>
> Now if the external site reconciles the ids, we have these options:
> (1) Keep everything as is (one main id marked as "preferred")
> (2) Make the redirect ids deprecated on Wikidata (show people that we are
> aware of the ids but they should not be used)
> (3) Delete the redirect ids
>
> I think (2) would be cleanest, since it avoids that unaware users re-add
> the old ids. (3) would also be ok once the old id is no longer in
> circulation.
>

I agree #2 is best, although #1 could work too.  The problem with #3 is
that an identifier, once minted, is never "no longer in circulation."  This
is precisely why Wikidata items are never deleted.  There's always the
possibility that someone will hold a reference to it somewhere.  Thad's use
case isn't uncommon.

Is there any benefit in removing old ids completely? I guess constraint
> reports will work better (but maybe constraint reports should not count
> deprecated statements in single value contraints ...).

The constraint reports definitely need to be fixed.  I recently saw a
reference to a VIAF bot run that deleted a whole bunch of VIAF identifiers
to "fix" things being flagged by some constraint.

> Other than this, I don't see a big reason to spend time on removing some
> ids. It's not wrong to claim that these are ids, just slightly redundant,
> and the old ids might still be useful for integrating with web sources that
> were not updated when the redirect happened.
>

Rather than not wasting time removing, I'd like to see affirmative
statements that keeping them is a good thing.  If people find them annoying
or cluttering, it's because of poor UI design, not because they lack
usefulness.

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Primary Sources tool backend API

2015-10-01 Thread Tom Morris

The version with support for all states has apparently been deployed.  The
default is "unapproved" and "any" is still the catchall, but you can also
use: approved, wrong, skipped, othersource, duplicate, blacklisted

e.g.
http://tools.wmflabs.org/wikidata-primary-sources/statements/all?state=duplicate=freebase

Is there a description somewhere of what each state means?  I assume that
things that are marked as "blacklisted" never got presented to the user.
Is that true and is it also true of "duplicate"?  If so, these are really
tracking quality of the data prep process (ie this filtering should have
been done before the data was loaded into the backend).

In looking at the UI, I only see Accept & Reject buttons.  Does this mean
that Reject (ie state == wrong) is the union of duplicate fact, bad
reference, we've got enough references so I don't feel like adding another
one, bad fact, and any other reason for rejection?  If so, it it seems like
a bunch of information that would be useful for analysis is being lost.

Tom

On Tue, Sep 29, 2015 at 6:10 PM, Tom Morris <tfmor...@gmail.com> wrote:

> Thanks for the quick reply, Thomas -- and for creating the issue.
>
> On Tue, Sep 29, 2015 at 4:15 PM, Thomas Tanon <thomaspe...@gmail.com>
> wrote:
>
>>
>> Sadly, the "state" parameter currently only accepts the values
>> "unapproved" and "all".
>
>
> When "all" didn't work, I had a peek at the sources and discovered that
> the only legal value for the state parameter is actually "any".
>
> The following query works for the analysis that I want to do (albeit with
> a *lot* of filtering required):
>
> http://tools.wmflabs.org/wikidata-primary-sources/statements/all?state=any
>
> Tom
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Duplicate identifiers (redirects & non-redirects)

2015-10-01 Thread Tom Morris

A small mechanical note for those not familiar with Wikidata's internals
(since it took me a while to figure this out):

"Preferred" and "Deprecated" are "Ranks" (the third is "Normal") and the
rank can be set by clicking the "Edit" button and then clicking on the
leftmost of the two tiny sets of three stacked buttons near the left of the
input field.

It seems like the constraint checker could check for either only one
"Preferred" or all but one "Deprecated" which would allow editors to evolve
in whichever way they wanted.

Tom

On Thu, Oct 1, 2015 at 12:40 PM, Tom Morris <tfmor...@gmail.com> wrote:

> On Thu, Oct 1, 2015 at 4:19 AM, Markus Krötzsch <
> mar...@semantic-mediawiki.org> wrote:
>
>> On 01.10.2015 00:58, Ricordisamoa wrote:
>>
>>> I think Tom is referring to external identifiers such as MusicBrainz
>>> artist ID <https://www.wikidata.org/wiki/Property:P434> etc. and whether
>>> Wikidata items should show all of them or 'preferred' ones only as we
>>> did for VIAF redirects
>>> <
>>> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/SamoaBot_38
>>> >.
>>>
>>
>> Now if the external site reconciles the ids, we have these options:
>> (1) Keep everything as is (one main id marked as "preferred")
>> (2) Make the redirect ids deprecated on Wikidata (show people that we are
>> aware of the ids but they should not be used)
>> (3) Delete the redirect ids
>>
>> I think (2) would be cleanest, since it avoids that unaware users re-add
>> the old ids. (3) would also be ok once the old id is no longer in
>> circulation.
>>
>
> I agree #2 is best, although #1 could work too.  The problem with #3 is
> that an identifier, once minted, is never "no longer in circulation."  This
> is precisely why Wikidata items are never deleted.  There's always the
> possibility that someone will hold a reference to it somewhere.  Thad's use
> case isn't uncommon.
>
> Is there any benefit in removing old ids completely? I guess constraint
>> reports will work better (but maybe constraint reports should not count
>> deprecated statements in single value contraints ...).
>
>
> The constraint reports definitely need to be fixed.  I recently saw a
> reference to a VIAF bot run that deleted a whole bunch of VIAF identifiers
> to "fix" things being flagged by some constraint.
>
>
>> Other than this, I don't see a big reason to spend time on removing some
>> ids. It's not wrong to claim that these are ids, just slightly redundant,
>> and the old ids might still be useful for integrating with web sources that
>> were not updated when the redirect happened.
>>
>
> Rather than not wasting time removing, I'd like to see affirmative
> statements that keeping them is a good thing.  If people find them annoying
> or cluttering, it's because of poor UI design, not because they lack
> usefulness.
>
> Tom
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-29 Thread Tom Morris

h
>> better maintained than others -- identify those that meet the
>> bar for wholesale import and leave the rest to the primary
>> sources tool.
>>
>> On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch
>> <mar...@semantic-mediawiki.org
>> <mailto:mar...@semantic-mediawiki.org>> wrote:
>>
>> On 24.09.2015 23:48, James Heald wrote:
>>  > Has anybody actually done an assessment on Freebase and
>> its reliability?
>>  >
>>  > Is it *really* too unreliable to import wholesale?
>>
>>   From experience with the Primary Sources tool proposals,
>> the quality is
>> mixed. Some things it proposes are really very valuable, but
>> other
>> things are also just wrong. I added a few very useful facts
>> and fitting
>> references based on the suggestions, but I also rejected
>> others. Not
>> sure what the success rate is for the cases I looked at, but
>> my feeling
>> is that some kind of "supervised import" approach is really
>> needed when
>> considering the total amount of facts.
>>
>> An issue is that it is often fairly hard to tell if a
>> suggestion is true
>> or not (mainly in cases where no references are suggested to
>> check). In
>> other cases, I am just not sure if a fact is correct for the
>> property
>> used. For example, I recently ended up accepting "architect:
>> Charles
>> Husband" for Lovell Telescope (Q555130), but to be honest I
>> am not sure
>> that this is correct: he was the leading engineer contracted
>> to design
>> the telescope, which seems different from an architect; no
>> official web
>> site uses the word "architect" it seems; I could not find a
>> better
>> property though, and it seemed "good enough" to accept it
>> (as opposed to
>> the post code of the location of this structure, which
>> apparently was
>> just wrong).
>>
>>  >
>>  > Are there any stats/progress graphs as to how the actual
>> import is in
>>  > fact going?
>>
>> It would indeed be interesting to see which percentage of
>> proposals are
>> being approved (and stay in Wikidata after a while), and
>> whether there
>> is a pattern (100% approval on some type of fact that could
>> then be
>> merged more quickly; or very low approval on something else
>> that would
>> maybe better revisited for mapping errors or other
>> systematic problems).
>>
>> Markus
>>
>>
>>  >
>>  >-- James.
>>  >
>>  >
>>  > On 24/09/2015 19:35, Lydia Pintscher wrote:
>>  >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris
>> <tfmor...@gmail.com <mailto:tfmor...@gmail.com>> wrote:
>>  >>>> This is to add MusicBrainz to the primary source tool,
>> not anything
>>  >>>> else?
>>  >>>
>>  >>>
>>  >>> It's apparently worse than that (which I hadn't
>> realized until I
>>  >>> re-read the
>>  >>> transcript).  It sounds like it's just going to
>> generate little warning
>>  >>> icons for "bad" facts and not lead to the recording of
>> any new facts
>>  >>> at all.
>>  >>>
>>  >>> 17:22:33  we'll also work on getting the
>> extension
>>  >>> deployed that
>>  >>> will help with checking against 3rd party databases
>>  >>> 17:23:33  the result of constraint checks
>> and checks
>>  >>> against 3rd
>>  >>> party databases will then be used to display little
>> indicators next to a
>>  >>> statement in case it is problematic
>>  >>> 17:23:47  i hope this way more people
>> become aware of
>>  >>> issues and
>>  >>> can help fix them
>>  >>> 17:24:35  Do you have any names of
>> databases that are
>>  >>> supported? :)
>>  >>> 17:24:59  sjoerddebruin: in the first
>> version the german
>>  >>> national library. it can be extended later
>>  >>>
>>  >>>
>>  >>> I know Freebase is deemed to be nasty and unreliable,
>> but is MusicBrainz
>>  >>> considered trustworthy enough to import directly or
>> will its facts
>>  >>> need to
>>  >>> be dripped through the primary source soda straw one at
>> a time too?
>>  >>
>>  >> The primary sources tool and the extension that helps us
>> check against
>>  >> other databases are two independent things.
>>  >> Imports from Musicbrainz have been happening since a
>> very long time
>>  >> already.
>>  >>
>>  >>
>>  >> Cheers
>>  >> Lydia
>>  >>
>>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Primary Sources tool backend API

2015-09-29 Thread Tom Morris

What am I doing wrong with this API call?

http://tools.wmflabs.org/wikidata-primary-sources/statements/all?state=wrong

I've tried with quotes, without quotes, with different state names and
never get anything except statements with a state of "unapproved"

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] next Wikidata office hour

2015-09-24 Thread Tom Morris

On Thu, Sep 24, 2015 at 5:43 AM, Lydia Pintscher <
lydia.pintsc...@wikimedia.de> wrote:

>
> And here is the log for anyone who missed it yesterday and wants to
> catch up:
> https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-09-23-17.01.log.html

Thanks!  Is there any more information on the issue with MusicBrainz?

17:26:27  sjoerddebruin: yes, we went for MusicBrainz
first, but it turned out to be impractical. you basically have to run
their software in order to use their dumps

MusicBrainz was a major source of information for Freebase, so they appear
to have been able to figure out how to parse the dumps (and they already
have the MusicBrainz & Wikipedia IDs correlated).

Is there more detail, perhaps in a bug somewhere?

Tom
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Naming projects

2015-09-14 Thread Tom Morris

On Mon, Sep 14, 2015 at 11:27 AM, Magnus Manske  wrote:

> My next tool shall be called "hitlerdisneycoke". If everyone is offended,
> no one is!
>

and Godwin's Law  makes
another appearance...
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] API JSON format for warnings

2015-09-01 Thread Tom Morris

On Tue, Sep 1, 2015 at 7:45 AM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> I now identified another format for API warnings.


Obviously such variability in error reporting is going to cause consumers
of the API aggravation.  Perhaps a single, consistent reporting style could
be developed.

Tom
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Help needed for Freebase to Wikidata migration

2015-06-17 Thread Tom Morris

On Tue, Jun 16, 2015 at 2:43 PM, Thomas Pellissier-Tanon
thoma...@google.com wrote:

 I have only added the most used 1000 properties in order to don't hide
 important properties with less important ones. But feel free to add
 properties that are not listed there.

Most frequent isn't the same as most important.  Has this been
reviewed by anyone who's familiar with the Freebase schema?  Google
knows all the stuff I've listed below and could save a lot of wasted
effort by people who aren't familiar with the schema.

Most of the stuff in the /base/* domains should probably be ignored
for a first pass and some, like /base/schemastaging should be ignored
permanently unless they've also got an alias in the commons namespace
due to being promoted. Ditto for the /user/* domains.  I don't see
anything from the /authority namespace which is arguably the most
important part of Freebase -- all it's reconciled strong identifiers
for IMDB, Library of Congress, New York Times, etc.

Some of the properties which are included have been replaced by keys
in the /authority namespace tree, e.g.

https://www.freebase.com/book/author/openlibrary_id
https://www.freebase.com/user/narphorium/people/nndb_person/nndb_id

These can be identified programaticly by looking at the schema where
the /type/property/enumeration property will point at the namespace
where the identifier is stored (/authority/openlibrary/author 
/authority/nndb, respectively).  Note that, for historical reasons,
some of the earlier key namespaces have aliases outside of the tree
rooted at /authority.  For example, the property
https://www.freebase.com/biology/organism_classification/itis_tsn
enumerates its identifiers in a namespace which is aliased as both
/biology/itis and /authority/itis.
https://www.freebase.com/biology/itis?keys=

All hidden properties should probably be ignored.  Most (all?)
deprecated properties should probably be ignored.  There was a
discussion about ISBN, but these can be identified by introspecting
the schema: https://www.freebase.com/book/book_edition/ISBN

There's a bunch of internal bookkeeping cruft included in that list
that should be excluded, e.g.:

https://www.freebase.com/dataworld/gardening_hint/split_to
https://www.freebase.com/dataworld/mass_data_operation/authority
https://www.freebase.com/dataworld/mass_data_operation/ended_operation
https://www.freebase.com/dataworld/mass_data_operation/estimated_primitive_count
https://www.freebase.com/dataworld/mass_data_operation/operator
https://www.freebase.com/dataworld/mass_data_operation/software_tool_used
https://www.freebase.com/dataworld/mass_data_operation/started_operation
https://www.freebase.com/dataworld/mass_data_operation/using_account
https://www.freebase.com/dataworld/provenance/data_operation
https://www.freebase.com/dataworld/provenance/tool
https://www.freebase.com/dataworld/software_tool/provenances

https://www.freebase.com/freebase/acre_doc/based_on
https://www.freebase.com/freebase/acre_doc/handler
https://www.freebase.com/freebase/domain_profile/expert_group
https://www.freebase.com/freebase/domain_profile/featured_views
https://www.freebase.com/freebase/domain_profile/hidden
https://www.freebase.com/freebase/domain_profile/show_commons
https://www.freebase.com/freebase/flag_judgment/flag
https://www.freebase.com/freebase/flag_judgment/item
https://www.freebase.com/freebase/flag_judgment/vote
https://www.freebase.com/freebase/flag_kind/flags
https://www.freebase.com/freebase/flag_vote/judgments

https://www.freebase.com/freebase/review_flag/item
https://www.freebase.com/freebase/review_flag/judgments
https://www.freebase.com/freebase/review_flag/kind

https://www.freebase.com/freebase/type_profile/instance_count
https://www.freebase.com/freebase/user_activity/primitives_live
https://www.freebase.com/freebase/user_activity/primitives_written
https://www.freebase.com/freebase/user_activity/topics_live
https://www.freebase.com/freebase/user_activity/types_live
https://www.freebase.com/freebase/user_activity/user

https://www.freebase.com/pipeline/delete_task/delete_guid
https://www.freebase.com/pipeline/task/status
https://www.freebase.com/pipeline/task/votes
https://www.freebase.com/pipeline/vote/vote_value

The properties below are for text and images which were uploaded and
are of questionable provenance/rights status, so can probably be
ignored (and aren't made available by Google in current data dumps):

https://www.freebase.com/type/content/blob_id
https://www.freebase.com/type/content_import/content
https://www.freebase.com/type/content_import/header_blob_id
https://www.freebase.com/type/content_import/uri
https://www.freebase.com/type/content/languagelanguage of work (or
name) (P407)See also original language of work (P364)
https://www.freebase.com/type/content/length
https://www.freebase.com/type/content/media_type
https://www.freebase.com/type/content/source
https://www.freebase.com/type/content/text_encoding

Re: [Wikidata-l] freebase id - wikidata id

2015-05-17 Thread Tom Morris

Hi Ed. In addition to the API(s), there's a specific Freebase-Wikidata
mapping dump which is a little old (18 months), but very compact and easy
to work with.  The identifiers are likely to be pretty stable, so if the
entities you are interested in have been around for a while, it might be
easier to work with.

   https://developers.google.com/freebase/data#freebase-wikidata-mappings

The three hits for your sample topic make me suspicious about the quality
of the mapping though

$ curl http://storage.googleapis.com/freebase-public/fb2w.nt.gz | zgrep
'm.04hcw'
http://rdf.freebase.com/ns/m.04hcw  
http://www.w3.org/2002/07/owl#sameAs http://www.wikidata.org/entity/Q9391
.
http://rdf.freebase.com/ns/m.04hcw4 
http://www.w3.org/2002/07/owl#sameAs 
http://www.wikidata.org/entity/Q1579293 .
http://rdf.freebase.com/ns/m.04hcwh 
http://www.w3.org/2002/07/owl#sameAs 
http://www.wikidata.org/entity/Q1347932 .

I don't know if this dump was used as the basis for the identifiers loaded
into Wikidata, but you might want to consider building in some safeguards
like double checking that names match -- whatever lookup scheme you decide
to use.

Tom

On Sun, May 17, 2015 at 10:23 AM, Ed Summers e...@pobox.com wrote:

 Hi all,

 I was wondering if anyone had any advice for mapping a set of Freebase
 identifiers to WikiData identifiers. I’m looking to port over a few
 thousand entities in use in an application since Freebase is going to be
 shut off next month.

 My apologies if this is a rudimentary question. I took a quick look at the
 API [1] and didn’t see an obvious way of doing it. Or is the WikiData Query
 API [2] a better fit for this? I was able to figure out how to do a lookup
 based on the FreebaseID,

 https://wdq.wmflabs.org/api?q=string[646:'/m/04hcw']

 Any tips or pointers would be appreciated.

 //Ed

 [1] https://www.wikidata.org/w/api.php
 [2] https://wdq.wmflabs.org/api_documentation.html

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] freebase id - wikidata id

2015-05-17 Thread Tom Morris

Oops! Never mind!

On Sun, May 17, 2015 at 5:28 PM, Tom Morris tfmor...@gmail.com wrote:


 The three hits for your sample topic make me suspicious about the quality
 of the mapping though

 $ curl http://storage.googleapis.com/freebase-public/fb2w.nt.gz | zgrep
 'm.04hcw'
 http://rdf.freebase.com/ns/m.04hcw  
 http://www.w3.org/2002/07/owl#sameAs 
 http://www.wikidata.org/entity/Q9391 .
 http://rdf.freebase.com/ns/m.04hcw4 
 http://www.w3.org/2002/07/owl#sameAs 
 http://www.wikidata.org/entity/Q1579293 .
 http://rdf.freebase.com/ns/m.04hcwh 
 http://www.w3.org/2002/07/owl#sameAs 
 http://www.wikidata.org/entity/Q1347932 .


Those are prefix matches, of course.  A better grep is:

$ curl http://storage.googleapis.com/freebase-public/fb2w.nt.gz | zgrep
'm.04hcw'

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] OpenStreetMap + Wikidata for light houses

2015-04-23 Thread Tom Morris

On Thu, Apr 23, 2015 at 10:16 AM, Edward Betts edw...@4angle.com wrote:

 Thad Guidry thadgui...@gmail.com wrote:
  I helped with the Lighthouses schema in Freebase.
 
  Some of which is based on List of Lights (NGA) USA.
 
  I have DB conversion data for the PDFs...just never got around to loading
  them all in.
 
  Let me know if I can help.

 I was able to match 407 lighthouses on OSM and Wikidata:


 http://edwardbetts.com/osm-wikidata/2015-04-18/match_results/Lighthouses.html


How can something be both a islet, a nature preserve, and a lighthouse?

http://www.wikidata.org/wiki/Q3372089

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Suggestions for improvements of Wikidata

2015-04-16 Thread Tom Morris

Any chance you could put this list up on the wiki? Perhaps in your user
space. It'd be interesting to see these issues end up being tracked in
Phabricator and hopefully fixed. :)

Yours,

--
Tom Morris
http://tommorris.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Names, Aliases, Copyright (and a little OpenStreetMap)

2015-03-11 Thread Tom Morris

On Wed, Mar 11, 2015 at 9:59 AM, Joe Filceolaire filceola...@gmail.com
wrote:

 I suspect these are not wikidata aliases. They are probably labels in
 other languages.

 While wikidata doesn't have a multilingual datatype it does allow you to
 add labels (and aliases) in any language and these labels, if they are
 correct, are the appropriate thing to use to localise osm place names

Do labels provide attribution/sources?  That seems to be the key missing
ingredient in the OPs query.

Tom


 Hope this helps
 On 10 Mar 2015 22:21, Markus Krötzsch mar...@semantic-mediawiki.org
 wrote:

 On 10.03.2015 17:09, Daniel Kinzler wrote:

 Am 10.03.2015 um 16:55 schrieb Markus Krötzsch:

 Hi Serge,

 The short answer to this is that the purpose of aliases in Wikidata is
 to help
 searching for items, and nothing more. Aliases may include nicknames
 that are in
 no way official, and abbreviations that are not valid if used in another
 context. Therefore, they seem to be a poor source of data to import
 into other
 projects.

 Wikidata has properties such as birth name
 (https://www.wikidata.org/wiki/Property:P1477) that are used to
 provide properly
 sourced multi-lingual text data for items.


 Note: Wikidata doesn't yet support multilingual property values, only
 monolingual (language + value). However, multiple statements about the
 respective property can be used to provide values in different
 languages. That's
 actually desirable in this case, since it allows different sources to be
 given
 for different languages.


 Good point; my formulation was ambiguous. In fact, alias-like properties
 are a case where you really want monolingual text data, since there is no
 one-to-one correspondence between aliases in different languages.

 Markus


 For towns and streets, the best property to use would probably be
 official
 name https://www.wikidata.org/wiki/Property:P1448



 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] [Dbpedia-discussion] [Dbpedia-developers] DBpedia-based RDF dumps for Wikidata

2015-03-11 Thread Tom Morris

Sebastian,

Thanks very much for the explanation.  It was a single missing word,
ontology, which led me astray.  If the opening sentence had said based
on the DBpedia ontology, I probably would have figured it out.  Your
amplification of the underlying motivation helps me better understand
what's driving this though.

I guess I had naively abandoned critical thinking and assumed DBpedia was
dead now that we had WikiData without thinking about how the two could
evolve / compete / cooperate / thrive.

Good luck!

Best regards,
Tom

On Wed, Mar 11, 2015 at 4:29 PM, Sebastian Hellmann 
hellm...@informatik.uni-leipzig.de wrote:

 Your description sounds quite close to what we had in mind. The high level
 group is manifesting quite well, the domain groups are planned as pilots
 for selected domains (e.g. Law or Mobility).

 I lost a bit the overview on the data classification. We might auto-link
 or crowdsource. I would need to ask others, however.

 We are aiming to create a structure that allows stability and innovation
 in an economic way - - I see this as the real challenge...

 Jolly good show,
 Sebastian




 On 11 March 2015 20:53:55 CET, John Flynn jflyn...@verizon.net wrote:

 This is a very ambitious, but commendable, goal. To map all data on the
 web to the DBpedia ontology is a huge undertaking that will take many
 years of effort. However, if it can be accomplished the potential payoff is
 also huge and could result in the realization of a true Semantic Web. Just
 as with any very large and complex software development effort, there needs
 to be a structured approach to achieving the desired results. That
 structured approach probably involves a clear requirements analysis and
 resulting requirements documentation. It also requires a design document
 and an implementation document, as well as risk assessment and risk
 mitigation. While there is no bigger believer in the build a little, test
 a little rapid prototyping approach to development, I don't think that is
 appropriate for a project of this size and complexity. Also, the size and
 complexity also suggest the final product will likely be beyond the scope
 of any individual to fully comprehend the overall ontological structure.
 Therefore, a reasonable approach might be to break the effort into smaller,
 comprehensible segments. Since this is a large ontology development effort,
 segmenting the ontology into domains of interest and creating working
 groups to focus on each domain might be a workable approach. There would
 also need to be a working group that focus on the top levels of the
 ontology and monitors the domain working groups to ensure overall
 compatibility and reduce the likelihood of duplicate or overlapping
 concepts in the upper levels of the ontology and treats universal concepts
 such as  space and time consistently. There also needs to be a clear,
 and hopefully simple, approach to mapping data on the web to the DBpedia
 ontology that will accommodate both large data developers and web site
 developers.  It would be wonderful to see the worldwide web community
 get behind such an initiative and make rapid progress in realizing this
 commendable goal. However, just as special interests defeated the goal of
 having a universal software development approach (Ada), I fear the same
 sorts of special interests will likely result in a continuation of the
 current myriad development efforts. I understand the one size doesn't fit
 all arguments, but I also think one size could fit a whole lot could be
 the case here.



 Respectfully,



 John Flynn

 http://semanticsimulations.com





 *From:* Sebastian Hellmann [mailto:hellm...@informatik.uni-leipzig.de]
 *Sent:* Wednesday, March 11, 2015 3:12 AM
 *To:* Tom Morris; Dimitris Kontokostas
 *Cc:* Wikidata Discussion List; dbpedia-ontology;
 dbpedia-discuss...@lists.sourceforge.net; DBpedia-Developers
 *Subject:* Re: [Dbpedia-discussion] [Dbpedia-developers] DBpedia-based
 RDF dumps for Wikidata



 Dear Tom,

 let me try to answer this question in a more general way.  In the future,
 we  honestly consider to map all data on the web to the DBpedia ontology
 (extending it where it makes sense). We hope that this will enable you to
 query many  data sets on the Web using the same queries.

 As a convenience measure, we will get a huge download server that
 provides all data from a single point in consistent  formats and consistent
 metadata, classified by the DBpedia Ontology.  Wikidata is just one
 example, there is also commons, Wiktionary (hopefully via DBnary), data
 from companies, DBpedia members and EU projects.

 all the best,
 Sebastian

 On 11.03.2015 06:11, Tom Morris wrote:

 Dimitris, Soren, and DBpedia team,



 That sounds like an interesting project, but I got lost between the
 statement of intent, below, and the practical consequences:



 On Tue, Mar 10, 2015 at 5:05 PM, Dimitris Kontokostas 
 kontokos...@informatik.uni-leipzig.de wrote:

 we made some different design

Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-10 Thread Tom Morris

On Tue, Mar 10, 2015 at 6:17 PM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH many
 *simple* SPARQL queries are not possible in WDQ; there is still time to
 restrict ourselves -- let's give SPARQL a chance before going back.


TLDR, so SPARQL is the one true way.


 Nik and Stas have made a careful analysis of the options, ...


citation please

Tom
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-10 Thread Tom Morris

How long has WDQ been in service?  What proportion of the total aggregate
lifetime Wikidata apps, presuming it survives, do the current, as of Mar
2015, Wikidata apps represent?

Should the question of premature optimization (or optimisation) be
considered?

Tom

p.s. Since your opinion doesn't represent the official team position, what,
exactly, *IS* the official team position?

pps I don't disagree that there are strong negative aspects to using
SPARQL, but you weaken your argument by saying that the status quo is the
only way forward

On Tue, Mar 10, 2015 at 10:31 AM, Daniel Kinzler 
daniel.kinz...@wikimedia.de wrote:

 Hi all!

 After the initial enthusiasm, I have grown increasingly wary of the
 prospect of
 exposing a SPARQL endpoint as Wikidata's canonical query interface. I
 decided to
 share my (personal and unfinished) thoughts about this on this list, as
 food for
 thought and a basis for discussion.

 Basically, I fear that exposing SPARQL will lock us in with respect to the
 backend technology we use. Once it's there, people will rely on it, and
 taking
 it away would be very harsh. That would make it practically impossible to
 move
 to, say, Neo4J in the future. This is even more true if if expose vendor
 specific extensions like RDR/SPARQL*.

 Also, exposing SPARQL as our primary query interface probably means
 abruptly
 discontinuing support for WDQ. It's pretty clear that the original WDQ
 service
 is not going to be maintained once the WMF offers infrastructure for
 wikidata
 queries. So, when SPARQL appears, WDQ would go away, and dozens of tools
 will
 need major modifications, or would just die.


 So, my proposal is to expose a WDQ-like service as our primary query
 interface.
 This follows the general principle having narrow interfaces to make it
 easy to
 swap out the implementation.

 But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint
 could
 be exposed to Labs, just like we provide access to replicated SQL databases
 there: on Labs, you get raw access, with added performance and
 flexibility,
 but no guarantees about interface stability.

 In terms of development resources and timeline, exposing WDQ may actually
 get us
 a public query endpoint more quickly: sandboxing full SPARQL may likely
 turn out
 to be a lot harder than sandboxing the more limited set of queries WDQ
 allows.

 Finally, why WDQ and not something else, say, MQL? Because WDQ is
 specifically
 tailored to our domain and use case, and there already is an ecosystem of
 tools
 that use it. We'd want to refine it a bit I suppose, but by and large, it's
 pretty much exactly what we need, because it was built around the actual
 demand
 for querying wikidata.


 So far my current thoughts. Note that this is not a decision or
 recommendation
 by the Wikidata team, just my personal take.

 -- daniel


 --
 Daniel Kinzler
 Senior Software Developer

 Wikimedia Deutschland
 Gesellschaft zur Förderung Freien Wissens e.V.

 ___
 Wikidata-tech mailing list
 Wikidata-tech@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-l] OpenStreetMap + Wikidata for light houses

2015-03-10 Thread Tom Morris

On Tue, Mar 10, 2015 at 6:41 PM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:


 For example, you can see that Portugal has a lot of lighthouses while
 Spain has almost none -- maybe we need to look at our data there ;-)


Perhaps it's a language confusion issue, but does Spain really have few
lighthouses?  That would seem VERY unusual for a territory with an
extensive coastline.

Or am I confusing cyber reality with real reality?

Where does Wikidata sit in that mix?  How does it compare with DBpedia,
Freebase, Wikipedia, or even *real* data sources like official governmental
lists of navigational aids for mariners at sea?

Independent of where it sits now, what where does it aspire to sit?

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] [Dbpedia-discussion] DBpedia-based RDF dumps for Wikidata

2015-03-10 Thread Tom Morris

Dimitris, Soren, and DBpedia team,

That sounds like an interesting project, but I got lost between the
statement of intent, below, and the practical consequences:

On Tue, Mar 10, 2015 at 5:05 PM, Dimitris Kontokostas 
kontokos...@informatik.uni-leipzig.de wrote:

 we made some different design choices and map wikidata data directly into
 the DBpedia ontology.


What, from your point of view, is the practical consequence of these
different design choices?  How do the end results manifest themselves to
the consumers?

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Kian: The first neural network to serve Wikidata

2015-03-08 Thread Tom Morris

On Sun, Mar 8, 2015 at 7:34 AM, Amir Ladsgroup ladsgr...@gmail.com wrote:

 This is the result for German Wikipedia:
 ... so I got list of articles in German Wikipedia that doesn't have item
 in Wikidata.  There were 16K articles ... When the number is below 0.50  it
 is obvious that they are not human. Between 0.50-0.61 there are 78 articles
 that the bot can't determine whether it's a human or not [1] and articles
 with more than 0.61 is definitely human. I used 0.62 just to be sure and
 created 3600 items with P31:5 in them.


Definitely human in this context means that you did 100% verification of
the 3600 items and one (or more?) human(s) agreed with the bots judgement
in these cases?  Or that you validated a statistically significant sample
of the 3600? Or something else?

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Freebase VIAF error rate (was next Wikidata office hour)

2015-01-17 Thread Tom Morris

2015-01-17 4:27 GMT-05:00 Lydia Pintscher lydia.pintsc...@wikimedia.de:


 The log is at
 https://meta.wikimedia.org/wiki/IRC_office_hours/Office_hours_2015-01-16
 for anyone who couldn't make it.


Denny discusses importing all missing VIAF keys from Freebase using
multichill (unclear what that is from the context) on the assumption that
the error rate is low.  It would be worth checking assumptions like that
with folks who are familiar with the Freebase data before acting on them.

Here are some things that I think are true about the VIAF keys in Freebase:

- they were assigned by a user, not by Google/Metawab (not necessarily a
bad thing since some of the biggest problems in Freebase were created by
G/M and some users have contributed very high quality data)

- they keys were, I believe, assigned based heavily on existing Library of
Congress identifiers that had previously been assigned by Metaweb.  Those
key assignments are not as high quality as other parts of Freebase.  One
easy thing to check for is people with two LC (and thus two VIAF) keys
assigned.  In cases where there are more than key and the extra keys don't
represent pseudonyms, this is a clear error.

- Freebase doesn't create separate entities for pseudonyms, unlike the
library cataloging world.  Depending on what decision Wikidata makes in
this regard, it's something to watch out for when reusing Freebase author
data (including VIAF keys)

- much Freebase author data was imported from OpenLibrary which has its own
set of quality issues.  A bunch of this data was later deleted, leaving
that portion of the graph somewhat thready and moth-eaten.  It's unclear
whether that was a net gain or loss in overall bibliographic data quality
for Freebase.

- I suspect that most VIAF keys which are in Freebase and not Wikidata
represent entities which are not in Wikidata which means they aren't useful
anyway since he wants to focus on creating new links, not new entities (a
direction that I'm not sure I agree with, but that's a whole separate
discussion).

One of the key inputs to judging the quality of assertions is their
provenance. Fortunately, this is recorded for all assertions in Freebase
and it's possible to trace a given fact back to the user, toolchain, or
process that added it to the database. Unfortunately, this information is
only available through the Freebase API, not the bulk data dump.
Hopefully, this will change before Google completely abandons Freebase.

If any Wikidata folk want to discuss VIAF keys in Freebase (or its author
data in general), feel free to get in touch.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] How to deal with bad data / vandalism?

2015-01-05 Thread Tom Morris

On Mon, Jan 5, 2015 at 8:45 PM, Ben McCann b...@benmccann.com wrote:


 I see the page for Google lists Mario Artero as the founder and CEO of
 Google. This is definitely not correct. What is the procedure for fixing
 this? I can't tell from the history page which edit added that.

 http://www.wikidata.org/wiki/Q95


When I click on the Mario Artero founder property value, I end up on Larry
Page's page (sic), so there may be something more fundamental going on then
simple vandalism.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Reasonator ignores of qualifier

2014-06-17 Thread Tom Morris

Sad to see the Deletionists taking hold on Wikidata too.

Tom


On Mon, Jun 16, 2014 at 4:31 PM, Thomas Douillard 
thomas.douill...@gmail.com wrote:

 Yeah, there seem to be some cognitive dissonance going on here, it's weird.


 2014-06-16 22:08 GMT+02:00 Derric Atzrott datzr...@alizeepathology.com:

  That's certainly what the policy says. It's not what some admins accept,
 though.
 
  A direct quote from one, from as recently as March this year:
 
*   The general spirit of the notability policy is that Wikipedia
  finds [the subject] notable

 This was also the general vibe that I had gotten that informed my
 understanding of
 notability on Wikidata before someone pointed out that policy actually
 says
 differently.

 Thank you,
 Derric Atzrott


 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l



 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] weekly summary #94

2014-01-27 Thread Tom Morris

On Mon, Jan 27, 2014 at 2:24 PM, Amir E. Aharoni 
amir.ahar...@mail.huji.ac.il wrote:


 These updates look a lot like blog posts, so they should be blog posts.


You've said that multiple times without getting anyone to agree with you.

Personally, I liked when we got the full update via email without having to
click through to read a web page.

Bring back email updates, please.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] [Wikisource-l] DNB 11M bibliographic records as CC0

2013-12-09 Thread Tom Morris

This doesn't reflect my understanding of the situation at OpenLibrary.

On Mon, Dec 9, 2013 at 5:31 AM, Andrea Zanni zanni.andre...@gmail.comwrote:


 I'd love too to collaborate with openlibrary, but at the beginning of our
 IEG project, me and Micru contacted them, in the person of Karen Coyle
 (User:Kcoyle),
 a very famous and skilled metadata librarian who is somehow in charge of
 the project now.
 She told us that openlibrary is frozen, at the moment, and there is no
 staff nor funds to get that going.
 Openlibrary was previously funded but internet Archive.


Karen has worked for OpenLibrary in the past, but I don't know if she
currently does and she's certainly not in charge.

OpenLibrary is owned and funded by the Internet Archive (ie Brewster
Kahle).  It is funded at a much lower level than it has been historically


 If someone could build the tool you proposed, Luiz, that would be awesome,
 but I'm not a technical person and I'm not able to understnd if that is
 feasible or not.


I'm not sure I agree.  There's a lot of good data in OpenLibrary, but
there's also a lot of junk.  Freebase imported a bunch of OpenLibrary data,
after winnowing it to what they thought was the good stuff, and still ended
up deleting a bunch of the supposedly good stuff later because they found
their goodness criteria hadn't been strict enough.

One of the reasons OpenLibrary is such a mess is because *they* arbitrarily
imported junky data (e.g. Amazon scraped records).  The last thing the
world needs is more duplicate copies of random junk.  We've already got the
DPLA for that. :-)

Another issue with the OpenLibrary metadata is that there's no clear
license associated with it.  IA's position is that they got it from
wherever they got it from and you're own your own if you want to reuse it,
which isn't very helpful.  The provenance for major chunks of it is
traceable and new stuff by users is nominally being contributed under CC0,
so they could probably be sorted out with enough effort (although the same
thing is true of the data quality issues too).

If Ed Summers (or any other capable programmer) is going to sign up to
solve this problem for you guys, I'm happy to help with my knowledge of the
state of play, but it's a *very * sizeable project.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Is Wikidata VIAF data being used in Wikipedia?

2013-10-17 Thread Tom Morris

Thanks for the feedback everyone.  Not the unambiguous Use this, it's the
best source answer I was hoping for, but I've got a better understanding
of the issues.

Aubrey - The Italian approach sounds good (and I like the position on the
page where the VIAF et al identifiers are rendered), but appears to still
depend on the inclusion of the {{Controllo di autorità}} template which is
missing in this case http://it.wikipedia.org/wiki/Jos%C3%A9_Mujica Not sure
if that's just a synchronization issue or something where people need to
manually noticed that the data is available in VIAF and include it (seems
like a job for a bot).  Also, for Max's example, it only includes one of
the two different VIAF identifiers: http://www.wikidata.org/wiki/Q18391

GerardM - The Occitan approach sounds good in theory, but when I look at
these two pages currently, they not only don't include the VIAF identifier,
but they've got all kinds of other rendering problems.
http://oc.wikipedia.org/wiki/Jos%C3%A9_Alberto_Mujica
http://oc.wikipedia.org/wiki/James_Clerk_Maxwell

Max - I'm not familiar enough with the universe of possible technical
options or the Wikipedians culture to really have a valuable opinion, but
that sounds like it would probably be too aggressive from some of the
comments that I've read along the lines of not sure if I want to invest
the time to learn how to edit things in a new (ie Wikidata) way.

Clearly data that's not visible isn't going to get reviewed or corrected.
 The trick is to make it visible in a way that makes the local editors
still believe they have control.

Tom



On Wed, Oct 16, 2013 at 2:32 PM, Gerard Meijssen
gerard.meijs...@gmail.comwrote:

 Tom,

 On the Occitan Wikipedia they indicate the VIAF (and other) identifiers
 are shown from Wikidata. When a new identifier is added, this new
 identifier will be shown as well. When it is changed it is changed as well.

 When information in both Wikidata and Wikipedia is the same, it would be
 good when the information is removed from Wikipedia (and shown from
 Wikidata). When there is a difference, the information needs to be verified
 and the resolution needs to go to Wikidata and sourced. In this way
 information will gradually become be improved and be available in more
 Wikipedias.

 Yes you can concentrate your efforts on Wikipedia but it will only benefit
 one Wikipedia. It could do so much more good.
 Thanks,
   GerardM


 On 16 October 2013 18:34, Tom Morris tfmor...@gmail.com wrote:

 If I want the most current/accurate VIAF ids, should I be looking at
 Wikidata or Wikipedia?

 When I look at the EN Wikipedia pages for these two topics:

 http://www.wikidata.org/wiki/Q9094
 http://www.wikidata.org/wiki/Q9095

 both of which have property P214, the VIAF identifer, the second displays
 the VIAF identifier, but the first doesn't and the one that does display
 the identifier appears to be using information from the embedded
 AuthorityControl template, not Wikidata.

 My concern is that if the Wikidata VIAF data isn't being viewed/edit on
 Wikipedia, it can easily be invisibly wrong like the infamous Persondata
 template.

 Tom


 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l



 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Is Wikidata VIAF data being used in Wikipedia?

2013-10-16 Thread Tom Morris

If I want the most current/accurate VIAF ids, should I be looking at
Wikidata or Wikipedia?

When I look at the EN Wikipedia pages for these two topics:

http://www.wikidata.org/wiki/Q9094
http://www.wikidata.org/wiki/Q9095

both of which have property P214, the VIAF identifer, the second displays
the VIAF identifier, but the first doesn't and the one that does display
the identifier appears to be using information from the embedded
AuthorityControl template, not Wikidata.

My concern is that if the Wikidata VIAF data isn't being viewed/edit on
Wikipedia, it can easily be invisibly wrong like the infamous Persondata
template.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-15 Thread Tom Morris

On Tue, Oct 15, 2013 at 7:50 AM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 My error margins are far too wide to make any realistic statement about
 minority genders even if I had a method to consider them.


This article: http://journal.code4lib.org/articles/8964
gives them as being in the range 0.002% - 0.006%
so they're unlikely to effect any real-world analysis.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-15 Thread Tom Morris

So you've got an agenda that's unrelated to Wikidata or analysis thereof.
 Got it.  Perhaps a non-Wikidata list would be a more appropriate forum.

On Tue, Oct 15, 2013 at 2:08 PM, Klein,Max kle...@oclc.org wrote:

Sorry to rant.


Accepted.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-14 Thread Tom Morris

Naming patterns change over time and geography.  If you're interested in
the gender of current day authors, you should probably constrain your name
sampling to the same timeframe.

There's an app that works of the Freebase data here:
http://namegender.freebaseapps.com/

It also has an API that returns JSON:
http://namegender.freebaseapps.com/gender_api?name=andrea

Based on the top name stats, it looks like its sample is a little more than
twice the size of Wikidata's.

Tom

On Sun, Oct 13, 2013 at 6:16 PM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 Hi all,

 I'd like to share a little Wikidata application: I just used Wikidata to
 guess the sex of people based on their (first) name [1]. My goal was to
 determine gender bias among the authors in several research areas. This is
 how some people spend their free time on weekends ;-)

 In the process, I also created a long list of first names with associated
 sex information from Wikidata [2]. It is not super clean but it served its
 purpose. If you are a researcher, then maybe the gender bias of
 journals/conferences is interesting to you as well. Details and some
 discussion of the results are online [1].

 Cheers,

 Markus

 [1] 
 http://korrekt.org/page/Note:**Sex_Distributions_in_Researchhttp://korrekt.org/page/Note:Sex_Distributions_in_Research
 [2] https://docs.google.com/**spreadsheet/ccc?key=0AstQ5xfO-**
 xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc**0cncusp=sharinghttps://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing

 __**_
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF Issues

2013-10-01 Thread Tom Morris

On Tue, Oct 1, 2013 at 1:01 PM, Daniel Kinzler
daniel.kinz...@wikimedia.dewrote:

 Ok, I have now found and tackled the issue.

 This was indeed a bug in EasyRDF that got fixed since we forked half a
 year ago.
 []

 Having to maintain the fork is really a pain, I wish there was a better
 way to
 do this.


How about not creating a fork just so you can delete a couple of
directories? The full download is a whopping 260KB.  Is that really too
big/complex to include in its entirety and just ignore the parts you don't
use?

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] 'Person' or 'human', upper ontologies and migrating 4 million claims

2013-09-24 Thread Tom Morris

On 22 September 2013 at 21:24:48, Antoine Isaac (ais...@few.vu.nl) wrote:

First, getting a clean hierarchy won't make things easier, if you end up with a 
too static/formal view on the world. Second, the feeling about the W3C 
recommendations is wrong. W3C has actually pushed SKOS to allow 'softer' 
classifications to be represented having to undergo the ordeals and dangers of 
RDFS/OWL...

But I realize all this might be regarded as questioning the decision you made 
earlier on using P31 and P279 instead of the GND type, so I'm going to stop 
bothering you ;-)


Agreed with some of that.

The primary problem with GND type is that it tried to reduce the whole world 
into 7 arbitrary categories. I'm not sure how any proposed alternative could be 
as barmy as that.

My general preference is towards simple, unfussy and bubble-up types, looking 
at existing systems that work and following them as far as possible (and, no, 
big formal ontology systems do not satisfy the that work part of that 
sentence, nor do indulgent academic thought experiments).

We don't need big-design-up-front. We need to apply common sense, the Pareto 
Principle and avoid the excesses of the academic AI/KR community that have thus 
far made things like the RDF/OWL spec impenetrable to the average person who 
doesn't know what model-theoretic semantics are. (And also left us in a state 
where we have endless specs on ontological minutiae, but nobody seems to be 
bothered about fixing datatypes.)

Indeed, one of the things that's good about RDF is precisely that because you 
use URIs to define properties and classes, you can delegate the creation of 
those classes and properties to subject matter experts. The biology/medicine 
people design the schemata they need to represent genes and drugs and so on; if 
I need a simple property to represent dietary preference, I just coin it and 
start publishing. On Wikidata, rather than trying to suppose that the ontology 
people have solved all the problems, it'd be much better if we followed actual 
usage and unified our semantics with others using things like owl:sameAs and 
equivalentProperty relations.

If I had to suggest some design principles, these would be where I start:

1. Prioritise pragmatism and common sense over theoretical unity.

2. Categorisation schemes are used by humans and implemented by humans. Design 
for humans rather than for hyper-intelligent robots or geniuses.

3. Actual usage takes priority over hypothetical use cases.

4. Use by Wikimedia projects takes priority over use by third parties.

5. Optimise for common use cases per Pareto's Principle.

6. You can apply two different types to something. Avoid creating union types. 
Wikipedia may have Jewish LGBT scientists from Portugal with a cleft lip, but 
we don't need to replicate that kind of silliness.

7. If explaining your proposed category/property/schema to the man on the 
Clapham Omnibus would cause him to laugh to the point where it would disturb 
his fellow travellers, you need to rethink your proposal.

8. Take your necktie off. You are designing a fancy computer index card system, 
not going to meet the Queen of England.

The most amusing thing in the GND discussions (beyond the hilarious defences of 
how the absurd way the GND categorises fictional characters, planets, families 
and so on is actually okay) were people predicting anarchy if we didn't 
strictly follow some kind of schema designed by librarians. It's almost as if 
Wikipedia hadn't happened: the same people would have been saying back in 2001 
that an encyclopedia written by random volunteers on the Internet would be 
impossible and the anarchic dream of pot-smoking hippies.

--
Tom Morris
http://tommorris.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] The Day the Knowledge Graph Exploded

2013-08-23 Thread Tom Morris

On Fri, Aug 23, 2013 at 8:29 AM, Michael Erdmann erdm...@diqa-pm.comwrote:

I just stumbled upon this report

http://moz.com/blog/the-day-**the-knowledge-graph-explodedhttp://moz.com/blog/the-day-the-knowledge-graph-exploded
that tells that Google's use of its Knowledge Graph hast drastically
increased from one day to the other.

That's an interesting report, but it's not really talking about Knowledge
Graph *use*, which is largely internal, but rather display of KG Cards in
search results -- and primarily those search results which are of interest
to SEOers. The comments are interesting to read as well, as the readers
immediately pivot into how to SEO entities when pages become irrelevant.

Does someone know, if this is related to WikiData in any way?!

In a word, no. Google acquired Metaweb, the company that built Freebase,
which forms the core of the Knowledge Graph in 2010. Metaweb was founded
in 2005 (interesting Google search: Metaweb founding) and started
extracting information from Wikipedia into Freebase in 2006.
https://www.freebase.com/m/0gw0?linkslang=enhistorical=true

The first DBpedia release was in 2007. Semantic information nets go back
to the 60s. TBL coined the term semantic web in 2006.

WikiData is a great project, but this progress has been building,
excrutiatingly slowly, over decades. One could even make the argument that
WikiData is the result of Knowledge Graph and its antecedents rather than
the other way around.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Cooperation between Wikidata and DBpedia and Wikipedia

2013-08-23 Thread Tom Morris

On Fri, Aug 23, 2013 at 5:53 AM, Gerard Meijssen
gerard.meijs...@gmail.comwrote:


 I have been blogging a lot the last two days with DBpedia in mind. My
 understanding is that at DBpedia a lot of effort went into making something
 of a cohesive model of properties. Now that the main type GND is about to
 be deleted, it makes sense to adopt much of the work that has been done at
 DBpedia.

 The benefits are:

- we will get access to academically reviewed data structures
- we do not have to wait and ponder and get in to thebusiness
enriching the data content of DBpedia
- we can easily compare the data in DBpedia and Wikidata
- more importantly, DBpedia has spend effort in connecting to other
resources

 Yes, we can import data from DBpedia and we can import data from
 Wikipedia. Actually we can do both. The one thing that needs to be
 considered is that we need data before we can curate it. With more data
 available it becomes more relevant to invest time in tools that compare
 data. We can start doing this now and, over time this will become more
 relevant. But now we need more properties and the associated date.


I think reviewing existing ontologies/schemas like DBpedia (or Freebase)
with an eye towards reusing them or incorporating pieces of them makes a
lot of sense.  I wouldn't take them wholesale without review though.

Importing data from DBpedia, I'd be much more wary of.  It can vary greatly
in quality depending on how it was generated.  I'd much rather see WikiData
take Freebase's approach of quality over quantity and let coverage improve
over time.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] The Day the Knowledge Graph Exploded

2013-08-23 Thread Tom Morris

On Fri, Aug 23, 2013 at 10:10 AM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:


 I understand Michael's question to be much more concrete: does the
 progress in Wikidata has anything to do with the changes in the Knowledge
 Graph's visibility in Google's searches that happened last month?


So, what's your opinion?

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Scope of a Wikidata entry

2013-08-11 Thread Tom Morris

On Sun, Aug 11, 2013 at 1:09 AM, Luca Martinelli
martinellil...@gmail.comwrote:

 2013/7/31 Andrew Gray andrew.g...@dunelm.org.uk:
  Hi Nicholas,
 
  a) Yes, it is about the person and the aliases together. As a general
  rule, it's one article per person, not per name.
 
  b) Different names is a quirk of the Wikipedia background - these
  default to the title of the Wikipedia article on that person, and
  there's no agreement on whether to put the article under the person or
  the more famous pseudonym.

 FYI, there is now a property for pseudonyms (
 http://www.wikidata.org/wiki/Property:P742 ).


Is it intentional to restrict the definition to personal pseudonyms?  That
doesn't cover all uses of them  For example, there are house pseudonyms
used by publishing houses which are associated with a series and the
publishing house contracts with writers to write effectively anonymously
(although it's often known who they are).

Another example of a relatively well known collective pseudonym is
http://en.wikipedia.org/wiki/Nicolas_Bourbaki  There's a whole category of
them here: http://en.wikipedia.org/wiki/Category:Collective_pseudonyms

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Scope of a Wikidata entry

2013-07-31 Thread Tom Morris

Freebase does the same as Wikidata -- and actually has five different
MusicBrainz IDs associated
https://www.freebase.com/m/01v_pj6?props=lang=enfilter=%2Fcommon%2Ftopic%2Ftopic_equivalent_webpage

It's worth noting however that this one person, one entry view isn't
universal. Library cataloging practice is to treat pseudonyms separately,
in the same way that you've already discovered MusicBrainz does. It's
worth keeping this in mind when interacting with other modeling communities.

Tom

On Wed, Jul 31, 2013 at 9:20 AM, Nicholas Humfrey
nicholas.humf...@bbc.co.uk wrote:

That is great, thanks for the clarification Andrew.

I guess as a rule of thumb, the label of the entry in Wikipedia should
match label of the linked entry in MusicBrainz.

nick.

-Original Message-
From: Andrew Gray andrew.g...@dunelm.org.uk
Date: Wednesday, 31 July 2013 12:59
To: Discussion list for the Wikidata project.
wikidata-l@lists.wikimedia.org
Cc: Nicholas Humfrey nicholas.humf...@bbc.co.uk
Subject: Re: [Wikidata-l] Scope of a Wikidata entry

Hi Nicholas,

a) Yes, it is about the person and the aliases together. As a general
rule, it's one article per person, not per name.

b) Different names is a quirk of the Wikipedia background - these
default to the title of the Wikipedia article on that person, and
there's no agreement on whether to put the article under the person or
the more famous pseudonym.

c) At the moment, yes, there would need to be separate Wikipedia
pages. I think for the specific case of people with pseudonyms,
Wikidata is likely to continue on a one entity rule even if we relax
the Wikipedia requirement.

d) I think the initial assumption was that there was a 1=1 match, but
if there are multiple musicbrainz id's representing facets of the same
entity, then Wikidata will support adding several.

Andrew.

On 31 July 2013 12:45, Nicholas Humfrey nicholas.humf...@bbc.co.uk
wrote:
Hello,

Can you help me understand the scope of a Wikidata entry please?

What is this Wikidata entry for?
http://www.wikidata.org/wiki/Q272619

Is it for the person Norman Cook and all of his aliases?
Should that title be Fatboy Slim or Norman Cook?
Is it ok that it has different titles in different languages?

Do there have to be separate Wikpedia pages before we can create
separate
Wikidata entities for the separate concepts?

In MusicBrainz there are three artists that point to the 'Norman Cook'
Wikipedia page:

http://musicbrainz.org/artist/3150be04-f42f-43e0-ab5c-77965a4f7a7d
http://musicbrainz.org/artist/34c63966-445c-4613-afe1-4f0e1e53ae9a
http://musicbrainz.org/artist/ba81eb4a-0c89-489f-9982-0154b8083a28

Should they all be pointing at the same Wikidata entry too?

Is it ok that there is only a single MusicBrainz identifier in Wikidata?
How is that identifier chosen?

The problem that we are experiencing is that our Triplestore is merging
all these concepts together into a single entity and I am trying to work
out where to break the equivalence, or if it is even a problem.

Thanks!

nick.

-
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless
specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

--
- Andrew Gray
andrew.g...@dunelm.org.uk

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Some Wiktionary data in Wikidata

2013-06-19 Thread Tom Morris

If you haven't already, it might be worth looking at the Freebase schema
for Wordnet, especially how it connects synsets to Freebase topics:

https://www.freebase.com/base/wordnet/synset?schema=

Tom


On Wed, Jun 19, 2013 at 9:57 AM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:

 Hello,

 I would like all interested in the interaction of Wikidata and Wiktionary
 to take a look at the following proposal. It is trying to serve all use
 cases mentioned so far, and remain still fairly simple to implement.

 http://www.wikidata.org/wiki/Wikidata:Wiktionary

 To the best of our knowledge, we have checked all discussions on this
 topic, and also related work like OmegaWiki, Wordnet, etc., and are
 building on top of that.

 I would extremely appreciate if some liaison editors could reach out to
 the Wiktionaries in order to get a wider discussion base. We are currently
 reading more on related work and trying to improve the proposal.

 It would be great if we could keep the discussion on the discussion page
 on the wiki, so to bundle it a bit. Or at least have pointers there.

 http://www.wikidata.org/wiki/Wikidata_talk:Wiktionary

 Note that we are giving this proposal early. Implementation has not
 started yet (obviously, otherwise the discussion would be a bit moot), and
 this is more a mid-term commitment (i.e. if the discussion goes smoothly,
 it might be implemented and deployed by the end of the year or so, although
 this depends on the results of the discussion obviously).

 Cheers,
 Denny




 --
 Project director Wikidata
 Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
 Tel. +49-30-219 158 26-0 | http://wikimedia.de

 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
 Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
 der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
 Körperschaften I Berlin, Steuernummer 27/681/51985.

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Content card prototype

2013-06-02 Thread Tom Morris

Is there a way to test this without creating an account on Wikipedia?

Tom

On Sun, Jun 2, 2013 at 11:55 AM, David Cuenca dacu...@gmail.com wrote:

Based on all feedback gathered during the RfC about a possible
inter-project links interface [1], User:Tpt has created a content card
prototype (image: [2])

To activate it, follow these steps:

1. Go to your common.js file. For a user named Test in English
Wikipedia, it would be:
https://en.wikipedia.org/wiki/User:Test/common.js
2. Modify it and paste this line: mw.loader.load('//

www.wikidata.org/w/index.php?title=User:Tpt/interproject.jsaction=rawctype=text/javascript'
);
3. Save and go to any Wikipedia page
4. You should see an icon next to the article title, if you don't,
refresh your browser cache. *Instructions*: Internet Explorer: hold
down the Ctrl key and click the Refresh or Reload button. Firefox: hold
down the Shift key while clicking Reload (or press Ctrl-Shift-R). Google
Chrome and Safari users can just click the Reload button.

What it does:

- It displays an icon next to the article title
- When you hover your mouse over the icon it shows a *content card*.
- The content card displays information from Wikidata: label, image,
link to Commons gallery, and link to edit Wikidata.

What it is supposed to do in the future when Wikidata supports sister
projects:

- It will display contents or links to sister projects

Please leave your feedback on the Request for comments, thanks!

http://meta.wikimedia.org/wiki/Requests_for_comment/Interproject_links_interface#Comments

Cheers,
Micru

[1]
http://meta.wikimedia.org/wiki/Requests_for_comment/Interproject_links_interface
[2] http://commons.wikimedia.org/wiki/File:Content-card-prototype.png

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Running Infobox film import script

2013-04-02 Thread Tom Morris

On Tue, Apr 2, 2013 at 12:58 AM, Michael Hale hale.michael...@live.comwrote:

 It will definitely have some errors, but I scanned the results for the
 first 100 movies before I started importing them, and I think the value-add
 will be much greater than the number of errors.


Does Wikidata have a quality goal or error rate threshold?  For example,
Freebase has a nominal quality goal of 99% accuracy and this is the metric
that new data loads are judged against (they also want to be in the 95%
confidence interval, which determines how big a sample you need when doing
evaluations).

I haven't looked at this bot, but a develop/test/deploy cycle measured in
hours seems, on the surface, to be very aggressive.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Running Infobox film import script

2013-04-02 Thread Tom Morris

The original URL is a 400 (looks like it's double URL encoded), but the
version you quoted works fine.  If you end up with %2527 in your browser
bar, correct it to just %27 (apostrophes in the URL aren't such a great
idea to start with though).

Tom


On Tue, Apr 2, 2013 at 2:44 PM, Ed Summers e...@pobox.com wrote:

 On Tue, Apr 2, 2013 at 5:25 AM, Michael Hale hale.michael...@live.com
 wrote:
  http://www.wikidata.org/wiki/User:Wakebrdkid%27s_bot/code

 I got a 400 when fetching this (in Chrome).

 //Ed

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Expiration date for data

2013-03-15 Thread Tom Morris

On Fri, Mar 15, 2013 at 1:49 AM, Michael Hale hale.michael...@live.comwrote:

 Yes, I think once qualifiers are enabled you would just have something
 like:
 ...
 Property(head of local government)
 ...
 Value(Elizabeth I) - Qualifier(1558-1603) - Sources()
 Value(James VI and I) - Qualifier(1603-1625) - Sources()
 ...
 ...

 There was a discussion about whether qualifiers should have specific
 datatypes other than just string, but I think we should only do that if
 needed.


Clearly the example that you gave is one where non-string datatypes are
critically important.  If you don't know that they're dates, you have no
way of telling when they were in those roles.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

2012-12-19 Thread Tom Morris

Wow, what a long thread.  I was just about to chime in to agree with Sven's
point about units when he interjected his comment about blithely ignoring
history, so I feel compelled to comment on that first.  It's fine to ignore
standards *for good reasons*, but doing it out of ignorance or gratuitously
is just silly.  Thinking that WMF is so special it can create a better
solution without even know what others have done before is the height of
arrogance.

Modeling time and units can basically be made arbitrary complex, so the
trick is in achieving the right balance of complexity vs utility.  Time is
complex enough that I think it deserves it's own thread.  The first thing
I'd do is establish some definitions to cover some basics like
durations/intervals, uncertain dates, unknown dates, imprecise dates, etc
to that everyone is using the same terminology and concepts.  Much of the
time discussion is difficult for me to follow because I have to guess at
what people mean.  In addition to the ability to handle circa/about dates
already mentioned, it's also useful to be able to represent before/after
dates e.g. he died before 1 Dec 1792 when his will was probated.  Long term
I suspect you'll need support for additional calendars rather than
converting everything to a common calendar, but only supporting Gregorian
is a good way to limit complexity to start with.  Geologic times may
(probably?) need to be modeled differently.

Although I disagree strongly with Sven's sentiments about the
appropriateness of reinventing things, I believe he's right about the need
to support more units than just SI units and to know what units were used
in the original measurement.  It's not just a matter of aesthetics but of
being able to preserve the provenance.  Perhaps this gets saved for a
future iteration, but you may find that you need both display and
computable versions of things stored separately.

Speaking of computable versions don't underestimate the issues with using
floating points numbers.  There are numbers that they just can't represent
and their range is not infinite.

Historians and genealogists have many interminable discussions about
date/time representation which can be found in various list archives, but
one recent spec worth reviewing is Extended Date/Time Format (EDTF)
http://www.loc.gov/standards/datetime/pre-submission.html

Another thing worth looking at is the Freebase schema since it not only
represents a bunch of this stuff already, but it's got real world data
stored in the schema and user interface implementations for input and
rendering (although many of the latter could be improved).  In particular,
some of the following might be of interest:

http://www.freebase.com/view/measurement_unit /
http://www.freebase.com/schema/measurement_unit
http://www.freebase.com/schema/time
http://www.freebase.com/schema/astronomy/celestial_object_age
http://www.freebase.com/schema/time/geologic_time_period
http://www.freebase.com/schema/time/geologic_time_period_uncertainty

If you rummage around, you can probably find lots of interesting examples
and decide for yourself whether or not that's a good way to model things.
 I'm reasonably familiar with the schema and happy to answer questions.

There are probably lots of other example vocabularlies that one could
review such as the Pleiades project's: http://pleiades.stoa.org/vocabularies

You're not going to get it right the first time, so I would just start with
a small core that you're reasonably confident in and iterate from there.

Tom

On Wed, Dec 19, 2012 at 12:47 PM, Sven Manguard svenmangu...@gmail.comwrote:

 My philosophy is this: We should do whatever works best for Wikidata and
 Wikidata's needs. If people want to reuse our content, and the choices
 we've made make existing tools unworkable, they can build new tools
 themselves. We should not be clinging to what's been done already if it
 gets in the way of what will make Wikidata better. Everything that we
 make and do is open, including the software we're going to operate the
 database on. Every WMF project has done things differently from the
 standards of the time, and people have developed tools to use our content
 before. Wikidata will be no different in that regard.

 Sven


 On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius 
 marty...@graphity.org wrote:

 Denny,

 you're sidestepping the main issue here -- every sensible architecture
 should build on as much previous standards as possible, and build own
 custom solution only if a *very* compelling reason is found to do so
 instead of finding a compromise between the requirements and the
 standard. Wikidata seems to be constantly doing the opposite --
 building a custom solution with whatever reason, or even without it.
 This drives the compatibility and reuse towards zero.

 This thread originally discussed datatypes for values such as numbers,
 dates and their intervals -- semantics for all of those are defined in
 XML Schema Datatypes:

Re: [Wikidata-l] Quick clarification on infobox migration

2012-11-14 Thread Tom Morris

On Wed, Nov 14, 2012 at 7:13 AM, Lydia Pintscher 
lydia.pintsc...@wikimedia.de wrote:

 On Wed, Nov 14, 2012 at 12:48 PM, Roger Hyam r.h...@rbge.ac.uk wrote:

  *If* I were to a work on an editor tool would it be feasible to think in
 terms of editing existing syntax for now and in version two adding a
 migrate to wikidata button? Would all the exiting infoboxes move over to
 wikidata in a 'big bang' approach?

 Potentially but I'd need to see the specific case you'd like to write
 a tool for. I don't think it'll be technically feasible to migrate
 everything with the click of a button.
 Infoboxes will move at the pace the editors migrate them
 (automatically or by hand).


I believe he's talking about these infoboxes:
http://en.wikipedia.org/wiki/Template:Taxobox
They are quite regular in their contents so are good candidates for
automated editing/migration support.

Tom
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] wikidata.org is live (with some caveats)

2012-10-30 Thread Tom Morris

On Tue, Oct 30, 2012 at 7:40 AM, emijrp emi...@gmail.com wrote:
 2012/10/30 Lydia Pintscher lydia.pintsc...@wikimedia.de

 On Tue, Oct 30, 2012 at 11:35 AM, emijrp emi...@gmail.com wrote:
  Ah, cool trick.
 
  So, we can take October 30, 2012 as the day Wikidata was launched? I
  would
  like to add a line here http://en.wikipedia.org/wiki/Wikidata

 Yes that sounds good :)


 Done. I'm not sure if we have to add R.I.P. 30 October 2012 to DBpedia and
 Freebase articles.

That's excellent that you've already surpassed them in quality and
quantity (not to mention classiness of contributors).

Congratulations!

On an unrelated note, can we assume that all data added to
wikidata.org is persistent now?

Tom

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Hello from the YAGO team

2012-04-18 Thread Tom Morris

On Tue, Apr 17, 2012 at 4:27 PM, Lydia Pintscher
lydia.pintsc...@wikimedia.de wrote:
 On Tue, Apr 17, 2012 at 10:23 PM, Kingsley Idehen
 kide...@openlinksw.com wrote:
 On 4/17/12 11:32 AM, Dario Taraborelli wrote:

 Shall we create a Wikidata vs {Freebase, DBpedia, YAGO} comparison table
 on meta (or enwiki)?

 Not a versus style table. That sends the wrong signals when these services
 are fundamentally complimentary .

 Yes. I'd prefer a short text - something we can add to the existing
 FAQ. (That's also what I have so far.)

I actually like tables myself and don't see them as being adversarial,
but I'm happy to contribute to the comparison in whatever form it
takes.

Having a matrix/table: a) focuses people on choosing a few important
dimensions to summarize and b) makes any holes in the comparison
matrix obvious so they can be filled in (which is much harder when
parsing text).

Tom

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data_model: Metamodel: Wikipedialink

2012-04-18 Thread Tom Morris

This is a good discussion.  The true dynamics won't be known until
you've got live users on the system, but based on what I've seen with
existing Wikipedia edits, the dynamics will be even more complex than
predicted  so far (which is already pretty complex!).

Some other things to consider:

- the focus of Wikipedia articles drifts over time (with good feedback
loops built in to the system, this should hopefully be
self-correcting)

- label/description disagreement occurs - title says one thing, first
few sentences (which is often all people scan when working quickly)
say something different, the article taken as a whole is about a third
thing

- you'll see different behavior depending on whether you track by
article number (internal ID) or article title

- the granularity of Wikipedia articles depends on the length of the
text, not just semantics.  Concepts with lots of text get split across
multiple articles (e.g. WW II), while concepts which don't have much
written about them risk getting combined into composite articles about
multiple concepts.

- redirects are used for: aliases, misspellings, see instead
references to semantically different articles, and probably other
things that I'm not aware of.  This can complicate doing something
meaningful with them.

Another source for data on the current articles and their behavior is
Freebase.  Wikipedia based topics which have been split or combined
retain an audit trail that lets you figure out what happened.  It only
covers the last 5 years and only English Wikipedia, but within those
limitations it could provide some interesting insights.  I'm happy to
help anyone who wants to work with this data.

Tom

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Hello from the YAGO team

2012-04-17 Thread Tom Morris

On Mon, Apr 16, 2012 at 5:19 PM, Kingsley Idehen kide...@openlinksw.com wrote:
 On 4/16/12 2:54 PM, Tom Morris wrote:

 - the refresh cycle is every couple of weeks (ie much faster than
 DBpedia but much slower than DBpedia live)

 Why do you make the comment above? Are you not aware of the DBpedia-Live
 editions have existed for a few years now?

I think my text that you quoted answers the question since I reference
Live -- or do I get points off for incorrect
capitalization/punctuation?

  three months  two weeks  minutes
  DBpedia  Freebase  DBpedia-Live (phew! spelled it correctly this time)

By my calculations though, availability is actually 10 months, not a
few years.
http://blog.aksw.org/2011/official-dbpedia-live-release/

On Tue, Apr 17, 2012 at 8:33 AM, Fabian M. Suchanek
f.m.sucha...@gmail.com wrote:

 I also wanted to ask again on the relationship between Freebase and
 Wikidata: Freebase was bootstrapped from the infoboxes of Wikipedia,

Wikipedia based data from infoboxes is updated on a regular basis.  It
wasn't just a one time bootstrap.

 but I think its main selling point is that volunteers can add and
 correct data. Thus, my understanding is that, both in Wikidata and in
 Freebase, volunteers would fill up structured, factual information. Is
 that right?

I outlined most of the major differences that come to mind.  I don't
think there's any one particular selling point and, in particular,
the Freebase team has never really attempted to do much in the way of
selling at all.  I don't really think that there's any overlap or
competition between the two projects.  If Wikidata is successful,
Freebase rips out their infoboxes parsers and gets cleaner Wikipedia
data to import with less effort.

 My intuition is that Wikidata will have a more principled
 approach, because it can build on the Wikipedia/Wikimedia culture.

To the extent that the Wikidata project is unsuccessful in changing
the current Wikipedia culture, they'll inherit both the good and bad
points of the existing culture.  Personally, I could do with a few
less deletionists and petty tyrants ruling their corner of
Wikipedia.

Tom

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Hello from the YAGO team

2012-04-16 Thread Tom Morris

On Mon, Apr 16, 2012 at 10:54 AM, Fabian M. Suchanek
f.m.sucha...@gmail.com wrote:
 From: JFC Morfin jef...@jefsey.com
 Thank you for this detailed explanation.
 How do you see the integration/impact of Wikidata on both projects?

 My intuition is that the impact could be mutual:
 * for YAGO and DBpedia, the impact would be immediate, because
 Wikidata could essentially provide cleaner infobox data for these
 projects. Yet, we have to see how Wikidata will position itself to
 Freebase, which seems to pursue a similar goal:
 http://en.wikipedia.org/wiki/Freebase
 (If you have thoughts on distinguishing Wikidata from Freebase, we'd
 be happy to know)

I don't speak for Freebase, but I view Freebase, DBpedia, and YAGO as
all occupying comparable positions relative to Wikipedia/Wikidata.
They currently attempt to reverse engineer structured data out of
Wikipedia infoboxes and if Wikidata is successful in providing the
data source for the Wikipedia infoboxes, it'll eliminate a lot of
troublesome, error-prone parsing code.

Some of the ways that Freebase is different include:

- it's editable by anyone, so you don't need to go back to Wikipedia
to correct mistakes.
- it doesn't have a notability requirement like Wikipedia.  If it's
factual and non-spammy, you can include it.
- infobox mappings aren't public and can only be modified by Google employees
- a relatively small number of popular infoboxes are mined (nowhere
near DBpedia's coverage)
- the refresh cycle is every couple of weeks (ie much faster than
DBpedia but much slower than DBpedia live)
- it includes a large amount of non-Wikipedia data from MusicBrainz,
OpenLibrary, Geonames, etc, as well as being linked to a number of
other sources of strong identifiers such as the New York Times, IMDB,
NNDB, U.S. Library of Congress Name Authority File and Subject
Headings, etc.

As far as positioning between Wikidata and Freebase goes, there's
really no way that Freebase (or any other non-Wikimedia Foundation
effort) could ever compete with Wikidata in the context of providing
data to Wikipedia.  The Wikipedia culture is just too insular.
Instead I would expect Freebase to stop parsing infoboxes and consume
data directly from Wikidata in the same way that I would expect
DBpedia, YAGO and other consumers to.

Before that happens though, Wikidata not only needs to get the
technical infrastructure in place, but also change the culture of
Wikipedia editors so that they're not anti-data and care about the
semantics as well as the presentation of the information.  A lot of
today's quality problems are social, not technical.

 - YAGO, e.g., has mappings of infobox data to relations with domains
 and ranges, with a quality guarantee.

Guarantee?  My understanding of the previous post was that a very
small sample of YAGO data had been measured for precision (with good
results), not that there was 100% curation or any type of quality
guarantee.

Freebase has a stated 99% quality goal, but actual quality (as well as
coverage) varies greatly from domain to domain.

Tom

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Notability in Wikidata

2012-04-02 Thread Tom Morris

On Sun, Apr 1, 2012 at 10:11 AM, Bináris wikipo...@gmail.com wrote:


 2012/4/1 Markus Krötzsch markus.kroetz...@cs.ox.ac.uk

 We would rather like to support linking and data integration with external
 data bases than suggest *every* fact of the world to be copied to Wikidata.


 An interesting example is the IUCN redlist of threatened species:
 http://www.iucnredlist.org/ Data are modified by IUCN from time to time, so
 copying would be a rather temporary solution, while the redlist status is
 important in infoboxes of animals and plants.

Actually, to do anything with their data you'd need to get written
approval for exceptions to their terms of use which prohibit both
redistribution and commercial use (ie you'd need to include a NC
clause in your license).

http://www.iucnredlist.org/info/terms-of-use#3._No_commercial_use

Tom

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

86 matches

Mail list logo