Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

2018-03-07 Thread Neubert, Joachim
Hi Aidan,

Thanks for your reply! My suggestion indeed was to feed in a property (e.g. " 
P2611") - not a certain value or range of values for that property - and to 
restrict the initial set to all items where that property is set (e.g., all 
known TED speakers in Wikidata).

That would allow to apply further faceting (e.g., according to occupation or 
country) to that particular subset of items. In effect it would offer an 
alternate view to the original database (e.g., https://www.ted.com/speakers 
which is organized by topic and by event). Thus, the full and often very rich 
structured data of Wikidata could be used to explore external datasets which 
are linked to WD via external identifiers.

Being able to browse their own special collections by facets from Wikidata 
could perhaps even offer an incentive to GLAM institutions to contribute to 
Wikidata. It may turn out much easier to add some missing data to Wikidata, in 
relation to introducing a new field in their own database/search interface, and 
populating it from scratch.

So I'd suggest that additional work invested into GraFa here could pay out in a 
new pattern of use for both Wikidata and collections linked by external 
identifiers.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Aidan Hogan
Gesendet: Mittwoch, 7. März 2018 06:16
An: wikidata@lists.wikimedia.org
Betreff: Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

Hi Joachim,

On 14-02-2018 7:32, Neubert, Joachim wrote:
> Hi Aidan, hi José,
> 
> I'm a bit late - sorry!

Likewise! :)

> What came to my mind as an perhaps easy extension: Can or could the browser 
> be seeded with an external property (for example P2611, TED speaker ID)?
> 
> That would allow to browse some external dataset (e.g., all known TED 
> speakers) by the facets provided by Wikidata.

Thanks for the suggestion! While it might seem an easy extension, unfortunately 
that would actually require some significant changes since GraFa only considers 
values that have a label/alias we can auto-complete on (which in the case of 
Wikidata means, for the most part, Q* values).

While it would be great to support datatype/external properties, we figured 
that adding them to the system in a general and clean way would not be trivial! 
We assessed that some such properties require ranges (e.g., date-of-birth or 
height), some require autocomplete (e.g., first name), etc. ... and in the case 
of IDs, it's not clear that these are really useful for faceted browsing 
perhaps since they will jump to a specific value. Hence it gets messy to handle 
in the interface and even messier in the back-end.

(A separate issue is that of existential values ... finding entities that have 
some value for a property as your example requires. That would require some 
work, but would be more feasible!)

Best,
Aidan

>> -Ursprüngliche Nachricht-
>> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im 
>> Auftrag von Aidan Hogan
>> Gesendet: Donnerstag, 8. Februar 2018 21:33
>> An: Discussion list for the Wikidata project.
>> Cc: José Ignacio .
>> Betreff: Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata 
>> [thanks!]
>>
>> Hi all,
>>
>> On behalf of José and myself, we would really like to thank the 
>> people who tried out our system and gave us feedback!
>>
>>
>> Some aspects are left to work on (for example, we have not tested for 
>> mobiles, etc.). However, we have made some minor initial changes 
>> reflecting some of the comments we received (adding example text for 
>> the type box, clarifying that the numbers refer to number of results 
>> not Q codes, etc.):
>>
>> http://grafa.dcc.uchile.cl/
>>
>>
>> To summarise some aspects of the work and what we've learnt:
>>
>> * In terms of usability, the principal lesson we have learnt (amongst
>> many) is that it is not clear for users what is a type. For example, 
>> when searching for "popes born in Poland", the immediate response of 
>> users is to type "pope" rather than "human" or "person" in the type box.
>> In a future version of the system, we might thus put less emphasis on 
>> starting the search with type (the original reasoning behind this was 
>> to quickly reduce the number of facets/properties that would be shown).
>> Hence the main conclusion here is to try to avoid interfaces that 
>> centre around "types".
>>
>>
>> * A major design goal is that the user is only ever shown options 
>> that lead to at least one result. All facets computed are exact with 
>> exact numbers. The technical challenge here is displaying these 
>> f

Re: [Wikidata] Metadata about Persistent Identifiers

2018-02-21 Thread Neubert, Joachim
Well - P921 is described as "primary topic of a work", and is an instance of WD 
property for items about works. Two possible issues:

- "domain" is a much clearer restriction than "main subject/primary topic" - 
which implies that there may be other secondary subjects. To identifiers, 
normally the more formal restriction applies.

- defining an identifier as a work seems a bit stretched - yet, no domain is 
given here :-)

So perhaps a new property is needed?

Cheers, Joachim

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Andy Mabbett
> Gesendet: Mittwoch, 21. Februar 2018 13:16
> An: Discussion list for the Wikidata project
> Betreff: Re: [Wikidata] Metadata about Persistent Identifiers
> 
> On 21 February 2018 at 12:03, Neubert, Joachim <j.neub...@zbw.eu> wrote:
> 
> > So, we should be able to formally specify the "domain" of identifiers.
> > Perhaps that could be derived from the type constraints in linked
> > properties, but I think it would make sense as an explicit property on the
> identifier.
> 
> Main subject (P921) ?
> 
> I certainly don't think users should have to query properties to find metadata
> about concepts.
> 
> --
> Andy Mabbett
> @pigsonthewing
> http://pigsonthewing.org.uk
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Metadata about Persistent Identifiers

2018-02-21 Thread Neubert, Joachim
So, we should be able to formally specify the "domain" of identifiers. Perhaps 
that could be derived from the type constraints in linked properties, but I 
think it would make sense as an explicit property on the identifier.

Some identifiers, e.g., GND, VIAF, require special attention because they span 
multiple domains. A person identifer from GND bears other opportunities for 
cross-linking than an organization identifier. No idea so far how to handle 
that ...

Cheers, Joachim


> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Antonin Delpeuch (lists)
> Gesendet: Mittwoch, 21. Februar 2018 10:57
> An: wikidata@lists.wikimedia.org
> Betreff: Re: [Wikidata] Metadata about Persistent Identifiers
> 
> Hi Andy,
> 
> Thanks, there seems to be quite a lot of work to do in this area indeed!
> 
> 
> On 20/02/2018 19:49, Andy Mabbett wrote:
> > As an example, I created 'KoreaMed Unique Identifier':
> >
> >https://www.wikidata.org/wiki/Q47489994
> >
> > How could we improve that? What additional properties might we need?
> > What properties already exist, that we could make use of?
> 
> I have recently proposed to create a "number of records" property to store the
> number of identifiers in a given scheme:
> 
> https://www.wikidata.org/wiki/Wikidata:Property_proposal/number_of_record
> s
> 
> This property could typically apply here. The idea behind this property is 
> that
> we could compare its values to the number of uses of the corresponding
> property in Wikidata.
> 
> One other thing I would love to see happening on Wikidata is keeping track of
> the links between identifier schemes. If identifier X and identifier Y are 
> both
> used by the same database Z, then we can probably use Z to "match" X to Y and
> conversely.
> 
> If we had many "uses (P2283)" and "used by (P1535)" statements to link
> identifiers to databases, we could then draw a graph of identifiers and
> databases using them. Given two identifiers, we could analyze the paths
> between these two identifiers…
> 
> For now the graph is a bit sparse: http://tinyurl.com/y89u3enx
> 
> (And you can already see one issue: even if we have a path from ORCID to ISNI,
> that does not mean that we can convert an ORCID id to an ISNI for the same
> person via this path, as GRID contains ISNIs for organizations
> only…)
> 
> Thanks a lot Andy for adding such statements on
> https://www.wikidata.org/wiki/Q43649390 by the way!
> 
> >
> >
> > Also, this query:
> >
> >http://tinyurl.com/y6wdrbhd
> >
> > returns over 5000 instances/ subclasses of "unique identifier"
> > (Q6545185) but includes both /types/ of identifiers (like the example
> > above) and individual identifier values, like ".ar" as an internet TLD
> > (domain name itself - Q32635 - is a subclass, not an instance, of UID)
> > - how should we distinguish between the two classes?
> 
> Urgh, that's messy. I think I would just change the ontology:
> "domain name" (Q32635) should not be a subclass of "unique identifier"
> (Q6545185), but rather an instance of it. (Actually the uniqueness is 
> debatable,
> I don't think DNS is meant to enforce any uniqueness at all, as it is very
> common for a website to have multiple domain names. So maybe just "domain
> name" "instance of" "identifier (Q853614)" would do).
> 
> Antonin
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

2018-02-14 Thread Neubert, Joachim
Hi Aidan, hi José,

I'm a bit late - sorry!

What came to my mind as an perhaps easy extension: Can or could the browser be 
seeded with an external property (for example P2611, TED speaker ID)?

That would allow to browse some external dataset (e.g., all known TED speakers) 
by the facets provided by Wikidata.

Cheers, Joachim


> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Aidan Hogan
> Gesendet: Donnerstag, 8. Februar 2018 21:33
> An: Discussion list for the Wikidata project.
> Cc: José Ignacio .
> Betreff: Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]
> 
> Hi all,
> 
> On behalf of José and myself, we would really like to thank the people who
> tried out our system and gave us feedback!
> 
> 
> Some aspects are left to work on (for example, we have not tested for
> mobiles, etc.). However, we have made some minor initial changes
> reflecting some of the comments we received (adding example text for the
> type box, clarifying that the numbers refer to number of results not Q
> codes, etc.):
> 
> http://grafa.dcc.uchile.cl/
> 
> 
> To summarise some aspects of the work and what we've learnt:
> 
> * In terms of usability, the principal lesson we have learnt (amongst
> many) is that it is not clear for users what is a type. For example,
> when searching for "popes born in Poland", the immediate response of
> users is to type "pope" rather than "human" or "person" in the type box.
> In a future version of the system, we might thus put less emphasis on
> starting the search with type (the original reasoning behind this was to
> quickly reduce the number of facets/properties that would be shown).
> Hence the main conclusion here is to try to avoid interfaces that centre
> around "types".
> 
> 
> * A major design goal is that the user is only ever shown options that
> lead to at least one result. All facets computed are exact with exact
> numbers. The technical challenge here is displaying these facets with
> exact numbers and values for large result sizes, such as human:
> 
> http://grafa.dcc.uchile.cl/search?instance=Q5
> 
> This is achieved through caching. We compute all possible queries in the
> data that would yield >50,000 results (e.g., human->gender:male,
> human->gender:male->country:United States, etc.). We then compute their
> facets offline and cache them. In total there's only a couple of hundred
> such queries generating that many results. The facets for other queries
> with fewer than 50,000 results are computed live. Note that we cannot
> cache for keyword queries (instead we just compute facets for the first
> 50,000 most relevant results). Also, if we add other features such as
> range queries or sub-type reasoning, the issue of caching would become
> far more complex to handle.
> 
> 
> In any case, thanks again to all those who provided feedback! Of course
> further comments or questions are welcome (either on- or off-list).
> Likewise we will be writing up a paper describing technical aspects of
> the system soon with some evaluation results. Once it's ready we will of
> course share a link with you.
> 
> Best,
> Aidan and José
> 
> 
>  Forwarded Message 
> Subject: Re: GraFa: Faceted browser for RDF/Wikidata [feedback requested]
> Date: Mon, 15 Jan 2018 11:47:18 -0300
> From: Aidan Hogan 
> To: Discussion list for the Wikidata project. 
> CC: José Ignacio . 
> 
> Hi all!
> 
> Just a friendly reminder that tomorrow we will close the questionnaire
> so if you have a few minutes to help us out (or are just curious to see
> our faceted search system) please see the links and instructions below.
> 
> And many thanks to those who have already provided feedback! :)
> 
> Best,
> José & Aidan
> 
> On 09-01-2018 14:18, Aidan Hogan wrote:
> > Hey all,
> >
> > A Masters student of mine (José Moreno in CC) has been working on a
> > faceted navigation system for (large-scale) RDF datasets called "GraFa".
> >
> > The system is available here loaded with a recent version of Wikidata:
> >
> > http://grafa.dcc.uchile.cl/
> >
> > Hopefully it is more or less self-explanatory for the moment. :)
> >
> >
> > If you have a moment to spare, we would hugely appreciate it if you
> > could interact with the system for a few minutes and then answer a quick
> > questionnaire that should only take a couple more minutes:
> >
> > https://goo.gl/forms/h07qzn0aNGsRB6ny1
> >
> > Just for the moment while the questionnaire is open, we would kindly
> > request to send feedback to us personally (off-list) to not affect
> > others' responses. We will leave the questionnaire open for a week until
> > January 16th, 17:00 GMT. After that time of course we would be happy to
> > discuss anything you might be interested in on the list. :)
> >
> > After completing the questionnaire, please also feel free to visit or
> > list something you noticed on the Issue Tracker:

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Neubert, Joachim
Hi Sebastian,

This is huge! It will cover almost all currently existing German companies. 
Many of these will have similar names, so preparing for disambiguation is a 
concern.

A good way for such an approach would be proposing a property for an external 
identifier, loading the data into Mix-n-match, creating links for companies 
already in Wikidata, and adding the rest (or perhaps only parts of them - I’m 
not sure if having all of them in Wikidata makes sense, but that’s another 
discussion), preferably with location and/or sector of trade in the description 
field.

I’ve tried to figure out what could be used as key for a external identifier 
property. However, it looks like the registry does not offer any (persistent) 
URL to its entries. So for looking up a company, apparently there are two 
options:


-  conducting an extended search for the exact string “A 
Dienstleistungsgesellschaft mbH“

-  copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that

Both ways are not very intuitive, even if we can provide a link to the search 
form. This would make a weak connection to the source of information. Much more 
important, it makes disambiguation in Mix-n-match difficult. This applies for 
the preparation of your initial load (you would not want to create duplicates). 
But much more so for everybody else who wants to match his or her data later 
on. Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my eyes, not 
something we want to impose on future contributors. And often, the free 
information they find in the registry (formal name, register number, legal 
form, address) will not easily match with the information they have (common 
name, location, perhaps founding date, and most important sector of trade), so 
disambiguation may still be difficult.

Have you checked which parts of the accessible information as below can be 
crawled and added legally to external databases such as Wikidata?

Cheers, Joachim

--
Joachim Neubert

ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462



Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Sebastian Hellmann
Gesendet: Sonntag, 15. Oktober 2017 09:45
An: wikidata@lists.wikimedia.org
Betreff: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata


Hi all,

the German business registry contains roughly 2.2 million organisations. Some 
information is paid, but other is public, i.e. the info you are searching for 
at and clicking on UT (see example below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e



I would like to add this to Wikidata, either by crawling or by raising money to 
use crowdsourcing concepts like crowdflour or amazon turk.



It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability

2. It refers to an instance of a clearly identifiable conceptual or material 
entity. The entity must be notable, in the sense that it can be described using 
serious and publicly available references. If there is no item about you yet, 
you are probably not notable.

The reference is the official German business registry, which is serious and 
public. Orgs are also per definition clearly identifiable legal entities.

How can I get clearance to proceed on this?

All the best,
Sebastian





Entity data

Saxony District court Leipzig HRB 32853 – A Dienstleistungsgesellschaft mbH

Legal status:

Gesellschaft mit beschränkter Haftung

Capital:

25.000,00 EUR

Date of entry:

29/08/2016
(When entering date of entry, wrong data input can occur due to system 
failures!)

Date of removal:

-

Balance sheet available:

-

Address (subject to correction):

A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig



--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, 
https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Which external identifiers are worth covering?

2017-09-11 Thread Neubert, Joachim
Hi Andrew, all,

In my eyes, a large incentive for the maintainers of external databases - as I 
am one for the ZBW German National Library for Economics - is the data they can 
earn: not only in terms of property values and attached Wikipedia pages, but 
also in terms identifiers and links to other vocabularies. 

This can reach up to the point where Wikidata replaces a custom database for 
identifier mappings. We approached that by moving a mapping of GND and RePEc 
(P2428) identifiers to Wikidata. Still, that were only 3100 of 460,000 GND IDs 
and 50,000 RePEc IDs in our EconBiz portal alone, so it's still very sparse - 
but an improvement. (details see https://hackmd.io/p/S1YmXWC0e). Adding, e.g., 
the rest of the "most important economists" from RePEc as well as GND is very 
tempting, as it will extend the mapping with relatively low efforts.

For vocabularies limited in size, such as STW Thesaurus for Economics, a 
complete mapping can be achieved (if relations beyond equivalence are available 
- see 
https://www.wikidata.org/wiki/Wikidata:Property_proposal/mapping_relation_type).
 The incentive for that is even higher, because it saves the owner of the 
vocabulary all cost of maintaining possibly multiple mappings to 
third-party-vocabularies.

So I think that embracing and extending the Wikidata's role as an *universal 
linking hub* benefits everybody, and will improve total coverage largely, 
because it offers incentives to communities not involved in Wikidata before.

Cheers, Joachim

PS. Thanks for the hint to P2429 - looks very useful!

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Andrew Gray
> Gesendet: Donnerstag, 7. September 2017 21:26
> An: Discussion list for the Wikidata project.
> Betreff: Re: [Wikidata] Which external identifiers are worth covering?
> 
> Hi Marco,
> 
> I guess this depends what you mean by "exhaustive". Exhaustive in that every
> Wikidata item has ID X, or exhaustive in that we have every instance of ID X 
> in
> Wikidata?
> 
> The first is probably not going to happen, as the vast majority of external
> identifiers have a defined scope for what they identify. Some are pretty 
> broad -
> VIAF is essentially "everyone who exists in a library catalogue as an author 
> or
> subject" - but still have a limit.
> We're never really going to reach a situation where there is a single 
> identifier
> type that covers everyone, unless we're linking across to another 
> Wikidata-type
> comprehensive knowledgebase, and even then we'd need to ensure we're in a
> position where they already cover everything in Wikidata.
> 
> The second can (and has) been done - the largest one I know of offhand for
> people is the Oxford DNB (60k items) but for non-people we have complete
> coverage of eg Swedish district codes, P1841 (160k items).
> It's a bit of a slog to get these completed and then maintained, since the 
> last 5-
> 10% tend to be more challenging complicated cases, but one or two
> determined people can make it happen. And of course it's not appropriate for
> many identifiers, as they may issue IDs for things that we don't intend to 
> have
> in Wikidata, so we will never completely cover them.
> 
> I should quickly plug the "expected completeness" property which is really
> useful for identifiers - P2429 - as this can quickly show whether something 
> is a)
> completely on Wikidata; b) not complete yet but eventually might be; or c)
> probably never will be. Not very widely rolled out yet, though...
> 
> Andrew.
> 
> 
> On 7 September 2017 at 19:51, Marco Fossati  wrote:
> > Hi everyone,
> >
> > As a data quality addict, I've been investigating the coverage of
> > external identifiers linked to Wikidata items about people.
> >
> > Given the numbers on SQID [1] and some SPARQL queries [2, 3], it seems
> > that even the second most used ID (VIAF) only covers *25%* of people items
> circa.
> > Then, there is a long tail of IDs that are barely used at all.
> >
> > So here is my question:
> > *which external identifiers deserve an effort to achieve exhaustive
> > coverage?*
> >
> > Looking forward to your valuable feedback.
> > Cheers,
> >
> > Marco
> >
> > [1] https://tools.wmflabs.org/sqid/#/browse?type=properties "Select
> > datatype" set to "ExternalId", "Used for class" set to "human Q5"
> > [2] total people: http://tinyurl.com/ybvcm5uw [3] people with a VIAF
> > link: http://tinyurl.com/ya6dnpr7
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> 
> 
> 
> --
> - Andrew Gray
>   and...@generalist.org.uk
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org

Re: [Wikidata] Some Mix'n'match mappings not stored in Wikidata?

2017-08-28 Thread Neubert, Joachim
Hi Osma,

The instrument we used to avoid duplicates was Mix-n-match. Even when something 
is not "automatically matched", often, on the details page (e.g., 
https://tools.wmflabs.org/mix-n-match/#/entry/22734337), possible matches come 
up.

That covers the case where a (partial) name is present somewhere in Wikidata or 
Wikipedia. Unfortunatly, I've not yet figured out how I could feed my own 
synonyms into Mix-n-match. Providing them in the description field helps for 
intellectual identification, but seems not to be used by the matching 
algorithm. Possibly, a separate "catalog" with permutated name variants from 
not-yet-matched entries could help, but I'm not sure if Magnus would encourage 
that, because it messes up the catalog list. Swedish and Finnish names for the 
same locations however could perhaps be a valid use case.

Anyway, with the 2,200 missing RePEc authors I decided at that point that the 
result was good enough, and created the not-matched entries. Less than a 
handful showed up later on as duplicates at some point (e.g., as automatically 
matched against GND). Of course, some will still linger hidden. But it is very 
easy to merge items in Wikidata, so I consider that as a much minor problem 
than it would be in library systems, where it is administrative and technically 
much more difficult to get rid of duplicates. 

Cheers, Joachim (and sorry for the late response)

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Osma Suominen
> Gesendet: Montag, 21. August 2017 13:41
> An: wikidata@lists.wikimedia.org
> Betreff: Re: [Wikidata] Some Mix'n'match mappings not stored in Wikidata?
> 
> Hi Joachim,
> 
> Thanks for this, indeed this could be a potential strategy for us to add some 
> or
> all of the missing entities. The challenge is that we would need to be
> reasonably sure that the places we want to create actually don't exist in
> Wikidata, for example using an alternate spelling. You said in your question
> that "Of course we make sure that neither of the ids exist in WD so far", but
> how did you do that?
> 
> -Osma
> 
> Neubert, Joachim kirjoitti 21.08.2017 klo 12:36:
> > Hi Osma,
> >
> > re. adding missing items, I've made good experiences with creating
> > input files for Quickstatements2 (see
> > https://github.com/zbw/repec-ras/blob/master/bin/create_missing_wikida
> > ta.pl). I've discussed how to best do this in the Wikidata Project
> > Chat before, and received valuable advice.
> > (https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/05#S
> > ource_statements_for_items_syntesized_from_authorities_-_recommendatio
> > ns.3F)
> >
> > Feel free to ask for further information, and all the best, Joachim
> >
> >> -Ursprüngliche Nachricht-
> >> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im
> >> Auftrag von Osma Suominen
> >> Gesendet: Montag, 21. August 2017 11:07
> >> An: Discussion list for the Wikidata project.
> >> Betreff: [Wikidata] Some Mix'n'match mappings not stored in Wikidata?
> >>
> >> Hi,
> >>
> >> We're more than halfway through mapping YSO places to Wikidata. Most
> >> of the remaining are places that don't exist in Wikidata, and adding
> >> them is quite labor-intensive so we will have to consider our strategy.
> >>
> >> Anyway, I did some checking of what remains unmapped and noticed a
> >> potential problem: some mappings for places that we have mapped using
> >> Mix'n'match have not actually been stored in Wikidata. For example
> >> Q36 Poland ("Puola" in YSO Places) is such a case. In Mix'n'match it
> >> is shown as manually matched (see attached screenshot), but in
> >> Wikidata the corresponding YSO ID property doesn't actually exist for
> >> the entity. I checked the change history of the Q36 entity and
> >> couldn't find anything relevant there, so it seems that the mapping
> >> was never stored in Wikidata. Maybe there was a transient error of some
> kind?
> >>
> >> Another such case was Q1754 Stockholm ("Tukholma" in YSO places). But
> >> for that one we removed the existing mapping in Mix'n'match and set
> >> it again, and now it is properly stored in Wikidata.
> >>
> >> Mix'n'match currently reports 4228 mappings for YSO places, while a
> >> SPARQL query for the Wikidata endpoint returns 4221 such mappings. So
> >> I suspect that this only affects a small number of entities.
> >>
> >> Is it possible to compare the Mix'n'match mappings with what actually
> >> ex

Re: [Wikidata] Some Mix'n'match mappings not stored in Wikidata?

2017-08-21 Thread Neubert, Joachim
Hi Osma,

re. adding missing items, I've made good experiences with creating input files 
for Quickstatements2 (see 
https://github.com/zbw/repec-ras/blob/master/bin/create_missing_wikidata.pl). 
I've discussed how to best do this in the Wikidata Project Chat before, and 
received valuable advice. 
(https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/05#Source_statements_for_items_syntesized_from_authorities_-_recommendations.3F)

Feel free to ask for further information, and all the best, Joachim

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Osma Suominen
> Gesendet: Montag, 21. August 2017 11:07
> An: Discussion list for the Wikidata project.
> Betreff: [Wikidata] Some Mix'n'match mappings not stored in Wikidata?
> 
> Hi,
> 
> We're more than halfway through mapping YSO places to Wikidata. Most of
> the remaining are places that don't exist in Wikidata, and adding them is 
> quite
> labor-intensive so we will have to consider our strategy.
> 
> Anyway, I did some checking of what remains unmapped and noticed a
> potential problem: some mappings for places that we have mapped using
> Mix'n'match have not actually been stored in Wikidata. For example Q36
> Poland ("Puola" in YSO Places) is such a case. In Mix'n'match it is shown as
> manually matched (see attached screenshot), but in Wikidata the
> corresponding YSO ID property doesn't actually exist for the entity. I checked
> the change history of the Q36 entity and couldn't find anything relevant 
> there,
> so it seems that the mapping was never stored in Wikidata. Maybe there was a
> transient error of some kind?
> 
> Another such case was Q1754 Stockholm ("Tukholma" in YSO places). But for
> that one we removed the existing mapping in Mix'n'match and set it again, and
> now it is properly stored in Wikidata.
> 
> Mix'n'match currently reports 4228 mappings for YSO places, while a SPARQL
> query for the Wikidata endpoint returns 4221 such mappings. So I suspect that
> this only affects a small number of entities.
> 
> Is it possible to compare the Mix'n'match mappings with what actually exists 
> in
> Wikidata, and somehow re-sync them? Or just to get the mappings out from
> Mix'n'match and compare them with what exists in Wikidata, so that the few
> missing mappings may be added there manually?
> 
> Thanks,
> Osma
> 
> 
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. 
> Box
> 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> osma.suomi...@helsinki.fi
> http://www.nationallibrary.fi
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SQID: Lookup by value of external identifier?

2017-08-02 Thread Neubert, Joachim
Hi,

does anybody know if SQID can be invoked with an URL, which includes an 
external id property and its value - similar to

https://tools.wmflabs.org/wikidata-todo/resolver.php?prop=P227=120434059

? That could give end users a really nice display.

Cheers, Joachim


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Multilingual and synonym support for M'n'm / was: Mix'n'Match with existing (indirect) mappings

2017-06-14 Thread Neubert, Joachim
Hi Magnus,

the idea was not to search for all labels/synonyms separately, but to 
concatenate everything in one large search string, and let the fulltext search 
do the magic.

E.g., for STW descriptor “CGE model”, search for “CGE model, CGE-Modell, ORANI 
model, MONASH model, Dynamic CGE model, Computable general equilibrium model, 
CGE analysis, Applied general equilibrium model”

When, as in Fuseki, the fulltext search tries to match every word in the 
string, it may return long lists of results. However: When these can be sorted 
by a score value, they can be limited to the best matching 10 or whatever 
results.

An according example query, which works on a GND endpoint, is here: 
http://zbw.eu/beta/sparql-lab/?endpoint=http://zbw.eu/beta/sparql/gnd/query=https://api.github.com/repos/zbw/sparql-queries/contents/gnd/search_subject.rq
 I’m pretty sure, that would work as well on our currently unavailable internal 
WD endpoint on Fuseki. Unfortunately, MWAPI fulltext search seems to work 
differently.

Another pattern, which I have applied with a query which looks up person names 
and their name variants from GND, and then searches in the above mentioned 
custom WD instance, is here: 
https://github.com/zbw/sparql-queries/blob/master/wikidata/search_person_by_gnd_names.rq.

For, e.g., “John H. Dunning” (http://d-nb.info/gnd/119094665) all name variants 
are bound  in a fulltext search expression, and a sum of scores is computed to 
rank the total result 
(http://zbw.eu/beta/sparql-lab/result?resultRef=https://api.github.com/repos/zbw/sparql-queries/contents/wikidata/results/search_person_by_gnd_names.wikidata_2016-11-07.gnd_2016-09.json).

I have experimented a bit, but neither of these patterns seems to work with the 
current MWAPI implementation. Since my understanding is very poor here, and the 
implementation is in an early stage, I cc Stas, who perhaps can contribute 
ideas.

Cheers, Joachim


Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Magnus Manske
Gesendet: Mittwoch, 14. Juni 2017 09:33
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] Multilingual and synonym support for M'n'm / was: 
Mix'n'Match with existing (indirect) mappings


On Tue, Jun 13, 2017 at 6:25 PM Neubert, Joachim 
<j.neub...@zbw.eu<mailto:j.neub...@zbw.eu>> wrote:
Hi Magnus, Osma,

I suppose the scenario Osma pointed out is quite common for knowledge 
organization systems and in particular thesauri: Matching could take advantage 
of multilingual labels and also of synonyms, which are defined in the KOS.

For the populating STW Thesaurus for Economics ID (P3911), my preliminary plan 
was to match with all multilingual labels and synonyms as search string in a 
custom WD endpoint (Fuseki, with full text search support), and display in the 
ranked SPARQL results of the search with a column with a valid insert statement 
that can be copied and pasted into QuickStatements2.

Since Stas just announced an extension for WDQS with fulltext search (if I 
haven’t misunderstood his mail of 2017-06-12), it is perhaps now possible to do 
this kind of matching in WDQS.

It would be great if such an extended matching could be integrated into M’n’m.
To clarify, Mix'n'match already searches language-neutral, e.g. for automatch.

Storing multiple labels per entry in the Mix'n'match database, and then 
checking all-against-all, would require some large-scale rewiring.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Multilingual and synonym support for M'n'm / was: Mix'n'Match with existing (indirect) mappings

2017-06-13 Thread Neubert, Joachim
Hi Magnus, Osma,

I suppose the scenario Osma pointed out is quite common for knowledge 
organization systems and in particular thesauri: Matching could take advantage 
of multilingual labels and also of synonyms, which are defined in the KOS.

For the populating STW Thesaurus for Economics ID (P3911), my preliminary plan 
was to match with all multilingual labels and synonyms as search string in a 
custom WD endpoint (Fuseki, with full text search support), and display in the 
ranked SPARQL results of the search with a column with a valid insert statement 
that can be copied and pasted into QuickStatements2.

Since Stas just announced an extension for WDQS with fulltext search (if I 
haven’t misunderstood his mail of 2017-06-12), it is perhaps now possible to do 
this kind of matching in WDQS.

It would be great if such an extended matching could be integrated into M’n’m.

Cheers, Joachim

Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Magnus Manske
Gesendet: Dienstag, 6. Juni 2017 16:07
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] Mix'n'Match with existing (indirect) mappings


On Tue, Jun 6, 2017 at 2:44 PM Osma Suominen 
> wrote:

By the way, we also have multilingual labels that could perhaps improve
the automatic matching. YSO generally has fi/sv/en, YSO places has
fi/sv. Can you make use of these too if I provided them in additional
columns?
Sorry, mix'n'match only does single language labels.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Mix'n'Match with existing (indirect) mappings

2017-06-13 Thread Neubert, Joachim
Hi Osma,

sorry for jumping in late. I've been at ELAG last week, talking about a very 
similar topic (Wikidata as authority linking hub, 
https://hackmd.io/p/S1YmXWC0e). Our use case was porting an existing mapping 
between RePEc author IDs and GND IDs into Wikidata (and furtheron extending it 
there). In that course, we had to match as many persons as possible on the GND 
as well as on the RePEc side (via Mix'n'match), before creating new items. The 
code used for preparing the (quickstatements2) insert statements is linked from 
the slides. 

Additionally, I've added ~12,000 GND IDs to Wikidata via their existing VIAF 
identifiers (derived from a federated query on a custom VIAF endpoint and the 
public WD endpoint - 
https://github.com/zbw/sparql-queries/blob/master/viaf/missing_gnd_id_for_viaf.rq).
 This sounds very similar to your use case; also another query which can derive 
future STW ID properties from the existing STW-GND mapping 
(https://github.com/zbw/sparql-queries/blob/master/stw/wikidata_mapping_candidates_via_gnd.rq
 - currently hits a timeout at the WD subquery, but worked before). I would be 
happy if that could be helpful.

The plan to divide the m'n'm catalogs (places vs. subjects) makes sense for me, 
we plan the same for STW. I'm not sure, if a restriction to locations 
(Q17334923, or something more specific) will match also all subclasses, but 
Magnus could perhaps take care of that when you send him the files.

Cheers, Joachim

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Osma Suominen
> Gesendet: Dienstag, 6. Juni 2017 12:19
> An: Discussion list for the Wikidata project.
> Betreff: [Wikidata] Mix'n'Match with existing (indirect) mappings
> 
> Hi Wikidatans,
> 
> After several delays we are finally starting to think seriously about mapping 
> the
> General Finnish Ontology YSO [1] to Wikidata. A "YSO ID"
> property (https://www.wikidata.org/wiki/Property:P2347) was added to
> Wikidata some time ago, but it has been used only a few times so far.
> 
> Recently some 6000 places have been added to "YSO Places" [2], a new
> extension of YSO, which was generated from place names in YSA and Allärs,
> our earlier subject indexing vocabularies. It would probably make sense to map
> these places to Wikidata, in addition to the general concepts in YSO. We have
> already manually added a few links from YSA/YSO places to Wikidata for newly
> added places, but this approach does not scale if we want to link the 
> thousands
> of existing places.
> 
> We also have some indirect sources of YSO/Wikidata mappings:
> 
> 1. YSO is mapped to LCSH, and Wikidata also to LCSH (using P244, LC/NACO
> Authority File ID). I digged a bit into both sets of mappings and found that
> approximately 1200 YSO-Wikidata links could be generated from the
> intersection of these mappings.
> 
> 2. The Finnish broadcasting company Yle has also created some mappings
> between KOKO (which includes YSO) and Wikidata. Last time I looked at those,
> we could generate at least 5000 YSO-Wikidata links from them.
> Probably more nowadays.
> 
> 
> Of course, indirect mappings are a bit dangerous. It's possible that there are
> some differences in meaning, especially with LCSH which has a very different
> structure (and cultural context) than YSO. Nevertheless I think these could 
> be a
> good starting point, especially if a tool such as Mix'n'Match could be used to
> verify them.
> 
> Now my question is, given that we already have or could easily generate
> thousands of Wikidata-YSO mappings, but the rest would still have to be semi-
> automatically linked using Mix'n'Match, what would be a good way to
> approach this? Does Mix'n'Match look at existing statements (in this case YSO
> ID / P2347) in Wikidata when you load a new catalog, or ignore them?
> 
> I can think of at least these approaches:
> 
> 1. First import the indirect mappings we already have to Wikidata as
> P2347 statements, then create a Mix'n'Match catalog with the remaining YSO
> concepts. The indirect mappings would have to be verified separately.
> 
> 2. First import the indirect mappings we already have to Wikidata as
> P2347 statements, then create a Mix'n'Match catalog with ALL the YSO
> concepts, including the ones for which we already have imported a mapping.
> Use Mix'n'Match to verify the indirect mappings.
> 
> 3. Forget about the existing mappings and just create a Mix'n'Match catalog
> with all the YSO concepts.
> 
> Any advice?
> 
> Thanks,
> 
> -Osma
> 
> [1] http://finto.fi/yso/
> 
> [2] http://finto.fi/yso-paikat/
> 
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. 
> Box
> 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> osma.suomi...@helsinki.fi
> http://www.nationallibrary.fi
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> 

Re: [Wikidata] Bogous cookie header in federated query

2017-04-25 Thread Neubert, Joachim
Hi Stas,

You are right, the header should be correct. There seems to be a long-lasting 
issue in Apache HttpClient 
(https://issues.apache.org/jira/browse/HTTPCLIENT-923), dating back to a 
disambiguity in an old Netcape cookie spec. 

Sorry, I've to address this to the Apache guys. Cheers, Joachim

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Bogous cookie header in federated query

2017-04-24 Thread Neubert, Joachim
In a federated query on my own (Fuseki) endpoint, which reaches out to the 
Wikidata endpoint, with values already bound, it seems that I get for each 
bound value an entry in the Fuseki log like this:

[2017-04-24 19:43:33] ResponseProcessCookies WARN  Invalid cookie header: 
"Set-Cookie: WMF-Last-Access=24-Apr-2017;Path=/;HttpOnly;secure;Expires=Fri, 26 
May 2017 12:00:00 GMT". Invalid 'expires' attribute: Fri, 26 May 2017 12:00:00 
GMT
[2017-04-24 19:43:33] ResponseProcessCookies WARN  Invalid cookie header: 
"Set-Cookie: 
WMF-Last-Access-Global=24-Apr-2017;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Fri,
 26 May 2017 12:00:00 GMT". Invalid 'expires' attribute: Fri, 26 May 2017 
12:00:00 GMT

The (simplified) query was:

PREFIX wdt: 
select *
where {
  bind("118578537" as ?gndId)
  service  {
?wd wdt:P227 ?gndId .
  }
}

I suppose this header was generated by the Wikidata endpoint - and could be 
fixed?

Cheers, Joachim
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Resolver for Sqid?

2017-02-07 Thread Neubert, Joachim
For wikidata, there exists a resolver at 
https://tools.wmflabs.org/wikidata-todo/resolver.php, which allows me to build 
URLs such as

https://tools.wmflabs.org/wikidata-todo/resolver.php?quick=VIAF:12307054 , or
https://tools.wmflabs.org/wikidata-todo/resolver.php?prop=P227=120434059

in order to address wikidata items directly from their external identifiers.

Squid is more appealing for viewing the items. Does a similar mechanism exist 
there?

Cheers, Joachim

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-21 Thread Neubert, Joachim
Just an idea for further extension: If the lang parameter would take a list of 
languages, a fallback mechanism could be addressed. E.g., "=de,en" could 
display German labels, or English, in case the former are missing.

Perhaps, such a mechanism could also incite people to translate labels or 
descriptions into "their" languages.

Cheers, Joachim

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Markus Kroetzsch
> Gesendet: Donnerstag, 21. April 2016 10:49
> An: Discussion list for the Wikidata project.
> Betreff: Re: [Wikidata] SQID: the new "Wikidata classes and properties
> browser"
> 
> On 21.04.2016 10:22, Neubert, Joachim wrote:
> > Hi Markus,
> >
> > Great work!
> >
> > One short question: Is there a way to switch to labels and descriptions in
> another language, by URL or otherwise?
> 
> This feature is not fully implemented yet, but there is partial support for 
> the
> that can be tried out already. To use it, add the "lang="
> parameter to the view URLs, like so:
> 
> http://tools.wmflabs.org/sqid/#/view?id=Q1339=de
> 
> Once you did this, all links will retain this setting, so you can happily 
> browse
> along.
> 
> Limitations:
> * It only works for the entity view perspective, not for the property and 
> class
> browser
> * All labels will be translated to any language (if available), but the 
> application
> interface right now is only available in English and German
> 
> The feature will be extended further in the future, including support for 
> user-
> provided interface translations.
> 
> Markus
> 
> >
> > Cheers, Joachim
> >
> >> -Ursprüngliche Nachricht-
> >> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im
> >> Auftrag von Markus Kroetzsch
> >> Gesendet: Dienstag, 19. April 2016 21:45
> >> An: Discussion list for the Wikidata project.
> >> Betreff: [Wikidata] SQID: the new "Wikidata classes and properties browser"
> >>
> >> Hi all,
> >>
> >> As promised a while ago, we have reworked our "Wikidata Classes and
> >> Properties" browser. I am happy to introduce the first beta version
> >> of a new app, called SQID:
> >>
> >> http://tools.wmflabs.org/sqid/
> >>
> >> It is a complete rewrite of the earlier application. Much faster,
> >> more usable, much more up-to-date information, supported on all
> >> reasonable browsers, and with tons of new features. Try it yourself,
> >> or read on for the main functions and current dev plans:
> >>
> >>
> >> == Browse classes and properties ==
> >>
> >> You can use it to find properties and class items by all kinds of filtering
> settings:
> >>
> >> http://tools.wmflabs.org/sqid/#/browse?type=properties
> >> http://tools.wmflabs.org/sqid/#/browse?type=classes
> >>
> >> New features:
> >> * Sort results by a criterion of choice
> >> * Powerful, easy-to-use filtering interface
> >> * Search properties by label, datatype, qualifiers used, or
> >> co-occurring properties
> >> * Search classes by label, (indirect) superclass or by properties
> >> used on instances of the class
> >> * All property statistics and some class statistics are updated every
> >> hour
> >>
> >> == View Wikidata entities ==
> >>
> >> The goal was to have a page for every property and every class, but
> >> we ended up having a generic data browser that can show all (live)
> >> data + some additional data for classes and properties (this is our
> >> main goal, but the other data is often helpful to understand the
> >> context). The UI is modelled after Reasonator as the quasi-standard
> >> of how Wikidata should look, but if you look beyond the surface you
> >> can see many differences in what SQID will (or will not) display.
> >>
> >> Examples:
> >> * Dresden, a plain item with a lot of data:
> >> http://tools.wmflabs.org/sqid/#/view?id=Q1731
> >> * Volcano, a class with many subclasses:
> >> http://tools.wmflabs.org/sqid/#/view?id=Q8072
> >> * sex or gender, a frequently used property:
> >> http://tools.wmflabs.org/sqid/#/view?id=P21
> >>
> >> Notable features:
> >> * Fast display
> >> * All statement data with all qualifiers shown
> >> * Extra statistical and live query data embedded
> >>
> >> == General Wikida

Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-21 Thread Neubert, Joachim
Hi Markus,

Thanks a lot. Fortunately, German is the alternate language we are most 
interested in :)

Cheers, Joachim

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Markus Kroetzsch
> Gesendet: Donnerstag, 21. April 2016 10:49
> An: Discussion list for the Wikidata project.
> Betreff: Re: [Wikidata] SQID: the new "Wikidata classes and properties
> browser"
> 
> On 21.04.2016 10:22, Neubert, Joachim wrote:
> > Hi Markus,
> >
> > Great work!
> >
> > One short question: Is there a way to switch to labels and descriptions in
> another language, by URL or otherwise?
> 
> This feature is not fully implemented yet, but there is partial support for 
> the
> that can be tried out already. To use it, add the "lang="
> parameter to the view URLs, like so:
> 
> http://tools.wmflabs.org/sqid/#/view?id=Q1339=de
> 
> Once you did this, all links will retain this setting, so you can happily 
> browse
> along.
> 
> Limitations:
> * It only works for the entity view perspective, not for the property and 
> class
> browser
> * All labels will be translated to any language (if available), but the 
> application
> interface right now is only available in English and German
> 
> The feature will be extended further in the future, including support for 
> user-
> provided interface translations.
> 
> Markus
> 
> >
> > Cheers, Joachim
> >
> >> -Ursprüngliche Nachricht-
> >> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im
> >> Auftrag von Markus Kroetzsch
> >> Gesendet: Dienstag, 19. April 2016 21:45
> >> An: Discussion list for the Wikidata project.
> >> Betreff: [Wikidata] SQID: the new "Wikidata classes and properties browser"
> >>
> >> Hi all,
> >>
> >> As promised a while ago, we have reworked our "Wikidata Classes and
> >> Properties" browser. I am happy to introduce the first beta version
> >> of a new app, called SQID:
> >>
> >> http://tools.wmflabs.org/sqid/
> >>
> >> It is a complete rewrite of the earlier application. Much faster,
> >> more usable, much more up-to-date information, supported on all
> >> reasonable browsers, and with tons of new features. Try it yourself,
> >> or read on for the main functions and current dev plans:
> >>
> >>
> >> == Browse classes and properties ==
> >>
> >> You can use it to find properties and class items by all kinds of filtering
> settings:
> >>
> >> http://tools.wmflabs.org/sqid/#/browse?type=properties
> >> http://tools.wmflabs.org/sqid/#/browse?type=classes
> >>
> >> New features:
> >> * Sort results by a criterion of choice
> >> * Powerful, easy-to-use filtering interface
> >> * Search properties by label, datatype, qualifiers used, or
> >> co-occurring properties
> >> * Search classes by label, (indirect) superclass or by properties
> >> used on instances of the class
> >> * All property statistics and some class statistics are updated every
> >> hour
> >>
> >> == View Wikidata entities ==
> >>
> >> The goal was to have a page for every property and every class, but
> >> we ended up having a generic data browser that can show all (live)
> >> data + some additional data for classes and properties (this is our
> >> main goal, but the other data is often helpful to understand the
> >> context). The UI is modelled after Reasonator as the quasi-standard
> >> of how Wikidata should look, but if you look beyond the surface you
> >> can see many differences in what SQID will (or will not) display.
> >>
> >> Examples:
> >> * Dresden, a plain item with a lot of data:
> >> http://tools.wmflabs.org/sqid/#/view?id=Q1731
> >> * Volcano, a class with many subclasses:
> >> http://tools.wmflabs.org/sqid/#/view?id=Q8072
> >> * sex or gender, a frequently used property:
> >> http://tools.wmflabs.org/sqid/#/view?id=P21
> >>
> >> Notable features:
> >> * Fast display
> >> * All statement data with all qualifiers shown
> >> * Extra statistical and live query data embedded
> >>
> >> == General Wikidata statistics ==
> >>
> >> As a minor feature, we also publish statistics on the weekly full
> >> Wikidata dump (which we process to get some of the statistics):
> >>
> >> http://tools.wmflabs.org/sqid/#/stat

Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-21 Thread Neubert, Joachim
Hi Markus,

Great work!

One short question: Is there a way to switch to labels and descriptions in 
another language, by URL or otherwise?

Cheers, Joachim

> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
> Markus Kroetzsch
> Gesendet: Dienstag, 19. April 2016 21:45
> An: Discussion list for the Wikidata project.
> Betreff: [Wikidata] SQID: the new "Wikidata classes and properties browser"
> 
> Hi all,
> 
> As promised a while ago, we have reworked our "Wikidata Classes and
> Properties" browser. I am happy to introduce the first beta version of a new
> app, called SQID:
> 
> http://tools.wmflabs.org/sqid/
> 
> It is a complete rewrite of the earlier application. Much faster, more usable,
> much more up-to-date information, supported on all reasonable browsers, and
> with tons of new features. Try it yourself, or read on for the main functions 
> and
> current dev plans:
> 
> 
> == Browse classes and properties ==
> 
> You can use it to find properties and class items by all kinds of filtering 
> settings:
> 
> http://tools.wmflabs.org/sqid/#/browse?type=properties
> http://tools.wmflabs.org/sqid/#/browse?type=classes
> 
> New features:
> * Sort results by a criterion of choice
> * Powerful, easy-to-use filtering interface
> * Search properties by label, datatype, qualifiers used, or co-occurring
> properties
> * Search classes by label, (indirect) superclass or by properties used on
> instances of the class
> * All property statistics and some class statistics are updated every hour
> 
> == View Wikidata entities ==
> 
> The goal was to have a page for every property and every class, but we ended
> up having a generic data browser that can show all (live) data + some
> additional data for classes and properties (this is our main goal, but the 
> other
> data is often helpful to understand the context). The UI is modelled after
> Reasonator as the quasi-standard of how Wikidata should look, but if you look
> beyond the surface you can see many differences in what SQID will (or will 
> not)
> display.
> 
> Examples:
> * Dresden, a plain item with a lot of data:
>http://tools.wmflabs.org/sqid/#/view?id=Q1731
> * Volcano, a class with many subclasses:
>http://tools.wmflabs.org/sqid/#/view?id=Q8072
> * sex or gender, a frequently used property:
>http://tools.wmflabs.org/sqid/#/view?id=P21
> 
> Notable features:
> * Fast display
> * All statement data with all qualifiers shown
> * Extra statistical and live query data embedded
> 
> == General Wikidata statistics ==
> 
> As a minor feature, we also publish statistics on the weekly full Wikidata 
> dump
> (which we process to get some of the statistics):
> 
> http://tools.wmflabs.org/sqid/#/status
> 
> Don't trust the main page -- find out how many entities there really are ;-)
> 
> 
> == Plans, todos, feedback, contributions ==
> 
> We are on github, so please make feature requests and bug reports there:
> 
> https://github.com/Wikidata/WikidataClassBrowser/issues
> 
> Pull requests are welcome too.
> 
> Known limitations of the current version:
> 
> * Data update still a bit shaky. We refresh most statistical data every hour
> (entity data is live anyway), but you may not see this unless you clear your
> browser cache. This will be easier in the future.
> * Entity data browser does not show sitelinks and references yet.
> * Incoming properties not shown yet on entity pages
> * I18N not complete yet (if you would like to try the current dev status, see,
> e.g.: http://tools.wmflabs.org/sqid/#/view?id=Q318=de)
> 
> Moreover, we are also planning to integrate more data displays, better live
> statistics, and editing capabilities. Developers who want to help there are
> welcome. SQID can also be a platform for other data display ideas (it's built
> using AngularJS, so integration is easy).
> 
> == And what about Miga? ==
> 
> The old Miga-based app at
> http://tools.wmflabs.org/wikidata-exports/miga/ will be retired in due course.
> Please update your links.
> 
> == Credits ==
> 
> I have had important development support from Markus Damm, Michael
> Günther and Georg Wild. We are funded by the German Research Foundation
> DFG. All of us are at TU Dresden. Complex statistics are computed with
> Wikidata Toolkit. Live query results come from the Wikidata SPARQL Query
> Service.
> 
> Enjoy,
> 
> Markus
> 
> --
> Markus Kroetzsch
> Faculty of Computer Science
> Technische Universität Dresden
> +49 351 463 38486
> http://korrekt.org/
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-19 Thread Neubert, Joachim
Hi Stas,

Thanks for your explanation! I've to perhaps do some tests on my own systems ...

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von Stas 
Malyshev
Gesendet: Donnerstag, 18. Februar 2016 19:12
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT 
results truncated

Hi!

> Now, obviously endpoints referenced in a federated query via a service 
> clause have to be open - so any attacker could send his queries 
> directly instead of squeezing them through some other endpoint. The 
> only scenario I can think of is that an attackers IP already is 
> blocked by the attacked site. If (instead of much more common ways to 
> fake an IP) the attacker would choose to do it by federated queries 
> through WDQS, this _could_ result in WDQS being blocked by this 
> endpoint.

This is not what we are concerned with. What we are concerned with is that 
federation essentially requires you to run an open proxy - i.e. to allow 
anybody to send requests to any URL. This is not acceptable to us because this 
means somebody could abuse this both to try and access our internal 
infrastructure and to launch attacks to other sites using our site as a 
platform.

We could allow, if there is enough demand, to access specific whitelisted 
endpoints but so far we haven't found any way to allow access to any SPARQL 
endpoint without essentially allowing anybody to launch arbitrary network 
connections from our server.

> provide for the linked data cloud. This must not involve the 
> highly-protected production environment, but could be solved by an 
> additional unstable/experimental endpoint under another address.

The problem is we can not run production-quality endpoint in non-production 
environment. We could set up an endpoint on the Labs, but this endpoint would 
be underpowered and we won't be able to guarantee any quality of service there. 
To serve the amount of Wikidata data and updates, the machines should have 
certain hardware capabilities, which Labs machines currently do not have.

Additionally, I'm not sure running open proxy even there would be a good idea. 
Unfortunately, in the internet environment of today there is no lack of players 
that would want to abuse such thing for nefarious purposes.

We will keep looking for solution for this, but so far we haven't found one.

Thanks,
--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Neubert, Joachim
Dear Ruben,

LDF seems a very promising solution to build reliable Linked Data production 
environment with high scalability at relatively low cost.

However, I'm not sure if the solution works well on queries like the ones 
discussed here (see below). It would be very interesting to learn how exactly 
such a query would be dealt with in an LDF client / server setting.

To me, a crucial point seems to be that I'm trying to look up a large number of 
distinct entities in two endpoints and join them. In the "real life" case 
discussed here, about 430.000 "economists" extracted from GND and about 320.000 
"persons with GND id" from wikidata. The result of the join are about 30.000 
wikidata items, for which the German and English wikipedia site links are 
required.

How could an LDF client get this information effectively?

Cheers, Joachim

> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> PREFIX schema: <http://schema.org/>
> #
> construct {
>?gnd schema:about ?sitelink .
> }
> where {
># the relevant wikidata items have already been 
># identified and loaded to the econ_pers endpoint in a 
># previous step
>service <http://zbw.eu/beta/sparql/econ_pers/query> {
>  ?gnd skos:prefLabel [] ;
>   skos:exactMatch ?wd .
>  filter(contains(str(?wd), 'wikidata'))
>}
>?sitelink schema:about ?wd ;
>  schema:inLanguage ?language .
>filter (contains(str(?sitelink), 'wikipedia'))
>filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
>

-Ursprüngliche Nachricht-
Von: Ruben Verborgh [mailto:ruben.verbo...@ugent.be] 
Gesendet: Donnerstag, 18. Februar 2016 14:02
An: wikidata@lists.wikimedia.org
Cc: Neubert, Joachim
Betreff: Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT 
results truncated

Dear all,

I don't mean to hijack the thread, but for federation purposes, you might be 
interested in a Triple Pattern Fragments interface [1]. TPF offers lower server 
cost to reach high availability, at the expense of slower queries and higher 
bandwidth [2]. This is possible because the client performs most of the query 
execution.

I noticed the Wikidata SPARQL endpoint has had an excellent track record so far 
(congratulations on this), so the TPF solution might not be necessary for 
server cost / availability reasons.

However, TPF is an excellent solution for federated queries. In (yet to be 
pulbished) experiments, we have verified that the TPF client/server solution 
performs on par with state-of-the-art federation frameworks based on SPARQL 
endpoints for many simple and complex queries. Furthermore, there are no 
security problems etc. ("open proxy"), because all federation is performed by 
the client.

You can see a couple of example queries here with other datasets:
- Works by writers born in Stockholm (VIAF and DBpedia - 
http://bit.ly/writers-stockholm) - Books by Swedish Nobel prize winners that 
are in the Harvard Library (VIAF, DBpedia, Harvard - 
http://bit.ly/swedish-nobel-harvard)

It might be a quick win to set up a TPF interface on top of the existing SPARQL 
endpoint.
If you want any info, don't hesitate to ask.

Best,

Ruben

[1] http://linkeddatafragments.org/in-depth/
[2] http://linkeddatafragments.org/publications/iswc2014.pdf

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Neubert, Joachim
From Stas' answer to https://phabricator.wikimedia.org/T127070 I learned the 
Wikidata Query Service does not "allow external federated queries ... for 
security reasons (it's basically open proxy)."

Now, obviously endpoints referenced in a federated query via a service clause 
have to be open - so any attacker could send his queries directly instead of 
squeezing them through some other endpoint. The only scenario I can think of is 
that an attackers IP already is blocked by the attacked site. If (instead of 
much more common ways to fake an IP) the attacker would choose to do it by 
federated queries through WDQS, this _could_ result in WDQS being blocked by 
this endpoint.

This is a quite unlikely scenario - in the last 7 years I'm on SPARQL mailing 
lists I cannot remember this kind of attack of ever having been reported - but 
of cause it is legitimate to secure production environments against any 
conceivable attack vector.

However, I think it should be possible to query Wikidata with this kind of 
query. Federated SPARQL queries are a basic building block for Linked Open 
Data, and blocking it breaks many uses Wikidata could provide for the linked 
data cloud. This must not involve the highly-protected production environment, 
but could be solved by an additional unstable/experimental endpoint under 
another address. 

As an additional illustrating argument: There is an immense difference between 
referencing something in a service clause and getting a result in a few 
seconds, or having to use the Wikidata toolkit. To get the initial query for 
this thread answered by the example program Markus kindly provided at 
https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/examples/DataExtractionProcessor.java
 (and which worked perfectly - thanks again!), it took me 
- more than five hours to download the dataset (in my work environment wired to 
the DFN network)
- 20 min to execute the query
- considerable time to fiddle with the Java code for the query if I had to 
adapt it (+ another 20 min to execute it again)

For many parts of the world, or even for users in Germany with a slow DSL 
connection, the first point alone would prohibit any use. And even with a good 
internet connection, a new or occasional user would quite probably turn away 
when offered this procedure instead of getting a "normal" LOD conformant query 
answered in a few seconds.

Again, I very much value your work and your determination to set up a service 
with very high availability and performance. Please, make the great Wikidata 
LOD available in less demanding settings, too. It should be possible for users 
to do more advanced SPARQL queries for LOD uses in an environment where you can 
not guarantee a high level of reliability.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Neubert, Joachim
Gesendet: Dienstag, 16. Februar 2016 15:48
An: 'Discussion list for the Wikidata project.'
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Thanks Markus, I've created https://phabricator.wikimedia.org/T127070 with the 
details.

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Dienstag, 16. Februar 2016 14:57
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi Joachim,

I think SERVICE queries should be working, but maybe Stas knows more about 
this. Even if they are disabled, this should not result in some message rather 
than in a NullPointerException. Looks like a bug.

Markus


On 16.02.2016 13:56, Neubert, Joachim wrote:
> Hi Markus,
>
> Great that you checked that out. I can confirm that the simplified query 
> worked for me, too. It took 15.6s and revealed roughly the same number of 
> results (323789).
>
> When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an 
> endpoint for "economics-related" persons, it matched with 36050 persons 
> (supposedly the "most important" 8 percent of our set).
>
> What I normally would do to get the according Wikipedia site URLs, is a query 
> against the wikidata endpoint, which references the relevant wikidata URIs 
> via a "service" clause:
>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> PREFIX schema: <http://schema.org/>
> #
> construct {
>?gnd schema:about ?sitelink .
> }
> where {
>service <http://zbw.eu/beta/sparql/econ_pers/query> {
>  ?gnd skos:prefLabel [] ;
>   skos:exactMatch ?wd .
>  filter(contains(str(?wd), 'wikidata'))
>}
>?sitelink schema:about ?wd ;
>  schema:inLanguage ?language .
>filter (contains(str(?sitelink), 'wikipedia'))
>filter (lang(?wdLabel) = ?language && ?language in ('en',

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-16 Thread Neubert, Joachim
Thanks Markus, I've created https://phabricator.wikimedia.org/T127070 with the 
details.

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Dienstag, 16. Februar 2016 14:57
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi Joachim,

I think SERVICE queries should be working, but maybe Stas knows more about 
this. Even if they are disabled, this should not result in some message rather 
than in a NullPointerException. Looks like a bug.

Markus


On 16.02.2016 13:56, Neubert, Joachim wrote:
> Hi Markus,
>
> Great that you checked that out. I can confirm that the simplified query 
> worked for me, too. It took 15.6s and revealed roughly the same number of 
> results (323789).
>
> When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an 
> endpoint for "economics-related" persons, it matched with 36050 persons 
> (supposedly the "most important" 8 percent of our set).
>
> What I normally would do to get the according Wikipedia site URLs, is a query 
> against the wikidata endpoint, which references the relevant wikidata URIs 
> via a "service" clause:
>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> PREFIX schema: <http://schema.org/>
> #
> construct {
>?gnd schema:about ?sitelink .
> }
> where {
>service <http://zbw.eu/beta/sparql/econ_pers/query> {
>  ?gnd skos:prefLabel [] ;
>   skos:exactMatch ?wd .
>  filter(contains(str(?wd), 'wikidata'))
>}
>?sitelink schema:about ?wd ;
>  schema:inLanguage ?language .
>filter (contains(str(?sitelink), 'wikipedia'))
>filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
>
> This however results in a java error.
>
> If "service" clauses are supposed to work in the wikidata endpoint, I'd 
> happily provide addtitional details in phabricator.
>
> For now, I'll get the data via your java example code :)
>
> Cheers, Joachim
>
> -Ursprüngliche Nachricht-
> Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag 
> von Markus Kroetzsch
> Gesendet: Samstag, 13. Februar 2016 22:56
> An: Discussion list for the Wikidata project.
> Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
>
> And here is another comment on this interesting topic :-)
>
> I just realised how close the service is to answering the query. It turns out 
> that you can in fact get the whole set of (currently >324000 result items) 
> together with their GND identifiers as a download *within the timeout* (I 
> tried several times without any errors). This is a 63M json result file with 
> >640K individual values, and it downloads in no time on my home network. The 
> query I use is simply this:
>
> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: 
> <http://www.wikidata.org/prop/direct/>
>
> select ?item ?gndId
> where {
> ?item wdt:P227 ?gndId ; # get gnd ID
>   wdt:P31  wd:Q5  . # instance of human } ORDER BY ASC(?gndId) 
> LIMIT 10
>
> (don't run this in vain: even with the limit, the ORDER clause 
> requires the service to compute all results every time someone runs 
> this. Also be careful when removing the limit; your browser may hang 
> on an HTML page that large; better use the SPARQL endpoint directly to 
> download the complete result file.)
>
> It seems that the timeout is only hit when adding more information (labels 
> and wiki URLs) to the result.
>
> So it seems that we are not actually very far away from being able to answer 
> the original query even within the timeout. Certainly not as far away as I 
> first thought. It might not be necessary at all to switch to a different 
> approach (though it would be interesting to know how long LDF takes to answer 
> the above -- our current service takes less than 10sec).
>
> Cheers,
>
> Markus
>
>
> On 13.02.2016 11:40, Peter Haase wrote:
>> Hi,
>>
>> you may want to check out the Linked Data Fragment server in Blazegraph:
>> https://github.com/blazegraph/BlazegraphBasedTPFServer
>>
>> Cheers,
>> Peter
>>> On 13.02.2016, at 01:33, Stas Malyshev <smalys...@wikimedia.org> wrote:
>>>
>>> Hi!
>>>
>>>> The Linked data fragments approach Osma mentioned is very 
>>>> interesting (particularly the bit about setting it up on top of an 
>>>> regularily updated existing endpoint), and could provide another 
>>>> alternative, but I have not yet experimented with it.
&g

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Neubert, Joachim
It's great how this discussion evolves - thanks to everybody!

Technically, I completely agree that in practice it may prove impossible to 
predict the load a query will produce. Relational databases have invested years 
and years in query optimization (e.g., Oracles cost based optimizer, which 
relies on extended statistics gathered during runtime), and I can't see that 
similar investments are possible for triple stores.

What I could imagine for public endpoints is the SPARQL engine monitoring and 
prioritizing queries: the longer a query already runs, or the more resources it 
has already used, the lower its priority is re-scheduled (up to some final 
limit). But this is just a theoretical consideration, I'm not aware of any 
system that implements anything like this - and it could be implemented only in 
the engine itself.

For ZBWs SPARQL endpoints, I've implemented a much simpler three-level 
strategy, which does not involve the engine at all:

1. Endpoints which drive production-level services (e.g. autosuggest or 
retrieval enhancement functions). These endpoints run on separate machines and 
offer completely encapsulated services via a public API 
(http://zbw.eu/beta/econ-ws), without any direct SPARQL access.

2. Public "beta" endpoints (http://zbw.eu/beta/sparql). These offer 
unrestricted SPARQL access, but without any garanties about performance or 
availability - though of course I do my best to keep these up and running. They 
run on an own virtual machine, and should not hurt any other parts of the 
infrastructure when getting overloaded or out of control.

3. Public "experimental" endpoints. These include in particular an endpoint for 
the GND dataset with 130m triples. It was mainly created for internal use 
because (to my best knowledge) no other public GND endpoint exists. The 
endpoint is not linked from the GND pages of DNB, and I've advertised it very 
low-key on a few mailing lists. For these experimental endpoints, we reserve 
the right to shut them down for the public if they get flooded with more 
requests than they can handle.

It may be of interest, that up to now, on none of these public endpoints we 
came across issues with attacks or evil-minded queries (which were a matter of 
concern when I started this in 2009), nor with longer-lasting massive access. 
Of course, that is different for Wikidata, where the data is of interest for 
_much_ more people. But if anyhow affordable, I'd like to encourage offering 
some kind of experimental access with really wide limits in an "unstable" 
setting, in addition to the reliable services. For most people who just want to 
check out something, it's not an option to download the whole dataset and set 
up an infrastructure for it. For us, this was an issue with even the much 
smaller GND set.

The Linked data fragments approach Osma mentioned is very interesting 
(particularly the bit about setting it up on top of an regularily updated 
existing endpoint), and could provide another alternative, but I have not yet 
experimented with it.

Have a fine weekend - Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Freitag, 12. Februar 2016 09:44
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

On 12.02.2016 00:04, Stas Malyshev wrote:
> Hi!
>
>> We basically have two choices: either we offer a limited interface 
>> that only allows for a narrow range of queries to be run at all. Or 
>> we offer a very general interface that can run arbitrary queries, but 
>> we impose limits on time and memory consumption. I would actually 
>> prefer the first option, because it's more predictable, and doesn't get 
>> people's hopes up too far. What do you think?
>
> That would require implementing pretty smart SPARQL parser... I don't 
> think it worth the investment of time. I'd rather put caps on runtime 
> and maybe also on parallel queries per IP, to ensure fair access. We 
> may also have a way to run longer queries - in fact, we'll need it 
> anyway if we want to automate lists - but that is longer term, we'll 
> need to figure out infrastructure for that and how we allocate access.

+1

Restricting queries syntactically to be "simpler" is what we did in Semantic 
MediaWiki (because MySQL did not support time/memory limits per query). It is a 
workaround, but it will not prevent long-running queries unless you make the 
syntactic restrictions really severe (and thereby forbid many simple queries, 
too). I would not do it if there is support for time/memory limits instead.

In the end, even the SPARQL engines are not able to predict reliably how 
complicated a query is going to be -- it's an important part of their work (for 
optimising query execution), but it is also very difficult.

Markus

>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Neubert, Joachim
Hi Marcus,

thank you very much, your code will be extremely helpful for solving my current 
need. And though not a Java programmer, I may be even able to adjust it to 
similar queries.

On the other side, it's some steps away from the promises of Linked data and 
SPARQL endpoints. I extremely value the wikidata endpoint for having the 
current data, so if I add some bit in the user interface, I can query for it 
immediately afterwards, and I can do this in a uniform way via standard SPARQL 
queries. I can imagine how hard that was to achieve.

And I completely agree that it's impossible to build a SPARQL endpoint which 
reliably serves arbitrary comlex queries for multiple users in finite time. 
(This is the reason why all our public endpoints at http://zbw.eu/beta/sparql/ 
are labeled beta.) And you easily can get at a point, where some ill-behaved 
query is run over and over again by some stupid program, and you have to be 
quite restrictive to keep your service up.

So an "unstable" endpoint with wider limits, as you suggested in your later 
mail, could be a great solution for this. In both instances, it would be nice 
if the policy and the actual limits could be documented, so users would know 
what to expect (and how to act appropriate as good citizens).

Thanks again for the code, and for taking up the discussion.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Donnerstag, 11. Februar 2016 15:05
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi Joachim,

Here is a short program that solves your problem:

https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/examples/DataExtractionProcessor.java

It is in Java, so, you need that (and Maven) to run it, but that's the only 
technical challenge ;-). You can run the program in various ways as described 
in the README:

https://github.com/Wikidata/Wikidata-Toolkit-Examples

The program I wrote puts everything into a CSV file, but you can of course also 
write RDF triples if you prefer this, or any other format you wish. The code 
should be easy to modify.

On a first run, the tool will download the current Wikidata dump, which takes a 
while (it's about 6G), but after this you can find and serialise all results in 
less than half an hour (for a processing rate of around 10K items/second). A 
regular laptop is enough to run it.

Cheers,

Markus


On 11.02.2016 01:34, Stas Malyshev wrote:
> Hi!
>
>> I try to extract all mappings from wikidata to the GND authority 
>> file, along with the according wikipedia pages, expecting roughly 
>> 500,000 to 1m triples as result.
>
> As a starting note, I don't think extracting 1M triples may be the 
> best way to use query service. If you need to do processing that 
> returns such big result sets - in millions - maybe processing the dump 
> - e.g. with wikidata toolkit at 
> https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
>
>> However, with various calls, I get much less triples (about 2,000 to 
>> 10,000). The output seems to be truncated in the middle of a statement, e.g.
>
> It may be some kind of timeout because of the quantity of the data 
> being sent. How long does such request take?
>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL CONSTRUCT results truncated

2016-02-08 Thread Neubert, Joachim
I try to extract all mappings from wikidata to the GND authority file, along 
with the according wikipedia pages, expecting roughly 500,000 to 1m triples as 
result.

However, with various calls, I get much less triples (about 2,000 to 10,000). 
The output seems to be truncated in the middle of a statement, e.g.

...
 
 
 .
  
 .
  

PREFIX wdt: 
PREFIX wikibase: 
PREFIX p: 
PREFIX v: 
PREFIX q: 
PREFIX schema: 
PREFIX rdfs: 
PREFIX skos: 
#
construct {
  ?gnd skos:exactMatch ?wd ;
schema:about ?sitelink .
}
#select ?gndId ?wd ?wdLabel ?sitelink ?gnd
where {
  # get all wikidata items and labels linked to GND
  ?wd wdt:P227 ?gndId ;
  rdfs:label ?wdLabel ;
  # restrict to
  wdt:P31 wd:Q5  . # instance of human
  # get site links (only from de/en wikipedia sites)
  ?sitelink schema:about ?wd ;
schema:inLanguage ?language .
  filter (contains(str(?sitelink), 'wikipedia'))
  filter (lang(?wdLabel) = ?language && ?language in ('en', 'de'))
  bind(uri(concat('http://d-nb.info/gnd/', ?gndId)) as ?gnd)
}
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata