[Wikidata] Re: Talk to the Search Platform / Query Service Team—May 8, 2024

2024-05-08 Thread Guillaume Lederrey
This is happening 1h from now.

On Fri, 3 May 2024 at 17:04, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, May 8, 2024
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>   Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/NZWTPFAYQH3456MYUXN3SBA2C5U5W5WH/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—May 8, 2024

2024-05-03 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, May 8, 2024
Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

  Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/3VB6I5XR2JEQWDMKME4GGPPIH5OJOYH6/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] WDQS Scaling update

2024-04-17 Thread Guillaume Lederrey
Hello all!

We’ve been moving forward on the WDQS Graph Split [1], time for an update!

We have new documentation to help the migration to the split graph:
* Federation limits [2]: Explanation of the limitations of the SPARQL
federation as used on the graph split. This might help you understand what
is possible and what isn’t when you need to federate the main WDQS graph
with the scholarly subgraph.
* Federated queries examples [3]: This document explains how to rewrite
queries to use SPARQL federation over the split graph. We’ve taken a number
of real life examples, and we’ve rewritten them to use federation. While
rewriting queries is not always trivial, the examples that we tried are all
possible to make work over a split graph.

We have been reaching out to people who will be impacted by the graph
split. In particular, we have been having conversations with community
members close to the Scholia and Wikicite projects. In that context, we are
realizing that our initial split proposal (moving all instances of
Scholarly articles to a separate graph - ?entity wdt:P31 wd:Q13442814) is
not sufficient. We have prepared a second and last proposal that will
refine this split to make it easier to use. See "WDQS Split Refinement" [4]
for details. We are open for feedback until May 15th 2024, please send it
to the related talk page [5].

While we refine this split, we are starting work on the implementation of
the missing pieces to make the graph split available. This includes
modifying the update pipeline to support the split and better automation of
the data loading process. We are also working on a migration plan, which we
will communicate as soon as it is ready. Our current assumption is that we
will leave ~6 months for the migration once the split services are
available before shutting down the full graph endpoint.

We need your help more than ever!
If you have use cases that need access to scholarly articles, please read
"Federation Limits" [2] and "Federated Queries Examples" [3], rewrite and
test your queries, and add your working examples to "Federated Queries
Examples" [3].
Send your general feedback to the project page [1].

On a side note, WDQS isn’t the only SPARQL endpoint exposing the Wikidata
graph. You can have a look at "Alternative endpoints" [6], which lists a
number of alternatives not hosted by WMF, which might be helpful during the
transition.

Thanks!

   Guillaume

[1]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split
[2]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federation_Limits
[3]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples
[4]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/WDQS_Split_Refinement
[5]
https://www.wikidata.org/w/index.php?title=Wikidata_talk:SPARQL_query_service/WDQS_graph_split/WDQS_Split_Refinement=edit
[6]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Alternative_endpoints

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/AHOKYOHFMHHDVOSVTFON3PGB5EAUUPX2/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—April 3, 2024

2024-04-03 Thread Guillaume Lederrey
This is happening 1h from now.

On Mon, 1 Apr 2024 at 20:53, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, April 3, 2024
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>   Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/IEHPLDMOLISUKNCCCJ6UC4EGRKDV4F2H/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—April 3, 2024

2024-04-01 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, April 3, 2024
Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

  Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/HG3XB4EEJDNTU5QBJQDB2Y6BRL5B44FZ/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—March 6, 2024

2024-03-06 Thread Guillaume Lederrey
This is happening in 1h.

On Mon, 4 Mar 2024 at 16:32, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, March 6, 2024
> Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>   Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/WNNNXGNV6JRBL6ES2UFOB5ECHRYIO5F6/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—March 6, 2024

2024-03-04 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, March 6, 2024
Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

  Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/OZEO66EKF5SMAH2E5X7V7SXQTG6VJRHO/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Wikidata Query Service - Scaling Update - February 2024

2024-02-07 Thread Guillaume Lederrey
Hello all!

We have been hard at work on our Graph Split experiment [1], and we now
have a working graph split that is loaded onto 3 test servers. We are
running tests on a selection of queries from our logs to help understand
the impact of the split. We need your help to validate the impact of
various use cases and workflows around Wikidata Query Service.

**What is the WDQS Graph Split experiment?**

We want to address the growing size of the Wikidata graph by splitting it
into 2 subgraphs of roughly half the size of the full graph, which should
support the growth of Wikidata for the next 5 years. This experiment is
about splitting the full Wikidata graph into a scholarly articles subgraph
and a “main” graph that contains everything else.

See our previous update for more details [2].

**Who should care?**

Anyone who uses WDQS through the UI or programmatically should check the
impact on their use cases, scripts, bots, code, etc.

**What are those test endpoints?**

We expose 3 test endpoints, for the full, main and scholarly articles
graphs. Those graphs are all created from the same dump and are not live
updated. This allows us to compare queries between the different endpoints,
with stable / non changing data (the data are from the middle of October
2023).

The endpoints are:
* https://query-full-experimental.wikidata.org/
* https://query-main-experimental.wikidata.org/
* https://query-scholarly-experimental.wikidata.org/

Each of the endpoints is backed by a single dedicated server of performance
similar to the production WDQS servers. We don’t expect performance to be
representative of production due to the different load and to the lack of
updates on the test servers.

**What kind of feedback is useful?**

We expect queries that don’t require scholarly articles to work
transparently on the “main” subgraph. We expect queries that require
scholarly articles to need to be rewritten with SPARQL federation between
the “main” and scholarly subgraphs (federation is supported for some
external SPARQL servers already [3], this just happens to be for internal
server-to-server communication). We are doing tests and analysis based on a
sample of query logs.

**We want to hear about:**

General use cases or classes of queries which break under federation
Bots or applications that need significant rewrite of queries to work with
federation
And also about use cases that work just fine!

Examples of queries and pointers to code will be helpful in your feedback.

**Where should feedback be sent?**

You can reach out to us using the project’s talk page [1], the Phabricator
ticket for community feedback [4] or by pinging directly Sannita (WMF) [5].

**Will feedback be taken into account?**

Yes! We will review feedback and it will influence our path forward. That
being said, there are limits to what is possible. The size of the Wikidata
graph is a threat to the stability of WDQS and thus a threat to the whole
Wikidata project. Scholarly articles is the only split we know of that
would reduce the graph size sufficiently. We can work together on providing
support for a migration, on reviewing the rules used for the graph split,
but we can’t just ignore the problem and continue with a WDQS that provides
transparent access to the full Wikidata graph.

  Have fun!

  Guillaume

[1]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split
[2]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/October_2023_scaling_update
[3]
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Federation
[4] https://phabricator.wikimedia.org/T356773
[5] https://www.wikidata.org/wiki/User:Sannita_(WMF)
--
Guillaume Lederrey (he/him)
Engineering Manager
Wikimedia Foundation
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/IIA5LVHBYK45FSMLPIVZI6WXA5QSRPF4/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—February 7, 2024

2024-02-07 Thread Guillaume Lederrey
This is happening 1h from now.

On Mon, 5 Feb 2024 at 15:11, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, February 7, 2024
> Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/UOOLGG34CDNW7JHOMGB7WIDRBI4D2YP2/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—February 7, 2024

2024-02-05 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, February 7, 2024
Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/WQB2SIGNRN7R5TFEPUJNOLDVF7IKCMC5/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—January 10, 2024

2024-01-10 Thread Guillaume Lederrey
This is happening 1h from now.

On Tue, 9 Jan 2024 at 11:27, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, January 10, 2024
> Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/IZGUMVMB3TIMTQFD5G7SND3XVLCC3KWP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—January 10, 2024

2024-01-09 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, January 10, 2024
Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/36FNFH6P4LTATIGYVZZ4KOOUEUB4P3TG/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—December 6, 2023

2023-12-06 Thread Guillaume Lederrey
This is happening 1h from now.

On Mon, 4 Dec 2023 at 14:42, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, December 6, 2023
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/6HM7P2RET3RQJH2PLYZXNVYIZBBS3DDP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—December 6, 2023

2023-12-04 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, December 6, 2023
Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/BADVPPRWAKA4JGJ7FEU6XVXW4AWJKV3E/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—November 1, 2023

2023-11-01 Thread Guillaume Lederrey
This is happening in 1h.

On Mon, 30 Oct 2023 at 17:40, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, November 1, 2023
> Time: 15:00-16:00 UTC / 08:00 PT / 11:00 EDT / 16:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/NBIWREMCUR4QCL476VUWV7ADRZROISOE/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—November 1, 2023

2023-10-30 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, November 1, 2023
Time: 15:00-16:00 UTC / 08:00 PT / 11:00 EDT / 16:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/OX2QCMVVLZ2JCVOBXYWWULFALD5QOO6W/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Wikidata Query Service - Scaling Update - October 2023

2023-10-17 Thread Guillaume Lederrey
educe the complexity of the experiment, we will
   focus on a static dump. If the experiment is successful, more work will be
   done to ensure that those split graphs can be updated in real time.
   -

   Production implementation of multiple graphs: we will only commit to a
   production implementation if the experiment is successful.

Success criteria

Part of the experimentation is understanding the impacts of this split, so
we only have imperfect metrics at this time.


   -

   Blazegraph stability is not threatened by the size of the graph. Our
   expectation is that a size reduction of 25% will give us leeway. A proxy
   metric for stability is our ability to reload the data from scratch in less
   than 10 days.
   -

   Query time is not increased for most queries.
   -

   The number of queries requiring rewrite due to federation is minimal.
   -

   The number of queries rendered too expensive by federation is minimal.

How to learn more?

We will create a wiki page for the project shortly, this will be the main
focus point for discussions. You are always welcome to join the Search
Platform Office Hours
<https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours>
(first Wednesday of every month) to ask more questions and have a direct
discussion with the team.


This communication is also available on wiki
<https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/October_2023_scaling_update>
.

Thank you all for your help and support!

   Guillaume




-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/A77E57AWD474P5UXE3EX4BKVPHHLATNH/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—October 4, 2023

2023-10-04 Thread Guillaume Lederrey
This is happening 1 hour from now.

On Fri, 29 Sept 2023 at 11:58, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, October 4, 2023
> Time: 15:00-16:00 UTC / 08:00 PT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/Z6ILGCZFQAWDJRJLOPHOY7IAXE2MR2OS/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—October 4, 2023

2023-09-29 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, October 4, 2023
Time: 15:00-16:00 UTC / 08:00 PT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/S7OCXSRJQR4ITMOZVJXY6ND3XG4YIGGG/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—September 6, 2023

2023-09-06 Thread Guillaume Lederrey
This is happening in 1 hour.

On Fri, 1 Sept 2023 at 17:05, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, September 6, 2023
> Time: 15:00-16:00 UTC / 08:00 PT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/WZVXRBJIHJAFXKTD3NIWUNBHLTOGSLF3/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—September 6, 2023

2023-09-04 Thread Guillaume Lederrey
I've done some cleanup of the etherpad [1] and added the suggestions to the
agenda.

[1] https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours

On Sat, 2 Sept 2023 at 21:34, Handgod Abraham  wrote:

>
>
> Le ven. 1 sept. 2023 à 11:06, Guillaume Lederrey 
> a écrit :
>
>> Hello all!
>> Zet
>> The Search Platform Team usually holds an open meeting on the first
>> Wednesday of each month. Come talk to us about anything related to
>> Wikimedia search, gWikidata Query on Service (WDQS), Wikimedia Commons
>> Query Service (WCQS), etc.!
>>
>> Feel free to add your items to the Etherpad Agenda for the next meeting.
>> Nan RC
>>
>> Details for our next meeting
>> Date: Wednesday, September 6, 2023
>> Time: 15:00-r16:00 UTC / 08:00 PT / 11:00   EDT / 17:00 veto lhouii
>> Etherpad:
>> https://etherpad.wikimedia.korg/p/Search_Platform_Office_Hours
>> <https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours>
>> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
>> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>>
>> Have fun and see you soon! Tejtt
>>
>>Guillaume
>> M
>> --
>> *Guillaume Lederrey* (he/him)
>> Engineering Manager
>> Wikimedia Foundation <https://wikimediafoundation.org/>
>> ___
>>  FgcxWikidata mailing list -- wikidata@lists.wikimedia.org
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/BYQA6RIUH4GAGKIHKOCTMR274XJMCEDZ/
>> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>>
> --
> *Handgod ABRAHAM*
> *Poète, opérateur culturel, Community Manager*
> **
> *Président de **Marathon du Livre **| **Co-Directeur Exécutif des **Editions
> Pul**ùcia*
> Président *Wikimedia Haiti (User Group)** \*
>  * ___*
> 313, rue Lamarre, Petit-Goâve, Haïti (W.I)
> +50946837263
> sambay...@gmail.com | h...@wikimediahaiti.org
> 
> *Whatsapp <https://wa.me/50946837263> *- *Facebook
> <https://www.facebook.com/handgodabraham>- Instagram
> <https://www.instagram.com/handgod_abraham/> - Twitter
> <https://twitter.com/HandgodAbraham> - Linkedin
> <https://www.linkedin.com/in/handgod-abraham-003140153/> - Telegram
> <https://t.me/Sambayo> *
> _______
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/3FK6HSW7HVH7CNU53YCSBKTZDYPIQS47/
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/TT5JJQYT2HNVL5ENLUPNQTU7IF55YLPT/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—September 6, 2023

2023-09-01 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, September 6, 2023
Time: 15:00-16:00 UTC / 08:00 PT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/BYQA6RIUH4GAGKIHKOCTMR274XJMCEDZ/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—August 2, 2023

2023-08-02 Thread Guillaume Lederrey
This is happening 1h from now.

On Wed, 2 Aug 2023 at 11:30, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, May 3, 2023
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/EDNFT5L6KFDW5C27ASKEUZ4SVGZ5TR7N/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—August 2, 2023

2023-08-02 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, May 3, 2023
Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/EUOSRT7TMVBHGPP7XAARH3JV53BNBR5R/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—June 7, 2023

2023-06-07 Thread Guillaume Lederrey
This is happening 1h from now.

On Mon, 5 Jun 2023 at 11:35, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, May 3, 2023
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/KRMXSZO3NYLAHU5YMZB2XMLQMB2MDHPS/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—June 7, 2023

2023-06-05 Thread Guillaume Lederrey
Damn, I'm terrible at copy paste :/

The date should have read "June 7", which is next Wednesday. Sorry for the
misdirection.


On Mon, 5 Jun 2023 at 11:35, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, June 7, 2023
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/TFXVVFEGWXGH5QJTWJ4BTNNRUSWFFIOA/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—June 7, 2023

2023-06-05 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, May 3, 2023
Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/LVAW3ZG5DHXQIPYYMUH5JW5W3LLY2XS2/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Wikidata Query Service overloaded in codfw

2023-05-23 Thread Guillaume Lederrey
Hello all!

We are currently experiencing issues with the codfw WDQS cluster, with a
sharp drop in successful queries [1]. We suspect that the cluster is
overloaded by some expensive queries, but we are still tracking them down.
This should affect only traffic routed to the codfw cluster [2].

We will post an update as soon as we know more.

Good luck!

  Guillaume


[1]
https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1=codfw_Threshold=.95_Days=90d=now-6h=now
[2] https://wikitech.wikimedia.org/wiki/Global_traffic_routing

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/QM32J4W3KYM7BH6AEMGTL5RJVNXKGP2A/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—May 3, 2023

2023-05-03 Thread Guillaume Lederrey
This is happening 1 hour from now.

On Fri, 28 Apr 2023 at 17:02, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, May 3, 2023
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/BTHMRBFW5YVE6FF6JTALTR46HBMBEPY6/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—May 3, 2023

2023-04-28 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, May 3, 2023
Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/HTEAZLNPUT3W3HTLAMP6B4AXKLXPSDSV/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—April 5, 2023

2023-04-05 Thread Guillaume Lederrey
This is happening 1 hour from now!

  See you there!

On Fri, 31 Mar 2023 at 17:04, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, April 5, 2023
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/6DHLXO3SDPEEGDN2E6GZ5CYBBA7RY44C/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—April 5, 2023

2023-03-31 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, April 5, 2023
Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume
-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/J7NSCTN3TMYEQCHYORPY2GR4RCOMV2ZF/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Search Platform team - Weekly status updates

2023-03-31 Thread Guillaume Lederrey
Hello all!

The Search Platform team is again publishing weekly updates on wiki [1].
For example, if you want to know what we've been doing this week, have a
look at [2].

Those updates are meant to be a bit more organized than trying to follow
our Phabricator board [3], and try to highlight what might be interesting
or significant for people outside of our team. They are still quite
succinct and don't provide all the context around everything we do. In most
cases, links are provided to phab tasks, or other relevant places where you
can find more information.

Please let us know if you find those updates useful! Or if you have further
questions on our work. You can reach us on the talk pages, on the discovery
mailing list [4], or join our office hours [5]

Have fun!

Guillaume


[1] https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates
[2]
https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-03-31
[3] https://phabricator.wikimedia.org/project/view/1227/
[4]
https://lists.wikimedia.org/postorius/lists/discovery.lists.wikimedia.org/
[5] https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/LEYPPE7L5NTRABGM3WS5JSFGVSIGU3NT/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—March 1st, 2023

2023-03-01 Thread Guillaume Lederrey
The Search Platform team office hour is happening 1 hour from now.

See you there!

   Guillaume

On Fri, 24 Feb 2023 at 17:04, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, March 1st, 2023
> Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
> Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/ZN4ZXE4WHIJLYVSVYBGXFSKULD5OTMJ3/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

2023-02-27 Thread Guillaume Lederrey
On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata <
wikidata@lists.wikimedia.org> wrote:

>
> On 2/24/23 5:59 AM, Guillaume Lederrey wrote:
>
> On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen 
> wrote:
>
>>
>> On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
>>
>> On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen 
>> wrote:
>>
>>>
>>> On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
>>>
>>> On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
>>> wikidata@lists.wikimedia.org> wrote:
>>>
>>>>
>>>> On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
>>>> > Hello all!
>>>> >
>>>> > TL;DR: We expect to successfully complete the recent data reload on
>>>> > Wikidata Query Service soon, but we've encountered multiple failures
>>>> > related to the size of the graph, and anticipate that this issue may
>>>> > worsen in the future. Although we succeeded this time, we cannot
>>>> > guarantee that future reload attempts will be successful given the
>>>> > current trend of the data reload process. Thank you for your
>>>> > understanding and patience..
>>>> >
>>>> > Longer version:
>>>> >
>>>> > WDQS is updated from a stream of recent changes on Wikidata, with a
>>>> > maximum delay of ~2 minutes. This process was improved as part of the
>>>> > WDQS Streaming Updater project to ensure data coherence[1] . However,
>>>> > the update process is still imperfect and can lead to data
>>>> > inconsistencies in some cases[2][3]. To address this, we reload the
>>>> > data from dumps a few times per year to reinitialize the system from
>>>> a
>>>> > known good state.
>>>> >
>>>> > The recent reload of data from dumps started in mid-December and was
>>>> > initially met with some issues related to download and instabilities
>>>> > in Blazegraph, the database used by WDQS[4]. Loading the data into
>>>> > Blazegraph takes a couple of weeks due to the size of the graph, and
>>>> > we had multiple attempts where the reload failed after >90% of the
>>>> > data had been loaded. Our understanding of the issue is that a "race
>>>> > condition" in Blazegraph[5], where subtle timing changes lead to
>>>> > corruption of the journal in some rare cases, is to blame.[6]
>>>> >
>>>> > We want to reassure you that the last reload job was successful on
>>>> one
>>>> > of our servers. The data still needs to be copied over to all of the
>>>> > WDQS servers, which will take a couple of weeks, but should not bring
>>>> > any additional issues. However, reloading the full data from dumps is
>>>> > becoming more complex as the data size grows, and we wanted to let
>>>> you
>>>> > know why the process took longer than expected. We understand that
>>>> > data inconsistencies can be problematic, and we appreciate your
>>>> > patience and understanding while we work to ensure the quality and
>>>> > consistency of the data on WDQS.
>>>> >
>>>> > Thank you for your continued support and understanding!
>>>> >
>>>> >
>>>> > Guillaume
>>>> >
>>>> >
>>>> > [1] https://phabricator.wikimedia.org/T244590
>>>> > [2] https://phabricator.wikimedia.org/T323239
>>>> > [3] https://phabricator.wikimedia.org/T322869
>>>> > [4] https://phabricator.wikimedia.org/T323096
>>>> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
>>>> > [6] https://phabricator.wikimedia.org/T263110
>>>> >
>>>> Hi Guillaume,
>>>>
>>>> Are there plans to decouple WDQS from the back-end database? Doing that
>>>> provides more resilient architecture for Wikidata as a whole since you
>>>> will be able to swap and interchange SPARQL-compliant backends.
>>>>
>>>
>>> It depends what you mean by decoupling. The coupling points as I see
>>> them are:
>>>
>>> * update process
>>> * UI
>>> * exposed SPARQL endpoint
>>>
>>> The update process is mostly decoupled from the backend. It is producing
>>> a stream of RDF updates that is backend independent, with a very thin
>

[Wikidata] Talk to the Search Platform / Query Service Team—March 1st, 2023

2023-02-24 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, March 1st, 2023
Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/HZFJFAUCVKCKCKVN36PYJ5ILOSQW63NP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

2023-02-24 Thread Guillaume Lederrey
On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen 
wrote:

>
> On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
>
> On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen 
> wrote:
>
>>
>> On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
>>
>> On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
>> wikidata@lists.wikimedia.org> wrote:
>>
>>>
>>> On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
>>> > Hello all!
>>> >
>>> > TL;DR: We expect to successfully complete the recent data reload on
>>> > Wikidata Query Service soon, but we've encountered multiple failures
>>> > related to the size of the graph, and anticipate that this issue may
>>> > worsen in the future. Although we succeeded this time, we cannot
>>> > guarantee that future reload attempts will be successful given the
>>> > current trend of the data reload process. Thank you for your
>>> > understanding and patience..
>>> >
>>> > Longer version:
>>> >
>>> > WDQS is updated from a stream of recent changes on Wikidata, with a
>>> > maximum delay of ~2 minutes. This process was improved as part of the
>>> > WDQS Streaming Updater project to ensure data coherence[1] . However,
>>> > the update process is still imperfect and can lead to data
>>> > inconsistencies in some cases[2][3]. To address this, we reload the
>>> > data from dumps a few times per year to reinitialize the system from a
>>> > known good state.
>>> >
>>> > The recent reload of data from dumps started in mid-December and was
>>> > initially met with some issues related to download and instabilities
>>> > in Blazegraph, the database used by WDQS[4]. Loading the data into
>>> > Blazegraph takes a couple of weeks due to the size of the graph, and
>>> > we had multiple attempts where the reload failed after >90% of the
>>> > data had been loaded. Our understanding of the issue is that a "race
>>> > condition" in Blazegraph[5], where subtle timing changes lead to
>>> > corruption of the journal in some rare cases, is to blame.[6]
>>> >
>>> > We want to reassure you that the last reload job was successful on one
>>> > of our servers. The data still needs to be copied over to all of the
>>> > WDQS servers, which will take a couple of weeks, but should not bring
>>> > any additional issues. However, reloading the full data from dumps is
>>> > becoming more complex as the data size grows, and we wanted to let you
>>> > know why the process took longer than expected. We understand that
>>> > data inconsistencies can be problematic, and we appreciate your
>>> > patience and understanding while we work to ensure the quality and
>>> > consistency of the data on WDQS.
>>> >
>>> > Thank you for your continued support and understanding!
>>> >
>>> >
>>> > Guillaume
>>> >
>>> >
>>> > [1] https://phabricator.wikimedia.org/T244590
>>> > [2] https://phabricator.wikimedia.org/T323239
>>> > [3] https://phabricator.wikimedia.org/T322869
>>> > [4] https://phabricator.wikimedia.org/T323096
>>> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
>>> > [6] https://phabricator.wikimedia.org/T263110
>>> >
>>> Hi Guillaume,
>>>
>>> Are there plans to decouple WDQS from the back-end database? Doing that
>>> provides more resilient architecture for Wikidata as a whole since you
>>> will be able to swap and interchange SPARQL-compliant backends.
>>>
>>
>> It depends what you mean by decoupling. The coupling points as I see them
>> are:
>>
>> * update process
>> * UI
>> * exposed SPARQL endpoint
>>
>> The update process is mostly decoupled from the backend. It is producing
>> a stream of RDF updates that is backend independent, with a very thin
>> Blazegraph specific adapted to load the data into Blazegraph.
>>
>>
>> Does that mean that we could integrate the RDF stream into our setup re
>> keeping our Wikidata instance up to date, for instance?
>>
> That data stream isn't exposed publicly. There are a few tricky part about
> the stream needing to be synchronized with a specific Wikidata dump that
> makes it not entirely trivial to reuse outside of our internal use case.
> But if there is enough interest, we could potentially work on making that
> 

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

2023-02-23 Thread Guillaume Lederrey
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen 
wrote:

>
> On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
>
> On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
> wikidata@lists.wikimedia.org> wrote:
>
>>
>> On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
>> > Hello all!
>> >
>> > TL;DR: We expect to successfully complete the recent data reload on
>> > Wikidata Query Service soon, but we've encountered multiple failures
>> > related to the size of the graph, and anticipate that this issue may
>> > worsen in the future. Although we succeeded this time, we cannot
>> > guarantee that future reload attempts will be successful given the
>> > current trend of the data reload process. Thank you for your
>> > understanding and patience..
>> >
>> > Longer version:
>> >
>> > WDQS is updated from a stream of recent changes on Wikidata, with a
>> > maximum delay of ~2 minutes. This process was improved as part of the
>> > WDQS Streaming Updater project to ensure data coherence[1] . However,
>> > the update process is still imperfect and can lead to data
>> > inconsistencies in some cases[2][3]. To address this, we reload the
>> > data from dumps a few times per year to reinitialize the system from a
>> > known good state.
>> >
>> > The recent reload of data from dumps started in mid-December and was
>> > initially met with some issues related to download and instabilities
>> > in Blazegraph, the database used by WDQS[4]. Loading the data into
>> > Blazegraph takes a couple of weeks due to the size of the graph, and
>> > we had multiple attempts where the reload failed after >90% of the
>> > data had been loaded. Our understanding of the issue is that a "race
>> > condition" in Blazegraph[5], where subtle timing changes lead to
>> > corruption of the journal in some rare cases, is to blame.[6]
>> >
>> > We want to reassure you that the last reload job was successful on one
>> > of our servers. The data still needs to be copied over to all of the
>> > WDQS servers, which will take a couple of weeks, but should not bring
>> > any additional issues. However, reloading the full data from dumps is
>> > becoming more complex as the data size grows, and we wanted to let you
>> > know why the process took longer than expected. We understand that
>> > data inconsistencies can be problematic, and we appreciate your
>> > patience and understanding while we work to ensure the quality and
>> > consistency of the data on WDQS.
>> >
>> > Thank you for your continued support and understanding!
>> >
>> >
>> > Guillaume
>> >
>> >
>> > [1] https://phabricator.wikimedia.org/T244590
>> > [2] https://phabricator.wikimedia.org/T323239
>> > [3] https://phabricator.wikimedia.org/T322869
>> > [4] https://phabricator.wikimedia.org/T323096
>> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
>> > [6] https://phabricator.wikimedia.org/T263110
>> >
>> Hi Guillaume,
>>
>> Are there plans to decouple WDQS from the back-end database? Doing that
>> provides more resilient architecture for Wikidata as a whole since you
>> will be able to swap and interchange SPARQL-compliant backends.
>>
>
> It depends what you mean by decoupling. The coupling points as I see them
> are:
>
> * update process
> * UI
> * exposed SPARQL endpoint
>
> The update process is mostly decoupled from the backend. It is producing a
> stream of RDF updates that is backend independent, with a very thin
> Blazegraph specific adapted to load the data into Blazegraph.
>
>
> Does that mean that we could integrate the RDF stream into our setup re
> keeping our Wikidata instance up to date, for instance?
>
That data stream isn't exposed publicly. There are a few tricky part about
the stream needing to be synchronized with a specific Wikidata dump that
makes it not entirely trivial to reuse outside of our internal use case.
But if there is enough interest, we could potentially work on making that
stream public.

>
> The UI is mostly backend independant. It relies on Search for some
> features. And of course, the queries themselves might depend on Blazegraph
> specific features.
>
>
> Can WDQS, based on what's stated above, work with a generic SPARQL
> back-end like Virtuoso, for instance? By that I mean dispatch SPARQL
> queries input by a user (without alteration) en route to server processing?
>
 The WDQS UI is managed by WMDE, my knowledge

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

2023-02-22 Thread Guillaume Lederrey
On Wed, 22 Feb 2023 at 04:45, Thad Guidry  wrote:

> Hi Guillaume,
>
> Which file system is used with Blazegraph?  Is it NFS or Ext4, etc.?
> Specifically, the file system used where Journal files are written and
> read from? [1]
> Because looking at the code, it seems there could be cases where
> unreported errors can happen around file locking.
>

We are using Ext4. I don't understand enough about the Blazegraph internals
to know if that might be an issue or not. But given your question, I assume
that the locking issues are probably more related to running on NFS.


> [1]
> https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/journal/FileMetadata.java
>
> Thad
> https://www.linkedin.com/in/thadguidry/
> https://calendly.com/thadguidry/
>
>
> On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey <
> gleder...@wikimedia.org> wrote:
>
>> Hello all!
>>
>> TL;DR: We expect to successfully complete the recent data reload on
>> Wikidata Query Service soon, but we've encountered multiple failures
>> related to the size of the graph, and anticipate that this issue may worsen
>> in the future. Although we succeeded this time, we cannot guarantee that
>> future reload attempts will be successful given the current trend of the
>> data reload process. Thank you for your understanding and patience..
>>
>> Longer version:
>>
>> WDQS is updated from a stream of recent changes on Wikidata, with a
>> maximum delay of ~2 minutes. This process was improved as part of the WDQS
>> Streaming Updater project to ensure data coherence[1] . However, the update
>> process is still imperfect and can lead to data inconsistencies in some
>> cases[2][3]. To address this, we reload the data from dumps a few times per
>> year to reinitialize the system from a known good state.
>>
>> The recent reload of data from dumps started in mid-December and was
>> initially met with some issues related to download and instabilities in
>> Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph
>> takes a couple of weeks due to the size of the graph, and we had multiple
>> attempts where the reload failed after >90% of the data had been loaded.
>> Our understanding of the issue is that a "race condition" in Blazegraph[5],
>> where subtle timing changes lead to corruption of the journal in some rare
>> cases, is to blame.[6]
>>
>> We want to reassure you that the last reload job was successful on one of
>> our servers. The data still needs to be copied over to all of the WDQS
>> servers, which will take a couple of weeks, but should not bring any
>> additional issues. However, reloading the full data from dumps is becoming
>> more complex as the data size grows, and we wanted to let you know why the
>> process took longer than expected. We understand that data inconsistencies
>> can be problematic, and we appreciate your patience and understanding while
>> we work to ensure the quality and consistency of the data on WDQS.
>>
>> Thank you for your continued support and understanding!
>>
>>
>> Guillaume
>>
>>
>> [1] https://phabricator.wikimedia.org/T244590
>> [2] https://phabricator.wikimedia.org/T323239
>> [3] https://phabricator.wikimedia.org/T322869
>> [4] https://phabricator.wikimedia.org/T323096
>> [5] https://en.wikipedia.org/wiki/Race_condition#In_software
>> [6] https://phabricator.wikimedia.org/T263110
>>
>> --
>> *Guillaume Lederrey* (he/him)
>> Engineering Manager
>> Wikimedia Foundation <https://wikimediafoundation.org/>
>> ___
>> Wikidata mailing list -- wikidata@lists.wikimedia.org
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/7QTJBRU2T3J22SNV4TGBRML4QNBGCEOU/
>> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>>
> ___
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/U2T6JKVJFJK7HNQCXNPYBFGSHK4AJQTX/
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/JYKC4KYWI4BHSDTHQPSQQWJREOCG44LF/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

2023-02-22 Thread Guillaume Lederrey
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
wikidata@lists.wikimedia.org> wrote:

>
> On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
> > Hello all!
> >
> > TL;DR: We expect to successfully complete the recent data reload on
> > Wikidata Query Service soon, but we've encountered multiple failures
> > related to the size of the graph, and anticipate that this issue may
> > worsen in the future. Although we succeeded this time, we cannot
> > guarantee that future reload attempts will be successful given the
> > current trend of the data reload process. Thank you for your
> > understanding and patience..
> >
> > Longer version:
> >
> > WDQS is updated from a stream of recent changes on Wikidata, with a
> > maximum delay of ~2 minutes. This process was improved as part of the
> > WDQS Streaming Updater project to ensure data coherence[1] . However,
> > the update process is still imperfect and can lead to data
> > inconsistencies in some cases[2][3]. To address this, we reload the
> > data from dumps a few times per year to reinitialize the system from a
> > known good state.
> >
> > The recent reload of data from dumps started in mid-December and was
> > initially met with some issues related to download and instabilities
> > in Blazegraph, the database used by WDQS[4]. Loading the data into
> > Blazegraph takes a couple of weeks due to the size of the graph, and
> > we had multiple attempts where the reload failed after >90% of the
> > data had been loaded. Our understanding of the issue is that a "race
> > condition" in Blazegraph[5], where subtle timing changes lead to
> > corruption of the journal in some rare cases, is to blame.[6]
> >
> > We want to reassure you that the last reload job was successful on one
> > of our servers. The data still needs to be copied over to all of the
> > WDQS servers, which will take a couple of weeks, but should not bring
> > any additional issues. However, reloading the full data from dumps is
> > becoming more complex as the data size grows, and we wanted to let you
> > know why the process took longer than expected. We understand that
> > data inconsistencies can be problematic, and we appreciate your
> > patience and understanding while we work to ensure the quality and
> > consistency of the data on WDQS.
> >
> > Thank you for your continued support and understanding!
> >
> >
> > Guillaume
> >
> >
> > [1] https://phabricator.wikimedia.org/T244590
> > [2] https://phabricator.wikimedia.org/T323239
> > [3] https://phabricator.wikimedia.org/T322869
> > [4] https://phabricator.wikimedia.org/T323096
> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
> > [6] https://phabricator.wikimedia.org/T263110
> >
> Hi Guillaume,
>
> Are there plans to decouple WDQS from the back-end database? Doing that
> provides more resilient architecture for Wikidata as a whole since you
> will be able to swap and interchange SPARQL-compliant backends.
>

It depends what you mean by decoupling. The coupling points as I see them
are:

* update process
* UI
* exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is producing a
stream of RDF updates that is backend independent, with a very thin
Blazegraph specific adapted to load the data into Blazegraph.

The UI is mostly backend independant. It relies on Search for some
features. And of course, the queries themselves might depend on Blazegraph
specific features.

The exposed SPARQL endpoint is at the moment a direct exposition of the
Blazegraph endpoint, so it does expose all the Blazegraph specific features
and quirks.


What we would like to do at some point (this is not more than a rough idea
at this point) is to add a proxy in front of the SPARQL endpoint, that
would filter specific SPARQL features, so that we limit what is available
to a standard set of features available across most potential backends.
This would help reduce the coupling of queries with the backend. Of course,
this would have the drawback of limiting the feature set.

I'm not sure I entirely understood the question, please let me know if my
answer is missing the point.

  Have fun!

Guillaume


> BTW -- we are going to make AWS and even Azure hosted instances (offered
> on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we
> recently reloaded).
>
> --
> Regards,
>
> Kingsley Idehen
> Founder & CEO
> OpenLink Software
> Home Page: http://www.openlinksw.com
> Community Support: https://community.openlinksw.com
> Weblogs (Blogs):
> Company Blog: https://medium.com/openlink-software-blog
&

[Wikidata] Inconsistencies on WDQS data - data reload on WDQS

2023-02-21 Thread Guillaume Lederrey
Hello all!

TL;DR: We expect to successfully complete the recent data reload on
Wikidata Query Service soon, but we've encountered multiple failures
related to the size of the graph, and anticipate that this issue may worsen
in the future. Although we succeeded this time, we cannot guarantee that
future reload attempts will be successful given the current trend of the
data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum
delay of ~2 minutes. This process was improved as part of the WDQS
Streaming Updater project to ensure data coherence[1] . However, the update
process is still imperfect and can lead to data inconsistencies in some
cases[2][3]. To address this, we reload the data from dumps a few times per
year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was
initially met with some issues related to download and instabilities in
Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph
takes a couple of weeks due to the size of the graph, and we had multiple
attempts where the reload failed after >90% of the data had been loaded.
Our understanding of the issue is that a "race condition" in Blazegraph[5],
where subtle timing changes lead to corruption of the journal in some rare
cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of
our servers. The data still needs to be copied over to all of the WDQS
servers, which will take a couple of weeks, but should not bring any
additional issues. However, reloading the full data from dumps is becoming
more complex as the data size grows, and we wanted to let you know why the
process took longer than expected. We understand that data inconsistencies
can be problematic, and we appreciate your patience and understanding while
we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!


Guillaume


[1] https://phabricator.wikimedia.org/T244590
[2] https://phabricator.wikimedia.org/T323239
[3] https://phabricator.wikimedia.org/T322869
[4] https://phabricator.wikimedia.org/T323096
[5] https://en.wikipedia.org/wiki/Race_condition#In_software
[6] https://phabricator.wikimedia.org/T263110

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/7QTJBRU2T3J22SNV4TGBRML4QNBGCEOU/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—February 1st, 2023

2023-02-01 Thread Guillaume Lederrey
The Search Platform Office Hours are starting in about 1h. Feel free to
join if you want to talk to us!


On Fri, 27 Jan 2023 at 17:06, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, February 1st, 2023
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
> Guillaume
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/RUDLNSH2XY4WGJXX37SYUNGQRYCRHSHM/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—February 1st, 2023

2023-01-31 Thread Guillaume Lederrey
On Fri, 27 Jan 2023 at 17:06, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, February 1st, 2023
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
>

Someone pointed out that I messed up timezones. This should have been:
16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET

Timezones are hard!


> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
> Guillaume
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/OSQVP6TCLS7SR563VYUZF4AUX3CCAZUK/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—February 1st, 2023

2023-01-27 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, February 1st, 2023
Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

Guillaume
-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/XB3WAPZZJLR7SSIB75GZTWQFPRXKOLBZ/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—January 11, 2023

2023-01-11 Thread Guillaume Lederrey
The Search Platform Office Hours are starting in about 1h. Feel free to
join if you want to talk to us!


On Fri, 6 Jan 2023 at 15:59, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, January 11, 2023
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/ABCLQVUOA7K66GW46XMQMCGA6BWTNZXT/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—January 11, 2023

2023-01-06 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, January 11, 2023
Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/3632V3C5FT2IZK2OAPEYSXDKA4FEX6AN/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—December 7, 2022

2022-12-07 Thread Guillaume Lederrey
The Search Platform Office Hours are starting in about 1h. Feel free to
join if you want to talk to us!


On Mon, 5 Dec 2022 at 20:56, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, December 7, 2022
> Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/V42RLKRIG53GDXJJJQ4DHNASLCYYZZNU/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—December 7, 2022

2022-12-05 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, December 7, 2022
Time: 16:00-17:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/QKFPWW4QEOTMNV352MWI3PM7XJC7QYPF/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—November 2nd, 2022

2022-11-02 Thread Guillaume Lederrey
The Search Platform Office Hours are starting in about 1h. Feel free to
join if you want to talk to us!

On Tue, 1 Nov 2022 at 11:37, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team usually holds an open meeting on the first
> Wednesday of each month. Come talk to us about anything related to
> Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
> Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
> Details for our next meeting:
> Date: Wednesday, November 2nd, 2022
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 16:00 CEST / 19:00 GST
> Etherpad:
> https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/HMEC432IXJCOEJ7HUT7JX3AZRNFQOPRP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—November 2nd, 2022

2022-11-01 Thread Guillaume Lederrey
Hello all!

The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.

Details for our next meeting:
Date: Wednesday, November 2nd, 2022
Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 16:00 CEST / 19:00 GST
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/2ITSDXIRQ47JB4SQSCFSXZSVCQCVTVEA/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Talk to the Search Platform / Query Service Team—October 5th, 2022

2022-10-05 Thread Guillaume Lederrey
Reminder that the Search Platform Office hours are starting in 1.5h from
now.

On Mon, 3 Oct 2022 at 17:02, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform Team
> <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds
> an open meeting on the first Wednesday of each month. Come talk to us about
> anything related to Wikimedia search, Wikidata Query Service (WDQS),
> Wikimedia Commons Query Service (WCQS), etc.!
>
> Feel free to add your items to the Etherpad Agenda for the next meeting.
>
>
> Details for our next meeting:
>
> Date: Wednesday, October 5th, 2022
>
> Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST / 19:00 GST
>
> Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
>
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
>
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
>
> Have fun and see you soon!
>
>Guillaume
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/QUYAAUUZTXPNPN6LCQG4AWBIJ2F5XHZW/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Talk to the Search Platform / Query Service Team—October 5th, 2022

2022-10-03 Thread Guillaume Lederrey
Hello all!

The Search Platform Team
<https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds an
open meeting on the first Wednesday of each month. Come talk to us about
anything related to Wikimedia search, Wikidata Query Service (WDQS),
Wikimedia Commons Query Service (WCQS), etc.!

Feel free to add your items to the Etherpad Agenda for the next meeting.


Details for our next meeting:

Date: Wednesday, October 5th, 2022

Time: 15:00-16:00 UTC / 08:00 PDT / 11:00 EDT / 17:00 CEST / 19:00 GST

Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours

Google Meet link: https://meet.google.com/vgj-bbeb-uyi

Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927

Have fun and see you soon!

   Guillaume

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/I75TGK5IJSUCJC72SNCZUQ36HW7LCYO2/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Documentation of OAuth on Wikimedia Commons Query Service (WCQS)

2022-09-16 Thread Guillaume Lederrey
Hello all!

We now have better documentation [1] on how to use WCQS with OAuth, in
particular how to use it programmatically. If anything is unclear, please
let us know via the Discussion page [2].

Thanks to Erik for writing that documentation and validating that it
actually works as expected!

Have fun!

   Guillaume


[1]
https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/API_endpoint
[2]
https://commons.wikimedia.org/w/index.php?title=Commons_talk:SPARQL_query_service/API_endpoint=edit
-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/W62YXQPZFBMWGUDIY5NEIFRONVZRKMAE/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Update lag on Wikimedia Commons Query Service

2022-09-12 Thread Guillaume Lederrey
Hello all!

We had an incident over this weekend where the updater to Wikimedia Commons
Query Service (WCQS) was broken for about a day and a half [1]. The root
cause seems to be MediaInfo on Commons allowing entities to share statement
IDs. This should be invalid and would cause the WCQS Updater to crash. A
work around is in place and updates have been backfilled.

Thanks for your patience!

   Guillaume


[1]
https://grafana.wikimedia.org/d/00489/wikidata-query-service?orgId=1=8_name=wcqs=1662849368455=1663021393568
[2] https://phabricator.wikimedia.org/T317530
-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/AQJGZ5GIAR2RULTOYCOUFCUAF4MF5ANI/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] The Search Platform team is looking for a Graph Consultant

2021-10-28 Thread Guillaume Lederrey
Hello all!

I know that posting job offers on mailing lists is somewhat controversial,
but since this one is very much about Wikidata Query Service, it would feel
weird not to send it to the Wikidata community.

The Search Platform team is looking for a consultant to help shape the
technical future of Wikidata Query Service. Have a look at the job offer
[1] and apply if you are interested. Or send it to someone who might be
interested.

Thanks all!

   Guillaume


[1] https://boards.greenhouse.io/wikimedia/jobs/3546920

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


Re: [Wikidata] Linked Data Fragments endpoint returns IllegalStateException

2021-03-31 Thread Guillaume Lederrey
Hello!

We've been busy on other priorities, but this is getting close to the top
of our backlog. We should start on it in the next week or two.

On Mon, 29 Mar 2021 at 16:39, Maciej Gawinecki  wrote:

> Hi there,
>
> I reported the problem in  December 2020. I can see this has not been even
> assigned to any one: https://phabricator.wikimedia.org/T270476
>
> Do you need any help to fix it?
>

We're always happy for help! If you have any ideas on how to fix this, or
more details on what the problem is that would definitely be helpful!

Thanks!

   Guillaume


> Thank you,
> Maciej Gawinecki
>
>
>
>
> śr., 13 sty 2021 o 09:37 Guillaume Lederrey 
> napisał(a):
>
>> On Tue, Jan 12, 2021 at 9:03 PM Ryan Kemper 
>> wrote:
>>
>>> Hi Maciej,
>>>
>>> Thanks for noticing the error in the ticket provided. I did a search on 
>>> "*Linked
>>> Data Fragments endpoint returns IllegalStateException*" and found the
>>> correct ticket:
>>>
>>> *https://phabricator.wikimedia.org/T270476
>>> <https://phabricator.wikimedia.org/T270476>* (looks like the last two
>>> digits got chomped in Guillaume's message)
>>>
>>> I don't see the ticket assigned/triaged at the moment so I'll try to
>>> make sure it gets looked at in our next planning cycle. Sorry for the delay.
>>>
>>
>> For some precision on the timeline: this means we'll push this ticket on
>> our queue of current work next Monday. It might take a few weeks before we
>> have time to actually work on it. But this is definitely not lost!
>>
>>
>>> Ryan
>>>
>>> On Tue, Jan 12, 2021 at 8:29 AM Maciej Gawinecki 
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>
>>>> Thanks for reporting the issue in the bug tracker. Is the link you have
>>>> provided, https://phabricator.wikimedia.org/T2704, correct?
>>>>
>>>> The last activity in that ticket was in 2014...
>>>>
>>>> Thanks,
>>>> Maciej Gawinecki
>>>>
>>>> czw., 17 gru 2020 o 16:05 Maciej Gawinecki 
>>>> napisał(a):
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to get alternative names of given names in WikiData with
>>>>> the following simple query:
>>>>>
>>>>> PREFIX ps: <http://www.wikidata.org/prop/direct/>
>>>>> PREFIX wd: <http://www.wikidata.org/entity/>
>>>>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>>>> CONSTRUCT {?s rdfs:label ?o}
>>>>> WHERE { ?s ps:P31 wd:Q202444. ?s rdfs:label ?o}
>>>>> LIMIT 1000
>>>>>
>>>>> Initially, the query was much more complex, but I was getting
>>>>> time-outs on the public WikiData SPARQL endpoint. I decided to use Linked
>>>>> Data Fragments to offload some filtering from the server to the client.
>>>>>
>>>>> comunica-sparql "https://query.wikidata.org/bigdata/ldf; -f query
>>>>> > given_names.n3
>>>>>
>>>>> (where "query" is a file with the SPARQL query shown above).
>>>>> Unfortunately, the client tries to get output from the 3rd page, I am
>>>>> getting the following error:
>>>>>
>>>>> Could not retrieve
>>>>> https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ21147790=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label=3
>>>>> (500: unknown error)
>>>>>
>>>>> Following the link in fact returns HTTP 500 error with
>>>>>
>>>>> Error details
>>>>> java.lang.IllegalStateException
>>>>>
>>>>> The link points to the 3rd page. It works if you try to go the second
>>>>> page:
>>>>> https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ21147790=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label=2
>>>>>
>>>>> Is this a bug or a limitation of a service?
>>>>>
>>>>> With kind regards,
>>>>> Maciej Gawinecki
>>>>>
>>>> ___
>>>> Wikidata mailing list
>>>> Wikidata@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> --
>> *Guillaume Lederrey* (he/him)
>> Engineering Manager
>> Wikimedia Foundation <https://wikimediafoundation.org/>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Delta Dumps Production?

2021-02-26 Thread Guillaume Lederrey
Hello!

We are working on a new update process for WDQS, based on a stream of
changes [1]. While not exactly the solution you are looking for, this might
be a building block for differential dumps. For example by aggregating the
stream of changes over a period of time.

Note that at this point, the stream of changes that we construct is
published to an internal Kafka that isn't exposed to the internet. If there
is enough interest, we might be able to expose it in some form.

Have fun!

   Guillaume



[1] https://phabricator.wikimedia.org/T244590


On Fri, Feb 26, 2021 at 8:49 AM Federico Leva (Nemo) 
wrote:

> Kingsley Idehen via Wikidata, 25/02/21 19:26:
> > Is there a mechanism in place for producing and publishing delta-centric
> > dumps for Wikidata?
>
> There's
> https://phabricator.wikimedia.org/T72246
>
> Magnus Manske used to maintain some biweekly dumps as part of its WDQ
> service, IIRC.
>
> Federico
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WCQS Beta Downtime beginning Feb 4 18:30 UTC

2021-02-05 Thread Guillaume Lederrey
On Thu, Feb 4, 2021 at 10:11 PM Maarten Dammers  wrote:

> Hi Ryan and Guillaume,
>
> Last time I checked WCQS was short for "Wikimedia Commons Query Service" (
> https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service ) so I'm
> a bit puzzled why you posted this on the Wikidata mailing list instead of
> the Wikimedia Commons list? I hope it will be back soon.
>
You're right! Oversight on my part, thanks for propagating the information.

Note that this notice has also been published on:

* https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service#Updates
*
https://commons.wikimedia.org/wiki/Commons:Village_pump#Unscheduled_maintenance:_Wikimedia_Commons_Query_Service


Have a great day!

   Guillaume

> Maarten
> On 03-02-2021 22:39, Guillaume Lederrey wrote:
>
> We ran some numbers and it looks that the data reload is going to take
> around 2.5 days, during which WCQS will be unavailable. Sorry for this
> interruption of service.
>
> On Wed, 3 Feb 2021, 21:16 Guillaume Lederrey, 
> wrote:
>
>> On Wed, Feb 3, 2021 at 8:53 PM Ryan Kemper  wrote:
>>
>>> Hi all,
>>>
>>> Our host *wcqs-beta-01.eqiad.wmflabs* is running low on disk space due
>>> to its blazegraph journal dataset size. In order to free up space we will
>>> need to take the service down, delete the journal and re-import from the
>>> latest dump. Service interruption will begin at *Feb 4 18:30 UTC* and
>>> continue until the data reload is complete.
>>>
>>
>> Just to be clear, this is the host behind https://wcqs-beta.wmflabs.org/
>> .
>>
>>
>>> We'll send out a notification when the downtime begins and when it ends
>>> as well.
>>>
>>> *Note*: This doesn't affect WDQS, only the WCQS beta.
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> --
>> *Guillaume Lederrey* (he/him)
>> Engineering Manager
>> Wikimedia Foundation <https://wikimediafoundation.org/>
>>
>
> _______
> Wikidata mailing 
> listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WCQS Beta Downtime beginning Feb 4 18:30 UTC

2021-02-03 Thread Guillaume Lederrey
We ran some numbers and it looks that the data reload is going to take
around 2.5 days, during which WCQS will be unavailable. Sorry for this
interruption of service.

On Wed, 3 Feb 2021, 21:16 Guillaume Lederrey, 
wrote:

> On Wed, Feb 3, 2021 at 8:53 PM Ryan Kemper  wrote:
>
>> Hi all,
>>
>> Our host *wcqs-beta-01.eqiad.wmflabs* is running low on disk space due
>> to its blazegraph journal dataset size. In order to free up space we will
>> need to take the service down, delete the journal and re-import from the
>> latest dump. Service interruption will begin at *Feb 4 18:30 UTC* and
>> continue until the data reload is complete.
>>
>
> Just to be clear, this is the host behind https://wcqs-beta.wmflabs.org/.
>
>
>> We'll send out a notification when the downtime begins and when it ends
>> as well.
>>
>> *Note*: This doesn't affect WDQS, only the WCQS beta.
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
> *Guillaume Lederrey* (he/him)
> Engineering Manager
> Wikimedia Foundation <https://wikimediafoundation.org/>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WCQS Beta Downtime beginning Feb 4 18:30 UTC

2021-02-03 Thread Guillaume Lederrey
On Wed, Feb 3, 2021 at 8:53 PM Ryan Kemper  wrote:

> Hi all,
>
> Our host *wcqs-beta-01.eqiad.wmflabs* is running low on disk space due to
> its blazegraph journal dataset size. In order to free up space we will need
> to take the service down, delete the journal and re-import from the latest
> dump. Service interruption will begin at *Feb 4 18:30 UTC* and continue
> until the data reload is complete.
>

Just to be clear, this is the host behind https://wcqs-beta.wmflabs.org/.


> We'll send out a notification when the downtime begins and when it ends as
> well.
>
> *Note*: This doesn't affect WDQS, only the WCQS beta.
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Query Service status update

2021-02-03 Thread Guillaume Lederrey
Hello all!

Here is a summary of what the Search Platform team is doing around WDQS:

* The database responsible for unit conversions [7] has been updated on
Friday Jan 29. It means that entities served from WDQS and updated since
this date will use the new conversion data for normalized quantities. The
WDQS database will be fully reloaded this month [8] so that all entities
will be coherent with the new conversion data.
* Now that we have the full functional coverage on the Flink based WDQS
Streaming Updater [1], we've done some more testing, and as expected we
found a few bugs and are correcting them.
* Exposing a test server [2] to gather feedback both on this new Flink
based Streaming Updater and on the long standing issue of solemnization of
blank nodes. We'll make an announcement when ready.
* Architecture review of the new Flink based Streaming Updater
with Ververica (the company behind Flink). We will probably uncover a few
more things that need to be improved.
* Productionizing the new Flink based Streaming Updater [8].
* Manual review of a sample of queries to WDQS. We learned a few things:
* Human intuition is not good at predicting which queries are expensive
* We have a large scope of very different queries / use cases, larger
than we expected
* Most of the request we've seen seem to be useful and valuable
* More in depth analysis and categorization of WDQS traffic [6]:
* Instead of focusing on a way to provide more performant solutions for
expensive queries that we see on WDQS, this analysis focuses on the query
groups that we see the most, even if they are already efficient.
* One key finding is that the top 90 query groups represent more than
80% of the queries we serve. Those queries are mostly "simple" queries:
only using the truthy graph, only doing a very limited number of hops in
the graph, etc... This opens the possibility to create a service that is
scalable and efficient for those classes of queries.
* This is very early work, we don't know yet what this service could
look like or if it is even feasible to create it. But it is an interesting
new approach in our problem space.
* The analysis is a bit raw, feel free to ask clarifying questions,
I'll route them to the appropriate person.
* Search Platform Office Hours are happening today (16:00-17:00 GMT /
08:00-09:00 PST / 11:00-12:00 EST / 17:00-18:00 CET) [9]. Feel free to join
if you have any additional questions, or just want to chat with the team!

  Have fun!

  Guillaume


[1] https://phabricator.wikimedia.org/T244590
[2] https://phabricator.wikimedia.org/T266470
[3] https://phabricator.wikimedia.org/T244341
[4] https://phabricator.wikimedia.org/T264006
[5] https://www.wikidata.org/wiki/Wikidata:REST_API_feedback_round
[6] https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis
[7] https://phabricator.wikimedia.org/T267644
[8] https://phabricator.wikimedia.org/T267927
[9] https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Linked Data Fragments endpoint returns IllegalStateException

2021-01-13 Thread Guillaume Lederrey
On Tue, Jan 12, 2021 at 9:03 PM Ryan Kemper  wrote:

> Hi Maciej,
>
> Thanks for noticing the error in the ticket provided. I did a search on 
> "*Linked
> Data Fragments endpoint returns IllegalStateException*" and found the
> correct ticket:
>
> *https://phabricator.wikimedia.org/T270476
> <https://phabricator.wikimedia.org/T270476>* (looks like the last two
> digits got chomped in Guillaume's message)
>
> I don't see the ticket assigned/triaged at the moment so I'll try to make
> sure it gets looked at in our next planning cycle. Sorry for the delay.
>

For some precision on the timeline: this means we'll push this ticket on
our queue of current work next Monday. It might take a few weeks before we
have time to actually work on it. But this is definitely not lost!


> Ryan
>
> On Tue, Jan 12, 2021 at 8:29 AM Maciej Gawinecki 
> wrote:
>
>> Hi Guillaume,
>>
>> Thanks for reporting the issue in the bug tracker. Is the link you have
>> provided, https://phabricator.wikimedia.org/T2704, correct?
>>
>> The last activity in that ticket was in 2014...
>>
>> Thanks,
>> Maciej Gawinecki
>>
>> czw., 17 gru 2020 o 16:05 Maciej Gawinecki 
>> napisał(a):
>>
>>> Hi,
>>>
>>> I am trying to get alternative names of given names in WikiData with the
>>> following simple query:
>>>
>>> PREFIX ps: <http://www.wikidata.org/prop/direct/>
>>> PREFIX wd: <http://www.wikidata.org/entity/>
>>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>> CONSTRUCT {?s rdfs:label ?o}
>>> WHERE { ?s ps:P31 wd:Q202444. ?s rdfs:label ?o}
>>> LIMIT 1000
>>>
>>> Initially, the query was much more complex, but I was getting time-outs
>>> on the public WikiData SPARQL endpoint. I decided to use Linked Data
>>> Fragments to offload some filtering from the server to the client.
>>>
>>> comunica-sparql "https://query.wikidata.org/bigdata/ldf; -f query >
>>> given_names.n3
>>>
>>> (where "query" is a file with the SPARQL query shown above).
>>> Unfortunately, the client tries to get output from the 3rd page, I am
>>> getting the following error:
>>>
>>> Could not retrieve
>>> https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ21147790=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label=3
>>> (500: unknown error)
>>>
>>> Following the link in fact returns HTTP 500 error with
>>>
>>> Error details
>>> java.lang.IllegalStateException
>>>
>>> The link points to the 3rd page. It works if you try to go the second
>>> page:
>>> https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ21147790=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label=2
>>>
>>> Is this a bug or a limitation of a service?
>>>
>>> With kind regards,
>>> Maciej Gawinecki
>>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Linked Data Fragments endpoint returns IllegalStateException

2020-12-18 Thread Guillaume Lederrey
Hello!

This looks like a bug, but we'll need more time to investigate and get to
the bottom of this. I've created https://phabricator.wikimedia.org/T2704 to
track the issue. There isn't that many people around during this end of
year, so this will probably have to wait for January.

Thanks for your patience!

   Guillaume

On Thu, Dec 17, 2020 at 4:06 PM Maciej Gawinecki 
wrote:

> Hi,
>
> I am trying to get alternative names of given names in WikiData with the
> following simple query:
>
> PREFIX ps: <http://www.wikidata.org/prop/direct/>
> PREFIX wd: <http://www.wikidata.org/entity/>
> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> CONSTRUCT {?s rdfs:label ?o}
> WHERE { ?s ps:P31 wd:Q202444. ?s rdfs:label ?o}
> LIMIT 1000
>
> Initially, the query was much more complex, but I was getting time-outs on
> the public WikiData SPARQL endpoint. I decided to use Linked Data Fragments
> to offload some filtering from the server to the client.
>
> comunica-sparql "https://query.wikidata.org/bigdata/ldf; -f query >
> given_names.n3
>
> (where "query" is a file with the SPARQL query shown above).
> Unfortunately, the client tries to get output from the 3rd page, I am
> getting the following error:
>
> Could not retrieve
> https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ21147790=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label=3
> (500: unknown error)
>
> Following the link in fact returns HTTP 500 error with
>
> Error details
> java.lang.IllegalStateException
>
> The link points to the 3rd page. It works if you try to go the second
> page:
> https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ21147790=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label=2
>
> Is this a bug or a limitation of a service?
>
> With kind regards,
> Maciej Gawinecki
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Query Service status update

2020-10-12 Thread Guillaume Lederrey
Hello all!

Here are a few updates from Wikidata Query Service:

* We are getting close to full functional coverage of our Flink based
Streaming Updater [1]. We are starting to work on productionizing it and
having a deployment strategy. The current goal is deploy on top of
Kubernetes.
* We've been reviewing how we log queries and we've been adding some
context to the logs. In particular, we are adding CPU load and query
concurrency [2], with the hope that we can normalize our analysis: a query
that takes time because the server is overload does not have the same
meaning as a query that takes time because it is intrinsically expensive.
* We've been exploring our assumption that expensive queries are more
likely to be human generated queries (via the UI) than bots [3]. That
assumption seems to be wrong.
* We are looking into upgrading to JDK11. We are currently running on JDK8,
we have some time before it is truly end of life. Blazegraph itself has a
number of issues with JDK11.
* We had a few issues with data reload on Wikimedia Commons Query Service.
We are now doing those data reload without interruption, so future issues
should not result in any downtime, but just delays in getting the new data.
The data size of WCQS is growing faster than we expected. We are
tentatively planning on working on a streaming updater for WCQS early 2021.

Have fun!

   Guillaume

[1] https://phabricator.wikimedia.org/T244590
[2] https://phabricator.wikimedia.org/T261937
[3] https://phabricator.wikimedia.org/T261841#6532765

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS / WCQS Status update

2020-09-02 Thread Guillaume Lederrey
Hello all!

A quick update on what's going on around our SPARQL endpoints.

* Wikimedia Commons Query Service (WCQS) [1] is available as a beta
service. We've seen a number of people starting to run queries. And a
number of examples have been added [2]. Thanks all for your help!
* We are focusing again on WDQS and improving the update process [3]. So
far, we have an end-to-end working example for simple updates (revision
create) and are working on adding support for more complex updates
(deletes, undeletes, suppressed deletes, etc...). Once this all process
is complete and working for WDQS, we'll see how we can adapt it for WCQS
and have streaming updates to WCQS.
* We are looking into the deployment constraints for the new WDQS update
process. Managing Flink at scale is non trivial, we are just starting, but
there is a lot more work to make this robust.
* We are planning to spend more time doing some analytics on our data [4].
We want to better understand the use cases and the data we have. We are
still defining exactly what question we want to answer from the data, but
the main ones are
** What are the most expensive queries, what are they trying to achieve and
is that reasonable
** Do we have performant subgraphs that we could expose indepently.
This will also require some work to improve our query logging and aggregate
more context with the queries we log.

That's all for today!

  Have fun!

 Guillaume


[1] https://wcqs-beta.wmflabs.org/
[2]
https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples
[3] https://phabricator.wikimedia.org/T244590
[4] https://phabricator.wikimedia.org/T257045
-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikimedia Commons Query Service (WCQS)

2020-07-24 Thread Guillaume Lederrey
On Thu, Jul 23, 2020 at 11:26 PM Hay (Husky)  wrote:

> Awesome, i'm really happy we finally have at least a start of a
> functioning query service.
>
> For now, the two things that i guess would be helpful for most query
> writers:
> 1) A way to make ImageGrid work without resorting to the clunky
> Special:FilePath hack
>

I've created https://phabricator.wikimedia.org/T258769 to track this
request. Feel free to add any details on the task. No promise on when we'll
have time to work on it. We now need to focus again on the new streaming
updater for WDQS (and eventually for WCQS) and improving the stability /
scaling of WDQS.


> 2) A nicer way to query Wikidata information without using federation.
>

It is unlikely that we'll be able to remove federation in this case. At
least it is unlikely that we'll merge those 2 graphs in a single Blazegraph
instance. Given the concerns we have about scaling this service, splitting
into more federated graphs seems a better option than merging into larger
graphs. There might be ways to pull a subset of the Wikidata dataset into
WCQS, but that looks like a complex problem.

I guess 2) might be a bit more difficult, but it might definitely be
> something to consider.
>
> Kind regards,
> -- Hay
>
> On Wed, Jul 22, 2020 at 8:03 PM Guillaume Lederrey
>  wrote:
> >
> > Hello all!
> >
> > We are happy to announce the availability of Wikimedia Commons Query
> Service (WCQS): https://wcqs-beta.wmflabs.org/.
> >
> > This is a beta SPARQL endpoint exposing the Structured Data on Commons
> (SDoC) dataset. This endpoint can federate with WDQS. More work is needed
> as we iterate on the service, but feel free to begin using the endpoint.
> Known limitations are listed below:
> >
> > * The service is a beta endpoint that is updated via weekly dumps. Some
> caveats include limited performance, expected downtimes, and no interface,
> naming, or backward compatibility stability guarantees.
> > * The service is hosted on Wikimedia Cloud Services, with limited
> resources and limited monitoring. This means there may be random unplanned
> downtime.
> > The data will be reloaded weekly from dumps. The service will be down
> during data reload. With the current amount of SDoC data, downtime will
> last approximately 4 hours, but this may increase as SDoC data grows.
> > * Due to an issue with the dump format, the data currently only
> dates back to July 5th. We’re working on getting more up-to-date data and
> hope to have a solution soon. (https://phabricator.wikimedia.org/T258507
> and https://phabricator.wikimedia.org/T258474)
> > * The MediaInfo concept URIs (e.g.
> http://commons.wikimedia.org/entity/M37200540) are currently HTTP; we may
> change these to HTTPS in the near future. Please comment on T258590 if you
> have concerns about this change.
> >
> > * The service is restricted behind OAuth authentication, backed by
> Commons. You will need an account on Commons to access the service. This is
> so that we can contact abusive bots and/or users and block them selectively
> as a last resort if needed.
> > * Please note that to correctly logout of the service, you need to
> use the logout link in WCQS - logging out of just Wikimedia Commons will
> not work for WCQS. This limitation will be lifted once we move to
> production.
> >
> > * No documentation on the service is available yet. In particular, no
> examples are provided yet. You can add your own examples at
> https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples
> following the format at
> https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
> .
> > * Please use the SPARQL template. Note that while there is currently
> a bug that doesn’t allow us to change the “Try it!” link endpoint, the
> examples will be displayed correctly on the WCQS GUI.
> >
> > * WCQS is a work in progress and some bugs are to be expected,
> especially related to generalizing WDQS to fit SDoC data. For example,
> current bugs include:
> > * URI prefixes specific for SDoC data don’t yet work - you need to
> use full URIs if you want to query using them. Relations and Q items are
> defined by Wikidata’s URI prefixes, so they work correctly.
> > * Autocomplete for SDoC items doesn’t work - without prefixes they’d
> be unusable anyway, but additional work will be required after we inject
> SDoC URI prefixes into WCQS GUI.
> > * If you find any additional bugs or issues, please report them via
> Phabricator with the tag wikidata-query-service.
> > * We do plan to move the service to production, but we don’t have a
> timeline on that yet. We want to emphasize that while

[Wikidata] Wikimedia Commons Query Service (WCQS)

2020-07-22 Thread Guillaume Lederrey
Hello all!

We are happy to announce the availability of Wikimedia Commons Query
Service (WCQS): https://wcqs-beta.wmflabs.org/.

This is a beta SPARQL endpoint exposing the Structured Data on Commons
(SDoC) dataset. This endpoint can federate with WDQS. More work is needed
as we iterate on the service, but feel free to begin using the endpoint.
Known limitations are listed below:

* The service is a beta endpoint that is updated via weekly dumps. Some
caveats include limited performance, expected downtimes, and no interface,
naming, or backward compatibility stability guarantees.
* The service is hosted on Wikimedia Cloud Services, with limited
resources and limited monitoring. This means there may be random unplanned
downtime.
The data will be reloaded weekly from dumps. The service will be down
during data reload. With the current amount of SDoC data, downtime will
last approximately 4 hours, but this may increase as SDoC data grows.
* Due to an issue with the dump format, the data currently only dates
back to July 5th. We’re working on getting more up-to-date data and hope to
have a solution soon. (https://phabricator.wikimedia.org/T258507 and
https://phabricator.wikimedia.org/T258474)
* The MediaInfo concept URIs (e.g.
http://commons.wikimedia.org/entity/M37200540) are currently HTTP; we may
change these to HTTPS in the near future. Please comment on T258590 if you
have concerns about this change.

* The service is restricted behind OAuth authentication, backed by Commons.
You will need an account on Commons to access the service. This is so that
we can contact abusive bots and/or users and block them selectively as a
last resort if needed.
* Please note that to correctly logout of the service, you need to use
the logout link in WCQS - logging out of just Wikimedia Commons will not
work for WCQS. This limitation will be lifted once we move to production.

* No documentation on the service is available yet. In particular, no
examples are provided yet. You can add your own examples at
https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples
following the format at
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
.
* Please use the SPARQL template. Note that while there is currently a
bug that doesn’t allow us to change the “Try it!” link endpoint, the
examples will be displayed correctly on the WCQS GUI.

* WCQS is a work in progress and some bugs are to be expected, especially
related to generalizing WDQS to fit SDoC data. For example, current bugs
include:
* URI prefixes specific for SDoC data don’t yet work - you need to use
full URIs if you want to query using them. Relations and Q items are
defined by Wikidata’s URI prefixes, so they work correctly.
* Autocomplete for SDoC items doesn’t work - without prefixes they’d be
unusable anyway, but additional work will be required after we inject SDoC
URI prefixes into WCQS GUI.
* If you find any additional bugs or issues, please report them via
Phabricator with the tag wikidata-query-service.
* We do plan to move the service to production, but we don’t have a
timeline on that yet. We want to emphasize that while we do expect a SPARQL
endpoint to be part of a medium to long-term solution, it will only be part
of that solution. Even once the service is production-ready, it will still
have limitations in terms of timeouts, expensive queries, and federation.
Some use cases will need to be migrated, over time, to better solutions -
once those solutions exist.

Have fun!

   Guillaume

--
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-07-09 Thread Guillaume Lederrey
On Thu, Jul 9, 2020 at 4:52 PM Egon Willighagen 
wrote:

>
> Dear Guillaume,
>
> On Thu, Jul 9, 2020 at 3:23 PM Guillaume Lederrey 
> wrote:
>
>> Some very preliminary analysis indicates that less then 2% of the queries
>> on WDQS generate more than 90% of the load. This is definitely something we
>> need to better understand.
>>
>
> Is the data behind that available? I wonder if I recognize any of the top
> 25 queries.
>

No, the data isn't publicly available. Queries can (and do) contain private
information, so we don't publish raw queries. We might publish a subset of
those queries at some point, but only after having reviewed them manually
to ensure they are clean.

(I guess the top 2% can be simple queries run very many times, as well as
> hard queries rarely run, correct?)
>

The analysis at this point is just on individual queries, with no
aggregation of similar queries. This means that this 2% of queries are very
expensive queries. We need to refine that analysis, and aggregation of
similar queries is one of the things we should be working on.


> Egon
>
>
> --
> Hi, do you like citation networks? Already 51% of all citations are
> available <https://i4oc.org/> available for innovative new uses
> <https://twitter.com/hashtag/acs2ioc>. Join me in asking the American
> Chemical Society to join the Initiative for Open Citations too
> <https://www.change.org/p/asking-the-american-chemical-society-to-join-the-initiative-for-open-citations>.
>  SpringerNature,
> the RSC and many others already did <https://i4oc.org/#publishers>.
>
> -
> E.L. Willighagen
> Department of Bioinformatics - BiGCaT
> Maastricht University (http://www.bigcat.unimaas.nl/)
> Homepage: http://egonw.github.com/
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: https://www.zotero.org/egonw
> ORCID: -0001-7542-0286 <http://orcid.org/-0001-7542-0286>
> ImpactStory: https://impactstory.org/u/egonwillighagen
> _______
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-07-09 Thread Guillaume Lederrey
On Thu, Jul 9, 2020 at 3:35 PM Gerard Meijssen 
wrote:

> Hoi,
> Is this different from Special:MediaSearch ??
>

I'm assuming that you are asking if the new WCQS is different from the
Special:MediaSearch prototype [1].

And yes, it is quite different. WCQS is a low level SPARQL interface,
oriented toward power users and tools, allowing federation with WDQS and
the Wikdiata dataset. Special:MediaSearch is a higher level search
interface, backed by elasticsearch. It is using the same underlying data,
but in a very different way.

Somewhat unrelated: we are also planning some work on Special:MediaSearch
to better integrate is with our current search infrastructure [2].

[1] https://commons.wikimedia.org/wiki/Special:MediaSearch
[2] https://phabricator.wikimedia.org/T257043

Thanks,
>   GerardM
>
> On Thu, 9 Jul 2020 at 15:23, Guillaume Lederrey 
> wrote:
>
>> Hello all!
>>
>> The Search Platform team will join the WIkidata office hours on July 21st
>> 16:00 UTC [1]. We are looking forward to discussing Wikidata Query Service
>> and anything else you might find of interest.
>>
>> We've been hard at work on Wikimedia Commons Query Service (WCQS) [2].
>> This will be a SPARL endpoint similar to WDQS, but serving the Structured
>> Data on Commons dataset. Our goal is to open a beta service, hosted on
>> Wikimedia Cloud Service (WMCS) by the end of July. The service will require
>> an account on Commons for authentication and will allow federation with
>> WDQS. We don't have a streaming update process ready yet, the data will be
>> reloaded from Commons dumps weekly for a start.
>>
>> As part of that work, the dumps for Structured Data on Commons are now
>> available [3]. Note that the prefix used in the TTL dumps is "wd", which
>> does not make much sense. We are working with WMDE on renaming the
>> prefixes, but this is more complex than expected since "wd" is hardcoded in
>> more places than it should be. Those prefix should only be valid in the
>> local context of the dumps, so renaming them is technically a non breaking
>> change. That being said, if you start using those dumps, make sure you
>> don't rely on this prefix, or that you are ready for a rename [4].
>>
>> We are planning to dig more into the data we have to get a better
>> understanding of the use cases around WDQS [5] (not much content on that
>> task yet, but it is coming). Some very preliminary analysis indicates that
>> less then 2% of the queries on WDQS generate more than 90% of the load.
>> This is definitely something we need to better understand. We will be
>> working on defining the kind of questions we need to answer, and improving
>> our data collection to be able to answer those questions.
>>
>> We have started an internal discussion around "planning for disaster"
>> [6]. We want to better understand the potential failure scenarios around
>> WDQS and have a plan if that worst case does happen. This will include some
>> analytics work and some testing to better understand the constraints and
>> what degraded mode we might still be able to provide in case of
>> catastrophic failure.
>>
>> Thanks for reading!
>>
>>Guillaume
>>
>> [1] https://www.wikidata.org/wiki/Wikidata:Events#Office_hours
>> [2] https://phabricator.wikimedia.org/T251488
>> [3] https://dumps.wikimedia.org/other/wikibase/commonswiki/
>> [4]
>> https://dumps.wikimedia.org/other/wikibase/commonswiki/README_commonsrdfdumps.txt
>> [5] https://phabricator.wikimedia.org/T257045
>> [6] https://phabricator.wikimedia.org/T257055
>>
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS status

2020-07-09 Thread Guillaume Lederrey
Hello all!

The Search Platform team will join the WIkidata office hours on July 21st
16:00 UTC [1]. We are looking forward to discussing Wikidata Query Service
and anything else you might find of interest.

We've been hard at work on Wikimedia Commons Query Service (WCQS) [2]. This
will be a SPARL endpoint similar to WDQS, but serving the Structured Data
on Commons dataset. Our goal is to open a beta service, hosted on Wikimedia
Cloud Service (WMCS) by the end of July. The service will require an
account on Commons for authentication and will allow federation with WDQS.
We don't have a streaming update process ready yet, the data will be
reloaded from Commons dumps weekly for a start.

As part of that work, the dumps for Structured Data on Commons are now
available [3]. Note that the prefix used in the TTL dumps is "wd", which
does not make much sense. We are working with WMDE on renaming the
prefixes, but this is more complex than expected since "wd" is hardcoded in
more places than it should be. Those prefix should only be valid in the
local context of the dumps, so renaming them is technically a non breaking
change. That being said, if you start using those dumps, make sure you
don't rely on this prefix, or that you are ready for a rename [4].

We are planning to dig more into the data we have to get a better
understanding of the use cases around WDQS [5] (not much content on that
task yet, but it is coming). Some very preliminary analysis indicates that
less then 2% of the queries on WDQS generate more than 90% of the load.
This is definitely something we need to better understand. We will be
working on defining the kind of questions we need to answer, and improving
our data collection to be able to answer those questions.

We have started an internal discussion around "planning for disaster" [6].
We want to better understand the potential failure scenarios around WDQS
and have a plan if that worst case does happen. This will include some
analytics work and some testing to better understand the constraints and
what degraded mode we might still be able to provide in case of
catastrophic failure.

Thanks for reading!

   Guillaume

[1] https://www.wikidata.org/wiki/Wikidata:Events#Office_hours
[2] https://phabricator.wikimedia.org/T251488
[3] https://dumps.wikimedia.org/other/wikibase/commonswiki/
[4]
https://dumps.wikimedia.org/other/wikibase/commonswiki/README_commonsrdfdumps.txt
[5] https://phabricator.wikimedia.org/T257045
[6] https://phabricator.wikimedia.org/T257055


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS Status

2020-04-01 Thread Guillaume Lederrey
Hello all!

I hope you are all doing well in these interesting times! We are doing our
best to continue moving forward, but there is no doubt that the COVID-19
situation is affecting our work. Unexpected schedules, kids homeschooling,
loved ones and family affected, or just the overall stress level claiming
some of our brain capacity.

That being said, we are moving forward with our new updater [1]. We almost
have a first working implementation of the simplest use case: creating a
turtle stream from revision create events. More work is needed to include
more complex use cases (visibility changes, renames, ...), but we are
focusing on testing the non functional requirements and making sure this
solution is robust. We are working on metrics, monitoring, validating
latencies.

We've been mostly focused on this scaling work, but we know we also need to
address some of the shorter term issues as well. We are trying to dedicate
some time to this. In particular:

* fixing a bug about categories reloading [2]
<https://phabricator.wikimedia.org/T246568>
* update to wikipathways federation endpoint (thanks to RhinosF1 for doing
the actual work! we'll just do the deployment next Monday) [3]

We are hoping to be able to address more of those day to day operations
going forward, but we are stretched thin at the moment.

Keep safe!

   Guillaume

[1] https://phabricator.wikimedia.org/T244590
[2] https://phabricator.wikimedia.org/T246568
[3] https://phabricator.wikimedia.org/T249041
-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS status

2020-03-04 Thread Guillaume Lederrey
Hello all!

Here is a short update on what we have been doing around WDQS lately.

The update lag over the last 30 days has been slightly better [1]. We have
not done anything more to improve it, or to analyze why it was less
problematic lately. My guess is that it is a combination of being lucky and
that the self throttling of edit based on the WDQS lag exposed through the
Wikidata API.

We are now collecting more metrics from the WDQS updater [2] and exposing
them through a new dashboard [3]. We are also collecting queries for
analysis. Our hope is that digging into those queries (when we'll have
time) will allow us to discover patterns of queries that might be better
served with a different solution than Blazegraph.

We have loaded Wikidata dumps in Hadoop. This allows us to run analysis
that would not be possible with Blazegraph. For example, we ran an analysis
of the usage of common qualifiers for “unknown value” [4].
<https://phabricator.wikimedia.org/T246238>

There is an ongoing discussion about the use of blank nodes [5]. Blank
nodes are problematic for our updater, as finding them is by design a non
trivial operation. The discussion is still ongoing, but it is likely that
we will need to introduce a breaking change in the way we are using blank
nodes. We will provide an update once we know more precisely what we need
to do and we have a migration path for use cases using them.

We are now focused on a complete rewrite of the WDQS Updater [6]. We are
investigating using Flink [7] as a stream processing solution. This should
allow us to both simplify the update process a lot and make it a lot more
efficient. There is still a lot of work to be done before this is complete,
but we think we have a good path forward.

Misc:

* some aliases for Wikidata have been deployed [8]

As always, thank you for your patience!

   Guillaume


[1]
https://grafana.wikimedia.org/d/00489/wikidata-query-service?orgId=1=8=now-30d=now
[2] https://phabricator.wikimedia.org/T239908
[3]
https://grafana.wikimedia.org/d/dSksY08Zk/wikidata-query-service-updater?orgId=1
[4] https://phabricator.wikimedia.org/T246238
[5] https://phabricator.wikimedia.org/T244341
[6] https://phabricator.wikimedia.org/T244590
[7] https://flink.apache.org/
[8] https://phabricator.wikimedia.org/T222321


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status of Wikidata Query Service

2020-02-10 Thread Guillaume Lederrey
On Fri, Feb 7, 2020 at 5:18 PM Guillaume Lederrey 
wrote:

> On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann 
> wrote:
>
>> thank you Guillaume, when do you expect a public update on the security
>> incident [1]? Is any of our personal and private data (email, password etc)
>> affected?
>>
>
> It should be made public in the next few days. I'm not going to go into
> any more details until this is made public, but overall, don't worry too
> much.
>

Corrections and apologies on what I said above. We are not actually ready
to make this ticket public. The underlying issue is under control and does
not require any user action to mitigate. Given the security aspect, I'm not
going to do any further communication on this.

Sorry to have been misleading on this.

  Enjoy your day!

 Guillaume


> best,
>> Marco
>>
>> [1] https://phabricator.wikimedia.org/T241410
>>
>> On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey <
>> gleder...@wikimedia.org> wrote:
>>
>>> Hello all!
>>>
>>> First of all, my apologies for the long silence. We need to do better in
>>> terms of communication. I'll try my best to send a monthly update from now
>>> on. Keep me honest, remind me if I fail.
>>>
>>> First, we had a security incident at the end of December, which forced
>>> us to move from our Kafka based update stream back to the RecentChanges
>>> poller. The details are still private, but you will be able to get the full
>>> story soon on phabricator [1]. The RecentChange poller is less efficient
>>> and this is leading to high update lag again (just when we thought we had
>>> things slightly under control). We tried to mitigate this by improving the
>>> parallelism in the updater [2], which helped a bit, but not as much as we
>>> need.
>>>
>>> Another attempt to get update lag under control is to apply back
>>> pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag
>>> [6]. This is obviously less than ideal (at least as long as WDQS updates
>>> are lagging as often as they are), but does allow the service to recover
>>> from time to time. We probably need to iterate on this, provide better
>>> granularity, differentiate better between operations that have an impact on
>>> update lag and those which don't.
>>>
>>> On the slightly better news side, we now have a much better
>>> understanding of the update process and of its shortcomings. The current
>>> process does a full diff between each updated entity and what we have in
>>> blazegraph. Even if a single triple needs to change, we still read tons of
>>> data from Blazegraph. While this approach is simple and robust, it is
>>> obviously not efficient. We need to rewrite the updater to take a more
>>> event streaming / reactive approach, and only work on the actual changes.
>>> This is a big chunk of work, almost a complete rewrite of the updater, and
>>> we need a new solution to stream changes with guaranteed ordering
>>> (something that our kafka queues don't offer). This is where we are
>>> focusing our energy at the moment, this looks like the best option to
>>> improve the situation in the medium term. This change will probably have
>>> some functional impacts [3].
>>>
>>> Some misc things:
>>>
>>> We have done some work to get better metrics and better understanding of
>>> what's going on. From collecting more metrics during the update [4] to
>>> loading RDF dumps into Hadoop for further analysis [5] and better logging
>>> of SPARQL requests. We are not focusing on this analysis until we are in a
>>> more stable situation regarding update lag.
>>>
>>> We have a new team member working on WDQS. He is still ramping up, but
>>> we should have a bit more capacity from now on.
>>>
>>> Some longer term thoughts:
>>>
>>> Keeping all of Wikidata in a single graph is most probably not going to
>>> work long term. We have not found examples of public SPARQL endpoints with
>>> > 10 B triples and there is probably a good reason for that. We will
>>> probably need to split the graphs at some point. We don't know how yet
>>> (that's why we loaded the dumps into Hadoop, that might give us some more
>>> insight). We might expose a subgraph with only truthy statements. Or have
>>> language specific graphs, with only language specific labels. Or something
>>> completely different.
>>>
>>> Keeping WDQS / Wikidata as open as they are at the momen

Re: [Wikidata] Status of Wikidata Query Service

2020-02-07 Thread Guillaume Lederrey
On Fri, Feb 7, 2020 at 3:12 PM  wrote:

>
> Better update granularity is probably good and may be a good priority.
>
> It is (still) unclear for me as a tool writer whether I can do anything.
> For instance it is not clear to me whether the parallel SPARQL queries
> that comes when a user visit a Scholia page is important for the load on
> WDQS (not likely) or it is miniscule (likely).
>

Sadly, I don't have a good answer to that at the moment. The work we've
done to better log queries and their context should help us to get some of
that understanding.

In the meantime, query run time is a good proxy for resource cost. If your
queries have an aggregate run time of 100ms per minute, don't worry about
it. If your queries have an aggregate runtime of 30 seconds per minute,
there is probably a need to do something. Or if you have individual queries
running regularly for more than 10 seconds.


> As far as I understand on http://ceur-ws.org/Vol-2073/article-03.pdf
> much of the query load comes via Magnus. I presume another big chunk is
> from the genewiki people.
>
> If robotic queries are sources of problems then tool writers/users can
> do something. But fixing issues would require the WMF to tell if it
> really is a problem and what the problems are.
>

Yep, we're working on that! But our highest priority at the moment is
rewriting the updater to be more efficient. Once this is done, we should
have some free cycles for a better analysis.


>
> best regards
> Finn
>
> On 07/02/2020 14:32, Guillaume Lederrey wrote:
> > Hello all!
> >
> > First of all, my apologies for the long silence. We need to do better in
> > terms of communication. I'll try my best to send a monthly update from
> > now on. Keep me honest, remind me if I fail.
> >
> > First, we had a security incident at the end of December, which forced
> > us to move from our Kafka based update stream back to the RecentChanges
> > poller. The details are still private, but you will be able to get the
> > full story soon on phabricator [1]. The RecentChange poller is less
> > efficient and this is leading to high update lag again (just when we
> > thought we had things slightly under control). We tried to mitigate this
> > by improving the parallelism in the updater [2], which helped a bit, but
> > not as much as we need.
> >
> > Another attempt to get update lag under control is to apply back
> > pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag
> > [6]. This is obviously less than ideal (at least as long as WDQS updates
> > are lagging as often as they are), but does allow the service to recover
> > from time to time. We probably need to iterate on this, provide better
> > granularity, differentiate better between operations that have an impact
> > on update lag and those which don't.
> >
> > On the slightly better news side, we now have a much better
> > understanding of the update process and of its shortcomings. The current
> > process does a full diff between each updated entity and what we have in
> > blazegraph. Even if a single triple needs to change, we still read tons
> > of data from Blazegraph. While this approach is simple and robust, it is
> > obviously not efficient. We need to rewrite the updater to take a more
> > event streaming / reactive approach, and only work on the actual
> > changes. This is a big chunk of work, almost a complete rewrite of the
> > updater, and we need a new solution to stream changes with guaranteed
> > ordering (something that our kafka queues don't offer). This is where we
> > are focusing our energy at the moment, this looks like the best option
> > to improve the situation in the medium term. This change will probably
> > have some functional impacts [3].
> >
> > Some misc things:
> >
> > We have done some work to get better metrics and better understanding of
> > what's going on. From collecting more metrics during the update [4] to
> > loading RDF dumps into Hadoop for further analysis [5] and better
> > logging of SPARQL requests. We are not focusing on this analysis until
> > we are in a more stable situation regarding update lag.
> >
> > We have a new team member working on WDQS. He is still ramping up, but
> > we should have a bit more capacity from now on.
> >
> > Some longer term thoughts:
> >
> > Keeping all of Wikidata in a single graph is most probably not going to
> > work long term. We have not found examples of public SPARQL endpoints
> > with > 10 B triples and there is probably a good reason for that. We
> > will probably need to split the graphs at some point. We don't know how
&

Re: [Wikidata] Status of Wikidata Query Service

2020-02-07 Thread Guillaume Lederrey
On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann 
wrote:

> thank you Guillaume, when do you expect a public update on the security
> incident [1]? Is any of our personal and private data (email, password etc)
> affected?
>

It should be made public in the next few days. I'm not going to go into any
more details until this is made public, but overall, don't worry too much.


> best,
> Marco
>
> [1] https://phabricator.wikimedia.org/T241410
>
> On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey 
> wrote:
>
>> Hello all!
>>
>> First of all, my apologies for the long silence. We need to do better in
>> terms of communication. I'll try my best to send a monthly update from now
>> on. Keep me honest, remind me if I fail.
>>
>> First, we had a security incident at the end of December, which forced us
>> to move from our Kafka based update stream back to the RecentChanges
>> poller. The details are still private, but you will be able to get the full
>> story soon on phabricator [1]. The RecentChange poller is less efficient
>> and this is leading to high update lag again (just when we thought we had
>> things slightly under control). We tried to mitigate this by improving the
>> parallelism in the updater [2], which helped a bit, but not as much as we
>> need.
>>
>> Another attempt to get update lag under control is to apply back pressure
>> on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is
>> obviously less than ideal (at least as long as WDQS updates are lagging as
>> often as they are), but does allow the service to recover from time to
>> time. We probably need to iterate on this, provide better granularity,
>> differentiate better between operations that have an impact on update lag
>> and those which don't.
>>
>> On the slightly better news side, we now have a much better understanding
>> of the update process and of its shortcomings. The current process does a
>> full diff between each updated entity and what we have in blazegraph. Even
>> if a single triple needs to change, we still read tons of data from
>> Blazegraph. While this approach is simple and robust, it is obviously not
>> efficient. We need to rewrite the updater to take a more event streaming /
>> reactive approach, and only work on the actual changes. This is a big chunk
>> of work, almost a complete rewrite of the updater, and we need a new
>> solution to stream changes with guaranteed ordering (something that our
>> kafka queues don't offer). This is where we are focusing our energy at the
>> moment, this looks like the best option to improve the situation in the
>> medium term. This change will probably have some functional impacts [3].
>>
>> Some misc things:
>>
>> We have done some work to get better metrics and better understanding of
>> what's going on. From collecting more metrics during the update [4] to
>> loading RDF dumps into Hadoop for further analysis [5] and better logging
>> of SPARQL requests. We are not focusing on this analysis until we are in a
>> more stable situation regarding update lag.
>>
>> We have a new team member working on WDQS. He is still ramping up, but we
>> should have a bit more capacity from now on.
>>
>> Some longer term thoughts:
>>
>> Keeping all of Wikidata in a single graph is most probably not going to
>> work long term. We have not found examples of public SPARQL endpoints with
>> > 10 B triples and there is probably a good reason for that. We will
>> probably need to split the graphs at some point. We don't know how yet
>> (that's why we loaded the dumps into Hadoop, that might give us some more
>> insight). We might expose a subgraph with only truthy statements. Or have
>> language specific graphs, with only language specific labels. Or something
>> completely different.
>>
>> Keeping WDQS / Wikidata as open as they are at the moment might not be
>> possible in the long term. We need to think if / how we want to implement
>> some form of authentication and quotas. Potentially increasing quotas for
>> some use cases, but keeping them strict for others. Again, we don't know
>> how this will look like, but we're thinking about it.
>>
>> What you can do to help:
>>
>> Again, we're not sure. Of course, reducing the load (both in terms of
>> edits on Wikidata and of reads on WDQS) will help. But not using those
>> services makes them useless.
>>
>> We suspect that some use cases are more expensive than others (a single
>> property change to a large entity will require a comparatively insane
>> a

[Wikidata] Status of Wikidata Query Service

2020-02-07 Thread Guillaume Lederrey
Hello all!

First of all, my apologies for the long silence. We need to do better in
terms of communication. I'll try my best to send a monthly update from now
on. Keep me honest, remind me if I fail.

First, we had a security incident at the end of December, which forced us
to move from our Kafka based update stream back to the RecentChanges
poller. The details are still private, but you will be able to get the full
story soon on phabricator [1]. The RecentChange poller is less efficient
and this is leading to high update lag again (just when we thought we had
things slightly under control). We tried to mitigate this by improving the
parallelism in the updater [2], which helped a bit, but not as much as we
need.

Another attempt to get update lag under control is to apply back pressure
on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is
obviously less than ideal (at least as long as WDQS updates are lagging as
often as they are), but does allow the service to recover from time to
time. We probably need to iterate on this, provide better granularity,
differentiate better between operations that have an impact on update lag
and those which don't.

On the slightly better news side, we now have a much better understanding
of the update process and of its shortcomings. The current process does a
full diff between each updated entity and what we have in blazegraph. Even
if a single triple needs to change, we still read tons of data from
Blazegraph. While this approach is simple and robust, it is obviously not
efficient. We need to rewrite the updater to take a more event streaming /
reactive approach, and only work on the actual changes. This is a big chunk
of work, almost a complete rewrite of the updater, and we need a new
solution to stream changes with guaranteed ordering (something that our
kafka queues don't offer). This is where we are focusing our energy at the
moment, this looks like the best option to improve the situation in the
medium term. This change will probably have some functional impacts [3].

Some misc things:

We have done some work to get better metrics and better understanding of
what's going on. From collecting more metrics during the update [4] to
loading RDF dumps into Hadoop for further analysis [5] and better logging
of SPARQL requests. We are not focusing on this analysis until we are in a
more stable situation regarding update lag.

We have a new team member working on WDQS. He is still ramping up, but we
should have a bit more capacity from now on.

Some longer term thoughts:

Keeping all of Wikidata in a single graph is most probably not going to
work long term. We have not found examples of public SPARQL endpoints with
> 10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet
(that's why we loaded the dumps into Hadoop, that might give us some more
insight). We might expose a subgraph with only truthy statements. Or have
language specific graphs, with only language specific labels. Or something
completely different.

Keeping WDQS / Wikidata as open as they are at the moment might not be
possible in the long term. We need to think if / how we want to implement
some form of authentication and quotas. Potentially increasing quotas for
some use cases, but keeping them strict for others. Again, we don't know
how this will look like, but we're thinking about it.

What you can do to help:

Again, we're not sure. Of course, reducing the load (both in terms of edits
on Wikidata and of reads on WDQS) will help. But not using those services
makes them useless.

We suspect that some use cases are more expensive than others (a single
property change to a large entity will require a comparatively insane
amount of work to update it on the WDQS side). We'd like to have real data
on the cost of various operations, but we only have guesses at this point.

If you've read this far, thanks a lot for your engagement!

  Have fun!

  Guillaume




[1] https://phabricator.wikimedia.org/T241410
[2] https://phabricator.wikimedia.org/T238045
[3] https://phabricator.wikimedia.org/T244341
[4] https://phabricator.wikimedia.org/T239908
[5] https://phabricator.wikimedia.org/T241125
[6] https://phabricator.wikimedia.org/T221774

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service update lag

2019-11-21 Thread Guillaume Lederrey
Hello all!

As you probably already know, the lag situation on WDQS is not improving as
much as we'd like. Over the past week, we've managed to keep the lag mostly
below 3 hours, but at the cost of a lot of manual work. And yes, we know
that 3 hours of lag is already too much.

Some updates on what we've been doing:

* Testing of our new Merging Updater [1]. This did not go as planned. The
throughput was worse than expected, and it was deleting more data than
expected. We are investigating to see if this new updater has a bug, or if
our previous updater was not cleaning up as much as it should (which would
be good news)! We are investigating.
* WMDE released a patch to expose the lag of WDQS through Wikidata [2].
This should allow edit bots to self throttle in case WDQS lag is climbing.
* We are working on adding more parallelism to the updater [3]. Finger's
crossed, this might help increase throughput a little bit.
* We've moved one server from the internal WDQS cluster to the public one,
to provide more resources. This has not had significant impact. We're
looking at moving one of our test servers in production. It looks like
throwing hardware at the problem isn't really working but we'll know for
sure once we try.
* Overall, we are still trying to figure out what's going on, adding some
metrics, digging through the code and trying to make sense of all that. We
are lacking knowledge and understanding, but we're learning.

What's coming next:

* We have a new team member starting December 9th. He will need to learn a
lot before being effective, but a new set of eyes and one more brain on the
problem will help in the medium term.
* We are looking at ways to reduce / throttle the query load even more
aggressively than we do at the moment. That could mean limiting the number
of requests per second per user-agent/IP, or limiting the number of
parallel requests or something else.
* We will be looking at alternatives to Blazegraph. We need some quiet time
(which has been seriously lacking lately) to do that. And we need a better
understanding of the problem we are trying to solve to be able to make the
right decision. In any case, this is something that will take time.

What you can do to help:

Honestly, not sure. Less query load on WDQS is likely to help, so if you
have a bot, make sure the queries it makes are useful and optimized.

I'll get back to you when we have something.

Thanks all for your patience!

   Guillaume



[1] https://phabricator.wikimedia.org/T231411
[2] https://phabricator.wikimedia.org/T221774
[3] https://phabricator.wikimedia.org/T238045


On Mon, Nov 18, 2019 at 6:54 PM Denny Vrandečić 
wrote:

> I don't know if there is actually someone who would be capable and have
> the time to do so, I just would hope there are such people - but it
> probably makes sense to check if there are actually volunteers before doing
> work to enable them :)
>
> On Fri, Nov 15, 2019 at 5:17 AM Guillaume Lederrey <
> gleder...@wikimedia.org> wrote:
>
>> On Fri, Nov 15, 2019 at 12:49 AM Denny Vrandečić 
>> wrote:
>>
>>> Just wondering, is there a way to let volunteers look into the issue? (I
>>> guess no because it would give potentially access to the query stream, but
>>> maybe the answer is more optimistic)
>>>
>>
>> There are ways, none of them easy. There are precedents for volunteers
>> having access to our production environment. I'm not really sure what the
>> process looks like. There is at least some NDA to sign and some vetting
>> process. As you pointed out, this would give access to sensitive
>> information, and to the ability to do great damage (power, responsibility
>> and those kind of things).
>>
>> More realistically, we could provide more information for analysis. Heap
>> dumps do contain private information, but thread dumps are pretty safe, so
>> we could publish those. We would need to automate this on our side, but
>> that might be an option. Of course, having access to limited information
>> and no way to experiment on changes seriously limits the ability to
>> investigate.
>>
>> I'll check with the team if that's something we are ready to invest in.
>>
>>
>>> On Thu, Nov 14, 2019 at 2:39 PM Thad Guidry 
>>> wrote:
>>>
>>>> In the enterprise, most folks use either Java Mission Control, or just
>>>> Java VisualVM profiler.  Seeing sleeping Threads is often good to start
>>>> with, and just taking a snapshot or even Heap Dump when things are really
>>>> grinding slow would be useful, you can later share those snapshots/heap
>>>> dump with the community or Java profiling experts to analyze later.
>>>>
>>>> https://visualvm.github.io/index.html
>>>>
>>>> Thad
&

Re: [Wikidata] Wikidata Query Service update lag

2019-11-15 Thread Guillaume Lederrey
On Fri, Nov 15, 2019 at 12:49 AM Denny Vrandečić 
wrote:

> Just wondering, is there a way to let volunteers look into the issue? (I
> guess no because it would give potentially access to the query stream, but
> maybe the answer is more optimistic)
>

There are ways, none of them easy. There are precedents for volunteers
having access to our production environment. I'm not really sure what the
process looks like. There is at least some NDA to sign and some vetting
process. As you pointed out, this would give access to sensitive
information, and to the ability to do great damage (power, responsibility
and those kind of things).

More realistically, we could provide more information for analysis. Heap
dumps do contain private information, but thread dumps are pretty safe, so
we could publish those. We would need to automate this on our side, but
that might be an option. Of course, having access to limited information
and no way to experiment on changes seriously limits the ability to
investigate.

I'll check with the team if that's something we are ready to invest in.


> On Thu, Nov 14, 2019 at 2:39 PM Thad Guidry  wrote:
>
>> In the enterprise, most folks use either Java Mission Control, or just
>> Java VisualVM profiler.  Seeing sleeping Threads is often good to start
>> with, and just taking a snapshot or even Heap Dump when things are really
>> grinding slow would be useful, you can later share those snapshots/heap
>> dump with the community or Java profiling experts to analyze later.
>>
>> https://visualvm.github.io/index.html
>>
>> Thad
>> https://www.linkedin.com/in/thadguidry/
>>
>>
>> On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey <
>> gleder...@wikimedia.org> wrote:
>>
>>> Hello!
>>>
>>> Thanks for the suggestions!
>>>
>>> On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry 
>>> wrote:
>>>
>>>> Is the Write Retention Queue adequate?
>>>> Is the branching factor for the lexicon indices too large, resulting in
>>>> a non-linear slowdown in the write rate over tim?
>>>> Did you look into Small Slot Optimization?
>>>> Are the Write Cache Buffers adequate?
>>>> Is there a lot of Heap pressure?
>>>> Is the MemoryManager have the maximum amount of RAM it can handle?  4TB?
>>>> Is the RWStore handling the recycling well?
>>>> Is the SAIL Buffer Capacity adequate?
>>>> Are you not using exact range counts where you could be using fast
>>>> range counts?
>>>>
>>>>
>>> Start at the Hardware side first however.
>>>> Is the disk activity for writes really low...and CPU is very high?  You
>>>> have identified a bottleneck in that case, discover WHY that would be the
>>>> case looking into any of the above.
>>>>
>>>
>>> Sounds like good questions, but outside of my area of expertise. I've
>>> created https://phabricator.wikimedia.org/T238362 to track it, and I'll
>>> see if someone can have a look. I know that we did multiple passes at
>>> tuning Blazegraph properties, with limited success so far.
>>>
>>>
>>>> and a 100+ other things that should be looked at that all affect WRITE
>>>> performance during UPDATES.
>>>>
>>>> https://wiki.blazegraph.com/wiki/index.php/IOOptimization
>>>> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>>>>
>>>> I would also suggest you start monitoring some of the internals of
>>>> Blazegraph (JAVA) while in production with tools such as XRebel or
>>>> AppDynamics.
>>>>
>>>
>>> Both XRebel and AppDynamics are proprietary, so no way that we'll deploy
>>> them in our environment. We are tracking a few JMX based metrics, but so
>>> far, we don't really know what to look for.
>>>
>>> Thanks!
>>>
>>>   Guillaume
>>>
>>> Thad
>>>> https://www.linkedin.com/in/thadguidry/
>>>>
>>>>
>>>> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
>>>> gleder...@wikimedia.org> wrote:
>>>>
>>>>> Thanks for the feedback!
>>>>>
>>>>> On Thu, Nov 14, 2019 at 11:11 AM  wrote:
>>>>>
>>>>>>
>>>>>> Besides waiting for the new updater, it may be useful to tell us,
>>>>>> what
>>>>>> we as users can do too. It is unclear to me what the problem is. For
>>>>>> instance, at one point I was worried that the many pa

Re: [Wikidata] Wikidata Query Service update lag

2019-11-14 Thread Guillaume Lederrey
Hello!

Thanks for the suggestions!

On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry  wrote:

> Is the Write Retention Queue adequate?
> Is the branching factor for the lexicon indices too large, resulting in a
> non-linear slowdown in the write rate over tim?
> Did you look into Small Slot Optimization?
> Are the Write Cache Buffers adequate?
> Is there a lot of Heap pressure?
> Is the MemoryManager have the maximum amount of RAM it can handle?  4TB?
> Is the RWStore handling the recycling well?
> Is the SAIL Buffer Capacity adequate?
> Are you not using exact range counts where you could be using fast range
> counts?
>
>
Start at the Hardware side first however.
> Is the disk activity for writes really low...and CPU is very high?  You
> have identified a bottleneck in that case, discover WHY that would be the
> case looking into any of the above.
>

Sounds like good questions, but outside of my area of expertise. I've
created https://phabricator.wikimedia.org/T238362 to track it, and I'll see
if someone can have a look. I know that we did multiple passes at tuning
Blazegraph properties, with limited success so far.


> and a 100+ other things that should be looked at that all affect WRITE
> performance during UPDATES.
>
> https://wiki.blazegraph.com/wiki/index.php/IOOptimization
> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>
> I would also suggest you start monitoring some of the internals of
> Blazegraph (JAVA) while in production with tools such as XRebel or
> AppDynamics.
>

Both XRebel and AppDynamics are proprietary, so no way that we'll deploy
them in our environment. We are tracking a few JMX based metrics, but so
far, we don't really know what to look for.

Thanks!

  Guillaume

Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
> gleder...@wikimedia.org> wrote:
>
>> Thanks for the feedback!
>>
>> On Thu, Nov 14, 2019 at 11:11 AM  wrote:
>>
>>>
>>> Besides waiting for the new updater, it may be useful to tell us, what
>>> we as users can do too. It is unclear to me what the problem is. For
>>> instance, at one point I was worried that the many parallel requests to
>>> the SPARQL endpoint that we make in Scholia is a problem. As far as I
>>> understand it is not a problem at all. Another issue could be the way
>>> that we use Magnus Manske's Quickstatements and approve bots for high
>>> frequency editing. Perhaps a better overview and constraints on
>>> large-scale editing could be discussed?
>>>
>>
>> To be (again) completely honest, we don't entirely understand the issue
>> either. There are clearly multiple related issues. In high level terms, we
>> have at least:
>>
>> * Some part of the update process on Blazegraph is CPU bound and single
>> threaded. Even with low query load, if we have a high edit rate, Blazegraph
>> can't keep up, and saturates a single CPU (with plenty of available
>> resources on other CPUs). This is a hard issue to fix, requiring either
>> splitting the processing over multiple CPU or sharding the data over
>> multiple servers. Neither of which Blazegraph supports (at least not in our
>> current configuration).
>> * There is a race for resources between edits and queries: a high query
>> load will impact the update rate. This could to some extent be mitigated by
>> reducing the query load: if no one is using the service, it works great!
>> Obviously that's not much of a solution.
>>
>> What you can do (short term):
>>
>> * Keep bots usage well behaved (don't do parallel queries, provide a
>> meaningful user agent, smooth the load over time if possible, ...). As far
>> as I can see, most usage are already well behaved.
>> * Optimize your queries: better queries will use less resources, which
>> should help. Time to completion is a good approximation of the resources
>> used. I don't really have any more specific advice, SPARQL is not my area
>> of expertise.
>>
>> What you can do (longer term):
>>
>> * Help us think out of the box. Can we identify higher level use cases?
>> Could we implement some of our workflows on a higher level API than SPARQL,
>> which might allow for more internal optimizations?
>> * Help us better understand the constraints. Document use cases on [1].
>>
>> Sadly, we don't have the bandwidth right now to engage meaningfully in
>> this conversation. Feel free to send thoughts already, but don't expect any
>> timely response.
>>
>> Yet another thought is the large discrepancy between Virginia and Texas
>>> 

Re: [Wikidata] Wikidata Query Service update lag

2019-11-14 Thread Guillaume Lederrey
Thanks for the feedback!

On Thu, Nov 14, 2019 at 11:11 AM  wrote:

>
> Besides waiting for the new updater, it may be useful to tell us, what
> we as users can do too. It is unclear to me what the problem is. For
> instance, at one point I was worried that the many parallel requests to
> the SPARQL endpoint that we make in Scholia is a problem. As far as I
> understand it is not a problem at all. Another issue could be the way
> that we use Magnus Manske's Quickstatements and approve bots for high
> frequency editing. Perhaps a better overview and constraints on
> large-scale editing could be discussed?
>

To be (again) completely honest, we don't entirely understand the issue
either. There are clearly multiple related issues. In high level terms, we
have at least:

* Some part of the update process on Blazegraph is CPU bound and single
threaded. Even with low query load, if we have a high edit rate, Blazegraph
can't keep up, and saturates a single CPU (with plenty of available
resources on other CPUs). This is a hard issue to fix, requiring either
splitting the processing over multiple CPU or sharding the data over
multiple servers. Neither of which Blazegraph supports (at least not in our
current configuration).
* There is a race for resources between edits and queries: a high query
load will impact the update rate. This could to some extent be mitigated by
reducing the query load: if no one is using the service, it works great!
Obviously that's not much of a solution.

What you can do (short term):

* Keep bots usage well behaved (don't do parallel queries, provide a
meaningful user agent, smooth the load over time if possible, ...). As far
as I can see, most usage are already well behaved.
* Optimize your queries: better queries will use less resources, which
should help. Time to completion is a good approximation of the resources
used. I don't really have any more specific advice, SPARQL is not my area
of expertise.

What you can do (longer term):

* Help us think out of the box. Can we identify higher level use cases?
Could we implement some of our workflows on a higher level API than SPARQL,
which might allow for more internal optimizations?
* Help us better understand the constraints. Document use cases on [1].

Sadly, we don't have the bandwidth right now to engage meaningfully in this
conversation. Feel free to send thoughts already, but don't expect any
timely response.

Yet another thought is the large discrepancy between Virginia and Texas
> data centers as I could see on Grafana [1]. As far as I understand the
> hardware (and software) are the same. So why is there this large
> difference? Rather than editing or BlazeGraph, could the issue be some
> form of network issue?
>

As pointed out by Lucas, this is expected. Due to how our GeoDNS works, we
see more traffic on eqiad than on codfw.

Thanks for the help!

   Guillaume

[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage



>
>
> [1]
>
> https://grafana.wikimedia.org/d/00489/wikidata-query-service?panelId=8=1=now-7d=now
>
> /Finn
>
>
>
> On 14/11/2019 10:50, Guillaume Lederrey wrote:
> > Hello all!
> >
> > As you've probably noticed, the update lag on the public WDQS endpoint
> > [1] is not doing well [2], with lag climbing to > 12h for some servers.
> > We are tracking this on phabricator [3], subscribe to that task if you
> > want to stay informed.
> >
> > To be perfectly honest, we don't have a good short term solution. The
> > graph database that we are using at the moment (Blazegraph [4]) does not
> > easily support sharding, so even throwing hardware at the problem isn't
> > really an option.
> >
> > We are working on a few medium term improvements:
> >
> > * A dedicated updater service in Blazegraph, which should help increase
> > the update throughput [5]. Finger crossed, this should be ready for
> > initial deployment and testing by next week (no promise, we're doing the
> > best we can).
> > * Some improvement in the parallelism of the updater [6]. This has just
> > been identified. While it will probably also provide some improvement in
> > throughput, we haven't actually started working on that and we don't
> > have any numbers at this point.
> >
> > Longer term:
> >
> > We are hiring a new team member to work on WDQS. It will take some time
> > to get this person up to speed, but we should have more capacity to
> > address the deeper issues of WDQS by January.
> >
> > The 2 main points we want to address are:
> >
> > * Finding a triple store that scales better than our current solution.
> > * Better understand what are the use cases on WDQS and see if we can
> > provide a technical solution that is better suited. O

[Wikidata] Wikidata Query Service update lag

2019-11-14 Thread Guillaume Lederrey
Hello all!

As you've probably noticed, the update lag on the public WDQS endpoint [1]
is not doing well [2], with lag climbing to > 12h for some servers. We are
tracking this on phabricator [3], subscribe to that task if you want to
stay informed.

To be perfectly honest, we don't have a good short term solution. The graph
database that we are using at the moment (Blazegraph [4]) does not easily
support sharding, so even throwing hardware at the problem isn't really an
option.

We are working on a few medium term improvements:

* A dedicated updater service in Blazegraph, which should help increase the
update throughput [5]. Finger crossed, this should be ready for initial
deployment and testing by next week (no promise, we're doing the best we
can).
* Some improvement in the parallelism of the updater [6]. This has just
been identified. While it will probably also provide some improvement in
throughput, we haven't actually started working on that and we don't have
any numbers at this point.

Longer term:

We are hiring a new team member to work on WDQS. It will take some time to
get this person up to speed, but we should have more capacity to address
the deeper issues of WDQS by January.

The 2 main points we want to address are:

* Finding a triple store that scales better than our current solution.
* Better understand what are the use cases on WDQS and see if we can
provide a technical solution that is better suited. Our intuition is that
some of the use cases that require synchronous (or quasi synchronous)
updates would be better implemented outside of a triple store. Honestly, we
have no idea yet if this makes sense and what those alternate solutions
might be.

Thanks a lot for your patience during this tough time!

   Guillaume


[1] https://query.wikidata.org/
[2]
https://grafana.wikimedia.org/d/00489/wikidata-query-service?orgId=1=1571131796906=1573723796906_name=wdqs=8
[3] https://phabricator.wikimedia.org/T238229
[4] https://blazegraph.com/
[5] https://phabricator.wikimedia.org/T212826
[6] https://phabricator.wikimedia.org/T238045

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Tracking tasks for wikidata query service

2019-11-12 Thread Guillaume Lederrey
Hello all!

Warning: this is fairly administrative, if you don't care about how work is
being tracked for Wikidata Query Service, just ignore this email!

For those of you who follow the work we do on Wikidata Query Service, you
might be using the Wikidata Query Service sprint board [1]. And you might
even have realized that this board is now empty. Don't worry, we have not
stopped working, quite the opposite. The tasks have been moved to the
Search sprint board [2].

The team in charge of WDQS is the "Search Platform team" (yes I know,
pretty bad name since we also take care of WDQS). We are trying to share
the responsibility of WDQS more across the team, which means that we have
people working both on search and on WDQS. Tracking everything on the same
board means less overhead for those people and more visibility for the
whole team about what's going on in the WDQS world.

Note that the WDQS backlog hasn't moved (yet) [3].

Thanks for reading!

   Guillaume

[1] https://phabricator.wikimedia.org/project/view/1239/
[2] https://phabricator.wikimedia.org/project/view/1227/
[3] https://phabricator.wikimedia.org/project/view/891/

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] How to respect throttling and retry-after headers on the Wikidata Query Service.

2019-11-06 Thread Guillaume Lederrey
Hello!

Sorry for the late reply.

On Sat, Nov 2, 2019 at 12:31 PM Andra Waagmeester  wrote:

> Thanks for your prompt response. I wasn't filtering for 429, but only for
> 503, so that might explain it.
> This is my current countermeasure against overloading the system:
>
>
> https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/wdi_core.py#L1179
>

With only a quick look at the code, it looks good enough to me. A few
things you might want to improve:

* L1148 [1]: use a default retry_after of 60 seconds instead of 30. That's
the upper bound of what our throttling will ask you
* L1186-L1189: in case of 429, you can check the "retry-after" header to
get a sleep value that will be what our throttling will expect



>
>
> If you follow all that, you should be good. If you still see throttling /
>> ban, let us know. If you give me the User-Agent of your script and the time
>> at which you received the throttling / ban response, and I can have a look
>> into the logs.
>>
>>
> Where do I let you know? Is this email list the right place to do so?
>

This list is the right place. Or you can contact me directly if you want.
But others might benefit from this discussion being public.

[1]
https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/wdi_core.py#L1148
[2]
https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/wdi_core.py#L1186-L1189



> Regards,
>
> Andra
>
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata] Contact for Wikidata Query Service related things

2019-09-12 Thread Guillaume Lederrey
Hello all!

I'm Guillaume, you might have read about me already. At the moment,
I'm the primary contact for things related to Wikidata Query Service.
Feel free to ping me on Phabricator, or via direct email if you think
I can help unblock something.

We are in the process of hiring 2 new engineers to work on WDQS, I'll
keep you posted when that happens!

Thanks all!

   Guillaume

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Performance and update versus query

2019-06-28 Thread Guillaume Lederrey
On Thu, Jun 27, 2019 at 4:37 PM Gerard Meijssen
 wrote:
>
> Hoi,
> The good news, the issue with my jobs has been isolated. It is a bug in the 
> software that occasionally manifests itself. Good because it has nothing to 
> do with performance at this time. Magnus has a ticket at Crossref [1] so that 
> will be fixed at some stage.
>
> The reason why I need to be certain about the functionality is that when 
> scientists find that their papers are well presented in Wikidata, they will 
> submit jobs for their data to be imported from ORCID into Wikidata.. Given 
> the number of scientists we already know about, this may have as a result of 
> many more jobs updating what we know of science. I have been asked to write 
> for the ORCID blog and that only makes sense when we can accomodate the 
> traffic.\

Predictions are hard, especially when about the future :)

It is difficult to tell you if Wikidata / WDQS is going to be able to
handle this additional load, without knowing what that additional load
will look like, both in term of complexity and in term of volume. That
being said, for WDQS, the capacity issues we have seen so far are
usually about peak traffic. A badly behaved bot starts sending way
more traffic than we usually have (either read or write traffic) and
we start lagging. Additional well behaved clients are probably not
going to be an issue short term (but again, I'm just guessing).

Again, we are working on several performance improvements and we will
hire an additional engineer to work on WDQS. For example, we're
working on a custom code to process updates into Blazegraph, hoping
that this will help us reduce the load related to edits and reduce the
update lag. We won't know the actual impact until this is implemented
and tested, but we're doing our best.

I understand that my answer is vague enough to be disappointing, but
at least you have some additional context.

Have fun!

   Guillaume

> Thanks,
>  GerardM
>
> [1] https://github.com/MattsSe/crossref-rs/issues/5
>
>
>
> On Thu, 27 Jun 2019 at 10:39, Guillaume Lederrey  
> wrote:
>>
>> Hello!
>>
>> I'm not familiar with some of the issues you raised, but let me try a
>> few guesses...
>>
>> On Wed, Jun 26, 2019 at 8:02 AM Gerard Meijssen
>>  wrote:
>> >
>> > Hoi,
>> > The performance of the query update is getting worse. Questions about this 
>> > have been raised before. I do remember quality replies like it is not 
>> > exponential so there is no problem. However, here we are and there is a 
>> > problem.
>> >
>> > The problem is that I run batch jobs, batch jobs that do not run [1]. I 
>> > have the impression that they are put in some kind of suspended animation 
>> > by a person. These jobs are submitted by the SourceMD tool by Magnus, 
>> > Magnus is well known for being responsive to suggestions on how he can 
>> > improve them. So do not use as an argument that there is something wrong 
>> > with these job. At most it is acceptable for these run to put on some kind 
>> > of hold for the duration of a crisis and then there has to be a release.
>>
>> I'm not familiar with sourcemd, and the link you provided isn't very
>> clear on what the actual error is. I just guessing, but maybe sourcemd
>> has some assumptions about updates to WDQS being synchronous, or
>> quasi-synchronous. Another guess is that it might be subject to
>> throttling and not backing off appropriately, and maybe it ends up
>> being banned for some time. If anyone knows what user agent is used by
>> sourcemd, I can have a look into the WDQS logs to get more
>> information.
>>
>> > At the same time I notice that the reports indicating multiple items with 
>> > the same ORCiD id include items that should have been picked up by earlier 
>> > reports. I notice that query does not pick up existing items with an ORCid 
>> > id and creates new ones. For me this is an indication that Query is not 
>> > reliable.
>> >
>> > There is talk on the Wiki that there is no point in having fixed 
>> > descriptions in anything but English. What caused this discussion is the 
>> > sheer amount of updates needed just for one language. At the London 
>> > Wikimania this perceived need for fixed descriptions was discussed vis a 
>> > vis automated descriptions and as I recall the only argument for having 
>> > them at all was "standards" in relation to dumps. Yes, automated 
>> > descriptions may be cached and included in a dump.
>> >
>> > I have been asked to write for the ORCiD blog and thereby in effect plug 
>> > the relevance 

Re: [Wikidata] Performance and update versus query

2019-06-27 Thread Guillaume Lederrey
Hello!

I'm not familiar with some of the issues you raised, but let me try a
few guesses...

On Wed, Jun 26, 2019 at 8:02 AM Gerard Meijssen
 wrote:
>
> Hoi,
> The performance of the query update is getting worse. Questions about this 
> have been raised before. I do remember quality replies like it is not 
> exponential so there is no problem. However, here we are and there is a 
> problem.
>
> The problem is that I run batch jobs, batch jobs that do not run [1]. I have 
> the impression that they are put in some kind of suspended animation by a 
> person. These jobs are submitted by the SourceMD tool by Magnus, Magnus is 
> well known for being responsive to suggestions on how he can improve them. So 
> do not use as an argument that there is something wrong with these job. At 
> most it is acceptable for these run to put on some kind of hold for the 
> duration of a crisis and then there has to be a release.

I'm not familiar with sourcemd, and the link you provided isn't very
clear on what the actual error is. I just guessing, but maybe sourcemd
has some assumptions about updates to WDQS being synchronous, or
quasi-synchronous. Another guess is that it might be subject to
throttling and not backing off appropriately, and maybe it ends up
being banned for some time. If anyone knows what user agent is used by
sourcemd, I can have a look into the WDQS logs to get more
information.

> At the same time I notice that the reports indicating multiple items with the 
> same ORCiD id include items that should have been picked up by earlier 
> reports. I notice that query does not pick up existing items with an ORCid id 
> and creates new ones. For me this is an indication that Query is not reliable.
>
> There is talk on the Wiki that there is no point in having fixed descriptions 
> in anything but English. What caused this discussion is the sheer amount of 
> updates needed just for one language. At the London Wikimania this perceived 
> need for fixed descriptions was discussed vis a vis automated descriptions 
> and as I recall the only argument for having them at all was "standards" in 
> relation to dumps. Yes, automated descriptions may be cached and included in 
> a dump.
>
> I have been asked to write for the ORCiD blog and thereby in effect plug the 
> relevance of the Scholia presentation for scientists. When I do, the number 
> of jobs like the ones I run will mushroom. It is why I have not put anything 
> forward so far because we cannot cope as it is.
>
> The issues I see is,
> * again to what extend can we grow our content, both for query and update for 
> the short medium and long term
> * will batch jobs like mine be able to complete

Honestly, I'm not sure what the issue is, so I can't assure you those
batches will be able to complete. What we can do is work together to
understand the issue and see what needs to be fixed.

> * can we ingest the attention when scholars discover how relevant Scholia is 
> for them, the subject they care for.
> * do we care that motivation of volunteers relies on the availability of 
> sufficient performance to do the tasks they care for.

It depends on who "we" is. I care, and I know that people on my team
care. Which does not mean we will be able to magically fix everything,
but we're trying.


In more general terms, scaling Wikidata and Wikidata Query Service
will require challenging some of our assumptions. Workflows that
assume WDQS to be updated synchronously will fail more and more.
Throttling is becoming more and more important to the stability of the
service and to a fair access to resources, so clients will need to be
able to smooth their load and backoff appropriately.

Sorry to not have a direct solution to your current issues, but let's
try to find one!

  Have fun!

Guillaume

>
> Thanks,
>   Gerard
>
>
>
>
>
> [1] https://tools.wmflabs.org/sourcemd/?action=batches=GerardM
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Overload of query.wikidata.org

2019-06-17 Thread Guillaume Lederrey
No, there isn't any prioritization. Updates are guaranteed as they stay in
the update queue if they could not be written, but both read and writes are
impacted by resource saturation.

On Mon, 17 Jun 2019, 15:35 Gerard Meijssen, 
wrote:

> Hoi,
> Does this mean that the retrieval of data has priority over updates ?
> Thanks,
>   GerardM
> On Mon, 17 Jun 2019 at 14:52, Guillaume Lederrey 
> wrote:
>
>> Hello all!
>>
>> We now have an incident report [1] describing in more detail this
>> overload of Wikidata Query Service. The ban of python-request is still
>> in effect and will remain so until we have a throttling solution in
>> place for generic user agents.
>>
>> Thanks all for your patience!
>>
>>Guillaume
>>
>>
>> [1]
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20190613-wdqs
>>
>> On Thu, Jun 13, 2019 at 7:52 PM Guillaume Lederrey
>>  wrote:
>> >
>> > Hello all!
>> >
>> > We are currently dealing with a bot overloading the Wikidata Query
>> > Service. This bot does not look actively malicious, but does create
>> > enough load to disrupt the service. As a stop gap measure, we had to
>> > deny access to all bots using python-request user agent.
>> >
>> > As a reminder, any bot should use a user agent that allows to identify
>> > it [1]. If you have trouble accessing WDQS, please check that you are
>> > following those guidelines.
>> >
>> > More information and a proper incident report will be communicated as
>> > soon as we are on top of things again.
>> >
>> > Thanks for your understanding!
>> >
>> >Guillaume
>> >
>> >
>> > [1] https://meta.wikimedia.org/wiki/User-Agent_policy
>> >
>> > --
>> > Guillaume Lederrey
>> > Engineering Manager, Search Platform
>> > Wikimedia Foundation
>> > UTC+2 / CEST
>>
>>
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+2 / CEST
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Overload of query.wikidata.org

2019-06-17 Thread Guillaume Lederrey
Hello all!

We now have an incident report [1] describing in more detail this
overload of Wikidata Query Service. The ban of python-request is still
in effect and will remain so until we have a throttling solution in
place for generic user agents.

Thanks all for your patience!

   Guillaume


[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190613-wdqs

On Thu, Jun 13, 2019 at 7:52 PM Guillaume Lederrey
 wrote:
>
> Hello all!
>
> We are currently dealing with a bot overloading the Wikidata Query
> Service. This bot does not look actively malicious, but does create
> enough load to disrupt the service. As a stop gap measure, we had to
> deny access to all bots using python-request user agent.
>
> As a reminder, any bot should use a user agent that allows to identify
> it [1]. If you have trouble accessing WDQS, please check that you are
> following those guidelines.
>
> More information and a proper incident report will be communicated as
> soon as we are on top of things again.
>
> Thanks for your understanding!
>
>Guillaume
>
>
> [1] https://meta.wikimedia.org/wiki/User-Agent_policy
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+2 / CEST



-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Overload of query.wikidata.org

2019-06-13 Thread Guillaume Lederrey
Hello all!

We are currently dealing with a bot overloading the Wikidata Query
Service. This bot does not look actively malicious, but does create
enough load to disrupt the service. As a stop gap measure, we had to
deny access to all bots using python-request user agent.

As a reminder, any bot should use a user agent that allows to identify
it [1]. If you have trouble accessing WDQS, please check that you are
following those guidelines.

More information and a proper incident report will be communicated as
soon as we are on top of things again.

Thanks for your understanding!

   Guillaume


[1] https://meta.wikimedia.org/wiki/User-Agent_policy

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey
On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 10.06.19 16:54, Guillaume Lederrey wrote:
>
> Hello!
>
> On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
>  wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.
>
> Yes, I am admitting that Blazegraph (at least in the way we are using
> it at the moment) does not scale to our future needs. Blazegraph does
> have support for sharding (what they call "Scale Out"). And yes, we
> need to have a closer look at how that works. I'm not the expert here,
> so I won't even try to assert if that's a viable solution or not.
>
> Yes, sharding is what you need, I think, instead of replication. This is the 
> technique where data is repartitioned into more manageable chunks across 
> servers.

Well, we need sharding for scalability and replication for
availability, so we do need both. The hard problem is sharding.

> Here is a good explanation of it:
>
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Interesting read. I don't see how Virtuoso addresses data locality, it
looks like sharding of their RDF store is just hash based (I'm
assuming some kind of uniform hash). I'm not enough of an expert on
graph databases, but I doubt that a highly connected graph like
Wikidata will be able to scale reads without some way to address data
locality. Obviously, this needs testing.

> http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/
>
>
> Sharding, scale-out or repartitioning is a classical enterprise feature for 
> Open-source databases. I am rather surprised that Blazegraph is full GPL 
> without an enterprise edition. But then they really sounded like their goal 
> as a company was to be bought by a bigger fish, in this case Amazon Web 
> Services. What is their deal? They are offering support?
>
> So if you go open-source, I think you will have a hard time finding good free 
> databases sharding/repartition. FoundationDB as proposed in the grant [1]is 
> from Apple
>
> [1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB
>
>
> I mean try the sharding feature. At some point though it might be worth 
> considering to go enterprise. Corporate Open Source often has a twist.

Closed source is not an option. We have strong open source
requirements to deploy anything in our production environment.

> Just a note here: Virtuoso is also a full RDMS, so you could probably keep 
> wikibase db in the same cluster and fix the asynchronicity. That is also true 
> for any mappers like Sparqlify: http://aksw.org/Projects/Sparqlify.html 
> However, these shift the problem, then you need a sharded/repartitioned 
> relational database

There is no plan to move the Wikibase storage out of MySQL at the
moment. In any case, having a low coupling between the primary storage
for wikidata and a secondary storage for complex querying is a sound
architectural principle. This asynchronous update process is most
probably going to stay in place, just because it makes a lot of sense.

Thanks for the discussion so far! It is always interesting to have outside idea!

   Have fun!

 Guillaume

>
> All the best,
>
> Sebastian
>
>
>
> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sha

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey
Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data across servers? 
> Or is this independent servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.

> Some info here:
>
> - We evaluated some stores according to their performance: 
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>   "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!

> - Virtuoso has proven quite useful. I don't want to advertise here, but the 
> thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
> it is also the OS version, not the professional with clustering and 
> repartition capability. So we are playing the game since ten years now: 
> Everybody tries other databases, but then most people come back to virtuoso. 
> I have to admit that OpenLink is maintaining the hosting for DBpedia 
> themselves, so they know how to optimise. They normally do large banks as 
> customers with millions of write transactions per hour. In LOD2 they also 
> implemented column store features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above (and a quick look at
virtuoso website does not give me the answer either), but it looks
like the sharding / partitioning options are only available in the
enterprise version. That probably makes it a non starter for us.

> - I recently heard a presentation from Arango-DB and they had a good cluster 
> concept as well, although I don't know anybody who tried it. The slides 
> seemed to make sense.

Nice, another one to add to our list of options to test.

> All the best,
>
> Sebastian
>
>
>
>
> Reasonably, addressing all of the above constraints is unlikely to
> ever happen. Some of the constraints are non negotiable: if we can't
> keep up with Wikidata in term of data size or number of edits, it does
> not make sense to address query performance. On some constraints, we
> will probably need to compromise.
>
> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting al

[Wikidata] Scaling Wikidata Query Service

2019-06-06 Thread Guillaume Lederrey
Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

Reasonably, addressing all of the above constraints is unlikely to
ever happen. Some of the constraints are non negotiable: if we can't
keep up with Wikidata in term of data size or number of edits, it does
not make sense to address query performance. On some constraints, we
will probably need to compromise.

For example, the update process is asynchronous. It is by nature
expected to lag. In the best case, this lag is measured in minutes,
but can climb to hours occasionally. This is a case of prioritizing
stability and correctness (ingesting all edits) over update latency.
And while we can work to reduce the maximum latency, this will still
be an asynchronous process and needs to be considered as such.

We currently have one Blazegraph expert working with us to address a
number of performance and stability issues. We
are planning to hire an additional engineer to help us support the
service in the long term. You can follow our current work in phabricator [2].

If anyone has experience with scaling large graph databases, please
reach out to us, we're always happy to share ideas!

Thanks all for your patience!

   Guillaume

[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
[2] https://phabricator.wikimedia.org/project/view/1239/

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] minimal hardware requirements for loading wikidata dump in Blazegraph

2019-06-04 Thread Guillaume Lederrey
On Tue, Jun 4, 2019 at 3:14 PM Vi to  wrote:
>
> AFAIR it's a double Xeon E5-2620 v3.
> With modern CPUs frequency is not so significant.

Our latest batch of servers are: Intel(R) Xeon(R) CPU E5-2620 v4 @
2.10GHz (so v4 instead of v3, but the difference is probably minimal).

> Vito
>
> Il giorno mar 4 giu 2019 alle ore 13:00 Adam Sanchez  
> ha scritto:
>>
>> Thanks Guillaume!
>> One question more, what is the CPU frequency (GHz)?
>>
>> Le mar. 4 juin 2019 à 12:25, Guillaume Lederrey
>>  a écrit :
>> >
>> > On Tue, Jun 4, 2019 at 12:18 PM Adam Sanchez  wrote:
>> > >
>> > > Hello,
>> > >
>> > > Does somebody know the minimal hardware requirements (disk size and
>> > > RAM) for loading wikidata dump in Blazegraph?
>> >
>> > The actual hardware requirements will depend on your use case. But for
>> > comparison, our production servers are:
>> >
>> > * 16 cores (hyper threaded, 32 threads)
>> > * 128G RAM
>> > * 1.5T of SSD storage
>> >
>> > > The downloaded dump file wikidata-20190513-all-BETA.ttl is 379G.
>> > > The bigdata.jnl file which stores all the triples data in Blazegraph
>> > > is 478G but still growing.
>> > > I had 1T disk but is almost full now.
>> >
>> > The current size of our jnl file in production is ~670G.
>> >
>> > Hope that helps!
>> >
>> > Guillaume
>> >
>> > > Thanks,
>> > >
>> > > Adam
>> > >
>> > > ___
>> > > Wikidata mailing list
>> > > Wikidata@lists.wikimedia.org
>> > > https://lists.wikimedia.org/mailman/listinfo/wikidata
>> >
>> >
>> >
>> > --
>> > Guillaume Lederrey
>> > Engineering Manager, Search Platform
>> > Wikimedia Foundation
>> > UTC+2 / CEST
>> >
>> > _______
>> > Wikidata mailing list
>> > Wikidata@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] minimal hardware requirements for loading wikidata dump in Blazegraph

2019-06-04 Thread Guillaume Lederrey
On Tue, Jun 4, 2019 at 12:18 PM Adam Sanchez  wrote:
>
> Hello,
>
> Does somebody know the minimal hardware requirements (disk size and
> RAM) for loading wikidata dump in Blazegraph?

The actual hardware requirements will depend on your use case. But for
comparison, our production servers are:

* 16 cores (hyper threaded, 32 threads)
* 128G RAM
* 1.5T of SSD storage

> The downloaded dump file wikidata-20190513-all-BETA.ttl is 379G.
> The bigdata.jnl file which stores all the triples data in Blazegraph
> is 478G but still growing.
> I had 1T disk but is almost full now.

The current size of our jnl file in production is ~670G.

Hope that helps!

Guillaume

> Thanks,
>
> Adam
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Request

2019-04-10 Thread Guillaume Lederrey
Hello!

It isn't entirely clear from your email what kind of data you are
looking for, or what endpoint you are using to get this data. If you
need to extract large amount of data from Wikidata, you should
probably start from the dumps [1], not from API calls. Without knowing
more about your context, it is hard to recommend anything.

Good luck for your project!

   Guillaume


[1] https://www.wikidata.org/wiki/Wikidata:Database_download

On Wed, Apr 10, 2019 at 9:14 AM Ahmed Mamdouh  wrote:
>
> Greetings All,
>
> Hope this e-mail finds you well. I am currently doing a master project in NLP 
> in JKU under the supervision of Prof. Bruno Buchberger the famous Austrian 
> Mathematician.
>
> I am facing a problem where I can’t get enough data for my project. So is 
> there anything that can be done to extend the limit of queries as they 
> timeout ?
>
> Thanks in advance,
> Mamdouh
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Question: Suddenly not able to perform SPARQL query (IP blocked?)

2019-02-22 Thread Guillaume Lederrey
Hello!

You might be hitting our throttling limits [1]. But that throttling
should result in an HTTP 429 (too many requests) and not in a
ECONNRESET. So there is something fishy here. I assume that your
application is following our usual policies [2] and that you are using
a custom user agent string to identify it. If that's the case, can you
tell us what UA you're using, and I can have a look into the logs.

Thanks!

   Guillaume


[1] 
https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Usage_constraints
[2] https://wikitech.wikimedia.org/wiki/Robot_policy#User_agent

On Fri, Feb 22, 2019 at 9:57 AM Matthew Moy de Vitry
 wrote:
>
> Hello everyone,
> I am developing an application that requires many SPARQL queries to the 
> wikidata server (1k-2k each time I test it). The application front-end is 
> visible here: https://beta.water-fountains.org
>
> I am just making SPARQL queries from a nodeJS server to get information about 
> fountains.
> This has been working fine for about a year, until yesterday, when the 
> connection started returning ECONNRESET or "Socket hang up". If I run the 
> query from the browser instead of the server it works fine, and if I use a 
> vpn, the NodeJS server is able perform the queries without issue. Here is an 
> example query:  link
>
> Has anyone else had such an issue? Thanks!
> Matthew
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Data corruption on 2 Wikidata Query Service servers

2019-01-08 Thread Guillaume Lederrey
Hello all!

We are having some issues with 2 of the Wikidata Query Service
servers. So far, the issue looks like data corruption, probably
related to an issue in Blazegraph itself (the database engine behind
Wikidata Query Service). The issue prevents updates to the data, but
reads are unaffected as far as we can tell.

The 2 affected servers are part of the internal WDQS cluster, so the
public wdqs endpoint [1] is not affected. Data is lagging on the
internal eqiad endpoint, so Mediawiki functionalities that use WDQS
are at the moment not seeing the latest updates to Wikidata.

We are reaching out to the Blazegraph team via Github [2] and via
private contacts that we have. We hope to identify the root cause of
the issue so that we can fix it for good, but this looks like a hard
problem. Failing that, we will reimport the full data set.

You can follow the upstream issue on Github [2] and on Phabricator on
our side [3].

Sorry for the inconvenience and thank you for your patience!

   Have fun,

 Guillaume


[1] https://query.wikidata.org/
[2] https://github.com/blazegraph/database/issues/114
[3] https://phabricator.wikimedia.org/T213134

-- 
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Raising alerting threshold for Wikdiata Query Service updater lag

2018-11-05 Thread Guillaume Lederrey
On Sat, Nov 3, 2018 at 5:58 PM Gerard Meijssen
 wrote:
>
> Hoi,
> We have found that duplicate items are created for publications. and so far 
> the only reason identified is that the lagtime before new data becomes 
> available is so bad.  The notion that there is no practical impact is 
> therefore wrong.
>
> When WDQS results are considered only for use outside Wikidata fine. However 
> as WDQS is used for the development of new and improved data you basically 
> indicate / accept failure by changing / accepting the status quo.
> Thanks,
>   Gerard

Thanks for the precision! There are more discussions on the subject in
the related phab task [1]. We don't have a good solution yet, but your
input on that task would be appreciated, if only to make your use
cases visible.

Have fun!

   Guillaume

[1] https://phabricator.wikimedia.org/T199228


> On Fri, 2 Nov 2018 at 14:34, Guillaume Lederrey  
> wrote:
>>
>> Hello all!
>>
>> TL;DR: alert level on Wikidata Query Service have been increased, any
>> Icinga alert should now be treated seriously.
>>
>> As you might know already, we're having trouble keeping up on updates
>> on the public Wikidata Query Service cluster. We're working on it, but
>> it is a hard problem. At the same time, known use cases of the public
>> WDQS endpoint don't depend on a short update lag.
>>
>> As such, we have increased the alerting threshold on update lag for
>> this public cluster to 6h / 12h for WARNING / CRITICAL [1]. This does
>> not actually change the quality of service of WDQS public endpoints,
>> but somewhat aligns expectations and reality. It also means that all
>> alerts raised by WDQS should now be treated seriously and not ignored
>> as known issues with no immediate solution.
>>
>> At the same time, we're having a conversation of what the service
>> level of that cluster should be [2]. Feel free to join that
>> conversation if you are impacted (or just if you have interesting
>> thoughts on the subject).
>>
>> Thanks for your patience,
>>
>>Guillaume
>>
>>
>> [1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/470819
>> [2] https://phabricator.wikimedia.org/T199228
>>
>> --
>> Guillaume Lederrey
>> Operations Engineer, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Raising alerting threshold for Wikdiata Query Service updater lag

2018-11-02 Thread Guillaume Lederrey
Hello all!

TL;DR: alert level on Wikidata Query Service have been increased, any
Icinga alert should now be treated seriously.

As you might know already, we're having trouble keeping up on updates
on the public Wikidata Query Service cluster. We're working on it, but
it is a hard problem. At the same time, known use cases of the public
WDQS endpoint don't depend on a short update lag.

As such, we have increased the alerting threshold on update lag for
this public cluster to 6h / 12h for WARNING / CRITICAL [1]. This does
not actually change the quality of service of WDQS public endpoints,
but somewhat aligns expectations and reality. It also means that all
alerts raised by WDQS should now be treated seriously and not ignored
as known issues with no immediate solution.

At the same time, we're having a conversation of what the service
level of that cluster should be [2]. Feel free to join that
conversation if you are impacted (or just if you have interesting
thoughts on the subject).

Thanks for your patience,

   Guillaume


[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/470819
[2] https://phabricator.wikimedia.org/T199228

--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS timeout and slowdown - Incident report

2018-06-25 Thread Guillaume Lederrey
Hello!

As you might already know, Wikdiata Query Service has been misbehaving
in the last 24 hours. Our public SPARQL endpoint [1] was slow and
throwing timeouts. Sadly, exposing a public SPARQL endpoint is a hard
problem and we don't have a final solution to this. Still we have some
improvements. Have a look at the incident report [2] if you want
details.

I also started to write a runbook for WDQS [3]. This should be
interesting mostly to our SRE team, but feel free to also have a look
and suggest improvements / clarifications.

Note that our internal WDQS endpoint was stable during that time (as expected).

Thanks for your help and your patience!

  Guillaume

[1] https://query.wikidata.org/
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180625-wdqs
[3] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

-- 
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


  1   2   >