Smalyshev added a comment.

Separate service - yes, though ideally I'd just use the same URL

If it will be separate service, it will be separate URL. Having two distinct services under the same URL would not really work well with how our LVS is set up, unless we do some complex tricks with redirects, which I don't really want to get into.

Separate service - yes, though ideally I'd just use the same URL

LIMIT worked for me, OFFSET of course wouldn't work better then the original once numbers are bigger, since OFFSET essentially needs to run through all the data set to get to the offset. If you need whole dataset, LIMIT/OFFSET would make the issue only worse, quadratically. However, if for the use case dealing with subset of data is enough, then LIMIT might help. Of course, depends on use case (e.g. I am still not sure what 2M results are being used for, specifically). But LIMIT/OFFSET is certainly not going to solve this case.

For my own purposes, we could even have a Wikidata page with "named queries"

Making proposal about something like this is long on my TODO list, see also https://commons.wikimedia.org/wiki/User:TabulistBot - but I haven't got it to working condition due to lack of time. I could pick it up.

Transparent offlining - queries with a keyword get run with increased timeout (say, 10min), but only one of them at a time

I don't think this is feasible - there's no real way to ensure "only one" part since servers are completely independent, and even "one per server" is not completely trivial, though can probably be done with some query parameter magic. More worrying is that this requires support for very long requests from the whole pipeline from frontend down to Blazegraph, and I'm not sure how well this will actually work - HTTP is not really meant to do hour-long requests, and it would be a real shame to run one only to discover there's no way to deliver the data back since the connection has died in the meantime.

Additionally, this doesn't really solve the issue as one query, provided it's big enough, can take down the whole server (Java handles OOM really badly, which is mostly mitigated by the fact that most queries will time out before they can clog enough memory, but if we remove timeout the risk grows significantly). If we had it in some kind of gated setup, it'd be fine to take this risk, but exposing it to the Internet where people are not exactly known for not trying to break things just for lulz seems too big a risk for me.

So, in general, this seems to need some thought yet... I'd like to hear more details of what exactly is done with those 2M results, maybe I'll have some other ideas.


TASK DETAIL
https://phabricator.wikimedia.org/T179879

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Edgars2007, chasemp, Lydia_Pintscher, Magnus, MichaelSchoenitzer_WMDE, MisterSynergy, doctaxon, Jonas, Ash_Crow, Daniel_Mietchen, Lucas_Werkmeister_WMDE, Jane023, Base, Gehel, Smalyshev, Ijon, Aklapper, Lahi, Gq86, Darkminds3113, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Avner, FloNight, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to