On 5/29/24 11:43, Walter Underwood wrote:
I’ve done three kinds of sanity checks/fixes to avoid performance problems.
1. Prevent deep paging. Have to do this every time. When a request comes in for
a page past 50, it gets rewritten to the 50th page.
2. Limit the size of queries. With homework help, we had people pasting in 800
word queries. Those get trimmed to 40 words. The results for 40 words were
nearly the same as those for 80 words in a test a few thousand real user
queries. Google only does 32.
3. Removing all syntax characters (or replacing them with spaces). This gets
tricky, because things like “-“ are OK inside a word. A more conservative
approach is to remove “*” and “?”, so you prevent script kiddie queries like
“a* b* c* d* e* f* …”
Thanks, everyone.
For #3 I think I'll steal the regexs from solarium, as Thomas suggested.
#1 & 2 aren't our problem ATM but are worth adding, while I'm at it.
I have doubts about reconfiguring the logging as per Misha's suggestion:
it'll save some disk space but exceptions themselves will still be there
with all their overhead... and disk is the cheapest part of it all.
And yeah, we are using the standard parser. It may be worth switching to
e.g. edismax, but that comes with lots of regression testing (and
finding all the places to test first), making it a much bigger project.
Thanks again,
Dima