On 5/29/24 11:43, Walter Underwood wrote:
I’ve done three kinds of sanity checks/fixes to avoid performance problems.

1. Prevent deep paging. Have to do this every time. When a request comes in for 
a page past 50, it gets rewritten to the 50th page.

2. Limit the size of queries. With homework help, we had people pasting in 800 
word queries. Those get trimmed to 40 words. The results for 40 words were 
nearly the same as those for 80 words in a test a few thousand real user 
queries. Google only does 32.

3. Removing all syntax characters (or replacing them with spaces). This gets 
tricky, because things like “-“ are OK inside a word. A more conservative 
approach is to remove “*” and “?”, so you prevent script kiddie queries like 
“a* b* c* d* e* f* …”

Thanks, everyone.

For #3 I think I'll steal the regexs from solarium, as Thomas suggested. #1 & 2 aren't our problem ATM but are worth adding, while I'm at it.

I have doubts about reconfiguring the logging as per Misha's suggestion: it'll save some disk space but exceptions themselves will still be there with all their overhead... and disk is the cheapest part of it all.

And yeah, we are using the standard parser. It may be worth switching to e.g. edismax, but that comes with lots of regression testing (and finding all the places to test first), making it a much bigger project.

Thanks again,
Dima

Reply via email to