I’ve done three kinds of sanity checks/fixes to avoid performance problems.
1. Prevent deep paging. Have to do this every time. When a request comes in for a page past 50, it gets rewritten to the 50th page. 2. Limit the size of queries. With homework help, we had people pasting in 800 word queries. Those get trimmed to 40 words. The results for 40 words were nearly the same as those for 80 words in a test a few thousand real user queries. Google only does 32. 3. Removing all syntax characters (or replacing them with spaces). This gets tricky, because things like “-“ are OK inside a word. A more conservative approach is to remove “*” and “?”, so you prevent script kiddie queries like “a* b* c* d* e* f* …” wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 29, 2024, at 7:11 AM, Dmitri Maziuk <dmitri.maz...@gmail.com> wrote: > > Hi all, > > our website has a search box that essentially passes its contents to Solr > without any massaging. This works fine 99% of the time, the other 1% is when > a misbehaving bot hits it and tries stuffing all sorts of crap in there. > > Then bad things happen: Java's overly verbose exception stack traces fill up > the disk faster than the logs are rotated, CPU load spikes, etc. > > So, question: does anyone know of a validator/sanitizer we can use clean up > the terms before passing them on to Solr? -- My google-fu fails to find one. > > TIA > Dima