All, I'm using Solr to index and search a database of user data (username, email, first and last name), so there aren't really "terms" in the data to search for, like you might search for words that describe products in a catalog, for example.
I have set up my schema to include plain-old text fields for each of the data mentioned above, plus I have a copy-field called "all" which includes everything all together, plus I have a first + last field which uses a phonetic index and query analyzer. Since I don't need things such as term-replacement (spanner == wrench), stemming (first name 'chris' -> 'chri'), and possibly other features that I don't know about, I'm wondering what might be a recommended set of tokenizer(s), analyzer(s), etc. for such data. We will definitely want to be able to search by substring (to find 'cschultz' as a username with 'schultz' as input) but some substrings are probably useless (such as @gmail.com for email addresses) and don't need to be supported. What are some good options to look at for this type of data? In production, we have fewer than 5M records to handle, so this is more of an academic exercise than an actual performance requirement (since Solr is at least an order of magnitude faster than our current RDBMS-searching implementation). If it makes any difference, we are trying to keep the index up-to-date with all user changes made in real time (okay, maybe delayed by a few seconds, but basically realtime). We have a few hundred new-user registrations per day and probably half as many changes to user records as that, so perhaps 2 document-updates per minute on average (during ~12 business hours in the US on weekdays). Thanks for any advice anyone may have, -chris
signature.asc
Description: OpenPGP digital signature