Recommendations for non-narrative data

Christopher Schultz Fri, 16 Mar 2018 06:47:23 -0700

All,

I'm using Solr to index and search a database of user data (username,
email, first and last name), so there aren't really "terms" in the data
to search for, like you might search for words that describe products in
a catalog, for example.


I have set up my schema to include plain-old text fields for each of the
data mentioned above, plus I have a copy-field called "all" which
includes everything all together, plus I have a first + last field which
uses a phonetic index and query analyzer.

Since I don't need things such as term-replacement (spanner == wrench),
stemming (first name 'chris' -> 'chri'), and possibly other features
that I don't know about, I'm wondering what might be a recommended set
of tokenizer(s), analyzer(s), etc. for such data.

We will definitely want to be able to search by substring (to find
'cschultz' as a username with 'schultz' as input) but some substrings
are probably useless (such as @gmail.com for email addresses) and don't
need to be supported.

What are some good options to look at for this type of data?

In production, we have fewer than 5M records to handle, so this is more
of an academic exercise than an actual performance requirement (since
Solr is at least an order of magnitude faster than our current
RDBMS-searching implementation).

If it makes any difference, we are trying to keep the index up-to-date
with all user changes made in real time (okay, maybe delayed by a few
seconds, but basically realtime). We have a few hundred new-user
registrations per day and probably half as many changes to user records
as that, so perhaps 2 document-updates per minute on average (during ~12
business hours in the US on weekdays).

Thanks for any advice anyone may have,
-chris

signature.asc
Description: OpenPGP digital signature

Recommendations for non-narrative data

Reply via email to