Hi everybody,

I'm new to Solr, and have been reading through documentation off-and-on for
days, but still have some unanswered basic/fundamental questions that have a
huge impact on my implementation approach.
I am thinking of moving my company's web app's main search engine over to
Solr. My goal is to index 5M user records of a social networking website
(most of which have a free-form text portion, but the majority of  data is
non-text) and have complex searches against those records back in the
sub-0.5s range of time. I have just under 10  application servers each
running my web-app, which is mostly stateless except for things like users'
online status.
Forgive me for asking so many in one email; feel free to change subject line
and reply about individual items. Here's the questions:

1. How to best organize a web-app that normally goes to a search-db to use
Solr instead?
a) Set up independent Solr instance, make app search it just like it used to
search database.
b) Integrate Solr right into app, so that app+solr get deployed together
(this is very possible, as our app is Java). But we run  several instances
of the app so we'd be running several Solr instances too.
c) Set up independent Solr instance + our code (plugins or whatever?), have
web clients request DIRECTLY to the Solr app and have  Solr return search
results directly.
d) Other configuration...?

2. How to best handle Enums?
We have a bunch of enumerated data (say, for example, shoe types). What
"fieldType" should we use to index them?
Should I index them as text? If I index "sandals" then if somebody searches
for the keyword "sandals" then the documents that have shoeType=Sandals (eg,
enum-value of "07") I'd want those documents to show up.

3. Enums are related, sort-of:
Sometimes our enumerated data is somewhat related. For example (in the "shoe
types" example), let's say we have "sandals", well,  "crocs" are not
sandals, but are SORT-oF like sandals, so we'd like them to match but score
lower than an exact sandal match. How do  we do this? (Is this "Changing
Similarity" or is that barking up the wrong tree?)

4. How to manage "Tags" data?
Users on my site can enter "tags", and we want to be able to build
tag-clouds, follow tag-links, and whatnot. Should I index tags as just a
fieldType of "text"?

5. How do I load the data?
Loading all the data from the database (to anything!) takes a big chunk of
time. Should I export it from the database once and then load it into Solr
using CSV?
Follow-up: How would I manage loading this/new data on an ongoing basis? The
site's users are creating data all the time, the bulk of  which is old (i.e.
before today; could be bulk loaded), but after an initial bulk load it's
ongoing data. Should I be just building  a huge Solr index on the filesystem
and making sure I don't lose it?

6. How do I manage real-time data?
For example, let's say I have users coming online and offline all the time,
and I need to be able to search my set of "online  users". How should I go
about this? Can this just be handled through index updates?

I'd appreciate any advice.

Sincerely,

Daryl.

Reply via email to