Hi everybody, I'm new to Solr, and have been reading through documentation off-and-on for days, but still have some unanswered basic/fundamental questions that have a huge impact on my implementation approach. I am thinking of moving my company's web app's main search engine over to Solr. My goal is to index 5M user records of a social networking website (most of which have a free-form text portion, but the majority of data is non-text) and have complex searches against those records back in the sub-0.5s range of time. I have just under 10 application servers each running my web-app, which is mostly stateless except for things like users' online status. Forgive me for asking so many in one email; feel free to change subject line and reply about individual items. Here's the questions:
1. How to best organize a web-app that normally goes to a search-db to use Solr instead? a) Set up independent Solr instance, make app search it just like it used to search database. b) Integrate Solr right into app, so that app+solr get deployed together (this is very possible, as our app is Java). But we run several instances of the app so we'd be running several Solr instances too. c) Set up independent Solr instance + our code (plugins or whatever?), have web clients request DIRECTLY to the Solr app and have Solr return search results directly. d) Other configuration...? 2. How to best handle Enums? We have a bunch of enumerated data (say, for example, shoe types). What "fieldType" should we use to index them? Should I index them as text? If I index "sandals" then if somebody searches for the keyword "sandals" then the documents that have shoeType=Sandals (eg, enum-value of "07") I'd want those documents to show up. 3. Enums are related, sort-of: Sometimes our enumerated data is somewhat related. For example (in the "shoe types" example), let's say we have "sandals", well, "crocs" are not sandals, but are SORT-oF like sandals, so we'd like them to match but score lower than an exact sandal match. How do we do this? (Is this "Changing Similarity" or is that barking up the wrong tree?) 4. How to manage "Tags" data? Users on my site can enter "tags", and we want to be able to build tag-clouds, follow tag-links, and whatnot. Should I index tags as just a fieldType of "text"? 5. How do I load the data? Loading all the data from the database (to anything!) takes a big chunk of time. Should I export it from the database once and then load it into Solr using CSV? Follow-up: How would I manage loading this/new data on an ongoing basis? The site's users are creating data all the time, the bulk of which is old (i.e. before today; could be bulk loaded), but after an initial bulk load it's ongoing data. Should I be just building a huge Solr index on the filesystem and making sure I don't lose it? 6. How do I manage real-time data? For example, let's say I have users coming online and offline all the time, and I need to be able to search my set of "online users". How should I go about this? Can this just be handled through index updates? I'd appreciate any advice. Sincerely, Daryl.