Re: Bunch of questions regarding enterprise configuration

Otis Gospodnetic Thu, 25 Sep 2008 22:38:04 -0700

Hi,

Your questions don't have simple answers, but here are some quick one.





----- Original Message ----
> I'm new to Solr, and have been reading through documentation off-and-on for
> days, but still have some unanswered basic/fundamental questions that have a
> huge impact on my implementation approach.
> I am thinking of moving my company's web app's main search engine over to
> Solr. My goal is to index 5M user records of a social networking website
> (most of which have a free-form text portion, but the majority of  data is
> non-text) and have complex searches against those records back in the
> sub-0.5s range of time. I have just under 10  application servers each
> running my web-app, which is mostly stateless except for things like users'
> online status.

How many servers have you got for running Solr? (assuming you don't intend to 
put Solr on the same servers as your webapp, as it sounds like each webapp is 
maxing out its server)

> Forgive me for asking so many in one email; feel free to change subject line
> and reply about individual items. Here's the questions:
> 
> 1. How to best organize a web-app that normally goes to a search-db to use
> Solr instead?
> a) Set up independent Solr instance, make app search it just like it used to
> search database.
> b) Integrate Solr right into app, so that app+solr get deployed together
> (this is very possible, as our app is Java). But we run  several instances
> of the app so we'd be running several Solr instances too.
> c) Set up independent Solr instance + our code (plugins or whatever?), have
> web clients request DIRECTLY to the Solr app and have  Solr return search
> results directly.
> d) Other configuration...?

a) Set up Solr master + N slaves on a separate set of boxes and access them 
remotely from your webapp.  If your webapp is a Java webapp, use SolrJ.  
Alternatively, if your webapp servers have enough spare CPU cycles and enough 
RAM, you could make those sam servers your 10 Solr slaves.

> 2. How to best handle Enums?
> We have a bunch of enumerated data (say, for example, shoe types). What
> "fieldType" should we use to index them?
> Should I index them as text? If I index "sandals" then if somebody searches
> for the keyword "sandals" then the documents that have shoeType=Sandals (eg,
> enum-value of "07") I'd want those documents to show up.

Sounds like "string" type.

> 3. Enums are related, sort-of:
> Sometimes our enumerated data is somewhat related. For example (in the "shoe
> types" example), let's say we have "sandals", well,  "crocs" are not
> sandals, but are SORT-oF like sandals, so we'd like them to match but score
> lower than an exact sandal match. How do  we do this? (Is this "Changing
> Similarity" or is that barking up the wrong tree?)

One option is to have a separate sort_of_like field where you stick various 
sort-of-like "synonyms".  If you are using DisMax you can include that 
sort_of_like field in the config but give it less boost than the "main" field.  
You could use index-time synonym injection for that sort_of_like field.

> 4. How to manage "Tags" data?
> Users on my site can enter "tags", and we want to be able to build
> tag-clouds, follow tag-links, and whatnot. Should I index tags as just a
> fieldType of "text"?

"text" is fine if you don't want tags to be exact.  Assume "photography" and 
"photo" have the same stem.  Do you want a user clicing on "photo" to get items 
tagged as "photography", too?  If so, use text, else consider string.  Treat 
multi-word tags as phrases.  Example: 
http://www.simpy.com/user/otis/tag/%22information+retrieval%22

> 5. How do I load the data?
> Loading all the data from the database (to anything!) takes a big chunk of
> time. Should I export it from the database once and then load it into Solr
> using CSV?

If export is not slow, then upload vis CSV should be faster than adding docs to 
Solr "the usual way".  But judging from your question below, you probably don't 
need the CSV approach.

> Follow-up: How would I manage loading this/new data on an ongoing basis? The
> site's users are creating data all the time, the bulk of  which is old (i.e.
> before today; could be bulk loaded), but after an initial bulk load it's
> ongoing data. Should I be just building  a huge Solr index on the filesystem
> and making sure I don't lose it?

Sounds like one-time bulk indexing followed by continous incremental indexing.  
You can have 2 masters to make things more fault-tolerant.  Or you can store 
your index on a SAN.  Or you can just count on your N Solr slaves acting as the 
"backup" (replicas) of your index, though they'll always be a little behind the 
master index.

> 6. How do I manage real-time data?
> For example, let's say I have users coming online and offline all the time,
> and I need to be able to search my set of "online  users". How should I go
> about this? Can this just be handled through index updates?

Yes, though there is no real-time search in Solr just yet.  There is always a 
bit of delay because of index replication (master=>slaves), index and cache 
warmups.


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Bunch of questions regarding enterprise configuration

Reply via email to