On Mon, Oct 22, 2012 at 8:12 AM, Dag Lem <[email protected]> wrote: > I've started playing around a bit with Lucy, and I have to say it's > really, really nice!
:) > However I've run into a problem trying to increase performance using > sharding with SearchServer / ClusterSearcher. In my tests, I get close > to a tenfold *drop* in performance using a few local shards (say, 3 > shards on a 4 core server). "Performance" can mean many things. By design, ClusterSearcher is going to degrade metrics in some areas relative to a single index, while improving in others. For instance, it is perfectly acceptable if queries which return a single hit are slower under ClusterSearcher than under a single local IndexSearcher. Single-hit queries tend to be dominated by per-query overhead rather than per-hit search processing, and the per-query cost of ClusterSearcher is much higher than that of IndexSearcher. A tenfold degradation is not inconcievable. If the search profile of an application is dominated by such small lookup queries -- for instance, if you are using Lucy as a key-value store -- it would be best to avoid ClusterSearcher until you absolutely have to use it. Instead, you would want to invest in either RAM or SSDs. ClusterSearcher is intended for a different search query profile, though: it is optimized for large, computationally expensive queries which are dominated by per-hit search processing and potentially return many hits. > While I would expect some overhead using SearchServer / ClusterSearcher, > the close to tenfold increase in search time I experience does seem > rather excessive. I'd need an exorbitant amount of shards just to get > the same performance as by using a single index, if I'd ever get there... Is your search query dominated by per-query or per-hit costs -- i.e. does it return quickly at the level of a single IndexSearcher? If the costs are mostly per-query, then degradation in ClusterSearcher is to be expected and arguably less of a concern. (Theoretically, we might look into things like changing how we do object serialization if we wanted to improve matters.) If the query is expensive to begin with, though -- because it is dominated by per-hit costs -- then it would be unexpected to see ClusterSearcher perform poorly, and we would want to find out why. > If there is anything I can do to help isolate any possible problem, > please do tell me so (e.g. strace / perl profiling / ...) We're not there yet. If we see expensive queries take longer in ClusterSearcher, I think some Perl profiling might help. If, however, only cheap queries are slower, then we'd want to focus on optimizing your application first. Marvin Humphrey
