Re: Solr vs Sphinx
Hi, Could you please start a new thread? Thanks, Otis - Original Message From: sunnyfr johanna...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, June 3, 2009 10:20:06 AM Subject: Re: Solr vs Sphinx Hi guys, I work now for serveral month on solr and really you provide quick answer ... and you're very nice to work with. But I've got huge issue that I couldn't fixe after lot of post. My indexation take one two days to be done. For 8G of data indexed and 1,5M of docs (ok I've plenty of links in my table but it takes such a long time). Second I've to do update every 20mn but every update represent maybe 20 000docs and when I use the replication I must replicate all the new index folder optimized because Ive too much datas updated and too much segment needs to be generate and I have to merge datas. So I lost my cache and my CPU goes mad. And I can't have more than 20request/sec. Fergus McMenemie-2 wrote: Something that would be interesting is to share solr configs for various types of indexing tasks. From a solr configuration aimed at indexing web pages to one doing large amounts of text to one that indexes specific structured data. I could see those being posted on the wiki and helping folks who say I want to do X, is there an example?. I think most folks start with the example Solr install and tweak from there, which probably isn't the best path... Eric Yep a solr cookbook with lots of different example recipes. However these would need to be very actively maintained to ensure they always represented best practice. While using cocoon I made extensive use of the examples section of the cocoon website. However most of the, massive number of, examples represent obsolete cocoon practise. Or there were four or five examples doing the same thing in different ways with no text explaining the pros/cons of the different approaches. This held me, as a newcomer, back and gave a bad impression of cocoon. I was wondering about a performance hints page. I was caught by an issue indexing CSV content where the use of overwrite=false made an almost 3x difference to my indexing speed. Still do not really know why! On May 15, 2009, at 8:09 AM, Mark Miller wrote: In the spirit of good defaults: I think we should change the Solr highlighter to highlight phrase queries by default, as well as prefix,range,wildcard constantscore queries. Its awkward to have to tell people you have to turn those on. I'd certainly prefer to have to turn them off if I have some limitation rather than on. Yep I agree, all whizzy new features should ideally be on by default unless there is a significant performance penalty. It is not enough that to issue a default solrconfig.xml with the feature on, it has to be on by default inside the code. - Mark - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal Fergus -- View this message in context: http://www.nabble.com/Solr-vs-Sphinx-tp23524676p23852364.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr vs Sphinx
Something that would be interesting is to share solr configs for various types of indexing tasks. From a solr configuration aimed at indexing web pages to one doing large amounts of text to one that indexes specific structured data. I could see those being posted on the wiki and helping folks who say I want to do X, is there an example?. I think most folks start with the example Solr install and tweak from there, which probably isn't the best path... Eric Yep a solr cookbook with lots of different example recipes. However these would need to be very actively maintained to ensure they always represented best practice. While using cocoon I made extensive use of the examples section of the cocoon website. However most of the, massive number of, examples represent obsolete cocoon practise. Or there were four or five examples doing the same thing in different ways with no text explaining the pros/cons of the different approaches. This held me, as a newcomer, back and gave a bad impression of cocoon. I was wondering about a performance hints page. I was caught by an issue indexing CSV content where the use of overwrite=false made an almost 3x difference to my indexing speed. Still do not really know why! On May 15, 2009, at 8:09 AM, Mark Miller wrote: In the spirit of good defaults: I think we should change the Solr highlighter to highlight phrase queries by default, as well as prefix,range,wildcard constantscore queries. Its awkward to have to tell people you have to turn those on. I'd certainly prefer to have to turn them off if I have some limitation rather than on. Yep I agree, all whizzy new features should ideally be on by default unless there is a significant performance penalty. It is not enough that to issue a default solrconfig.xml with the feature on, it has to be on by default inside the code. - Mark - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal Fergus
Re: Solr vs Sphinx
On Thu, May 14, 2009 at 8:36 PM, Mark Miller markrmil...@gmail.com wrote: Michael McCandless wrote: So why haven't we enabled this by default, already? Why isn't Lucene done already :) I hear you :) Mike
Re: Solr vs Sphinx
In the spirit of good defaults: I think we should change the Solr highlighter to highlight phrase queries by default, as well as prefix,range,wildcard constantscore queries. Its awkward to have to tell people you have to turn those on. I'd certainly prefer to have to turn them off if I have some limitation rather than on. - Mark
Re: Solr vs Sphinx
Something that would be interesting is to share solr configs for various types of indexing tasks. From a solr configuration aimed at indexing web pages to one doing large amounts of text to one that indexes specific structured data. I could see those being posted on the wiki and helping folks who say I want to do X, is there an example?. I think most folks start with the example Solr install and tweak from there, which probably isn't the best path... Eric On May 15, 2009, at 8:09 AM, Mark Miller wrote: In the spirit of good defaults: I think we should change the Solr highlighter to highlight phrase queries by default, as well as prefix,range,wildcard constantscore queries. Its awkward to have to tell people you have to turn those on. I'd certainly prefer to have to turn them off if I have some limitation rather than on. - Mark - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: Solr vs Sphinx
I agree regarding posting different types of files - because right now if you're just starting out with Solr, taking the sample files from the distro and going from there is the /only path/ =\ Thanks for your time! Matthew Runo Software Engineer, Zappos.com mr...@zappos.com - 702-943-7833 On May 15, 2009, at 6:41 AM, Eric Pugh wrote: Something that would be interesting is to share solr configs for various types of indexing tasks. From a solr configuration aimed at indexing web pages to one doing large amounts of text to one that indexes specific structured data. I could see those being posted on the wiki and helping folks who say I want to do X, is there an example?. I think most folks start with the example Solr install and tweak from there, which probably isn't the best path... Eric On May 15, 2009, at 8:09 AM, Mark Miller wrote: In the spirit of good defaults: I think we should change the Solr highlighter to highlight phrase queries by default, as well as prefix,range,wildcard constantscore queries. Its awkward to have to tell people you have to turn those on. I'd certainly prefer to have to turn them off if I have some limitation rather than on. - Mark - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: Solr vs Sphinx
On Wed, May 13, 2009 at 12:33 PM, Grant Ingersoll gsing...@apache.org wrote: I've contacted others in the past who have done comparisons and after one round of emailing it was almost always clear that they didn't know what best practices are for any given product and thus were doing things sub-optimally. While I agree, one should properly match tune all apps they are testing (for a fair comparison), we in turn must set out-of-the-box defaults (in Lucene and Solr) that get you as close to the best practices as possible. We don't always do that, and I think we should do better. My most recent example of this is BooleanQuery's performance. It turns out, if you setAllowDocsOutOfOrder(true), it yields a sizable performance gain (27% on my most recent test) for OR queries. So why haven't we enabled this by default, already? (As far as I can tell it's functionally equivalent, as long as the Collector can accept out-of-order docs, which our core collectors can). We can't expect the other camp to discover that this obscure setting must be set, to maximize Lucene's OR query performance. Mike
Re: Solr vs Sphinx
My most recent example of this is BooleanQuery's performance. It turns out, if you setAllowDocsOutOfOrder(true), it yields a sizable performance gain (27% on my most recent test) for OR queries. Mike, Can you please point me to some information concerning allowDocsOutOfOrder? What's this at all? -- Andrew Klochkov
Re: Solr vs Sphinx
Totally agree on optimizing out of the box experience, it's just never a one size fits all thing. And we have to be very careful about micro- benchmarks driving these settings. Currently, many of us use Wikipedia, but that's just one doc set and I'd venture to say most Solr users do not have docs that look anything like Wikipedia. One of the things the Open Relevance project (http://wiki.apache.org/lucene-java/OpenRelevance , see the discussion on gene...@lucene.a.o) should aim to do is bring in a variety of test collections, from lots of different genres. This will help both with relevance and with speed testing. -Grant On May 14, 2009, at 6:47 AM, Michael McCandless wrote: On Wed, May 13, 2009 at 12:33 PM, Grant Ingersoll gsing...@apache.org wrote: I've contacted others in the past who have done comparisons and after one round of emailing it was almost always clear that they didn't know what best practices are for any given product and thus were doing things sub-optimally. While I agree, one should properly match tune all apps they are testing (for a fair comparison), we in turn must set out-of-the-box defaults (in Lucene and Solr) that get you as close to the best practices as possible. We don't always do that, and I think we should do better. My most recent example of this is BooleanQuery's performance. It turns out, if you setAllowDocsOutOfOrder(true), it yields a sizable performance gain (27% on my most recent test) for OR queries. So why haven't we enabled this by default, already? (As far as I can tell it's functionally equivalent, as long as the Collector can accept out-of-order docs, which our core collectors can). We can't expect the other camp to discover that this obscure setting must be set, to maximize Lucene's OR query performance. Mike -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Solr vs Sphinx
On Thu, May 14, 2009 at 06:47:01AM -0400, Michael McCandless wrote: While I agree, one should properly match tune all apps they are testing (for a fair comparison), we in turn must set out-of-the-box defaults (in Lucene and Solr) that get you as close to the best practices as possible. So, should Lucene use the non-compound file format by default because some idiot's sloppy benchmarks might run a smidge faster, even though that will cause many users to run out of file descriptors? Anyone doing comparative benchmarking who doesn't submit their code to the support list for the software under review is either a dolt or a propagandist. Good benchmarking is extremely difficult, like all experimental science. If there isn't ample evidence that the benchmarker appreciates that, their tests aren't worth a second thought. If you don't avail yourself of the help of experts when assembling your experiment, you are unserious. Richard Feynman: ...if you're doing an experiment, you should report everything that you think might make it invalid - not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked - to make sure the other fellow can tell they have been eliminated. Marvin Humphrey
Re: Solr vs Sphinx
On Thu, May 14, 2009 at 6:51 AM, Andrey Klochkov akloch...@griddynamics.com wrote: Can you please point me to some information concerning allowDocsOutOfOrder? What's this at all? There is this cryptic static setter (in Lucene): BooleanQuery.setAllowDocsOutOfOrder(boolean) It defaults to false, which means BooleanScorer2 will always be used to compute hits for a BooleanQuery. When set to true, BooleanScorer will instead be used, when possible. BooleanScorer gets better performance, but it collects docs out of order, which for some external collectors might cause a problem. All of Lucene's core collectors work fine with out-of-order collection (but I'm not sure about Solr's collectors). If you experiment with this, please post back with your results! Mike
Re: Solr vs Sphinx
Yonik Seeley-2 wrote: It's probably the case that every search engine out there is faster than Solr at one thing or another, and that Solr is faster or better at some other things. I prefer to spend my time improving Solr rather than engage in benchmarking wars... and Solr 1.4 will have a ton of speed improvements over Solr 1.3. -Yonik http://www.lucidimagination.com Solr is very fast even with 1.3 and the developers have done an incredible job. However, maybe the next Solr improvement should be the creation of a configuration manager and/or automated tuning tool. I know that optimizing Solr performance can be time consuming and sometimes frustrating. -- View this message in context: http://www.nabble.com/Solr-vs-Sphinx-tp23524676p23544492.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr vs Sphinx
On Thu, May 14, 2009 at 9:07 AM, Marvin Humphrey mar...@rectangular.com wrote: Richard Feynman: ...if you're doing an experiment, you should report everything that you think might make it invalid - not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked - to make sure the other fellow can tell they have been eliminated. Excellent quote! So, should Lucene use the non-compound file format by default because some idiot's sloppy benchmarks might run a smidge faster, even though that will cause many users to run out of file descriptors? No, I don't think we should change that default. Nor (for example) can we switch to SweetSpotSimilarity by default, even though it seems to improve relevance, because it requires app-dependent configuration. Nor should we set IndexWriter's RAM buffer to 1 GB. Etc. But when there is a choice that has near zero downside and improves performance (like my example), we should make the switch. Making IndexReader.open return a readOnly reader is another example (... which we plan to do in 3.0). Every time Lucene or Solr has a default built-in setting, we should think carefully about how to set it. Anyone doing comparative benchmarking who doesn't submit their code to the support list for the software under review is either a dolt or a propagandist. Good benchmarking is extremely difficult, like all experimental science. If there isn't ample evidence that the benchmarker appreciates that, their tests aren't worth a second thought. If you don't avail yourself of the help of experts when assembling your experiment, you are unserious. Agreed. Mike
Re: Solr vs Sphinx
On 14-May-09, at 9:46 AM, gdeconto wrote: Solr is very fast even with 1.3 and the developers have done an incredible job. However, maybe the next Solr improvement should be the creation of a configuration manager and/or automated tuning tool. I know that optimizing Solr performance can be time consuming and sometimes frustrating. Making Solr more self-service has been a theme we have had and should strive to move toward. In some respects, extreme configurability is a liability, if considerable tweaking and experimentation is needed to achieve optimum results. You can't expect everyone to put in the investment to develop the expertise. That said, it is very difficult to come up with appropriate auto- tuning heuristics that don't fail. It almost calls for a level higher than Solr that you could hint what you want to do with the field (sort, facet, etc.), and it makes the field definitions appropriately. The problem with such abstractions is that they are invariably leaky, and thus diagnosing problems requires similar expertise as omitting the abstraction step in the first place. Getting this trade-off right is one of the central problems of computer science. -Mike
Re: Solr vs Sphinx
Michael McCandless wrote: So why haven't we enabled this by default, already? Why isn't Lucene done already :) - Mark
Re: Solr vs Sphinx
It's probably the case that every search engine out there is faster than Solr at one thing or another, and that Solr is faster or better at some other things. I prefer to spend my time improving Solr rather than engage in benchmarking wars... and Solr 1.4 will have a ton of speed improvements over Solr 1.3. -Yonik http://www.lucidimagination.com
Re: Solr vs Sphinx
On May 13, 2009, at 11:55 AM, wojtekpia wrote: I came across this article praising Sphinx: http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article specifically mentions Solr as an 'aging' technology, Solr is the same age as Sphinx (2006), so if Solr is aging, then so is Sphinx. But, hey aren't we all aging? It sure beats not aging. ;-) That being said, we are always open to suggestions and improvements. Lucene has seen a massive speedup on indexing that comes through in Solr in the past year (and it was fast before), and Solr 1.4 looks to be faster than 1.3 (and it was fast before, too.) The Solr community is clearly interested in moving things forward and staying fresh, as is the Lucene community. and states that performance on Sphinx is 2x-4x faster than Solr. Has anyone compared Sphinx to Solr? Or used Sphinx in the past? I realize that you can't just say one is faster than the other because it depends so much on configuration, requirements, # docs, size of each doc, etc. I'm just looking for general observations. I've found other articles comparing Solr with Sphinx and most state that performance is similar between the two. I can't speak to Sphinx, as I haven't used it. As for performance tests, those are always apples and oranges. If one camp does them, then the other camp says You don't know how to use our product and vice versa. I think that applies here. So, when you see things like Internal tests show that is always a red flag in my mind. I've contacted others in the past who have done comparisons and after one round of emailing it was almost always clear that they didn't know what best practices are for any given product and thus were doing things sub-optimally. One thing in the article that is worthwhile to consider is the fact that some (most?) people would likely benefit from not removing stopwords, as they can enhance phrase based searching and thus improve relevance. Obviously, with Solr, it is easy to keep stopwords by simply removing the StopwordFilterFactor from the analysis process and then dealing with them appropriately at query time. However, it is likely the case that too many Solr users simply rely on the example schema when it comes to setup instead of actively investigating what the proper choices are for their situation. Finally, an old baseball saying comes to mind: Pitchers only bother to throw at .300 hitters. Solr is a pretty darn full featured search platform with a large and active community, a commercial friendly license, and it also performs quite well. -Grant
Re: Solr vs Sphinx
Our company has a large search deployment serving 50 M search hits / per day. We've been leveraging Lucene for several years and have recently deployed Solr for the distributed search feature. We were hitting scaling limits with lucene due to our index size. I did an evaluation of Sphinx and found Solr / Lucene to be more suitable for our needs and much more flexible. Performance in the Solr deployment ( especially with 1.4) has been better than expected. Thanks to all the Solr developers for a great product. Hopefully we'll have the opportunity to contribute to the project as it moves forward. Todd On Wed, May 13, 2009 at 10:33 AM, Grant Ingersoll gsing...@apache.orgwrote: On May 13, 2009, at 11:55 AM, wojtekpia wrote: I came across this article praising Sphinx: http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article specifically mentions Solr as an 'aging' technology, Solr is the same age as Sphinx (2006), so if Solr is aging, then so is Sphinx. But, hey aren't we all aging? It sure beats not aging. ;-) That being said, we are always open to suggestions and improvements. Lucene has seen a massive speedup on indexing that comes through in Solr in the past year (and it was fast before), and Solr 1.4 looks to be faster than 1.3 (and it was fast before, too.) The Solr community is clearly interested in moving things forward and staying fresh, as is the Lucene community. and states that performance on Sphinx is 2x-4x faster than Solr. Has anyone compared Sphinx to Solr? Or used Sphinx in the past? I realize that you can't just say one is faster than the other because it depends so much on configuration, requirements, # docs, size of each doc, etc. I'm just looking for general observations. I've found other articles comparing Solr with Sphinx and most state that performance is similar between the two. I can't speak to Sphinx, as I haven't used it. As for performance tests, those are always apples and oranges. If one camp does them, then the other camp says You don't know how to use our product and vice versa. I think that applies here. So, when you see things like Internal tests show that is always a red flag in my mind. I've contacted others in the past who have done comparisons and after one round of emailing it was almost always clear that they didn't know what best practices are for any given product and thus were doing things sub-optimally. One thing in the article that is worthwhile to consider is the fact that some (most?) people would likely benefit from not removing stopwords, as they can enhance phrase based searching and thus improve relevance. Obviously, with Solr, it is easy to keep stopwords by simply removing the StopwordFilterFactor from the analysis process and then dealing with them appropriately at query time. However, it is likely the case that too many Solr users simply rely on the example schema when it comes to setup instead of actively investigating what the proper choices are for their situation. Finally, an old baseball saying comes to mind: Pitchers only bother to throw at .300 hitters. Solr is a pretty darn full featured search platform with a large and active community, a commercial friendly license, and it also performs quite well. -Grant