Re: Solr vs Sphinx

2009-06-03 Thread Otis Gospodnetic

Hi,

Could you please start a new thread?


Thanks,
Otis


- Original Message 
 From: sunnyfr johanna...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 10:20:06 AM
 Subject: Re: Solr vs Sphinx
 
 
 Hi guys,
 
 I work now for serveral month on solr and really you provide quick answer
 ... and you're very nice to work with.
 But I've got huge issue that I couldn't fixe after lot of post.
 
 My indexation take one two days to be done. For 8G of data indexed and 1,5M
 of docs (ok I've plenty of links in my table but it takes such a long time).
 
 Second I've to do update every 20mn but every update represent maybe 20
 000docs
 and when I use the replication I must replicate all the new index folder
 optimized because Ive too much datas updated and too much segment needs to
 be generate and I have to merge datas. So I lost my cache and my CPU goes
 mad.
 
 And I can't have more than 20request/sec.
 
 
 
 
 Fergus McMenemie-2 wrote:
  
 Something that would be interesting is to share solr configs for  
 various types of indexing tasks.  From a solr configuration aimed at  
 indexing web pages to one doing large amounts of text to one that  
 indexes specific structured data.  I could see those being posted on  
 the wiki and helping folks who say I want to do X, is there an  
 example?.
 
 I think most folks start with the example Solr install and tweak from  
 there, which probably isn't the best path...
 
 Eric
  
  Yep a solr cookbook with lots of different example recipes. However
  these would need to be very actively maintained to ensure they always
  represented best practice. While using cocoon I made extensive use
  of the examples section of the cocoon website. However most of the,
  massive number of, examples represent obsolete cocoon practise. Or 
  there were four or five examples doing the same thing in different 
  ways with no text explaining the pros/cons of the different approaches.
  This held me, as a newcomer, back and gave a bad impression of cocoon.
  
  I was wondering about a performance hints page. I was caught by an
  issue indexing CSV content where the use of overwrite=false made
  an almost 3x difference to my indexing speed. Still do not really
  know why!
  
 
 On May 15, 2009, at 8:09 AM, Mark Miller wrote:
 
  In the spirit of good defaults:
 
  I think we should change the Solr highlighter to highlight phrase  
  queries by default, as well as prefix,range,wildcard constantscore  
  queries. Its awkward to have to tell people you have to turn those  
  on. I'd certainly prefer to have to turn them off if I have some  
  limitation rather than on.
  
  Yep I agree, all whizzy new features should ideally be on by default
  unless there is a significant performance penalty. It is not enough
  that to issue a default solrconfig.xml with the feature on, it has to
  be on by default inside the code.
   
 
  - Mark
 
 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
 http://www.opensourceconnections.com
 Free/Busy: http://tinyurl.com/eric-cal
  
  Fergus
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Solr-vs-Sphinx-tp23524676p23852364.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr vs Sphinx

2009-05-17 Thread Fergus McMenemie
Something that would be interesting is to share solr configs for  
various types of indexing tasks.  From a solr configuration aimed at  
indexing web pages to one doing large amounts of text to one that  
indexes specific structured data.  I could see those being posted on  
the wiki and helping folks who say I want to do X, is there an  
example?.

I think most folks start with the example Solr install and tweak from  
there, which probably isn't the best path...

Eric

Yep a solr cookbook with lots of different example recipes. However
these would need to be very actively maintained to ensure they always
represented best practice. While using cocoon I made extensive use
of the examples section of the cocoon website. However most of the,
massive number of, examples represent obsolete cocoon practise. Or 
there were four or five examples doing the same thing in different 
ways with no text explaining the pros/cons of the different approaches.
This held me, as a newcomer, back and gave a bad impression of cocoon.

I was wondering about a performance hints page. I was caught by an
issue indexing CSV content where the use of overwrite=false made
an almost 3x difference to my indexing speed. Still do not really
know why!


On May 15, 2009, at 8:09 AM, Mark Miller wrote:

 In the spirit of good defaults:

 I think we should change the Solr highlighter to highlight phrase  
 queries by default, as well as prefix,range,wildcard constantscore  
 queries. Its awkward to have to tell people you have to turn those  
 on. I'd certainly prefer to have to turn them off if I have some  
 limitation rather than on.

Yep I agree, all whizzy new features should ideally be on by default
unless there is a significant performance penalty. It is not enough
that to issue a default solrconfig.xml with the feature on, it has to
be on by default inside the code.
 

 - Mark

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal

Fergus


Re: Solr vs Sphinx

2009-05-15 Thread Michael McCandless
On Thu, May 14, 2009 at 8:36 PM, Mark Miller markrmil...@gmail.com wrote:
 Michael McCandless wrote:

 So why haven't we enabled this by default, already?

 Why isn't Lucene done already :)

I hear you :)

Mike


Re: Solr vs Sphinx

2009-05-15 Thread Mark Miller

In the spirit of good defaults:

I think we should change the Solr highlighter to highlight phrase 
queries by default, as well as prefix,range,wildcard constantscore 
queries. Its awkward to have to tell people you have to turn those on. 
I'd certainly prefer to have to turn them off if I have some limitation 
rather than on.


- Mark


Re: Solr vs Sphinx

2009-05-15 Thread Eric Pugh
Something that would be interesting is to share solr configs for  
various types of indexing tasks.  From a solr configuration aimed at  
indexing web pages to one doing large amounts of text to one that  
indexes specific structured data.  I could see those being posted on  
the wiki and helping folks who say I want to do X, is there an  
example?.


I think most folks start with the example Solr install and tweak from  
there, which probably isn't the best path...


Eric

On May 15, 2009, at 8:09 AM, Mark Miller wrote:


In the spirit of good defaults:

I think we should change the Solr highlighter to highlight phrase  
queries by default, as well as prefix,range,wildcard constantscore  
queries. Its awkward to have to tell people you have to turn those  
on. I'd certainly prefer to have to turn them off if I have some  
limitation rather than on.


- Mark


-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal






Re: Solr vs Sphinx

2009-05-15 Thread Matthew Runo
I agree regarding posting different types of files - because right now  
if you're just starting out with Solr, taking the sample files from  
the distro and going from there is the /only path/ =\


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On May 15, 2009, at 6:41 AM, Eric Pugh wrote:

Something that would be interesting is to share solr configs for  
various types of indexing tasks.  From a solr configuration aimed at  
indexing web pages to one doing large amounts of text to one that  
indexes specific structured data.  I could see those being posted on  
the wiki and helping folks who say I want to do X, is there an  
example?.


I think most folks start with the example Solr install and tweak  
from there, which probably isn't the best path...


Eric

On May 15, 2009, at 8:09 AM, Mark Miller wrote:


In the spirit of good defaults:

I think we should change the Solr highlighter to highlight phrase  
queries by default, as well as prefix,range,wildcard constantscore  
queries. Its awkward to have to tell people you have to turn those  
on. I'd certainly prefer to have to turn them off if I have some  
limitation rather than on.


- Mark


-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal








Re: Solr vs Sphinx

2009-05-14 Thread Michael McCandless
On Wed, May 13, 2009 at 12:33 PM, Grant Ingersoll gsing...@apache.org wrote:
 I've contacted
 others in the past who have done comparisons and after one round of
 emailing it was almost always clear that they didn't know what best
 practices are for any given product and thus were doing things
 sub-optimally.

While I agree, one should properly match  tune all apps they are
testing (for a fair comparison), we in turn must set out-of-the-box
defaults (in Lucene and Solr) that get you as close to the best
practices as possible.

We don't always do that, and I think we should do better.

My most recent example of this is BooleanQuery's performance.  It
turns out, if you setAllowDocsOutOfOrder(true), it yields a sizable
performance gain (27% on my most recent test) for OR queries.

So why haven't we enabled this by default, already?  (As far as I can
tell it's functionally equivalent, as long as the Collector can accept
out-of-order docs, which our core collectors can).

We can't expect the other camp to discover that this obscure setting
must be set, to maximize Lucene's OR query performance.

Mike


Re: Solr vs Sphinx

2009-05-14 Thread Andrey Klochkov


 My most recent example of this is BooleanQuery's performance.  It
 turns out, if you setAllowDocsOutOfOrder(true), it yields a sizable
 performance gain (27% on my most recent test) for OR queries.


Mike,

Can you please point me to some information concerning allowDocsOutOfOrder?
What's this at all?


-- 
Andrew Klochkov


Re: Solr vs Sphinx

2009-05-14 Thread Grant Ingersoll
Totally agree on optimizing out of the box experience, it's just never  
a one size fits all thing.  And we have to be very careful about micro- 
benchmarks driving these settings.  Currently, many of us use  
Wikipedia, but that's just one doc set and I'd venture to say most  
Solr users do not have docs that look anything like Wikipedia.  One of  
the things the Open Relevance project (http://wiki.apache.org/lucene-java/OpenRelevance 
, see the discussion on gene...@lucene.a.o) should aim to do is bring  
in a variety of test collections, from lots of different genres.  This  
will help both with relevance and with speed testing.


-Grant

On May 14, 2009, at 6:47 AM, Michael McCandless wrote:

On Wed, May 13, 2009 at 12:33 PM, Grant Ingersoll  
gsing...@apache.org wrote:

I've contacted
others in the past who have done comparisons and after one round of
emailing it was almost always clear that they didn't know what best
practices are for any given product and thus were doing things
sub-optimally.


While I agree, one should properly match  tune all apps they are
testing (for a fair comparison), we in turn must set out-of-the-box
defaults (in Lucene and Solr) that get you as close to the best
practices as possible.

We don't always do that, and I think we should do better.

My most recent example of this is BooleanQuery's performance.  It
turns out, if you setAllowDocsOutOfOrder(true), it yields a sizable
performance gain (27% on my most recent test) for OR queries.

So why haven't we enabled this by default, already?  (As far as I can
tell it's functionally equivalent, as long as the Collector can accept
out-of-order docs, which our core collectors can).

We can't expect the other camp to discover that this obscure setting
must be set, to maximize Lucene's OR query performance.

Mike


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: Solr vs Sphinx

2009-05-14 Thread Marvin Humphrey
On Thu, May 14, 2009 at 06:47:01AM -0400, Michael McCandless wrote:
 While I agree, one should properly match  tune all apps they are
 testing (for a fair comparison), we in turn must set out-of-the-box
 defaults (in Lucene and Solr) that get you as close to the best
 practices as possible.

So, should Lucene use the non-compound file format by default because some
idiot's sloppy benchmarks might run a smidge faster, even though that will
cause many users to run out of file descriptors?

Anyone doing comparative benchmarking who doesn't submit their code to the
support list for the software under review is either a dolt or a propagandist.

Good benchmarking is extremely difficult, like all experimental science.  If
there isn't ample evidence that the benchmarker appreciates that, their tests
aren't worth a second thought.  If you don't avail yourself of the help of
experts when assembling your experiment, you are unserious.

Richard Feynman:

...if you're doing an experiment, you should report everything that you
think might make it invalid - not only what you think is right about it:
other causes that could possibly explain your results; and things you
thought of that you've eliminated by some other experiment, and how they
worked - to make sure the other fellow can tell they have been eliminated.

Marvin Humphrey



Re: Solr vs Sphinx

2009-05-14 Thread Michael McCandless
On Thu, May 14, 2009 at 6:51 AM, Andrey Klochkov
akloch...@griddynamics.com wrote:

 Can you please point me to some information concerning allowDocsOutOfOrder?
 What's this at all?

There is this cryptic static setter (in Lucene):

  BooleanQuery.setAllowDocsOutOfOrder(boolean)

It defaults to false, which means BooleanScorer2 will always be used
to compute hits for a BooleanQuery.  When set to true, BooleanScorer
will instead be used, when possible.  BooleanScorer gets better
performance, but it collects docs out of order, which for some
external collectors might cause a problem.

All of Lucene's core collectors work fine with out-of-order collection
(but I'm not sure about Solr's collectors).

If you experiment with this, please post back with your results!

Mike


Re: Solr vs Sphinx

2009-05-14 Thread gdeconto



Yonik Seeley-2 wrote:
 
 It's probably the case that every search engine out there is faster
 than Solr at one thing or another, and that Solr is faster or better
 at some other things.
 
 I prefer to spend my time improving Solr rather than engage in
 benchmarking wars... and Solr 1.4 will have a ton of speed
 improvements over Solr 1.3.
 
 -Yonik
 http://www.lucidimagination.com
 
 

Solr is very fast even with 1.3 and the developers have done an incredible
job.

However, maybe the next Solr improvement should be the creation of a
configuration manager and/or automated tuning tool.  I know that optimizing
Solr performance can be time consuming and sometimes frustrating.


-- 
View this message in context: 
http://www.nabble.com/Solr-vs-Sphinx-tp23524676p23544492.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr vs Sphinx

2009-05-14 Thread Michael McCandless
On Thu, May 14, 2009 at 9:07 AM, Marvin Humphrey mar...@rectangular.com wrote:
 Richard Feynman:

...if you're doing an experiment, you should report everything that you
think might make it invalid - not only what you think is right about it:
other causes that could possibly explain your results; and things you
thought of that you've eliminated by some other experiment, and how they
worked - to make sure the other fellow can tell they have been eliminated.

Excellent quote!

 So, should Lucene use the non-compound file format by default because some
 idiot's sloppy benchmarks might run a smidge faster, even though that will
 cause many users to run out of file descriptors?

No, I don't think we should change that default.

Nor (for example) can we switch to SweetSpotSimilarity by default,
even though it seems to improve relevance, because it requires
app-dependent configuration.

Nor should we set IndexWriter's RAM buffer to 1 GB.  Etc.

But when there is a choice that has near zero downside and improves
performance (like my example), we should make the switch.

Making IndexReader.open return a readOnly reader is another example
(... which we plan to do in 3.0).

Every time Lucene or Solr has a default built-in setting, we should
think carefully about how to set it.

 Anyone doing comparative benchmarking who doesn't submit their code to the
 support list for the software under review is either a dolt or a propagandist.

 Good benchmarking is extremely difficult, like all experimental science.  If
 there isn't ample evidence that the benchmarker appreciates that, their tests
 aren't worth a second thought.  If you don't avail yourself of the help of
 experts when assembling your experiment, you are unserious.

Agreed.

Mike


Re: Solr vs Sphinx

2009-05-14 Thread Mike Klaas


On 14-May-09, at 9:46 AM, gdeconto wrote:


Solr is very fast even with 1.3 and the developers have done an  
incredible

job.

However, maybe the next Solr improvement should be the creation of a
configuration manager and/or automated tuning tool.  I know that  
optimizing

Solr performance can be time consuming and sometimes frustrating.


Making Solr more self-service has been a theme we have had and  
should strive to move toward.  In some respects, extreme  
configurability is a liability, if considerable tweaking and  
experimentation is needed to achieve optimum results.  You can't  
expect everyone to put in the investment to develop the expertise.


That said, it is very difficult to come up with appropriate auto- 
tuning heuristics that don't fail.  It almost calls for a level higher  
than Solr that you could hint what you want to do with the field  
(sort, facet, etc.), and it makes the field definitions  
appropriately.  The problem with such abstractions is that they are  
invariably leaky, and thus diagnosing problems requires similar  
expertise as omitting the abstraction step in the first place.


Getting this trade-off right is one of the central problems of  
computer science.


-Mike


Re: Solr vs Sphinx

2009-05-14 Thread Mark Miller

Michael McCandless wrote:
So why haven't we enabled this by default, already? 

Why isn't Lucene done already :)

- Mark




Re: Solr vs Sphinx

2009-05-13 Thread Yonik Seeley
It's probably the case that every search engine out there is faster
than Solr at one thing or another, and that Solr is faster or better
at some other things.

I prefer to spend my time improving Solr rather than engage in
benchmarking wars... and Solr 1.4 will have a ton of speed
improvements over Solr 1.3.

-Yonik
http://www.lucidimagination.com


Re: Solr vs Sphinx

2009-05-13 Thread Grant Ingersoll


On May 13, 2009, at 11:55 AM, wojtekpia wrote:



I came across this article praising Sphinx:
http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article
specifically mentions Solr as an 'aging' technology,


Solr is the same age as Sphinx (2006), so if Solr is aging, then so is  
Sphinx.  But, hey aren't we all aging?  It sure beats not aging.  ;-)   
That being said, we are always open to suggestions and improvements.   
Lucene has seen a massive speedup on indexing that comes through in  
Solr in the past year (and it was fast before), and Solr 1.4 looks to  
be faster than 1.3 (and it was fast before, too.)  The Solr community  
is clearly interested in moving things forward and staying fresh, as  
is the Lucene community.



and states that
performance on Sphinx is 2x-4x faster than Solr. Has anyone compared  
Sphinx
to Solr? Or used Sphinx in the past? I realize that you can't just  
say one

is faster than the other because it depends so much on configuration,
requirements, # docs, size of each doc, etc. I'm just looking for  
general
observations. I've found other articles comparing Solr with Sphinx  
and most

state that performance is similar between the two.


I can't speak to Sphinx, as I haven't used it.

As for performance tests, those are always apples and oranges.  If one  
camp does them, then the other camp says You don't know how to use  
our product and vice versa.  I think that applies here.  So, when you  
see things like Internal tests show that is always a red flag in my  
mind.  I've contacted others in the past who have done comparisons  
and after one round of emailing it was almost always clear that they  
didn't know what best practices are for any given product and thus  
were doing things sub-optimally.


One thing in the article that is worthwhile to consider is the fact  
that some (most?) people would likely benefit from not removing  
stopwords, as they can enhance phrase based searching and thus improve  
relevance.  Obviously, with Solr, it is easy to keep stopwords by  
simply removing the StopwordFilterFactor from the analysis process and  
then dealing with them appropriately at query time.  However, it is  
likely the case that too many Solr users simply rely on the example  
schema when it comes to setup instead of actively investigating what  
the proper choices are for their situation.


Finally, an old baseball saying comes to mind: Pitchers only bother  
to throw at .300 hitters.  Solr is a pretty darn full featured search  
platform with a large and active community, a commercial friendly  
license, and it also performs quite well.


-Grant


Re: Solr vs Sphinx

2009-05-13 Thread Todd Benge
Our company has a large search deployment serving  50 M search hits / per
day.

We've been leveraging Lucene for several years and have recently deployed
Solr for the distributed search feature.  We were hitting scaling limits
with lucene due to our index size.

I did an evaluation of Sphinx and found Solr / Lucene to be more suitable
for our needs and much more flexible.  Performance in the Solr deployment (
especially with 1.4) has been better than expected.

Thanks to all the Solr developers for a great product.

Hopefully we'll have the opportunity to contribute to the project as it
moves forward.

Todd

On Wed, May 13, 2009 at 10:33 AM, Grant Ingersoll gsing...@apache.orgwrote:


 On May 13, 2009, at 11:55 AM, wojtekpia wrote:


 I came across this article praising Sphinx:
 http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article
 specifically mentions Solr as an 'aging' technology,


 Solr is the same age as Sphinx (2006), so if Solr is aging, then so is
 Sphinx.  But, hey aren't we all aging?  It sure beats not aging.  ;-)  That
 being said, we are always open to suggestions and improvements.  Lucene has
 seen a massive speedup on indexing that comes through in Solr in the past
 year (and it was fast before), and Solr 1.4 looks to be faster than 1.3 (and
 it was fast before, too.)  The Solr community is clearly interested in
 moving things forward and staying fresh, as is the Lucene community.

  and states that
 performance on Sphinx is 2x-4x faster than Solr. Has anyone compared
 Sphinx
 to Solr? Or used Sphinx in the past? I realize that you can't just say one
 is faster than the other because it depends so much on configuration,
 requirements, # docs, size of each doc, etc. I'm just looking for general
 observations. I've found other articles comparing Solr with Sphinx and
 most
 state that performance is similar between the two.


 I can't speak to Sphinx, as I haven't used it.

 As for performance tests, those are always apples and oranges.  If one camp
 does them, then the other camp says You don't know how to use our product
 and vice versa.  I think that applies here.  So, when you see things like
 Internal tests show that is always a red flag in my mind.  I've contacted
 others in the past who have done comparisons and after one round of
 emailing it was almost always clear that they didn't know what best
 practices are for any given product and thus were doing things
 sub-optimally.

 One thing in the article that is worthwhile to consider is the fact that
 some (most?) people would likely benefit from not removing stopwords, as
 they can enhance phrase based searching and thus improve relevance.
  Obviously, with Solr, it is easy to keep stopwords by simply removing the
 StopwordFilterFactor from the analysis process and then dealing with them
 appropriately at query time.  However, it is likely the case that too many
 Solr users simply rely on the example schema when it comes to setup instead
 of actively investigating what the proper choices are for their situation.

 Finally, an old baseball saying comes to mind: Pitchers only bother to
 throw at .300 hitters.  Solr is a pretty darn full featured search platform
 with a large and active community, a commercial friendly license, and it
 also performs quite well.

 -Grant