Re: Performance optimization of Proximity/Wildcard searches

Salman Akram Sat, 05 Feb 2011 01:17:40 -0800

Since all queries return total count as well so on average a query matches
10% of the total documents. The index I am talking about is around 13
million so that means around 1.3 million documents match on average. Of
course all of them won't be overlapping so I am guessing that around 30-50%
documents do match the daily queries.


I tried to find out a lot if you can tell SOLR to stop searching after a
certain count - I don't mean no. of rows but just like MySQL limit so that
it doesn't have to spend time calculating the total count whereas its only
returning few rows to UI and we are OK in showing count as 1000+ (if its
more than 1000) but couldn't find any way.

On Sat, Feb 5, 2011 at 7:45 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com
> wrote:

> Heh, I'm not sure if this is valid thinking. :)
>
> By *matching* doc distribution I meant: what proportion of your millions of
> documents actually ever get matched and then how many of those make it to
> the
> UI.
> If you have 1000 queries in a day and they all end up matching only 3 of
> your
> docs, the system will need less RAM than a system where 1000 queries match
> 50000
> different docs.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Salman Akram <salman.ak...@northbaysolutions.net>
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 3:38:55 PM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > Well I assume many people out there would have indexes larger than 100GB
>  and
> > I don't think so normally you will have more RAM than 32GB or  64!
> >
> > As I mentioned the queries are mostly phrase, proximity, wildcard  and
> > combination of these.
> >
> > What exactly do you mean by distribution of  documents? On this index our
> > documents are not more than few hundred KB's on  average (file system
> size)
> > and there are around 14 million documents. 80% of  the index size is
> taken up
> > by position file. I am not sure if this is what  you asked?
> >
> > On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Hi,
> > >
> > >
> > > > Sharding is an  option  too but that too comes with limitations so
> want to
> > > > keep that as a  last  resort but I think there must be other things
> coz
> > >  150GB
> > > > is not too big for  one drive/server with 32GB  Ram.
> > >
> > > Hmm.... what makes you think 32 GB is enough for your 150  GB index?
> > > It depends on queries and distribution of matching documents,  for
> example.
> > > What's yours like?
> > >
> > > Otis
> > >  ----
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
> > >
> > > ----- Original  Message ----
> > > > From: Salman Akram <salman.ak...@northbaysolutions.net>
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Tue, January 25, 2011 4:20:34 AM
> > > > Subject: Performance  optimization of Proximity/Wildcard searches
> > > >
> > > >  Hi,
> > > >
> > > > I am facing performance issues in three types of  queries (and  their
> > > > combination). Some of the queries take  more than 2-3 mins. Index
> size  is
> > > > around 150GB.
> > >  >
> > > >
> > > >    - Wildcard
> > > >     -  Proximity
> > > >    - Phrases (with common  words)
> > > >
> > > > I know CommonGrams and  Stop words are a  good way to resolve such
> issues
> > > but
> > > > they don't fulfill  our  functional requirements (Common Grams seem
> to
> > > have
> > >  > issues with phrase  proximity, stop words have issues with exact
>  match
> > > etc).
> > > >
> > > > Sharding is an  option too  but that too comes with limitations so
> want to
> > > > keep that as a  last  resort but I think there must be other things
> coz
> > >  150GB
> > > > is not too big for  one drive/server with 32GB  Ram.
> > > >
> > > > Cache warming is a good option too but  the  index get updated every
> hour
> > > so
> > > > not sure how much would  that  help.
> > > >
> > > > What are the other main tips that can  help in performance
>  optimization
> > > of
> > > > the above  queries?
> > > >
> > > > Thanks
> > > >
> > > > --
> > >  > Regards,
> > > >
> > > > Salman Akram
> > >  >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram

Re: Performance optimization of Proximity/Wildcard searches

Reply via email to