Re: Solr Pagination

2015-10-10 Thread Salman Ansari
Regarding Solr performance issue I was facing, I upgraded my Solr machine
to have
8 cores
56 GB RAM
8 GB JVM

However, unfortunately, I am still getting delays. I have run

* the query "Football" with start=0 and rows=10 and it took around 7.329
seconds
* the query "Football" with start=1000 and rows=10 and it took around
21.994 seconds

I was looking at Solr admin that the RAM and JVM are not being utilized to
the maximum, even not half or 1/4th. How do I push data to the cache once
Solr starts? and is pushing data to cache the right strategy to solve the
issue?

Appreciate your comments.

Regards,
Salman



On Sat, Oct 10, 2015 at 11:55 AM, Salman Ansari 
wrote:

> Thanks Shawn for your response. Based on that
> 1) Can you please direct me where I can get more information about cold
> shard vs hot shard?
>
> 2)  That 10GB number assumes there's no other software on the machine,
> like a database server or a webserver.
> Yes the machine is dedicated for Solr
>
> 3) How much index data is on the machine?
> I have 3 collections 2 for testing (so the aggregate of both of them does
> not exceed 1M document) and the main collection that I am querying now
> which contains around 69M. I have distributed all my collections into 2
> shards each with 2 replicas. The consumption on the hard disk is about 40GB.
>
> 4) A memory size of 14GB would be unusual for a physical machine, and
> makes me wonder if you're using virtual machines
> Yes I am using virtual machine as using a bare metal will be difficult in
> my case as all of our data center is on the cloud. I can increase its
> capacity though. While testing some edge cases on Solr, I realized on Solr
> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)
>
> 5) Just to confirm, I have combined the lessons from
>
> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
> AND
> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>
> to come up with the following settings
>
> FilterCache
>
>   size="16384"
>  initialSize="4096"
>  autowarmCount="4096"/>
>
> DocummentCahce
>
> size="16384"
>initialSize="16384"
>autowarmCount="0"/>
>
> NewSearcher and FirsSearcher
>
> 
>   
>*score desc id
> desc
>   
> 
> 
>   
>  * score desc id desc
> 
> 
>  *
>   category
>   
> 
>
> Will this be using more cache in Solr and prepoupulate it?
>
> Regards,
> Salman
>
>
>
>
> On Sat, Oct 10, 2015 at 5:10 AM, Shawn Heisey  wrote:
>
>> On 10/9/2015 1:39 PM, Salman Ansari wrote:
>>
>> > INFO  - 2015-10-09 18:46:17.953; [c:sabr102 s:shard1 r:core_node2
>> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
>> > [sabr102_shard1_replica1] webapp=/solr path=/select
>> > params={start=0=(content_text:Football)=10} hits=24408 status=0
>> > QTime=3391
>>
>> Over 3 seconds for a query like this definitely sounds like there's a
>> problem.
>>
>> > INFO  - 2015-10-09 18:47:04.727; [c:sabr102 s:shard1 r:core_node2
>> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
>> > [sabr102_shard1_replica1] webapp=/solr path=/select
>> > params={start=1000=(content_text:Football)=10} hits=24408
>> status=0
>> > QTime=21569
>>
>> Adding a start value of 1000 increases QTime by a factor of more than
>> 6?  Even more evidence of a performance problem.
>>
>> For comparison purposes, I did a couple of simple queries on a large
>> index of mine.  Here are the response headers showing the QTime value
>> and all the parameters (except my shard URLs) for each query:
>>
>>   "responseHeader": {
>> "status": 0,
>> "QTime": 1253,
>> "params": {
>>   "df": "catchall",
>>   "spellcheck.maxCollationEvaluations": "2",
>>   "spellcheck.dictionary": "default",
>>   "echoParams": "all",
>>   "spellcheck.maxCollations": "5",
>>   "q.op": "AND",
>>   "shards.info": "true",
>>   "spellcheck.maxCollationTries": "2",
>>   "rows": "70",
>>   "spellcheck.extendedResults": "false",
>>   "shards": "REDACTED SEVEN SHARD URLS",
>>   "shards.tolerant": "true",
>>   "spellcheck.onlyMorePopular": "false",
>>   "facet.method": "enum",
>>   "spellcheck.count": "9",
>>   "q": "catchall:carriage",
>>   "indent": "true",
>>   "wt": "json",
>>   "_": "120900498"
>> }
>>
>>
>>   "responseHeader": {
>> "status": 0,
>> "QTime": 176,
>> "params": {
>>   "df": "catchall",
>>   "spellcheck.maxCollationEvaluations": "2",
>>   "spellcheck.dictionary": "default",
>>   "echoParams": "all",
>>   "spellcheck.maxCollations": "5",
>>   "q.op": "AND",
>>   "shards.info": "true",
>>   "spellcheck.maxCollationTries": "2",
>>   "rows": "70",
>>   

NullPointerException

2015-10-10 Thread Mark Fenbers

Greetings!

I'm new to Solr Spellchecking...  I have yet to get it to work.

Attached is a snippet from my solrconfig.xml pertaining to my spellcheck 
efforts.


When I use the Admin UI (v5.3.0), and check the spellcheck.build box, I 
get a NullPointerException stacktrace.  The actual stacktrace is at the 
bottom of the attachment.  My spellcheck.q is the following:

Solr will yuse suggestions frum both.

The FileBasedSpellChecker.build method is clearly the problem 
(determined from the stack trace), but I cannot figure out why.


Maybe I don't need to do a build on it...(?)  If I don't, the 
spell-checker finds no mispelled words.  yet, "yuse" and "frum" are not 
stand-alone words in /usr/share/dict/words.


/usr/share/dict/words exists and has global read permissions.  I 
displayed the file and see no issues (i.e., one word per line) although 
some "words" are a string of digits, but that shouldn't matter.


Does my snippet give any clues about why I would get this error? Is my 
stripped down configuration missing something, perhaps?


Mark

  
text_en



   
 solr.FileBasedSpellChecker
logtext 
FileDict
 /usr/share/dict/words
 UTF-8
 /localapps/dev/EventLog/solr/EventLog2/data/spFile
   
  
  

  

  FileDict

  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  


"trace": "java.lang.NullPointerException\n\tat 
org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:509)\n\tat
 
org.apache.solr.spelling.FileBasedSpellChecker.build(FileBasedSpellChecker.java:74)\n\tat
 
org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:124)\n\tat
 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:251)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:499)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tat
 java.lang.Thread.run(Thread.java:745)\n",


Re: How to show some documents ahead of others - requirements

2015-10-10 Thread Upayavira
I've seen a similar requirement to this recently.

Basically, a sorting requirement that is close to impossible to
implement as a scoring/boosting formula, because the *position* of the
result features in the score, and that's not something I believe can be
done right now.

The way we solved the issue in the similar case I referred to above was
by using a RerankQuery. That query class has a getTopDocsCollector()
function, which you can override, providing your own Collector.

If you then refer to your query(actually your query parser) with the
rerank query param in Solr: rq={!myRerankQuery} then it will trigger
your new collector, which will be given its topDocs() method is called,
will call topDocs on its parent query, get a list of documents, then
order them in some way such as you require, and return them in a
non-score order.

Not sure I've made that very clear, but hope it helps a little.

Upayavira

On Sat, Oct 10, 2015, at 03:13 PM, liviuchrist...@yahoo.com.INVALID
wrote:
> Hi Upayavira & Walter & everyone else
> 
> About the requirements:1. I need to return no more than 3 paid results on
> a page of 12 results2. Paid results should be sorted like this: let's say
> a user is searching for: "chocolate almonds cake"Now, lets say that 2000
> results match the query and there are about 10 of these that are "paid
> results".I need to list the first 3 (1-2-3) of the paid results (in their
> ranking decreasing order) on the first page (maybe by improving the
> ranking of the 20 paid results over the non-paid ones and listing the
> first 3 of them.) and then listing 9 non-paid results on the page in
> their ranking decreasing order.
> Then, on the second page, I want to list first the next 3 paid results
> (4-5-6) and so on.
> 
> Kind regards,Christian
>  Christian Fotache Tel: 0728.297.207 
> 
>   From: Upayavira 
>  To: solr-user@lucene.apache.org 
>  Sent: Thursday, October 8, 2015 7:03 PM
>  Subject: Re: How to show some documents ahead of others
>
> Hence the suggestion to group by the paid field - would give you two
> lists of the number you ask for.
> 
> What I'm trying to say is that the QueryElevationComponent might do it,
> but it is also relatively clunky, so a pure search solution might do it.
> 
> However, the thing we lack right now is a full take on the requirements,
> e.g. how should paid results be sorted, how many paid results do you
> show, etc, etc. Without these details we're all guessing.
> 
> Upayavira
> 
> 
> On Thu, Oct 8, 2015, at 04:45 PM, Walter Underwood wrote:
> > Sorting all paid above all unpaid will give bad results when there are
> > many matches. It will show 1000 paid items, include all the barely
> > relevant ones, before it shows the first highly relevant unpaid recipe.
> > What if that was the only correct result?
> > 
> > Two approaches that work:
> > 
> > 1. Boost paid items using the “boost” parameter in edismax. Adjust it to
> > be a tiebreaker between documents with similar score.
> > 
> > 2. Show two lists, one with the five most relevant paid, the next with
> > the five most relevant unpaid.
> > 
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> > 
> > 
> > > On Oct 8, 2015, at 7:39 AM, Alessandro Benedetti 
> > >  wrote:
> > > 
> > > Is it possible to understand better this : "as it doesn't
> > > allow any meaningful customization " ?
> > > 
> > > Cheers
> > > 
> > > On 8 October 2015 at 15:27, Andrea Roggerone 
> > >  > >> wrote:
> > > 
> > >> Hi guys,
> > >> I don't think that sorting is a good solution in this case as it doesn't
> > >> allow any meaningful customization.I believe that the advised
> > >> QueryElevationComponent is one of the viable alternative. Another one 
> > >> would
> > >> be to boost at query time a particular field, like for instance paid. 
> > >> That
> > >> would allow you to assign different boosts to different values using a
> > >> function.
> > >> 
> > >> On Thu, Oct 8, 2015 at 1:48 PM, Upayavira  wrote:
> > >> 
> > >>> Or just have a field in your index -
> > >>> 
> > >>> paid: true/false
> > >>> 
> > >>> Then sort=paid desc, score desc
> > >>> 
> > >>> (you may need to sort paid asc, not sure which way a boolean would sort)
> > >>> 
> > >>> Question is whether you want to show ALL paid posts, or just a set of
> > >>> them. For the latter you could use result grouping on the paid field.
> > >>> 
> > >>> Upayavira
> > >>> 
> > >>> On Thu, Oct 8, 2015, at 01:34 PM, NutchDev wrote:
> >  Hi Christian,
> >  
> >  You can take a look at Solr's  QueryElevationComponent
> >    .
> >  
> >  It will allow you to configure the top results for a given query
> >  regardless
> >  of the normal lucene scoring. Also you can specify exclude document
> > >> list
> >  to
> >  exclude certain 

Re: How to show some documents ahead of others - requirements

2015-10-10 Thread Erick Erickson
Would result grouping work here? If the group key was "paid", then
you'd get two groups back, "paid" an "unpaid". Within each group you'd
have results ordered by rank. This would work for a page or two, but
eventually you'd be in a spot where you'd have to over sample, i.e.
return pages*X in each group to be able to page very deeply.

Or you could just fire two queries and have the app assemble the final list.

Best,
Erick

On Sat, Oct 10, 2015 at 8:13 AM, Upayavira  wrote:
> I've seen a similar requirement to this recently.
>
> Basically, a sorting requirement that is close to impossible to
> implement as a scoring/boosting formula, because the *position* of the
> result features in the score, and that's not something I believe can be
> done right now.
>
> The way we solved the issue in the similar case I referred to above was
> by using a RerankQuery. That query class has a getTopDocsCollector()
> function, which you can override, providing your own Collector.
>
> If you then refer to your query(actually your query parser) with the
> rerank query param in Solr: rq={!myRerankQuery} then it will trigger
> your new collector, which will be given its topDocs() method is called,
> will call topDocs on its parent query, get a list of documents, then
> order them in some way such as you require, and return them in a
> non-score order.
>
> Not sure I've made that very clear, but hope it helps a little.
>
> Upayavira
>
> On Sat, Oct 10, 2015, at 03:13 PM, liviuchrist...@yahoo.com.INVALID
> wrote:
>> Hi Upayavira & Walter & everyone else
>>
>> About the requirements:1. I need to return no more than 3 paid results on
>> a page of 12 results2. Paid results should be sorted like this: let's say
>> a user is searching for: "chocolate almonds cake"Now, lets say that 2000
>> results match the query and there are about 10 of these that are "paid
>> results".I need to list the first 3 (1-2-3) of the paid results (in their
>> ranking decreasing order) on the first page (maybe by improving the
>> ranking of the 20 paid results over the non-paid ones and listing the
>> first 3 of them.) and then listing 9 non-paid results on the page in
>> their ranking decreasing order.
>> Then, on the second page, I want to list first the next 3 paid results
>> (4-5-6) and so on.
>>
>> Kind regards,Christian
>>  Christian Fotache Tel: 0728.297.207
>>
>>   From: Upayavira 
>>  To: solr-user@lucene.apache.org
>>  Sent: Thursday, October 8, 2015 7:03 PM
>>  Subject: Re: How to show some documents ahead of others
>>
>> Hence the suggestion to group by the paid field - would give you two
>> lists of the number you ask for.
>>
>> What I'm trying to say is that the QueryElevationComponent might do it,
>> but it is also relatively clunky, so a pure search solution might do it.
>>
>> However, the thing we lack right now is a full take on the requirements,
>> e.g. how should paid results be sorted, how many paid results do you
>> show, etc, etc. Without these details we're all guessing.
>>
>> Upayavira
>>
>>
>> On Thu, Oct 8, 2015, at 04:45 PM, Walter Underwood wrote:
>> > Sorting all paid above all unpaid will give bad results when there are
>> > many matches. It will show 1000 paid items, include all the barely
>> > relevant ones, before it shows the first highly relevant unpaid recipe.
>> > What if that was the only correct result?
>> >
>> > Two approaches that work:
>> >
>> > 1. Boost paid items using the “boost” parameter in edismax. Adjust it to
>> > be a tiebreaker between documents with similar score.
>> >
>> > 2. Show two lists, one with the five most relevant paid, the next with
>> > the five most relevant unpaid.
>> >
>> > wunder
>> > Walter Underwood
>> > wun...@wunderwood.org
>> > http://observer.wunderwood.org/  (my blog)
>> >
>> >
>> > > On Oct 8, 2015, at 7:39 AM, Alessandro Benedetti 
>> > >  wrote:
>> > >
>> > > Is it possible to understand better this : "as it doesn't
>> > > allow any meaningful customization " ?
>> > >
>> > > Cheers
>> > >
>> > > On 8 October 2015 at 15:27, Andrea Roggerone 
>> > > > > >> wrote:
>> > >
>> > >> Hi guys,
>> > >> I don't think that sorting is a good solution in this case as it doesn't
>> > >> allow any meaningful customization.I believe that the advised
>> > >> QueryElevationComponent is one of the viable alternative. Another one 
>> > >> would
>> > >> be to boost at query time a particular field, like for instance paid. 
>> > >> That
>> > >> would allow you to assign different boosts to different values using a
>> > >> function.
>> > >>
>> > >> On Thu, Oct 8, 2015 at 1:48 PM, Upayavira  wrote:
>> > >>
>> > >>> Or just have a field in your index -
>> > >>>
>> > >>> paid: true/false
>> > >>>
>> > >>> Then sort=paid desc, score desc
>> > >>>
>> > >>> (you may need to sort paid asc, not sure which way a boolean would 
>> > >>> sort)
>> > >>>
>> > >>> Question is whether you want to 

How to use FuzzyQuery in schema.xml

2015-10-10 Thread vit
I am using Solr 4.2
For some reason I cannot find an example of http://lucene.472066.n3.nabble.com/How-to-use-FuzzyQuery-in-schema-xml-tp4233900.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to show some documents ahead of others - requirements

2015-10-10 Thread liviuchristian
Hi Upayavira & Walter & everyone else

About the requirements:1. I need to return no more than 3 paid results on a 
page of 12 results2. Paid results should be sorted like this: let's say a user 
is searching for: "chocolate almonds cake"Now, lets say that 2000 results match 
the query and there are about 10 of these that are "paid results".I need to 
list the first 3 (1-2-3) of the paid results (in their ranking decreasing 
order) on the first page (maybe by improving the ranking of the 20 paid results 
over the non-paid ones and listing the first 3 of them.) and then listing 9 
non-paid results on the page in their ranking decreasing order.
Then, on the second page, I want to list first the next 3 paid results (4-5-6) 
and so on.

Kind regards,Christian
 Christian Fotache Tel: 0728.297.207 

  From: Upayavira 
 To: solr-user@lucene.apache.org 
 Sent: Thursday, October 8, 2015 7:03 PM
 Subject: Re: How to show some documents ahead of others
   
Hence the suggestion to group by the paid field - would give you two
lists of the number you ask for.

What I'm trying to say is that the QueryElevationComponent might do it,
but it is also relatively clunky, so a pure search solution might do it.

However, the thing we lack right now is a full take on the requirements,
e.g. how should paid results be sorted, how many paid results do you
show, etc, etc. Without these details we're all guessing.

Upayavira


On Thu, Oct 8, 2015, at 04:45 PM, Walter Underwood wrote:
> Sorting all paid above all unpaid will give bad results when there are
> many matches. It will show 1000 paid items, include all the barely
> relevant ones, before it shows the first highly relevant unpaid recipe.
> What if that was the only correct result?
> 
> Two approaches that work:
> 
> 1. Boost paid items using the “boost” parameter in edismax. Adjust it to
> be a tiebreaker between documents with similar score.
> 
> 2. Show two lists, one with the five most relevant paid, the next with
> the five most relevant unpaid.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Oct 8, 2015, at 7:39 AM, Alessandro Benedetti 
> >  wrote:
> > 
> > Is it possible to understand better this : "as it doesn't
> > allow any meaningful customization " ?
> > 
> > Cheers
> > 
> > On 8 October 2015 at 15:27, Andrea Roggerone  >> wrote:
> > 
> >> Hi guys,
> >> I don't think that sorting is a good solution in this case as it doesn't
> >> allow any meaningful customization.I believe that the advised
> >> QueryElevationComponent is one of the viable alternative. Another one would
> >> be to boost at query time a particular field, like for instance paid. That
> >> would allow you to assign different boosts to different values using a
> >> function.
> >> 
> >> On Thu, Oct 8, 2015 at 1:48 PM, Upayavira  wrote:
> >> 
> >>> Or just have a field in your index -
> >>> 
> >>> paid: true/false
> >>> 
> >>> Then sort=paid desc, score desc
> >>> 
> >>> (you may need to sort paid asc, not sure which way a boolean would sort)
> >>> 
> >>> Question is whether you want to show ALL paid posts, or just a set of
> >>> them. For the latter you could use result grouping on the paid field.
> >>> 
> >>> Upayavira
> >>> 
> >>> On Thu, Oct 8, 2015, at 01:34 PM, NutchDev wrote:
>  Hi Christian,
>  
>  You can take a look at Solr's  QueryElevationComponent
>    .
>  
>  It will allow you to configure the top results for a given query
>  regardless
>  of the normal lucene scoring. Also you can specify exclude document
> >> list
>  to
>  exclude certain results for perticular query.
>  
>  
>  
>  
>  
>  --
>  View this message in context:
>  
> >>> 
> >> http://lucene.472066.n3.nabble.com/How-to-show-some-documents-ahead-of-others-tp4233481p4233490.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
> >>> 
> >> 
> > 
> > 
> > 
> > -- 
> > --
> > 
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> > 
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> > 
> > William Blake - Songs of Experience -1794 England
> 

  

Re: How to show some documents ahead of others - requirements

2015-10-10 Thread Walter Underwood
By far the easiest solution is to do two queries from the front end.
One requesting three paid results, and one requesting nine unpaid results.
If all the results are in one collection, use “fq” to select paid/unpaid.

That is going to be fast and there is zero doubt that it will do the right 
thing. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 10, 2015, at 9:31 AM, Erick Erickson  wrote:
> 
> Would result grouping work here? If the group key was "paid", then
> you'd get two groups back, "paid" an "unpaid". Within each group you'd
> have results ordered by rank. This would work for a page or two, but
> eventually you'd be in a spot where you'd have to over sample, i.e.
> return pages*X in each group to be able to page very deeply.
> 
> Or you could just fire two queries and have the app assemble the final list.
> 
> Best,
> Erick
> 
> On Sat, Oct 10, 2015 at 8:13 AM, Upayavira  wrote:
>> I've seen a similar requirement to this recently.
>> 
>> Basically, a sorting requirement that is close to impossible to
>> implement as a scoring/boosting formula, because the *position* of the
>> result features in the score, and that's not something I believe can be
>> done right now.
>> 
>> The way we solved the issue in the similar case I referred to above was
>> by using a RerankQuery. That query class has a getTopDocsCollector()
>> function, which you can override, providing your own Collector.
>> 
>> If you then refer to your query(actually your query parser) with the
>> rerank query param in Solr: rq={!myRerankQuery} then it will trigger
>> your new collector, which will be given its topDocs() method is called,
>> will call topDocs on its parent query, get a list of documents, then
>> order them in some way such as you require, and return them in a
>> non-score order.
>> 
>> Not sure I've made that very clear, but hope it helps a little.
>> 
>> Upayavira
>> 
>> On Sat, Oct 10, 2015, at 03:13 PM, liviuchrist...@yahoo.com.INVALID
>> wrote:
>>> Hi Upayavira & Walter & everyone else
>>> 
>>> About the requirements:1. I need to return no more than 3 paid results on
>>> a page of 12 results2. Paid results should be sorted like this: let's say
>>> a user is searching for: "chocolate almonds cake"Now, lets say that 2000
>>> results match the query and there are about 10 of these that are "paid
>>> results".I need to list the first 3 (1-2-3) of the paid results (in their
>>> ranking decreasing order) on the first page (maybe by improving the
>>> ranking of the 20 paid results over the non-paid ones and listing the
>>> first 3 of them.) and then listing 9 non-paid results on the page in
>>> their ranking decreasing order.
>>> Then, on the second page, I want to list first the next 3 paid results
>>> (4-5-6) and so on.
>>> 
>>> Kind regards,Christian
>>> Christian Fotache Tel: 0728.297.207
>>> 
>>>  From: Upayavira 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, October 8, 2015 7:03 PM
>>> Subject: Re: How to show some documents ahead of others
>>> 
>>> Hence the suggestion to group by the paid field - would give you two
>>> lists of the number you ask for.
>>> 
>>> What I'm trying to say is that the QueryElevationComponent might do it,
>>> but it is also relatively clunky, so a pure search solution might do it.
>>> 
>>> However, the thing we lack right now is a full take on the requirements,
>>> e.g. how should paid results be sorted, how many paid results do you
>>> show, etc, etc. Without these details we're all guessing.
>>> 
>>> Upayavira
>>> 
>>> 
>>> On Thu, Oct 8, 2015, at 04:45 PM, Walter Underwood wrote:
 Sorting all paid above all unpaid will give bad results when there are
 many matches. It will show 1000 paid items, include all the barely
 relevant ones, before it shows the first highly relevant unpaid recipe.
 What if that was the only correct result?
 
 Two approaches that work:
 
 1. Boost paid items using the “boost” parameter in edismax. Adjust it to
 be a tiebreaker between documents with similar score.
 
 2. Show two lists, one with the five most relevant paid, the next with
 the five most relevant unpaid.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
 
> On Oct 8, 2015, at 7:39 AM, Alessandro Benedetti 
>  wrote:
> 
> Is it possible to understand better this : "as it doesn't
> allow any meaningful customization " ?
> 
> Cheers
> 
> On 8 October 2015 at 15:27, Andrea Roggerone 
> > wrote:
> 
>> Hi guys,
>> I don't think that sorting is a good solution in this case as it doesn't
>> allow any meaningful customization.I believe that the advised
>> QueryElevationComponent is one of the viable alternative. Another one 
>> would

Using SimpleNaiveBayesClassifier in solr

2015-10-10 Thread Yewint Ko
Hi

I am trying to use SimpleNaiveBayesClassifier in my solr project. Currently
looking at its test base ClassificationTestBase.java.

The sample test code inside seems like that classifier read the whole index
db to train the model everytime when classification happened for
inputDocument. or am I misunderstanding something here? If i had a large
index db, will it impact performance?

protected void checkCorrectClassification(Classifier classifier, String
inputDoc, T expectedResult, Analyzer analyzer, String textFieldName, String
classFieldName, Query query) throws Exception {

AtomicReader atomicReader = null;

try {

  populateSampleIndex(analyzer);

  atomicReader = SlowCompositeReaderWrapper.wrap(indexWriter
.getReader());

  classifier.train(atomicReader, textFieldName, classFieldName, analyzer,
query);

  ClassificationResult classificationResult = classifier.assignClass(
inputDoc);

  assertNotNull(classificationResult.getAssignedClass());

  assertEquals("got an assigned class of " +
classificationResult.getAssignedClass(),
expectedResult, classificationResult.getAssignedClass());

  assertTrue("got a not positive score " + classificationResult.getScore(),
classificationResult.getScore() > 0);

} finally {

  if (atomicReader != null)

atomicReader.close();

}

  }


Re: Exclude documents having same data in two fields

2015-10-10 Thread Upayavira
In which case you'd be happy to wait for 30s for it to complete, in
which case the func or frange function query should be fine.

Upayavira

On Fri, Oct 9, 2015, at 05:55 PM, Aman Tandon wrote:
> Thanks Mikhail the suggestion. I will try that on monday will let you
> know.
> 
> *@*Walter This was just an random requirement to find those fields which
> are not same and then reindex only those. I can full index but I was
> wondering if there might some function or something.
> 
> With Regards
> Aman Tandon
> 
> On Fri, Oct 9, 2015 at 9:05 PM, Mikhail Khludnev
>  > wrote:
> 
> > Aman,
> >
> > You can invoke Terms Component for the filed M, let it returns terms:
> > {a,c,d,f}
> > then you invoke it for field T let it return {b,c,f,e},
> > then you intersect both lists (it's quite romantic if they are kept
> > ordered), you've got {c,f}
> > and then you applies filter:
> > fq=-((+M:c +T:c) (+M:f +T:f))
> > etc
> >
> >
> > On Thu, Oct 8, 2015 at 8:29 AM, Aman Tandon 
> > wrote:
> >
> > > Hi,
> > >
> > > Is there a way in solr to remove all those documents from the search
> > > results in which two of the fields, *mapping* and  *title* is the exactly
> > > same.
> > >
> > > With Regards
> > > Aman Tandon
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> > 
> >


Re: Exclude documents having same data in two fields

2015-10-10 Thread Walter Underwood
After several days, we finally get the real requirement. It really does waste a 
lot of time and energy when people won’t tell us that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 10, 2015, at 8:19 AM, Upayavira  wrote:
> 
> In which case you'd be happy to wait for 30s for it to complete, in
> which case the func or frange function query should be fine.
> 
> Upayavira
> 
> On Fri, Oct 9, 2015, at 05:55 PM, Aman Tandon wrote:
>> Thanks Mikhail the suggestion. I will try that on monday will let you
>> know.
>> 
>> *@*Walter This was just an random requirement to find those fields which
>> are not same and then reindex only those. I can full index but I was
>> wondering if there might some function or something.
>> 
>> With Regards
>> Aman Tandon
>> 
>> On Fri, Oct 9, 2015 at 9:05 PM, Mikhail Khludnev
>> >> wrote:
>> 
>>> Aman,
>>> 
>>> You can invoke Terms Component for the filed M, let it returns terms:
>>> {a,c,d,f}
>>> then you invoke it for field T let it return {b,c,f,e},
>>> then you intersect both lists (it's quite romantic if they are kept
>>> ordered), you've got {c,f}
>>> and then you applies filter:
>>> fq=-((+M:c +T:c) (+M:f +T:f))
>>> etc
>>> 
>>> 
>>> On Thu, Oct 8, 2015 at 8:29 AM, Aman Tandon 
>>> wrote:
>>> 
 Hi,
 
 Is there a way in solr to remove all those documents from the search
 results in which two of the fields, *mapping* and  *title* is the exactly
 same.
 
 With Regards
 Aman Tandon
 
>>> 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Principal Engineer,
>>> Grid Dynamics
>>> 
>>> 
>>> 
>>> 



Re: Solr Pagination

2015-10-10 Thread Shawn Heisey
On 10/10/2015 2:55 AM, Salman Ansari wrote:
> Thanks Shawn for your response. Based on that
> 1) Can you please direct me where I can get more information about cold
> shard vs hot shard?

I don't know of any information out there about hot/cold shards.  I can
describe it, though:

A split point is determined.  Everything older than the split point gets
divided by some method (usually hashing) between multiple cold shards.
Everything newer than the split point goes into the hot shard.  For my
index, there is only one hot shard, but it is possible to have multiple
hot shards.

On some interval (nightly in my index), the split point is adjusted and
documents are moved from the hot shard to the cold shards according to
that split point.  The hot shard is typically a lot smaller than the
cold shards, which helps increase indexing speed for new documents.

I am not using SolrCloud. I manage all my own sharding. There is no
capability included in SolrCloud that can do hot/cold sharding.

> 2)  That 10GB number assumes there's no other software on the machine, like
> a database server or a webserver.
> Yes the machine is dedicated for Solr
> 
> 3) How much index data is on the machine?
> I have 3 collections 2 for testing (so the aggregate of both of them does
> not exceed 1M document) and the main collection that I am querying now
> which contains around 69M. I have distributed all my collections into 2
> shards each with 2 replicas. The consumption on the hard disk is about 40GB.

That sounds like a recipe for a performance problem, although I am not
certain why the problem persisted after increasing the memory.  Perhaps
it has something to do with the filterCache, which I will get to further
down.

> 4) A memory size of 14GB would be unusual for a physical machine, and makes me
> wonder if you're using virtual machines
> Yes I am using virtual machine as using a bare metal will be difficult in
> my case as all of our data center is on the cloud. I can increase its
> capacity though. While testing some edge cases on Solr, I realized on Solr
> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)

This is how operating systems and Java are designed to work.  When
things are running well, all of physical memory might be allocated, and
the heap will become full on a semi-regular basis.  If it *stays* full,
that usually means it needs to be larger.  The admin UI is a poor tool
for watching JVM memory usage.

> 5) Just to confirm, I have combined the lessons from
> 
> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
> AND
> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
> 
> to come up with the following settings
> 
> FilterCache
> 
>   size="16384"
>  initialSize="4096"
>  autowarmCount="4096"/>

That's a very very large cache size.  It is likely to use a VERY large
amount of heap, and autowarming up to 4096 entries at commit time might
take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
index core with 70 million documents, each filterCache entry is at least
8.75 million bytes.  Multiply that by 16384, and a completely full cache
would need about 140GB of heap memory.  4096 entries will require 35GB.
 I don't think this cache is actually storing that many entries, or you
would most certainly be running into OutOfMemoryError exceptions.

> size="16384"
>initialSize="16384"
>autowarmCount="0"/>
> 
> NewSearcher and FirsSearcher
> 
> 
>   
>*score desc id
> desc
>   
> 
> 
>   
>  * score desc id desc 
> 
>  *
>   category
>   
> 
> 
> Will this be using more cache in Solr and prepoupulate it?

The newSearcher entry will result in one entry in the queryResultCache,
and an unknown number of entries in the documentCache -- that depends on
the "rows" parameter on the /select handler (defaults to 10) and the
queryResultMaxDocsCached parameter.

The firstSearcher entry does two queries, but because the "q" parameter
is identical on them, it will only result in one entry in the
queryResultCache.  One of them has facet.field, but you did not include
facet=true, so the facet query will not actually be run.  Without the
facet query, the filterCache will not be populated.

I think the design intent for newSearcher and firstSearcher is to load
critical index data into the OS disk cache.  It's not so much about
warming the Solr caches as it is about priming the system as a whole.

Note that the wildcard query you are running (q=*) is relatively slow,
but is an excellent choice for a warming query, because it actually
reads every single term from the default field.  Because of how slow
this query can run, setting useColdSearcher to true is recommended.

Thanks,
Shawn



Using SimpleNaiveBayesClassifier in solr

2015-10-10 Thread Yewint Ko
Hi

I am trying to use NaiveBayesClassifier in my solr project. Currently
looking at its test case ClassificationTestBase.java.

Below codes seems like that classifier read the whole index db to train the
model everytime when classification happened for inputDocument. or am I
misunderstanding something here? If i had a large index db, will it impact
performance?

protected void checkCorrectClassification(Classifier classifier, String
inputDoc, T expectedResult, Analyzer analyzer, String textFieldName, String
classFieldName, Query query) throws Exception {

AtomicReader atomicReader = null;

try {

  populateSampleIndex(analyzer);

  atomicReader = SlowCompositeReaderWrapper.wrap(indexWriter
.getReader());

  classifier.train(atomicReader, textFieldName, classFieldName, analyzer,
query);

  ClassificationResult classificationResult = classifier.assignClass(
inputDoc);

  assertNotNull(classificationResult.getAssignedClass());

  assertEquals("got an assigned class of " +
classificationResult.getAssignedClass(),
expectedResult, classificationResult.getAssignedClass());

  assertTrue("got a not positive score " + classificationResult.getScore(),
classificationResult.getScore() > 0);

} finally {

  if (atomicReader != null)

atomicReader.close();

}

  }


Re: Solr Pagination

2015-10-10 Thread Salman Ansari
Thanks Shawn for your response. Based on that
1) Can you please direct me where I can get more information about cold
shard vs hot shard?

2)  That 10GB number assumes there's no other software on the machine, like
a database server or a webserver.
Yes the machine is dedicated for Solr

3) How much index data is on the machine?
I have 3 collections 2 for testing (so the aggregate of both of them does
not exceed 1M document) and the main collection that I am querying now
which contains around 69M. I have distributed all my collections into 2
shards each with 2 replicas. The consumption on the hard disk is about 40GB.

4) A memory size of 14GB would be unusual for a physical machine, and makes me
wonder if you're using virtual machines
Yes I am using virtual machine as using a bare metal will be difficult in
my case as all of our data center is on the cloud. I can increase its
capacity though. While testing some edge cases on Solr, I realized on Solr
admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)

5) Just to confirm, I have combined the lessons from

http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
AND
https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

to come up with the following settings

FilterCache



DocummentCahce



NewSearcher and FirsSearcher


  
   *score desc id
desc
  


  
 * score desc id desc 

 *
  category
  


Will this be using more cache in Solr and prepoupulate it?

Regards,
Salman




On Sat, Oct 10, 2015 at 5:10 AM, Shawn Heisey  wrote:

> On 10/9/2015 1:39 PM, Salman Ansari wrote:
>
> > INFO  - 2015-10-09 18:46:17.953; [c:sabr102 s:shard1 r:core_node2
> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
> > [sabr102_shard1_replica1] webapp=/solr path=/select
> > params={start=0=(content_text:Football)=10} hits=24408 status=0
> > QTime=3391
>
> Over 3 seconds for a query like this definitely sounds like there's a
> problem.
>
> > INFO  - 2015-10-09 18:47:04.727; [c:sabr102 s:shard1 r:core_node2
> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
> > [sabr102_shard1_replica1] webapp=/solr path=/select
> > params={start=1000=(content_text:Football)=10} hits=24408 status=0
> > QTime=21569
>
> Adding a start value of 1000 increases QTime by a factor of more than
> 6?  Even more evidence of a performance problem.
>
> For comparison purposes, I did a couple of simple queries on a large
> index of mine.  Here are the response headers showing the QTime value
> and all the parameters (except my shard URLs) for each query:
>
>   "responseHeader": {
> "status": 0,
> "QTime": 1253,
> "params": {
>   "df": "catchall",
>   "spellcheck.maxCollationEvaluations": "2",
>   "spellcheck.dictionary": "default",
>   "echoParams": "all",
>   "spellcheck.maxCollations": "5",
>   "q.op": "AND",
>   "shards.info": "true",
>   "spellcheck.maxCollationTries": "2",
>   "rows": "70",
>   "spellcheck.extendedResults": "false",
>   "shards": "REDACTED SEVEN SHARD URLS",
>   "shards.tolerant": "true",
>   "spellcheck.onlyMorePopular": "false",
>   "facet.method": "enum",
>   "spellcheck.count": "9",
>   "q": "catchall:carriage",
>   "indent": "true",
>   "wt": "json",
>   "_": "120900498"
> }
>
>
>   "responseHeader": {
> "status": 0,
> "QTime": 176,
> "params": {
>   "df": "catchall",
>   "spellcheck.maxCollationEvaluations": "2",
>   "spellcheck.dictionary": "default",
>   "echoParams": "all",
>   "spellcheck.maxCollations": "5",
>   "q.op": "AND",
>   "shards.info": "true",
>   "spellcheck.maxCollationTries": "2",
>   "rows": "70",
>   "spellcheck.extendedResults": "false",
>   "shards": "REDACTED SEVEN SHARD URLS",
>   "shards.tolerant": "true",
>   "spellcheck.onlyMorePopular": "false",
>   "facet.method": "enum",
>   "spellcheck.count": "9",
>   "q": "catchall:wibble",
>   "indent": "true",
>   "wt": "json",
>   "_": "121001024"
> }
>
> The first query had a numFound of 120906, the second a numFound of 32.
> When I re-executed the first  query (the one with a QTime of 1253) so it
> would use the Solr caches, QTime was 17.
>
> This is an index that has six cold shards with 38.8 million documents
> each and a hot shard with 1.5 million documents.  Total document count
> for the index is over 234 million documents, and the total size of the
> index is about 272GB.  Each copy of the index has its shards split
> between two servers that each have 64GB of RAM, with an 8GB max Java
> heap.  I do not have enough memory to cache all the index contents in
> RAM, but I can get a little less than half of it in the cache -- each
> machine has about 56GB of cache available and contains