Elevation and core create

2014-03-02 Thread David Stuart
Hi sorry for the cross post but I got no response in the dev group so assumed I 
posted in the wrong place.



I am using Solr 3.6 and am trying to automate the deployment of cores with a 
custom elevate file. It is proving to be difficult as most of the file (schema, 
stop words etc) support absolute path elevate seems to need to be in either a 
conf directory as a sibling to data or in the data directory itself. I am able 
to achieve my goal by having a secondary process that places the file but 
thought I would as the group just in case I have missed the obvious. Should I 
move to Solr 4 is it fixed here? I could also go down the root of extending the 
SolrCore create function to accept additional params and move the file into the 
defined data directory.

Ideas?

Thanks for your help
David Stuart
M  +44(0) 778 854 2157
T   +44(0) 845 519 5465
www.axistwelve.com
Axis12 Ltd | The Ivories | 6/18 Northampton Street, London | N1 2HY | UK

AXIS12 - Enterprise Web Solutions

Reg Company No. 7215135
VAT No. 997 4801 60

This e-mail is strictly confidential and intended solely for the ordinary user 
of the e-mail account to which it is addressed. If you have received this 
e-mail in error please inform Axis12 immediately by return e-mail or telephone. 
We advise that in keeping with good computing practice the recipient of this 
e-mail should ensure that it is virus free. We do not accept any responsibility 
for any loss or damage that may arise from the use of this email or its 
contents.





Re: SolrCloud plugin

2014-03-02 Thread Shalin Shekhar Mangar
Perhaps you just need StatsComponent?

https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

On Sun, Mar 2, 2014 at 6:32 AM, Soumitra Kumar kumar.soumi...@gmail.com wrote:
 In general, yes.

 I don't how SolrCloud serves a distributed query? What all it does on the
 shards, and what on the server serving the query?
 On Mar 1, 2014 2:58 PM, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi;

 Ok, I see that your aim is different. Do you want to implement something
 similar to Map/Reduce paradigm?

 Thanks;
 Furkan KAMACI


 2014-03-02 0:09 GMT+02:00 Soumitra Kumar kumar.soumi...@gmail.com:

  I want to add a command to calculate average of some numeric field. How
 do
  I efficiently do this when data is split across multiple shards. I would
  like to do the computation on each shard, and then aggregate the result.
 
 
  On Sat, Mar 1, 2014 at 1:51 PM, Furkan KAMACI furkankam...@gmail.com
  wrote:
 
   Hi;
  
   I've written a dashboard for such kind of purposes and I will make it
  open
   source soon. You can get information of SolrCloud via Solrj or you
  interact
   with Zookeeper. Could you explain more what do you want to do? Which
 kind
   of results do you want to aggregate for SolrCloud installation.
  
   Thanks;
   Furkan KAMACI
  
  
   2014-03-01 23:39 GMT+02:00 Soumitra Kumar kumar.soumi...@gmail.com:
  
Hello,
   
I want to write a plugin for a SolrCloud installation.
   
I could not find where and how to aggregate the results from all
  shards,
please give some pointers.
   
Thanks,
-Soumitra.
   
  
 




-- 
Regards,
Shalin Shekhar Mangar.


Re: Elevation and core create

2014-03-02 Thread Erick Erickson
Hmmm, you _ought_ to be able to specify a relative path
in str name=confFilessolrconfig_slave.xml:solrconfig.xml,x.xml,y.xml/str

But there's certainly the chance that this is hard-coded in
the query elevation component so I can't say that this'll work
with assurance.

Best,
Erick

On Sun, Mar 2, 2014 at 6:14 AM, David Stuart d...@axistwelve.com wrote:
 Hi sorry for the cross post but I got no response in the dev group so assumed 
 I posted in the wrong place.



 I am using Solr 3.6 and am trying to automate the deployment of cores with a 
 custom elevate file. It is proving to be difficult as most of the file 
 (schema, stop words etc) support absolute path elevate seems to need to be in 
 either a conf directory as a sibling to data or in the data directory itself. 
 I am able to achieve my goal by having a secondary process that places the 
 file but thought I would as the group just in case I have missed the obvious. 
 Should I move to Solr 4 is it fixed here? I could also go down the root of 
 extending the SolrCore create function to accept additional params and move 
 the file into the defined data directory.

 Ideas?

 Thanks for your help
 David Stuart
 M  +44(0) 778 854 2157
 T   +44(0) 845 519 5465
 www.axistwelve.com
 Axis12 Ltd | The Ivories | 6/18 Northampton Street, London | N1 2HY | UK

 AXIS12 - Enterprise Web Solutions

 Reg Company No. 7215135
 VAT No. 997 4801 60

 This e-mail is strictly confidential and intended solely for the ordinary 
 user of the e-mail account to which it is addressed. If you have received 
 this e-mail in error please inform Axis12 immediately by return e-mail or 
 telephone. We advise that in keeping with good computing practice the 
 recipient of this e-mail should ensure that it is virus free. We do not 
 accept any responsibility for any loss or damage that may arise from the use 
 of this email or its contents.





Re: Date query not returning results only some time

2014-03-02 Thread Arun Rangarajan
Erick,
Thanks a lot for the detailed explanation. That clarified things for me
better.


On Sun, Mar 2, 2014 at 10:04 AM, Erick Erickson erickerick...@gmail.comwrote:

 Well, in M/S setups the master shouldn't be searching at all,
 but that's a nit.

 That aside, whether the master has opened a new or
 searcher or not is irrelevant to what the slave replicates.
 What _is_ relevant is whether any of the files on disk that
 comprise the index (i.e. the segment files) have been
 changed. Really, if any of them have been closed/merged
 whatever since the last sync. Imagine it like this (this isn't
 quite what happens, but it's a useful model). The slave
 says here's a list of my segments, is it the same as the
 list of closed segments on the master? If the answer
 is no, a replication is performed. Actually, this is done
 much more efficiently, but that's the idea.

 You seem to be really asking about the whole issue of whether
 searches on the various nodes (master + slaves) is
 consistent. This is one of the problems with M/S setups, they
 can be different by whatever has happened in the polling interval.

 The state of the master's searchers just doesn't enter the picture.

 Glad the problem is solved no matter what.

 Erick

 On Sat, Mar 1, 2014 at 10:26 PM, Arun Rangarajan
 arunrangara...@gmail.com wrote:
  The slave is polling the master after the interval specified in
  solrconfig.xml. The slave essentially asks has anything changed? If
 so, the
  changes are brought down to the slave.
  Yes, I understand this, but if master does not open a new searcher after
  auto commits (which would indicate that the new index is not quite ready
  yet) and if master is still using the old index to serve search
 requests, I
  would expect the slave to do the same as well. Or the slave should at
 least
  not replicate or not open a new searcher, until the master opened a new
  searcher. But that is just the way I see it and it may be wrong.
 
  What's your polling interval on the slave anyway? Sounds like it's quite
  frequent if you notice this immediately after the DIH starts.
  No, polling interval is set to 1 hour, but the full import was set to run
  at 1 AM. I believe a delete followed by few docs got replicated after the
  first few auto commits when the slave probably polled around 1:10 AM and
  slave index had few docs for an hour before the next polling happened,
  which is why the date query was returning empty results for exactly that
  one hour. (The full index takes about 1.5 hours to finish.)
 
  Anyway the problem is now solved by specifying clean=false in the DIH
  full import command.
 
 
  On Sat, Mar 1, 2014 at 9:12 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  bq: the slave anyway replicates the index after auto commits! (Is this
  desired behavior?)
 
  Absolutely it's desired behavior. The slave is polling the master
  after the interval
  specified in solrconfig.xml. The slave essentially asks has anything
  changed? If so,
  the changes are brought down to the slave. And by definition, commits
  change the index,
  especially if all docs have been deleted
 
  What's your polling interval on the slave anyway? Sounds like it's
  quite frequent if you
  notice this immediately after the DIH starts.
 
  Best,
  Erick
 
  On Fri, Feb 28, 2014 at 9:04 PM, Arun Rangarajan
  arunrangara...@gmail.com wrote:
   I believe I figured out what the issue is. Even though we do not open
 a
  new
   searcher on master during full import, the slave anyway replicates the
   index after auto commits! (Is this desired behavior?) Since
 clean=true
   this meant all the docs were deleted on slave and a partial index got
   replicated! The reason only the date query did not return any results
 is
   because recently created docs have higher doc IDs and we index by
  ascending
   order of IDs!
  
   I believe I have two options:
   - as Chris suggested I have to use clean=false so the existing docs
 are
   not deleted first on the slave. Since we have primary keys, newly
 added
   docs will overwrite old docs as they get added.
   - disable replication after commits. Replicate only after optimize.
  
   Thx all for your help.
  
  
  
  
  
   On Fri, Feb 28, 2014 at 8:06 PM, Arun Rangarajan
   arunrangara...@gmail.comwrote:
  
   Thx, Erick and Chris.
  
   This is indeed very strange. Other queries which do not restrict by
 the
   date field are returning results, so the index is definitely not
 empty.
  Has
   it got something to do with the date query part, with NOW/DAY or
  something
   in here?
   first_publish_date:[NOW/DAY-33DAYS TO NOW/DAY-3DAYS]
  
   For now, I have set up a script to just log the number of docs on the
   slave every minute. Will monitor and report the findings.
  
  
   On Fri, Feb 28, 2014 at 6:49 PM, Chris Hostetter 
  hossman_luc...@fucit.org
wrote:
  
  
   : This is odd. The full import, I think, deletes the
   : docs in the index when it starts.
  
   Yeah, if you 

Re: How to best handle search like Dave David

2014-03-02 Thread Arun Rangarajan
If you are trying to serve results as users are typing, then you can use
EdgeNGramFilter (see
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
).

Let's say you configure your field like this, as shown in the Solr wiki:

fieldType name=text_general_edge_ngram class=solr.TextField
positionIncrementGap=100
   analyzer type=index
  tokenizer class=solr.LowerCaseTokenizerFactory/
  filter class=solr.EdgeNGramFilterFactory minGramSize=2
maxGramSize=15 side=front/
   /analyzer
   analyzer type=query
  tokenizer class=solr.LowerCaseTokenizerFactory/
   /analyzer
/fieldType

Then this is what happens at index time for your tokens:

David --- | LowerCaseTokenizerFactory | --- david --- |
EdgeNGramFilterFactory
| --- da dav davi david
Dave --- | LowerCaseTokenizerFactory | --- dave --- | EdgeNGramFilterFactory
| --- da dav dave

And at query time, when your user enters 'Dav' it will match both those
tokens. Note that the moment your user starts typing more, say 'davi' it
won't match 'Dave' since you are doing edge N gramming only at index time
and not at query time. You can also do edge N gramming at query time if you
want 'Dave' to match 'David', probably keeping a larger minGramSize (in
this case 3) to avoid noise (like say 'Dave' matching 'Dana' though with a
lower score), but it will be expensive to do n-gramming at query time.




On Fri, Feb 28, 2014 at 3:22 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Hi,

 We have name searches on Solr for millions of documents. User may search
 like Morrison Dave or other may search like Morrison David.  What's the
 best way to handle that both brings similar results. Adding Synonym is the
 option we are using right.

 But we may need to add around such 50,000+ synonyms for different names
 for each specific name there can be couple of synonyms like for Richard, it
 can be Rich, Rick, Richie etc.

 Any experience adding so many synonyms or any other thoughts? Stemming may
 help in few situations but not like Dave and David.

 Thanks,
 Susheel



Re: Cluster state ranges are all null after reboot

2014-03-02 Thread Greg Pendlebury
Thanks again for the info. Hopefully we find some more clues if it
continues to occur. The ops team are looking at alternative deployment
methods as well, so we might end up avoiding the issue altogether.

Ta,
Greg


On 28 February 2014 02:42, Shalin Shekhar Mangar shalinman...@gmail.comwrote:

 I think it is just a side-effect of the current implementation that
 the ranges are assigned linearly. You can also verify this by choosing
 a document from each shard and running it's uniqueKey against the
 CompositeIdRouter's sliceHash method and verifying that it is included
 in the range.

 I couldn't reproduce this but I didn't try too hard either. If you are
 able to isolate a reproducible example then please do report back.
 I'll spend some time to review the related code again to see if I can
 spot the problem.

 On Thu, Feb 27, 2014 at 2:19 AM, Greg Pendlebury
 greg.pendleb...@gmail.com wrote:
  Thanks Shalin, that code might be helpful... do you know if there is a
  reliable way to line up the ranges with the shard numbers? When the
 problem
  occurred we had 80 million documents already in the index, and could not
  issue even a basic 'deleteById' call. I'm tempted to assume they are just
  assigned linearly since our Test and Prod clusters both look to work that
  way now, but I can't be sure whether that is by design or just
 happenstance
  of boot order.
 
  And no, unfortunately we have not been able to reproduce this issue
  consistently despite trying a number of different things such as
 graceless
  stop/start and screwing with the underlying WAR file (which is what we
  thought puppet might be doing). The problem has occurred twice since, but
  always in our Test environment. The fact that Test has only a single
  replica per shard is the most likely culprit for me, but as mentioned,
 even
  gracelessly killing the last replica in the cluster seems to leave the
  range set correctly in clusterstate when we test it in isolation.
 
  In production (45 JVMs, 15 shards with 3 replicas each) we've never seen
  the problem, despite a similar number of rollouts for version changes
 etc.
 
  Ta,
  Greg
 
 
 
 
  On 26 February 2014 23:46, Shalin Shekhar Mangar shalinman...@gmail.com
 wrote:
 
  If you have 15 shards and assuming that you've never used shard
  splitting, you can calculate the shard ranges by using new
  CompositeIdRouter().partitionRange(15, new
  CompositeIdRouter().fullRange())
 
  This gives me:
  [8000-9110, 9111-a221, a222-b332,
  b333-c443, c444-d554, d555-e665,
  e666-f776, f777-887, 888-1998,
  1999-2aa9, 2aaa-3bba, 3bbb-4ccb,
  4ccc-5ddc, 5ddd-6eed, 6eee-7fff]
 
  Have you done any more investigation into why this happened? Anything
  strange in the logs? Are you able to reproduce this in a test
  environment?
 
  On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury
  greg.pendleb...@gmail.com wrote:
   We've got a 15 shard cluster spread across 3 hosts. This morning our
  puppet
   software rebooted them all and afterwards the 'range' for each shard
 has
   become null in zookeeper. Is there any way to restore this value
 short of
   rebuilding a fresh index?
  
   I've read various questions from people with a similar problem,
 although
  in
   those cases it is usually a single shard that has become null allowing
  them
   to infer what the value should be and manually fix it in ZK. In this
  case I
   have no idea what the ranges should be. This is our test cluster, and
   checking production I can see that the ranges don't appear to be
   predictable based on the shard number.
  
   I'm also not certain why it even occurred. Our test cluster only has a
   single replica per shard, so when a JVM is rebooted the cluster is
   unavailable... would that cause this? Production has 3 replicas so we
 can
   do rolling reboots.
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 



 --
 Regards,
 Shalin Shekhar Mangar.



SolrCloud: heartbeat succeeding while node has failing SSD?

2014-03-02 Thread Gregg Donovan
We had a brief SolrCloud outage this weekend when a node's SSD began to
fail but the node still appeared to be up to the rest of the SolrCloud
cluster (i.e. still green in clusterstate.json). Distributed queries that
reached this node would fail but whatever heartbeat keeps the node in the
clustrstate.json must have continued to succeed.

We eventually had to power the node down to get it to be removed from
clusterstate.json.

This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
the default heartbeat mechanism is and how we may augment it to be sure
that the disk is checked as part of the heartbeat and/or we verify that it
can serve queries.

Any pointers would be appreciated.

Thanks!

--Gregg


Re: Solr is NoSQL database or not?

2014-03-02 Thread Michael Sokolov

On 3/1/2014 6:53 PM, Jack Krupansky wrote:

NoSQL? To me it's just a marketing term, like Big Data.

Data store? That does imply support for persistence, as opposed to 
mere caching, but mere persistence doesn't assure that the store is 
suitable for use as a System of Record which is a requirement in my 
view for a true database. So, I wouldn't assert that a data store is a 
database.

I agree, Jack.

Our experience has been that we don't actually need everything a true 
ACID database has to offer.  In particular we don't care all that much 
about the I (isolation) part since we don't use Solr to store 
transactional data, just documents, which are loaded by a small number 
of writers that we coordinate. If I had to pick one thing though that 
would make you have to say well um not really a database, it would be 
the transactional model: anyone commits, everyone sees the updates.


-Mike


SEVERE: org.apache.solr.common.SolrException: no field name specified in query and no default specified via 'df' param

2014-03-02 Thread eShard
Hi,
I'm using Solr 4.0 Final (yes, I know I need to upgrade)

I'm getting this error:
SEVERE: org.apache.solr.common.SolrException: no field name specified in
query and no default specified via 'df' param

And I applied this fix: https://issues.apache.org/jira/browse/SOLR-3646 
And unfortunately, the error persists.
I'm using a multi shard environment and the error is only happening on one
of the shards.
I've already updated about half of the other shards with the missing default
text in /browse but the error persists on that one shard.
Can anyone tell me how to make the error go away?

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SEVERE-org-apache-solr-common-SolrException-no-field-name-specified-in-query-and-no-default-specifiem-tp4120789.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud: heartbeat succeeding while node has failing SSD?

2014-03-02 Thread Mark Miller
The heartbeat that keeps the node alive is the connection it maintains with 
ZooKeeper.

We don’t currently have anything built in that will actively make sure each 
node can serve queries and remove it from clusterstatem.json if it cannot. If a 
replica is maintaining it’s connection with ZooKeeper and in most cases, if it 
is accepting updates, it will appear up. Load balancing should handle the 
failures, but I guess it depends on how sticky the request fails are.

In the past, I’ve seen this handled on a different search engine by having a 
variety of external agent scripts that would occasionally attempt to do a 
query, and if things did not go right, it killed the process to cause it to try 
and startup again (supervised process).

I’m not sure what the right long term feature for Solr is here, but feel free 
to start a JIRA issue around it.

One simple improvement might even be a background thread that periodically 
checks some local readings and depending on the results, pulls itself out of 
the mix as best it can (remove itself from clusterstate.json or simply closes 
it’s zk conneciton).

- Mark

http://about.me/markrmiller

On Mar 2, 2014, at 3:42 PM, Gregg Donovan gregg...@gmail.com wrote:

 We had a brief SolrCloud outage this weekend when a node's SSD began to
 fail but the node still appeared to be up to the rest of the SolrCloud
 cluster (i.e. still green in clusterstate.json). Distributed queries that
 reached this node would fail but whatever heartbeat keeps the node in the
 clustrstate.json must have continued to succeed.
 
 We eventually had to power the node down to get it to be removed from
 clusterstate.json.
 
 This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
 the default heartbeat mechanism is and how we may augment it to be sure
 that the disk is checked as part of the heartbeat and/or we verify that it
 can serve queries.
 
 Any pointers would be appreciated.
 
 Thanks!
 
 --Gregg



Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread KNitin
Hi

I have very large index for a few collections and when they are being
queried, i see the Old gen space close to 100% Usage all the time. The
system becomes extremely slow due to GC activity right after that and it
gets into this cycle very often

I have given solr close to 30G of heap in a 65 GB ram machine and rest is
given to RAm. I have a lot of hits in filter,query result and document
caches and the size of all the caches is around 512 entries per
collection.Are all the caches used by solr on or off heap ?


Given this scenario where GC is the primary bottleneck what is a good
recommended memory settings for solr? Should i increase the heap memory
(that will only postpone the problem before the heap becomes full again
after a while) ? Will memory maps help at all in this scenario?


Kindly advise on the best practices
Thanks
Nitin


Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread Walter Underwood
An LRU cache will always fill up the old generation. Old objects are ejected, 
and those are usually in the old generation.

Increasing the heap size will not eliminate this. It will make major, stop the 
world collections longer.

Increase the new generation size until the rate of old gen increase slows down. 
Then choose a total heap size to control the frequency (and duration) of major 
collections.

We run with the new generation at about 25% of the heap, so 8GB total and a 2GB 
newgen.

A 512 entry cache is very small for query results or docs. We run with 10K or 
more entries for those. The filter cache size depends on your usage. We have 
only a handful of different filter queries, so a tiny cache is fine.

What is your hit rate on the caches?

wunder

On Mar 2, 2014, at 7:42 PM, KNitin nitin.t...@gmail.com wrote:

 Hi
 
 I have very large index for a few collections and when they are being
 queried, i see the Old gen space close to 100% Usage all the time. The
 system becomes extremely slow due to GC activity right after that and it
 gets into this cycle very often
 
 I have given solr close to 30G of heap in a 65 GB ram machine and rest is
 given to RAm. I have a lot of hits in filter,query result and document
 caches and the size of all the caches is around 512 entries per
 collection.Are all the caches used by solr on or off heap ?
 
 
 Given this scenario where GC is the primary bottleneck what is a good
 recommended memory settings for solr? Should i increase the heap memory
 (that will only postpone the problem before the heap becomes full again
 after a while) ? Will memory maps help at all in this scenario?
 
 
 Kindly advise on the best practices
 Thanks
 Nitin




Re: stopwords issue with edismax

2014-03-02 Thread sureshrk19
Jack,

Thanks for the reply.

Yes. your observation is right. I see, stopwords are not being ignore at
query time. 
Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be
the part of search.
But, here I see 'of' is being sent. Same is the query syntax for 'OR' and
'AND' operators and 'OR' is returning results as expected. But in my case, I
want to use 'AND'.

Here is debug query information...

parsedquery:(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0))
DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0))
DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0)))~3))/no_coord,
parsedquery_toString:+(((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)
(number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0))~3)

Is there any reason why 'stopwords' are not being ignored. I checked
schema.xml for filter and the same is present:
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /




--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: stopwords issue with edismax

2014-03-02 Thread Jack Krupansky
As I suggested, you have a couple of field that do not ignore stop words, so 
the stop word must be present in at least one of those fields:


(number:of^3.0 | all_code:of^2.0)

The solution would be to remove the number and all_code fields from qf.

-- Jack Krupansky

-Original Message- 
From: sureshrk19

Sent: Monday, March 3, 2014 1:05 AM
To: solr-user@lucene.apache.org
Subject: Re: stopwords issue with edismax

Jack,

Thanks for the reply.

Yes. your observation is right. I see, stopwords are not being ignore at
query time.
Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be
the part of search.
But, here I see 'of' is being sent. Same is the query syntax for 'OR' and
'AND' operators and 'OR' is returning results as expected. But in my case, I
want to use 'AND'.

Here is debug query information...

parsedquery:(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0))
DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0))
DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0)))~3))/no_coord,
   parsedquery_toString:+(((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)
(number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0))~3)

Is there any reason why 'stopwords' are not being ignored. I checked
schema.xml for filter and the same is present:
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /




--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread KNitin
Thanks, Walter

Hit rate on the document caches is close to 70-80% and the filter caches
are a 100% hit (since most of our queries filter on the same fields but
have a different q parameter). Query result cache is not of great
importance to me since the hit rate their is almost negligible.

Does it mean i need to increase the size of my filter and document cache
for large indices?

The split up of my 25Gb heap usage is split as follows

1. 19 GB - Old Gen (100% pool utilization)
2.  3 Gb - New Gen (50% pool utilization)
3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings)
4. Survivor space is in the order of 300-400 MB and is almost always 100%
full.(Is this a major issue?)

We are also currently using Parallel GC collector but planning to move to
CMS for lesser stop-the-world gc times. If i increase the filter cache and
document cache entry sizes, they would also go to the Old gen right?

A very naive question: How does increasing young gen going to help if we
know that solr is already pushing major caches and other objects to old gen
because of their nature? My young gen pool utilization is still well under
50%


Thanks
Nitin


On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wun...@wunderwood.orgwrote:

 An LRU cache will always fill up the old generation. Old objects are
 ejected, and those are usually in the old generation.

 Increasing the heap size will not eliminate this. It will make major, stop
 the world collections longer.

 Increase the new generation size until the rate of old gen increase slows
 down. Then choose a total heap size to control the frequency (and duration)
 of major collections.

 We run with the new generation at about 25% of the heap, so 8GB total and
 a 2GB newgen.

 A 512 entry cache is very small for query results or docs. We run with 10K
 or more entries for those. The filter cache size depends on your usage. We
 have only a handful of different filter queries, so a tiny cache is fine.

 What is your hit rate on the caches?

 wunder

 On Mar 2, 2014, at 7:42 PM, KNitin nitin.t...@gmail.com wrote:

  Hi
 
  I have very large index for a few collections and when they are being
  queried, i see the Old gen space close to 100% Usage all the time. The
  system becomes extremely slow due to GC activity right after that and it
  gets into this cycle very often
 
  I have given solr close to 30G of heap in a 65 GB ram machine and rest is
  given to RAm. I have a lot of hits in filter,query result and document
  caches and the size of all the caches is around 512 entries per
  collection.Are all the caches used by solr on or off heap ?
 
 
  Given this scenario where GC is the primary bottleneck what is a good
  recommended memory settings for solr? Should i increase the heap memory
  (that will only postpone the problem before the heap becomes full again
  after a while) ? Will memory maps help at all in this scenario?
 
 
  Kindly advise on the best practices
  Thanks
  Nitin





Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread Bernd Fehling
Actually, I haven't ever seen a PermGen with 2.8 GB.
So you must have a very special use case with SOLR.

For my little index with 60 million docs and 170GB index size I gave
PermGen 82 MB and it is only using 50.6 MB for a single VM.

Permanent Generation (PermGen) is completely separate from the heap.

Permanent Generation (non-heap):
The pool containing all the reflective data of the virtual machine itself,
such as class and method objects. With Java VMs that use class data sharing,
this generation is divided into read-only and read-write areas.

Regards
Bernd


Am 03.03.2014 07:54, schrieb KNitin:
 Thanks, Walter
 
 Hit rate on the document caches is close to 70-80% and the filter caches
 are a 100% hit (since most of our queries filter on the same fields but
 have a different q parameter). Query result cache is not of great
 importance to me since the hit rate their is almost negligible.
 
 Does it mean i need to increase the size of my filter and document cache
 for large indices?
 
 The split up of my 25Gb heap usage is split as follows
 
 1. 19 GB - Old Gen (100% pool utilization)
 2.  3 Gb - New Gen (50% pool utilization)
 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings)
 4. Survivor space is in the order of 300-400 MB and is almost always 100%
 full.(Is this a major issue?)
 
 We are also currently using Parallel GC collector but planning to move to
 CMS for lesser stop-the-world gc times. If i increase the filter cache and
 document cache entry sizes, they would also go to the Old gen right?
 
 A very naive question: How does increasing young gen going to help if we
 know that solr is already pushing major caches and other objects to old gen
 because of their nature? My young gen pool utilization is still well under
 50%
 
 
 Thanks
 Nitin
 
 
 On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wun...@wunderwood.orgwrote:
 
 An LRU cache will always fill up the old generation. Old objects are
 ejected, and those are usually in the old generation.

 Increasing the heap size will not eliminate this. It will make major, stop
 the world collections longer.

 Increase the new generation size until the rate of old gen increase slows
 down. Then choose a total heap size to control the frequency (and duration)
 of major collections.

 We run with the new generation at about 25% of the heap, so 8GB total and
 a 2GB newgen.

 A 512 entry cache is very small for query results or docs. We run with 10K
 or more entries for those. The filter cache size depends on your usage. We
 have only a handful of different filter queries, so a tiny cache is fine.

 What is your hit rate on the caches?

 wunder

 On Mar 2, 2014, at 7:42 PM, KNitin nitin.t...@gmail.com wrote:

 Hi

 I have very large index for a few collections and when they are being
 queried, i see the Old gen space close to 100% Usage all the time. The
 system becomes extremely slow due to GC activity right after that and it
 gets into this cycle very often

 I have given solr close to 30G of heap in a 65 GB ram machine and rest is
 given to RAm. I have a lot of hits in filter,query result and document
 caches and the size of all the caches is around 512 entries per
 collection.Are all the caches used by solr on or off heap ?


 Given this scenario where GC is the primary bottleneck what is a good
 recommended memory settings for solr? Should i increase the heap memory
 (that will only postpone the problem before the heap becomes full again
 after a while) ? Will memory maps help at all in this scenario?


 Kindly advise on the best practices
 Thanks
 Nitin



 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread Walter Underwood
New gen should be big enough to handle all allocations that have a lifetime of 
a single request, considering that you'll have multiple concurrent requests. If 
new gen is routinely overflowed, you can put short-lived objects in the old gen.

Yes, you need to go to CMS.

I have usually seen the hit rates on query results and doc caches to be fairly 
similar, with doc cache somewhat higher.

Cache hit rates depend on the number of queries between updates. If you update 
once per day and get a million queries or so, your hit rates can get pretty 
good.

70-80% seems typical for doc cache on an infrequently updated index. We stay 
around 75% on our busiest 4m doc index. 

The query result cache is the most important, because it saves the most work. 
Ours stays around 20%, but I should spend some time improving that.

The perm gen size is very big. I think we run with 128 Meg.

wunder

On Mar 2, 2014, at 10:54 PM, KNitin nitin.t...@gmail.com wrote:

 Thanks, Walter
 
 Hit rate on the document caches is close to 70-80% and the filter caches
 are a 100% hit (since most of our queries filter on the same fields but
 have a different q parameter). Query result cache is not of great
 importance to me since the hit rate their is almost negligible.
 
 Does it mean i need to increase the size of my filter and document cache
 for large indices?
 
 The split up of my 25Gb heap usage is split as follows
 
 1. 19 GB - Old Gen (100% pool utilization)
 2.  3 Gb - New Gen (50% pool utilization)
 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings)
 4. Survivor space is in the order of 300-400 MB and is almost always 100%
 full.(Is this a major issue?)
 
 We are also currently using Parallel GC collector but planning to move to
 CMS for lesser stop-the-world gc times. If i increase the filter cache and
 document cache entry sizes, they would also go to the Old gen right?
 
 A very naive question: How does increasing young gen going to help if we
 know that solr is already pushing major caches and other objects to old gen
 because of their nature? My young gen pool utilization is still well under
 50%
 
 
 Thanks
 Nitin
 
 
 On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wun...@wunderwood.orgwrote:
 
 An LRU cache will always fill up the old generation. Old objects are
 ejected, and those are usually in the old generation.
 
 Increasing the heap size will not eliminate this. It will make major, stop
 the world collections longer.
 
 Increase the new generation size until the rate of old gen increase slows
 down. Then choose a total heap size to control the frequency (and duration)
 of major collections.
 
 We run with the new generation at about 25% of the heap, so 8GB total and
 a 2GB newgen.
 
 A 512 entry cache is very small for query results or docs. We run with 10K
 or more entries for those. The filter cache size depends on your usage. We
 have only a handful of different filter queries, so a tiny cache is fine.
 
 What is your hit rate on the caches?
 
 wunder
 
 On Mar 2, 2014, at 7:42 PM, KNitin nitin.t...@gmail.com wrote:
 
 Hi
 
 I have very large index for a few collections and when they are being
 queried, i see the Old gen space close to 100% Usage all the time. The
 system becomes extremely slow due to GC activity right after that and it
 gets into this cycle very often
 
 I have given solr close to 30G of heap in a 65 GB ram machine and rest is
 given to RAm. I have a lot of hits in filter,query result and document
 caches and the size of all the caches is around 512 entries per
 collection.Are all the caches used by solr on or off heap ?
 
 
 Given this scenario where GC is the primary bottleneck what is a good
 recommended memory settings for solr? Should i increase the heap memory
 (that will only postpone the problem before the heap becomes full again
 after a while) ? Will memory maps help at all in this scenario?
 
 
 Kindly advise on the best practices
 Thanks
 Nitin
 
 
 

--
Walter Underwood
wun...@wunderwood.org