Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Simon Willnauer
On Sun, Sep 12, 2010 at 1:51 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom tburt...@umich.edu wrote:
  Is there an example of how to set up the divisor parameter in 
 solrconfig.xml somewhere?

 Alas I don't know how to configure terms index divisor from Solr...

You can set the termIndexInterval via

indexDefaults
...
termIndexInterval128/termIndexInterval
...
/indexDefaults

which has the same effect but requires reindexing. I don't see that
the index divisor is exposed but maybe we should do so!

simon
In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will 
show this gain...;

 I'm looking forward to a number of the developments in 4.0, but am a bit 
 wary of using it in production.   I've wanted to work in some tests with 
 4.0, but other more pressing issues have so far prevented this.

 Understood.

 What about Lucene 2205?  Would that be a way to get some of the benefit 
 similar to the changes in flex without the rest of the changes in flex and 
 4.0?

 2205 was a similar idea (don't create tons of small objects), but it
 was never committed...

I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your 
index?  Your index is a great test for Lucene!

 We haven't been able to make much data available due to copyright and other 
 legal issues.  However, since there is absolutely no way anyone could 
 reconstruct copyrighted works from the tii/tis index alone, that should be 
 ok on that front.  On Monday I'll try to get legal/administrative clearance 
 to provide the data and also ask around and see if I can get the ok to 
 either find a spare hard drive to ship, or make some kind of sftp 
 arrangement.  Hopefully we will find a way to be able to do this.

 That would be awesome, thanks!

 BTW  Most of the terms are probably the result of  dirty OCR and the impact 
 is probably increased by our present punctuation filter.  When we re-index 
 we plan to use a more intelligent filter that will truncate extremely long 
 tokens on punctuation and we also plan to do some minimal prefiltering prior 
 to sending documents to Solr for indexing.  However, since with now have 
 over 400 languages , we will have to be conservative in our filtering since 
 we would rather  index dirty OCR than risk not indexing legitimate content.

 Got it... it's a great test case for Lucene :)

 Mike



RE: multivalued fields in result

2010-09-12 Thread Jason Chaffee
But it doesn't seem to be returning mulitvalued fields that are stored.  It is 
returning all of the single value fields though.


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent: Sat 9/11/2010 4:19 AM
To: solr-user@lucene.apache.org
Subject: RE: multivalued fields in result
 
Yes, you'll get what is stored and asked for. 
 
-Original message-
From: Jason Chaffee jchaf...@ebates.com
Sent: Sat 11-09-2010 05:27
To: solr-user@lucene.apache.org; 
Subject: multivalued fields in result

Is it possible to return multivalued files in the result?  

I would like to have a multivalued field that is stored and not indexed (I also 
copy the same field into another field where it is tokenized and indexed).  I 
would then like all the values of this field returned in the result set.  Is 
there a way to do this?

If it is not possible, could someone elaborate why that is so that I may see if 
I can make it work.

thanks,

Jason



Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Michael McCandless
One thing that the Codec API makes possible (in theory, anyway)...
is variable gap terms index.

Ie, Lucene today makes an indexed term at regular (every N -- 128 in
3.x, 32 in 4.0) intervals.

But this is rather silly.  Imagine the terms you are going through are
all singletons (happen only in one doc, eg if they are OCR noise or
whatver).  Maybe you have 500 such terms in sequence and then you hit
a real term with a high freq.  In this case, you don't really need
to add any indexed terms from those 500, but then make the real term
an indexed term.

Because... a TermQuery against those singleton terms is going to be
wicked fast, so you can afford the extra term-seek time.  Whereas a
TermQuery against a high-frequency term will be costly, so you want to
minimize term-seek time.

Such an approach could tremendously reduce the RAM required by the
terms index w/ no appreciable hit to the worst-case queries (and
possibly a slight improvement).

Mike

On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom tburt...@umich.edu wrote:
  Is there an example of how to set up the divisor parameter in 
 solrconfig.xml somewhere?

 Alas I don't know how to configure terms index divisor from Solr...

In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will 
show this gain...;

 I'm looking forward to a number of the developments in 4.0, but am a bit 
 wary of using it in production.   I've wanted to work in some tests with 
 4.0, but other more pressing issues have so far prevented this.

 Understood.

 What about Lucene 2205?  Would that be a way to get some of the benefit 
 similar to the changes in flex without the rest of the changes in flex and 
 4.0?

 2205 was a similar idea (don't create tons of small objects), but it
 was never committed...

I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your 
index?  Your index is a great test for Lucene!

 We haven't been able to make much data available due to copyright and other 
 legal issues.  However, since there is absolutely no way anyone could 
 reconstruct copyrighted works from the tii/tis index alone, that should be 
 ok on that front.  On Monday I'll try to get legal/administrative clearance 
 to provide the data and also ask around and see if I can get the ok to 
 either find a spare hard drive to ship, or make some kind of sftp 
 arrangement.  Hopefully we will find a way to be able to do this.

 That would be awesome, thanks!

 BTW  Most of the terms are probably the result of  dirty OCR and the impact 
 is probably increased by our present punctuation filter.  When we re-index 
 we plan to use a more intelligent filter that will truncate extremely long 
 tokens on punctuation and we also plan to do some minimal prefiltering prior 
 to sending documents to Solr for indexing.  However, since with now have 
 over 400 languages , we will have to be conservative in our filtering since 
 we would rather  index dirty OCR than risk not indexing legitimate content.

 Got it... it's a great test case for Lucene :)

 Mike



RE: Delta Import with something other than Date

2010-09-12 Thread Ephraim Ofir
Alternatively, you could use the deltaQuery to retrieve the last indexed
id from the DB (you'd have to save it there on your previous import).
Your entity would look something like:
entity name=my_entity
deltaQuery=SELECT MAX(id) AS last_id_value FROM last_id_table
deltaImportQuery=SELECT * FROM my_table WHERE id 
${dataimporter.delta.last_id_value}
... 
field ... /
/entity

You could implement your deltaImportQuery as a stored procedure which
would store the appropriate id in last_id_table (for the next
delta-import) in addition to returning the data from the query.

Ephraim Ofir


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, September 10, 2010 4:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Delta Import with something other than Date

  On 9/9/2010 1:23 PM, Vladimir Sutskever wrote:
 Shawn,

 Can you provide a sample of passing the parameter via URL? And how
using it would look in the data-config.xml


Here's the URL that I send to do a full build on my last shard:

http://idxst5-a:8983/solr/build/dataimport?command=full-importoptimize=
truecommit=truedataTable=ncdatnumShards=6modVal=5minDid=0maxDid=24
2895591

If I want to do a delta, I just change the command to delta-import and 
give it a proper minDid value, rather than 0.

Below is the entity from my data-config.xml.  You have to have a 
deltaQuery defined for delta-import to work, but if you're going to use 
your own placeholders, just put something in that returns a single value

very quickly.  In my case, my query and deltaImportQuery are actually 
identical.

entity name=dataTable pk=did
   query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM 
${dataimporter.request.dataTable} WHERE did gt; 
${dataimporter.request.minDid} AND did lt;= 
${dataimporter.request.maxDid} AND (did % 
${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})
   deltaQuery=SELECT MAX(did) FROM
${dataimporter.request.dataTable}
   deltaImportQuery=SELECT *,FROM_UNIXTIME(post_date) as pd FROM 
${dataimporter.request.dataTable} WHERE did gt; 
${dataimporter.request.minDid} AND did lt;= 
${dataimporter.request.maxDid} AND (did % 
${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})
/entity




Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Robert Muir
On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom tburt...@umich.edu
 wrote:
   Is there an example of how to set up the divisor parameter in
 solrconfig.xml somewhere?

 Alas I don't know how to configure terms index divisor from Solr...


To change the divisor in your solrconfig, for example to 4, it looks like
you need to do this.

  indexReaderFactory name=IndexReaderFactory
class=org.apache.solr.core.StandardIndexReaderFactory
int name=setTermIndexInterval4/int
  /indexReaderFactory 

This parameter was added in SOLR-1296 so its in Solr 1.4

Tom, i would recommend altering this parameter, instead of the default
(1)... especially since you don't have to reindex to take advantage of it.

-- 
Robert Muir
rcm...@gmail.com


Invalid version or the data in not in 'javabin' format

2010-09-12 Thread h00kpub...@gmail.com
 hi... currently i am integrating nutch (release 1.2) into solr 
(trunk). if i indexing to solr index with nutch i got the exception:


java.lang.RuntimeException: Invalid version or the data in not in 
'javabin' format
at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at 
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
at 
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: 
Job failed!


can you tell me, whats wrong or how can i fix this?

best regards marcel :)





Re: Invalid version or the data in not in 'javabin' format

2010-09-12 Thread Peter Sturge
Could be a solrj .jar version compat issue. Check that  the client and
server's solrj version jars match up.

Peter


On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com
h00kpub...@googlemail.com wrote:
  hi... currently i am integrating nutch (release 1.2) into solr (trunk). if
 i indexing to solr index with nutch i got the exception:

 java.lang.RuntimeException: Invalid version or the data in not in 'javabin'
 format
        at
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at
 org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
        at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
        at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
        at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
        at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
        at
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
        at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job
 failed!

 can you tell me, whats wrong or how can i fix this?

 best regards marcel :)






Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Simon Willnauer
On Sun, Sep 12, 2010 at 12:42 PM, Robert Muir rcm...@gmail.com wrote:
 On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom tburt...@umich.edu
 wrote:
   Is there an example of how to set up the divisor parameter in
 solrconfig.xml somewhere?

 Alas I don't know how to configure terms index divisor from Solr...


 To change the divisor in your solrconfig, for example to 4, it looks like
 you need to do this.

  indexReaderFactory name=IndexReaderFactory
 class=org.apache.solr.core.StandardIndexReaderFactory
    int name=setTermIndexInterval4/int
  /indexReaderFactory 

Ah, thanks robert! I didn't know about that one either!

simon

 This parameter was added in SOLR-1296 so its in Solr 1.4

 Tom, i would recommend altering this parameter, instead of the default
 (1)... especially since you don't have to reindex to take advantage of it.

 --
 Robert Muir
 rcm...@gmail.com



Re: mm=0?

2010-09-12 Thread Erick Erickson
Could you explain the use-case a bit? Because the very
first response I would have is why in the world did
product management make this a requirement and try
to get the requirement changed

As a user, I'm having a hard time imagining being well
served by getting a document in response to a search that
had no relation to my search, it was just a random doc
selected from the corpus.

All that said, I don't think a single query would do the trick.
You could include a very special document with a field
that no other document had with very special text in it. Say
field name bogusmatch, filled with the text bogustext
then, at least the second query would match one and only
one document and would take minimal time. Or you could
tack on to each and every query OR bogusmatch:bogustext^0.001
(which would really be inexpensive) and filter it out if there
was more than one response. By boosting it really low, it should
always appear at the end of the list which wouldn't be a bad thing.

DisMax might help you here...

But do ask if it is really a requirement or just something nobody's
objected to before bothering IMO...

Best
Erick

On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar 
satish.kumar.just.d...@gmail.com wrote:

 Hi,

 We have a requirement to show at least one result every time -- i.e., even
 if user entered term is not found in any of the documents. I was hoping
 setting mm to 0 will return results in all cases, but it is not.

 For example, if user entered term alpha and it is *not* in any of the
 documents in the index, any document in the index can be returned. If term
 alpha is in the document set, documents having the term alpha only must
 be returned.

 My idea so far is to perform a search using user entered term. If there are
 any results, return them. If there are no results, perform another search
 without the query term-- this means doing two searches. Any suggestions on
 implementing this requirement using only one search?


 Thanks,
 Satish



Re: Invalid version or the data in not in 'javabin' format

2010-09-12 Thread h00kpub...@gmail.com
 thats was the solution!! i package the current lucene and solrj 
repositories (dev 4.0) and copy the nesseccary jars to nutch-libs (after 
removing the old), building nutch and run it - it works!! thank you peter :)


marcel

On 09/12/2010 03:40 PM, Peter Sturge wrote:

Could be a solrj .jar version compat issue. Check that  the client and
server's solrj version jars match up.

Peter


On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com
h00kpub...@googlemail.com  wrote:

  hi... currently i am integrating nutch (release 1.2) into solr (trunk). if
i indexing to solr index with nutch i got the exception:

java.lang.RuntimeException: Invalid version or the data in not in 'javabin'
format
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job
failed!

can you tell me, whats wrong or how can i fix this?

best regards marcel :)








Re: multivalued fields in result

2010-09-12 Thread Erick Erickson
Can we see your schema file? Because it sounds like you didn't
really declare your field multivalued=true on the face of things.

But if it is multivalued AND you changed it, did you reindex after
you changed the schema?

Best
Erick

On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee jchaf...@ebates.com wrote:

 But it doesn't seem to be returning mulitvalued fields that are stored.  It
 is returning all of the single value fields though.


 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
 Sent: Sat 9/11/2010 4:19 AM
 To: solr-user@lucene.apache.org
 Subject: RE: multivalued fields in result

 Yes, you'll get what is stored and asked for.

 -Original message-
 From: Jason Chaffee jchaf...@ebates.com
 Sent: Sat 11-09-2010 05:27
 To: solr-user@lucene.apache.org;
 Subject: multivalued fields in result

 Is it possible to return multivalued files in the result?

 I would like to have a multivalued field that is stored and not indexed (I
 also copy the same field into another field where it is tokenized and
 indexed).  I would then like all the values of this field returned in the
 result set.  Is there a way to do this?

 If it is not possible, could someone elaborate why that is so that I may
 see if I can make it work.

 thanks,

 Jason




Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Robert Muir
On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer 
simon.willna...@googlemail.com wrote:

  To change the divisor in your solrconfig, for example to 4, it looks like
  you need to do this.
 
   indexReaderFactory name=IndexReaderFactory
  class=org.apache.solr.core.StandardIndexReaderFactory
 int name=setTermIndexInterval4/int
   /indexReaderFactory 

 Ah, thanks robert! I didn't know about that one either!

 simon


actually I'm wrong, for solr 1.4, use setTermIndexDivisor.

i was looking at 3.1/trunk and there is a bug in the name of this parameter:
https://issues.apache.org/jira/browse/SOLR-2118

-- 
Robert Muir
rcm...@gmail.com


Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Sturge
Hi,

Below are some notes regarding Solr cache tuning that should prove
useful for anyone who uses Solr with frequent commits (e.g. 5min).

Environment:
Solr 1.4.1 or branch_3x trunk.
Note the 4.x trunk has lots of neat new features, so the notes here
are likely less relevant to the 4.x environment.

Overview:
Our Solr environment makes extensive use of faceting, we perform
commits every 30secs, and the indexes tend be on the large-ish side
(20million docs).
Note: For our data, when we commit, we are always adding new data,
never changing existing data.
This type of environment can be tricky to tune, as Solr is more geared
toward fast reads than frequent writes.

Symptoms:
If anyone has used faceting in searches where you are also performing
frequent commits, you've likely encountered the dreaded OutOfMemory or
GC Overhead Exeeded errors.
In high commit rate environments, this is almost always due to
multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
finish autowarming their caches before the next commit()
comes along and invalidates them.
Once this starts happening on a regular basis, it is likely your
Solr's JVM will run out of memory eventually, as the number of
searchers (and their cache arrays) will keep growing until the JVM
dies of thirst.
To check if your Solr environment is suffering from this, turn on INFO
level logging, and look for: 'PERFORMANCE WARNING: Overlapping
onDeckSearchers=x'.

In tests, we've only ever seen this problem when using faceting, and
facet.method=fc.

Some solutions to this are:
Reduce the commit rate to allow searchers to fully warm before the
next commit
Reduce or eliminate the autowarming in caches
Both of the above

The trouble is, if you're doing NRT commits, you likely have a good
reason for it, and reducing/elimintating autowarming will very
significantly impact search performance in high commit rate
environments.

Solution:
Here are some setup steps we've used that allow lots of faceting (we
typically search with at least 20-35 different facet fields, and date
faceting/sorting) on large indexes, and still keep decent search
performance:

1. Firstly, you should consider using the enum method for facet
searches (facet.method=enum) unless you've got A LOT of memory on your
machine. In our tests, this method uses a lot less memory and
autowarms more quickly than fc. (Note, I've not tried the new
segement-based 'fcs' option, as I can't find support for it in
branch_3x - looks nice for 4.x though)
Admittedly, for our data, enum is not quite as fast for searching as
fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
tradeoff.
If you do have access to LOTS of memory, AND you can guarantee that
the index won't grow beyond the memory capacity (i.e. you have some
sort of deletion policy in place), fc can be a lot faster than enum
when searching with lots of facets across many terms.

2. Secondly, we've found that LRUCache is faster at autowarming than
FastLRUCache - in our tests, about 20% faster. Maybe this is just our
environment - your mileage may vary.

So, our filterCache section in solrconfig.xml looks like this:
filterCache
  class=solr.LRUCache
  size=3600
  initialSize=1400
  autowarmCount=3600/

For a 28GB index, running in a quad-core x64 VMWare instance, 30
warmed facet fields, Solr is running at ~4GB. Stats filterCache size
shows usually in the region of ~2400.

3. It's also a good idea to have some sort of
firstSearcher/newSearcher event listener queries to allow new data to
populate the caches.
Of course, what you put in these is dependent on the facets you need/use.
We've found a good combination is a firstSearcher with as many facets
in the search as your environment can handle, then a subset of the
most common facets for the newSearcher.

4. We also set:
   useColdSearchertrue/useColdSearcher
just in case.

5. Another key area for search performance with high commits is to use
2 Solr instances - one for the high commit rate indexing, and one for
searching.
The read-only searching instance can be a remote replica, or a local
read-only instance that reads the same core as the indexing instance
(for the latter, you'll need something that periodically refreshes -
i.e. runs commit()).
This way, you can tune the indexing instance for writing performance
and the searching instance as above for max read performance.

Using the setup above, we get fantastic searching speed for small
facet sets (well under 1sec), and really good searching for large
facet sets (a couple of secs depending on index size, number of
facets, unique terms etc. etc.),
even when searching against largeish indexes (20million docs).
We have yet to see any OOM or GC errors using the techniques above,
even in low memory conditions.

I hope there are people that find this useful. I know I've spent a lot
of time looking for stuff like this, so hopefullly, this will save
someone some time.


Peter


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Erick Erickson
Peter:

This kind of information is extremely useful to document, thanks! Do you
have the time/energy to put it up on the Wiki? Anyone can edit it by
creating
a logon. If you don't, would it be OK if someone else did it (with
attribution,
of course)? I guess that by bringing it up I'm volunteering :)...

Best
Erick

On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
Reduce the commit rate to allow searchers to fully warm before the
 next commit
Reduce or eliminate the autowarming in caches
Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
filterCache
  class=solr.LRUCache
  size=3600
  initialSize=1400
  autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr instances - one for the high commit rate indexing, and one for
 searching.
 The read-only searching instance can be a remote replica, or a local
 read-only instance that reads the same core as the indexing instance
 (for the latter, you'll need something that periodically refreshes -
 i.e. runs commit()).
 This way, you can tune the indexing instance for writing performance
 and the searching instance as above for max read performance.

 Using the setup above, we get fantastic searching speed for small
 facet sets (well under 1sec), and really good 

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Dennis Gearon
Wow! Thanks for that. This email is DEFINITELY being filed.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Sun, 9/12/10, Peter Sturge peter.stu...@gmail.com wrote:

 From: Peter Sturge peter.stu...@gmail.com
 Subject: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Sunday, September 12, 2010, 9:26 AM
 Hi,
 
 Below are some notes regarding Solr cache tuning that
 should prove
 useful for anyone who uses Solr with frequent commits (e.g.
 5min).
 
 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the
 notes here
 are likely less relevant to the 4.x environment.
 
 Overview:
 Our Solr environment makes extensive use of faceting, we
 perform
 commits every 30secs, and the indexes tend be on the
 large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding
 new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is
 more geared
 toward fast reads than frequent writes.
 
 Symptoms:
 If anyone has used faceting in searches where you are also
 performing
 frequent commits, you've likely encountered the dreaded
 OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due
 to
 multiple 'onDeck' searchers and autowarming - i.e. new
 searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely
 your
 Solr's JVM will run out of memory eventually, as the number
 of
 searchers (and their cache arrays) will keep growing until
 the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this,
 turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING:
 Overlapping
 onDeckSearchers=x'.
 
 In tests, we've only ever seen this problem when using
 faceting, and
 facet.method=fc.
 
 Some solutions to this are:
     Reduce the commit rate to allow searchers to
 fully warm before the
 next commit
     Reduce or eliminate the autowarming in
 caches
     Both of the above
 
 The trouble is, if you're doing NRT commits, you likely
 have a good
 reason for it, and reducing/elimintating autowarming will
 very
 significantly impact search performance in high commit
 rate
 environments.
 
 Solution:
 Here are some setup steps we've used that allow lots of
 faceting (we
 typically search with at least 20-35 different facet
 fields, and date
 faceting/sorting) on large indexes, and still keep decent
 search
 performance:
 
 1. Firstly, you should consider using the enum method for
 facet
 searches (facet.method=enum) unless you've got A LOT of
 memory on your
 machine. In our tests, this method uses a lot less memory
 and
 autowarms more quickly than fc. (Note, I've not tried the
 new
 segement-based 'fcs' option, as I can't find support for it
 in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for
 searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a
 worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can
 guarantee that
 the index won't grow beyond the memory capacity (i.e. you
 have some
 sort of deletion policy in place), fc can be a lot faster
 than enum
 when searching with lots of facets across many terms.
 
 2. Secondly, we've found that LRUCache is faster at
 autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this
 is just our
 environment - your mileage may vary.
 
 So, our filterCache section in solrconfig.xml looks like
 this:
     filterCache
       class=solr.LRUCache
       size=3600
       initialSize=1400
       autowarmCount=3600/
 
 For a 28GB index, running in a quad-core x64 VMWare
 instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats
 filterCache size
 shows usually in the region of ~2400.
 
 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow
 new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets
 you need/use.
 We've found a good combination is a firstSearcher with as
 many facets
 in the search as your environment can handle, then a subset
 of the
 most common facets for the newSearcher.
 
 4. We also set:
    useColdSearchertrue/useColdSearcher
 just in case.
 
 5. Another key area for search performance with high
 commits is to use
 2 Solr instances - one for the high commit rate indexing,
 and one for
 searching.
 The read-only searching instance can be a remote replica,
 or a local
 read-only instance that reads the same core as the indexing
 instance
 (for the latter, you'll need something that periodically
 refreshes -
 i.e. runs commit()).
 This way, you can tune the indexing instance for writing
 performance
 and the searching 

Re: Solr and jvm Garbage Collection tuning

2010-09-12 Thread Grant Ingersoll

On Sep 10, 2010, at 7:01 PM, Burton-West, Tom wrote:

 We have noticed that when the first query hits Solr after starting it up, 
 memory use increases significantly, from about 1GB to about 16GB, and then as 
 queries are received it goes up to about 19GB at which point there is a Full 
 Garbage Collection which takes about 30 seconds and then memory use drops 
 back down to 16GB.  Under a relatively heavy load, the full GC happens about 
 every 10-20 minutes.
 
 We are running 3 Solr shards under one Tomcat with 20GB allocated to the jvm. 
  Each shard has a total index size of about 400GB on and a tii size of about 
 600MB and indexes about 650,000 full-text books. (The server has a total of 
 72GB of memory, so we are leaving quite a bit of memory for the OS disk 
 cache).
 
 Is there some argument we could give the jvm so that it would collect garbage 
 more frequently? Or some other JVM tuning action that might reduce the amount 
 of time where Solr is waiting on GC?
 
 If we could get the time for each GC to take under a second, with the 
 trade-off being that GC  would occur much more frequently, that would help us 
 avoid the occasional query taking more than 30 seconds at the cost of a 
 larger number of queries taking at least a second.
 

What are your current GC settings?  Also, I guess I'd look at ways you can 
reduce the heap size needed.  Caching, field type choices, faceting choices.  
Also could try playing with the termIndexInterval which will load fewer terms 
into memory at the cost of longer seeks.  At some point, though, you just may 
need more shards and the resulting smaller indexes.  How many CPU cores do you 
have on each machine?

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Karich
Peter,

thanks a lot for your in-depth explanations!
Your findings will be definitely helpful for my next performance
improvement tests :-)

Two questions:

1. How would I do that:

 or a local read-only instance that reads the same core as the indexing 
 instance (for the latter, you'll need something that periodically refreshes - 
 i.e. runs commit()).


2. Did you try sharding with your current setup (e.g. one big,
nearly-static index and a tiny write+read index)?

Regards,
Peter.

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
 Reduce the commit rate to allow searchers to fully warm before the
 next commit
 Reduce or eliminate the autowarming in caches
 Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
 filterCache
   class=solr.LRUCache
   size=3600
   initialSize=1400
   autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr instances - one for the high commit rate indexing, and one for
 searching.
 The read-only searching instance can be a remote replica, or a local
 read-only instance that reads the same core as the indexing instance
 (for the latter, you'll need something that periodically refreshes -
 i.e. runs commit()).
 This way, you can tune the indexing instance for writing performance
 and the searching instance as above for max read performance.

 Using 

Saravanan Chinnadurai/Actionimages is out of the office.

2010-09-12 Thread Saravanan . Chinnadurai
I will be out of the office starting  12/09/2010 and will not return until
14/09/2010.

Please email to itsta...@actionimages.com  for any urgent issues.
(Embedded image moved to file: pic19187.jpg)

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Jason Rutherglen
Peter,

Are you using per-segment faceting, eg, SOLR-1617?  That could help
your situation.

On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.com wrote:
 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
    Reduce the commit rate to allow searchers to fully warm before the
 next commit
    Reduce or eliminate the autowarming in caches
    Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
    filterCache
      class=solr.LRUCache
      size=3600
      initialSize=1400
      autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr instances - one for the high commit rate indexing, and one for
 searching.
 The read-only searching instance can be a remote replica, or a local
 read-only instance that reads the same core as the indexing instance
 (for the latter, you'll need something that periodically refreshes -
 i.e. runs commit()).
 This way, you can tune the indexing instance for writing performance
 and the searching instance as above for max read performance.

 Using the setup above, we get fantastic searching speed for small
 facet sets (well under 1sec), and really good searching for large
 facet sets (a couple of secs depending on index size, number of
 facets, unique terms etc. etc.),
 even when searching against largeish indexes (20million docs).
 We have yet to see any OOM or GC errors 

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Sturge
Hi Jason,

I've tried some limited testing with the 4.x trunk using fcs, and I
must say, I really like the idea of per-segment faceting.
I was hoping to see it in 3.x, but I don't see this option in the
branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
one to use with 3.1?
There seems to be a number of Solr issues tied to this - one of them
being Lucene-1785. Can the per-segment faceting patch work with Lucene
2.9/branch_3x?

Thanks,
Peter



On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Peter,

 Are you using per-segment faceting, eg, SOLR-1617?  That could help
 your situation.

 On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.com wrote:
 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
    Reduce the commit rate to allow searchers to fully warm before the
 next commit
    Reduce or eliminate the autowarming in caches
    Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
    filterCache
      class=solr.LRUCache
      size=3600
      initialSize=1400
      autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr instances - one for the high commit rate indexing, and one for
 searching.
 The read-only searching instance can be a remote replica, or a local
 read-only instance that reads the same core as the indexing instance
 (for 

Re: No more trunk support for 2.9 indexes

2010-09-12 Thread Ryan McKinley
 I suppose an index 'remaker' might be something like a DIH reader for
 a Solr index - streams everything out of the existing index, writing
 it into the new one?

This works fine if all fields are stored (and copy field does not go
to a stored field), otherwise you would need/want to start with the
orignial source.

ryan


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Lance Norskog

Bravo!

Other tricks: here is a policy for deciding when to merge segments that 
attempts to balance merging with performance. It was contributed by 
LinkedIn- they also run indexsearch in the same instance (not Solr, a 
different Lucene app).


lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java

The optimize command now includes a partial optimize option, so you can 
do larger controlled merges.


Peter Sturge wrote:

Hi,

Below are some notes regarding Solr cache tuning that should prove
useful for anyone who uses Solr with frequent commits (e.g.5min).

Environment:
Solr 1.4.1 or branch_3x trunk.
Note the 4.x trunk has lots of neat new features, so the notes here
are likely less relevant to the 4.x environment.

Overview:
Our Solr environment makes extensive use of faceting, we perform
commits every 30secs, and the indexes tend be on the large-ish side
(20million docs).
Note: For our data, when we commit, we are always adding new data,
never changing existing data.
This type of environment can be tricky to tune, as Solr is more geared
toward fast reads than frequent writes.

Symptoms:
If anyone has used faceting in searches where you are also performing
frequent commits, you've likely encountered the dreaded OutOfMemory or
GC Overhead Exeeded errors.
In high commit rate environments, this is almost always due to
multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
finish autowarming their caches before the next commit()
comes along and invalidates them.
Once this starts happening on a regular basis, it is likely your
Solr's JVM will run out of memory eventually, as the number of
searchers (and their cache arrays) will keep growing until the JVM
dies of thirst.
To check if your Solr environment is suffering from this, turn on INFO
level logging, and look for: 'PERFORMANCE WARNING: Overlapping
onDeckSearchers=x'.

In tests, we've only ever seen this problem when using faceting, and
facet.method=fc.

Some solutions to this are:
 Reduce the commit rate to allow searchers to fully warm before the
next commit
 Reduce or eliminate the autowarming in caches
 Both of the above

The trouble is, if you're doing NRT commits, you likely have a good
reason for it, and reducing/elimintating autowarming will very
significantly impact search performance in high commit rate
environments.

Solution:
Here are some setup steps we've used that allow lots of faceting (we
typically search with at least 20-35 different facet fields, and date
faceting/sorting) on large indexes, and still keep decent search
performance:

1. Firstly, you should consider using the enum method for facet
searches (facet.method=enum) unless you've got A LOT of memory on your
machine. In our tests, this method uses a lot less memory and
autowarms more quickly than fc. (Note, I've not tried the new
segement-based 'fcs' option, as I can't find support for it in
branch_3x - looks nice for 4.x though)
Admittedly, for our data, enum is not quite as fast for searching as
fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
tradeoff.
If you do have access to LOTS of memory, AND you can guarantee that
the index won't grow beyond the memory capacity (i.e. you have some
sort of deletion policy in place), fc can be a lot faster than enum
when searching with lots of facets across many terms.

2. Secondly, we've found that LRUCache is faster at autowarming than
FastLRUCache - in our tests, about 20% faster. Maybe this is just our
environment - your mileage may vary.

So, our filterCache section in solrconfig.xml looks like this:
 filterCache
   class=solr.LRUCache
   size=3600
   initialSize=1400
   autowarmCount=3600/

For a 28GB index, running in a quad-core x64 VMWare instance, 30
warmed facet fields, Solr is running at ~4GB. Stats filterCache size
shows usually in the region of ~2400.

3. It's also a good idea to have some sort of
firstSearcher/newSearcher event listener queries to allow new data to
populate the caches.
Of course, what you put in these is dependent on the facets you need/use.
We've found a good combination is a firstSearcher with as many facets
in the search as your environment can handle, then a subset of the
most common facets for the newSearcher.

4. We also set:
useColdSearchertrue/useColdSearcher
just in case.

5. Another key area for search performance with high commits is to use
2 Solr instances - one for the high commit rate indexing, and one for
searching.
The read-only searching instance can be a remote replica, or a local
read-only instance that reads the same core as the indexing instance
(for the latter, you'll need something that periodically refreshes -
i.e. runs commit()).
This way, you can tune the indexing instance for writing performance
and the searching instance as above for max read performance.

Using the setup above, we get fantastic searching speed for small
facet sets (well under 1sec), and really good searching 

Re: multivalued fields in result

2010-09-12 Thread Lance Norskog
Also, the 'v' is capitalized: multiValued. (This is one reason why 
posting your schema helps.)


Erick Erickson wrote:

Can we see your schema file? Because it sounds like you didn't
really declare your field multivalued=true on the face of things.

But if it is multivalued AND you changed it, did you reindex after
you changed the schema?

Best
Erick

On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffeejchaf...@ebates.com  wrote:

   

But it doesn't seem to be returning mulitvalued fields that are stored.  It
is returning all of the single value fields though.


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent: Sat 9/11/2010 4:19 AM
To: solr-user@lucene.apache.org
Subject: RE: multivalued fields in result

Yes, you'll get what is stored and asked for.

-Original message-
From: Jason Chaffeejchaf...@ebates.com
Sent: Sat 11-09-2010 05:27
To: solr-user@lucene.apache.org;
Subject: multivalued fields in result

Is it possible to return multivalued files in the result?

I would like to have a multivalued field that is stored and not indexed (I
also copy the same field into another field where it is tokenized and
indexed).  I would then like all the values of this field returned in the
result set.  Is there a way to do this?

If it is not possible, could someone elaborate why that is so that I may
see if I can make it work.

thanks,

Jason


 
   


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Chris Haggstrom
Thanks, Peter.  This is really great info.

One setting I've found to be very useful for the problem of overlapping 
onDeskSearchers is to reduce the value of maxWarmingSearchers in 
solrconfig.xml.  I've reduced this to 1, so if a slave is already busy doing 
pre-warming, it won't try to also pre-warm additional updates.  This has 
greatly reduced our time to incorporate updates, with no visible downsides 
other than an uglier snapinstaller.log (we're still using 1.3 w/rsync-based 
replication).

-Chris

On Sep 12, 2010, at 9:26 AM, Peter Sturge wrote:

 Hi,
 
 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).
 
 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.
 
 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.
 
 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.
 
 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.
 
 Some solutions to this are:
Reduce the commit rate to allow searchers to fully warm before the
 next commit
Reduce or eliminate the autowarming in caches
Both of the above
 
 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.
 
 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:
 
 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.
 
 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.
 
 So, our filterCache section in solrconfig.xml looks like this:
filterCache
  class=solr.LRUCache
  size=3600
  initialSize=1400
  autowarmCount=3600/
 
 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.
 
 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.
 
 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.
 
 5. Another key area for search performance with high commits is to use
 2 Solr instances - one for the high commit rate indexing, and one for
 searching.
 The read-only searching instance can be a remote replica, or a local
 read-only instance that reads the same core as the indexing instance
 (for the latter, you'll need something that periodically refreshes -
 i.e. runs commit()).
 This way, you can tune the indexing instance for writing 

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Jason Rutherglen
Yeah there's no patch... I think Yonik can write it. :-)  Yah... The
Lucene version shouldn't matter.  The distributed faceting
theoretically can easily be applied to multiple segments, however the
way it's written for me is a challenge to untangle and apply
successfully to a working patch.  Also I don't have this as an itch to
scratch at the moment.

On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge peter.stu...@gmail.com wrote:
 Hi Jason,

 I've tried some limited testing with the 4.x trunk using fcs, and I
 must say, I really like the idea of per-segment faceting.
 I was hoping to see it in 3.x, but I don't see this option in the
 branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
 one to use with 3.1?
 There seems to be a number of Solr issues tied to this - one of them
 being Lucene-1785. Can the per-segment faceting patch work with Lucene
 2.9/branch_3x?

 Thanks,
 Peter



 On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 Peter,

 Are you using per-segment faceting, eg, SOLR-1617?  That could help
 your situation.

 On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.com 
 wrote:
 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
    Reduce the commit rate to allow searchers to fully warm before the
 next commit
    Reduce or eliminate the autowarming in caches
    Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
    filterCache
      class=solr.LRUCache
      size=3600
      initialSize=1400
      autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can 

RE: multivalued fields in result

2010-09-12 Thread Jason Chaffee
My schema.xml was fine.  The problem was that my test queries weren't returning 
top 10 documents that had data in the fields.  Once I increased the rows, I saw 
the results.

Definitely user error.  :)

Thanks for help though.

Jason


-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Sun 9/12/2010 6:23 PM
To: solr-user@lucene.apache.org
Subject: Re: multivalued fields in result
 
Also, the 'v' is capitalized: multiValued. (This is one reason why 
posting your schema helps.)

Erick Erickson wrote:
 Can we see your schema file? Because it sounds like you didn't
 really declare your field multivalued=true on the face of things.

 But if it is multivalued AND you changed it, did you reindex after
 you changed the schema?

 Best
 Erick

 On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffeejchaf...@ebates.com  wrote:


 But it doesn't seem to be returning mulitvalued fields that are stored.  It
 is returning all of the single value fields though.


 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
 Sent: Sat 9/11/2010 4:19 AM
 To: solr-user@lucene.apache.org
 Subject: RE: multivalued fields in result

 Yes, you'll get what is stored and asked for.

 -Original message-
 From: Jason Chaffeejchaf...@ebates.com
 Sent: Sat 11-09-2010 05:27
 To: solr-user@lucene.apache.org;
 Subject: multivalued fields in result

 Is it possible to return multivalued files in the result?

 I would like to have a multivalued field that is stored and not indexed (I
 also copy the same field into another field where it is tokenized and
 indexed).  I would then like all the values of this field returned in the
 result set.  Is there a way to do this?

 If it is not possible, could someone elaborate why that is so that I may
 see if I can make it work.

 thanks,

 Jason