solr4 performance question
Hi, We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB machine and 40 GB of index. We're constantly noticing that Solr queries take longer time while update (with commit=false setting) is in progress. The query which usually takes .5 seconds, take up to 2 minutes while updates are in progress. And this is not the case with all queries, it is very sporadic behavior. Any pointer to nail this issue would be appreciated. Is there a way to find how much of a query result came from cache? Can we enable any log settings to start printing what came from cache vs. what was queried? Thanks!
Re: solr4 performance question
What do you have for hour _softcommit_ settings in solrconfig.xml? I'm guessing you're using SolrJ or similar, but the solrconfig settings will trip a commit as well. For that matter ,what are all our commit settings in solrconfig.xml, both hard and soft? Best, Erick On Tue, Apr 8, 2014 at 10:28 AM, Joshi, Shital shital.jo...@gs.com wrote: Hi, We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB machine and 40 GB of index. We're constantly noticing that Solr queries take longer time while update (with commit=false setting) is in progress. The query which usually takes .5 seconds, take up to 2 minutes while updates are in progress. And this is not the case with all queries, it is very sporadic behavior. Any pointer to nail this issue would be appreciated. Is there a way to find how much of a query result came from cache? Can we enable any log settings to start printing what came from cache vs. what was queried? Thanks!
Re: solr4 performance question
Hi Joshi; Click to the Plugins/Stats section under your collection at Solr Admin UI. You will see the cache statistics for different types of caches. hitratio and evictions are good statistics to look at first. On the other hand you should read here: https://wiki.apache.org/solr/SolrPerformanceFactors Thanks; Furkan KAMACI 2014-04-08 20:28 GMT+03:00 Joshi, Shital shital.jo...@gs.com: Hi, We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB machine and 40 GB of index. We're constantly noticing that Solr queries take longer time while update (with commit=false setting) is in progress. The query which usually takes .5 seconds, take up to 2 minutes while updates are in progress. And this is not the case with all queries, it is very sporadic behavior. Any pointer to nail this issue would be appreciated. Is there a way to find how much of a query result came from cache? Can we enable any log settings to start printing what came from cache vs. what was queried? Thanks!
RE: solr4 performance question
We don't do any soft commit. This is our hard commit setting. autoCommit maxTime${solr.autoCommit.maxTime:60}/maxTime maxDocs10/maxDocs openSearchertrue/openSearcher /autoCommit We use this update command: solr_command=$(catEnD time zcat --force $file2load | /usr/bin/curl --proxy --silent --show-error --max-time 3600 \ http://$solr_url/solr/$solr_core/update/csv?\ commit=false\ separator=|\ escape=\\\ trim=true\ header=false\ skipLines=2\ overwrite=true\ _shard_=$shardid\ fieldnames=$fieldnames\ f.cs_rep.split=true\ f.cs_rep.separator=%5E --data-binary @- -H 'Content-type:text/plain; charset=utf-8' EnD) -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 08, 2014 2:21 PM To: solr-user@lucene.apache.org Subject: Re: solr4 performance question What do you have for hour _softcommit_ settings in solrconfig.xml? I'm guessing you're using SolrJ or similar, but the solrconfig settings will trip a commit as well. For that matter ,what are all our commit settings in solrconfig.xml, both hard and soft? Best, Erick On Tue, Apr 8, 2014 at 10:28 AM, Joshi, Shital shital.jo...@gs.com wrote: Hi, We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB machine and 40 GB of index. We're constantly noticing that Solr queries take longer time while update (with commit=false setting) is in progress. The query which usually takes .5 seconds, take up to 2 minutes while updates are in progress. And this is not the case with all queries, it is very sporadic behavior. Any pointer to nail this issue would be appreciated. Is there a way to find how much of a query result came from cache? Can we enable any log settings to start printing what came from cache vs. what was queried? Thanks!
Re: solr4 performance question
bq: solr.autoCommit.maxTime:60 maxDocs10/maxDocs openSearchertrue/openSearcher Every 100K documents or 10 minutes (whichever comes first) your current searchers will be closed and a new searcher opened, all the warmup queries etc. might happen. I suspect you're not doing much with autwarming and/or newSearcher queries. So occasionally your search has to wait for caches to be read, terms to be populated, etc. Some possibilities to test this: 1 create some newSearcher queries in solrconfig.xml 2 specify a reasonable autowarm count for queryResultCache (don't go crazy here, start with 16 or some similiar) 3 set openSearcher to false above. In this case you won't be able to see the documents until either a hard or soft commit happens, you could cure this with a single hard commit at the end of your indexing run. It all depends on what latency you can tolerate in terms of searching newly-indexed documents. Here's a reference... http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Tue, Apr 8, 2014 at 12:11 PM, Joshi, Shital shital.jo...@gs.com wrote: We don't do any soft commit. This is our hard commit setting. autoCommit maxTime${solr.autoCommit.maxTime:60}/maxTime maxDocs10/maxDocs openSearchertrue/openSearcher /autoCommit We use this update command: solr_command=$(catEnD time zcat --force $file2load | /usr/bin/curl --proxy --silent --show-error --max-time 3600 \ http://$solr_url/solr/$solr_core/update/csv?\ commit=false\ separator=|\ escape=\\\ trim=true\ header=false\ skipLines=2\ overwrite=true\ _shard_=$shardid\ fieldnames=$fieldnames\ f.cs_rep.split=true\ f.cs_rep.separator=%5E --data-binary @- -H 'Content-type:text/plain; charset=utf-8' EnD) -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 08, 2014 2:21 PM To: solr-user@lucene.apache.org Subject: Re: solr4 performance question What do you have for hour _softcommit_ settings in solrconfig.xml? I'm guessing you're using SolrJ or similar, but the solrconfig settings will trip a commit as well. For that matter ,what are all our commit settings in solrconfig.xml, both hard and soft? Best, Erick On Tue, Apr 8, 2014 at 10:28 AM, Joshi, Shital shital.jo...@gs.com wrote: Hi, We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB machine and 40 GB of index. We're constantly noticing that Solr queries take longer time while update (with commit=false setting) is in progress. The query which usually takes .5 seconds, take up to 2 minutes while updates are in progress. And this is not the case with all queries, it is very sporadic behavior. Any pointer to nail this issue would be appreciated. Is there a way to find how much of a query result came from cache? Can we enable any log settings to start printing what came from cache vs. what was queried? Thanks!
Performance Question: 'facets.missing'
I'm debating whether or not to set the 'facets.missing' parameter to true by default when faceting. What is the performance impact of setting 'facets.missing' to true? -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-Question-facets-missing-tp4099602.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance Question: 'facets.missing'
On Wed, Nov 6, 2013 at 12:07 PM, andres and...@octopart.com wrote: I'm debating whether or not to set the 'facets.missing' parameter to true by default when faceting. What is the performance impact of setting 'facets.missing' to true? It really depends on the faceting method. For some faceting methods (like enum), the first time on a new view of the index can be somewhat expensive, but then the set of docs that have a value in the field should be cached and it will be very cheap. Other facet methods should be cheap regardless. -Yonik http://heliosearch.com -- making solr shine
Re: Solr4 update and query performance question
bq: There is no batching while updating/inserting documents in Solr3 Correct, but all the updates only went to the server you targeted them for. The batching you're seeing is the auto-distributing the docs to the various shards, a whole different animal. Keep an eye on: https://issues.apache.org/jira/browse/SOLR-4816. You might prompt Joel to see if this is testable. This JIRA routes the docs directly to the leader of the shard they should go to. IOW it does the routing on the client side. There will still be batching from the leader to the replicas, but this should help. It is usually a Bad Thing to commit after every batch either in Solr 3 or Solr 4 from the client. I suspect you're right that the wait for all the searchers on all the shards is one of your problems. Try configuring autocommit (both hard and soft) in solrconfig.xml and forgetting the commit bits from the client. This is the usual pattern in Solr4. Your soft commit (which may be commented out) controls when the documents are searchable. It is less expensive than hard commits with openSearcher=true and makes docs visible. Hard commit closes the current segment and opens a new one. So set up openSearcher=false for your hard commit and a soft commit interval of whatever latency you can stand would by my recommendation. Final note: if you set your hard commit with openSearcher=false, do it fairly often since it truncates the transaction logs and is quite inexpensive. If you let your tlog grow huge, if you kill your server and re-start Solr you get into a situation where solr may replay the tlog. If it has a bazillion docs in it that can take a very long time to start up. Best Erick On Wed, Aug 14, 2013 at 4:39 PM, Joshi, Shital shital.jo...@gs.com wrote: We didn't copy/paste Solr3 config to solr4. We started with Solr4 config and only updated new searcher queries and few other things. There is no batching while updating/inserting documents in Solr3, is that correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4 it takes about 3-4 minutes. We noticed in Solr4 logs that, commit only returns after new searcher is created across all nodes. This is possibly cause waitSearcher=true by default in Solr4. This was not the case with Solr3, commit would return without waiting for new searcher creation. In order to improve performance with Solr4, we first changed from commit=true to commit=false in update URL and added autoHardCommit setting in solrconfig.xml. This improved performance from 3-4 minutes to 1-2 minutes but that is not good enough. Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class from 10 to 1000 and deployed this class in $JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted solr4 nodes. But we still see the batch size of 10 being used. Did we change correct variable/class? Next thing We will try using softCommit=true in update url and check if it gives us desired performance. Thanks for looking into this. Appreciate your help. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, August 13, 2013 8:12 AM To: solr-user@lucene.apache.org Subject: Re: Solr4 update and query performance question 1 That's hard-coded at present. There's anecdotal evidence that there are throughput improvements with larger batch sizes, but no action yet. 2 Yep, all searchers are also re-opened, caches re-warmed, etc. 3 Odd. I'm assuming your Solr3 was master/slave setup? Seeing the queries would help diagnose this. Also, did you try to copy/paste the configuration from your Solr3 to Solr4? I'd start with the Solr4 and copy/paste only the parts needed from your SOlr3 setup. Best Erick On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital shital.jo...@gs.com wrote: Hi, We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with about 450 mil documents (~90 mil per shard). We're loading 1000 or less documents in CSV format every few minutes. In Solr3, with 300 mil documents, it used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 3 minutes to load 1000 documents. We're using custom sharding, we include _shard_=shardid parameter in update command. Upon looking Solr4 log files we found that: 1. Documents are added in a batch of 10 records. How do we increase this batch size from 10 to 1000 documents? 2. We do hard commit after loading 1000 documents. For every hard commit, it refreshes searcher on all nodes. Are all caches also refreshed when hard commit happens? We're planning to change to soft commit and do auto hard commit every 10-15 minutes. 3. We're not seeing improved query performance compared to Solr3. Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with Solr4. We think this could be due to frequent hard commits and searcher refresh. Do you think
RE: Solr4 update and query performance question
We didn't copy/paste Solr3 config to solr4. We started with Solr4 config and only updated new searcher queries and few other things. There is no batching while updating/inserting documents in Solr3, is that correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4 it takes about 3-4 minutes. We noticed in Solr4 logs that, commit only returns after new searcher is created across all nodes. This is possibly cause waitSearcher=true by default in Solr4. This was not the case with Solr3, commit would return without waiting for new searcher creation. In order to improve performance with Solr4, we first changed from commit=true to commit=false in update URL and added autoHardCommit setting in solrconfig.xml. This improved performance from 3-4 minutes to 1-2 minutes but that is not good enough. Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class from 10 to 1000 and deployed this class in $JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted solr4 nodes. But we still see the batch size of 10 being used. Did we change correct variable/class? Next thing We will try using softCommit=true in update url and check if it gives us desired performance. Thanks for looking into this. Appreciate your help. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, August 13, 2013 8:12 AM To: solr-user@lucene.apache.org Subject: Re: Solr4 update and query performance question 1 That's hard-coded at present. There's anecdotal evidence that there are throughput improvements with larger batch sizes, but no action yet. 2 Yep, all searchers are also re-opened, caches re-warmed, etc. 3 Odd. I'm assuming your Solr3 was master/slave setup? Seeing the queries would help diagnose this. Also, did you try to copy/paste the configuration from your Solr3 to Solr4? I'd start with the Solr4 and copy/paste only the parts needed from your SOlr3 setup. Best Erick On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital shital.jo...@gs.com wrote: Hi, We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with about 450 mil documents (~90 mil per shard). We're loading 1000 or less documents in CSV format every few minutes. In Solr3, with 300 mil documents, it used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 3 minutes to load 1000 documents. We're using custom sharding, we include _shard_=shardid parameter in update command. Upon looking Solr4 log files we found that: 1. Documents are added in a batch of 10 records. How do we increase this batch size from 10 to 1000 documents? 2. We do hard commit after loading 1000 documents. For every hard commit, it refreshes searcher on all nodes. Are all caches also refreshed when hard commit happens? We're planning to change to soft commit and do auto hard commit every 10-15 minutes. 3. We're not seeing improved query performance compared to Solr3. Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with Solr4. We think this could be due to frequent hard commits and searcher refresh. Do you think when we change to soft commit and increase the batch size, we will see better query performance. Thanks!
Re: Solr4 update and query performance question
1 That's hard-coded at present. There's anecdotal evidence that there are throughput improvements with larger batch sizes, but no action yet. 2 Yep, all searchers are also re-opened, caches re-warmed, etc. 3 Odd. I'm assuming your Solr3 was master/slave setup? Seeing the queries would help diagnose this. Also, did you try to copy/paste the configuration from your Solr3 to Solr4? I'd start with the Solr4 and copy/paste only the parts needed from your SOlr3 setup. Best Erick On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital shital.jo...@gs.com wrote: Hi, We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with about 450 mil documents (~90 mil per shard). We're loading 1000 or less documents in CSV format every few minutes. In Solr3, with 300 mil documents, it used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 3 minutes to load 1000 documents. We're using custom sharding, we include _shard_=shardid parameter in update command. Upon looking Solr4 log files we found that: 1. Documents are added in a batch of 10 records. How do we increase this batch size from 10 to 1000 documents? 2. We do hard commit after loading 1000 documents. For every hard commit, it refreshes searcher on all nodes. Are all caches also refreshed when hard commit happens? We're planning to change to soft commit and do auto hard commit every 10-15 minutes. 3. We're not seeing improved query performance compared to Solr3. Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with Solr4. We think this could be due to frequent hard commits and searcher refresh. Do you think when we change to soft commit and increase the batch size, we will see better query performance. Thanks!
Solr4 update and query performance question
Hi, We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with about 450 mil documents (~90 mil per shard). We're loading 1000 or less documents in CSV format every few minutes. In Solr3, with 300 mil documents, it used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 3 minutes to load 1000 documents. We're using custom sharding, we include _shard_=shardid parameter in update command. Upon looking Solr4 log files we found that: 1. Documents are added in a batch of 10 records. How do we increase this batch size from 10 to 1000 documents? 2. We do hard commit after loading 1000 documents. For every hard commit, it refreshes searcher on all nodes. Are all caches also refreshed when hard commit happens? We're planning to change to soft commit and do auto hard commit every 10-15 minutes. 3. We're not seeing improved query performance compared to Solr3. Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with Solr4. We think this could be due to frequent hard commits and searcher refresh. Do you think when we change to soft commit and increase the batch size, we will see better query performance. Thanks!
Re: Performance question on Spatial Search
So after re-feeding our data with a new boolean field that is true when data exists and false when it doesn't our search times have gone from avg of about 20s to around 150ms... pretty amazing change in perf... It seems like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many peoples pain in doing this kind of query (if I have some time I may take a look at it).. Anyway we are in pretty good shape at this point.. the only remaining issue is that the first queries after commits are taking 5-6s... This is cause by the loading of 2 (one long and one int) FieldCaches (uninvert) that are used for sorting.. I'm suspecting that docvalues will greatly help this load performance? thanks, steve On Wed, Jul 31, 2013 at 4:32 PM, Steven Bower smb-apa...@alcyon.net wrote: the list of IDs does change relatively frequently, but this doesn't seem to have very much impact on the performance of the query as far as I can tell. attached are the stacks thanks, steve On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote: not sure what you mean by good hit raitio? I mean such queries are really expensive (even on cache hit), so if the list of ids changes every time, it never hit cache and hence executes these heavy queries every time. It's well known performance problem. Here are the stacks... they seems like hotspots, and shows index reading that's reasonable. But I can't see what caused these readings, to get that I need whole stack of hot thread. Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
Re: Performance question on Spatial Search
On 8/5/2013 7:13 AM, Steven Bower wrote: So after re-feeding our data with a new boolean field that is true when data exists and false when it doesn't our search times have gone from avg of about 20s to around 150ms... pretty amazing change in perf... It seems like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many peoples pain in doing this kind of query (if I have some time I may take a look at it).. Anyway we are in pretty good shape at this point.. the only remaining issue is that the first queries after commits are taking 5-6s... This is cause by the loading of 2 (one long and one int) FieldCaches (uninvert) that are used for sorting.. I'm suspecting that docvalues will greatly help this load performance? I would handle this by using newSearcher events in the config to search for all documents (*:*) with your desired sort parameters. That way, the fieldcache will be pre-populated before the new searcher accepts any queries. The old searcher will continue to handle queries will this is happening. Be aware that this will increase your commit time, which might mean that you need to decrease your autowarmCount values on your Solr caches to compensate. If you have removed this section from your solrconfig.xml file, see the example config. Thanks, Shawn
Re: Performance question on Spatial Search
From: Steven Bower-2 [via Lucene] ml-node+s472066n4082569...@n3.nabble.commailto:ml-node+s472066n4082569...@n3.nabble.com Date: Monday, August 5, 2013 9:14 AM To: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org Subject: Re: Performance question on Spatial Search So after re-feeding our data with a new boolean field that is true when data exists and false when it doesn't our search times have gone from avg of about 20s to around 150ms... pretty amazing change in perf... It seems like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many peoples pain in doing this kind of query (if I have some time I may take a look at it).. Awesome performance improvement! Anyway we are in pretty good shape at this point.. the only remaining issue is that the first queries after commits are taking 5-6s... This is cause by the loading of 2 (one long and one int) FieldCaches (uninvert) that are used for sorting.. I'm suspecting that docvalues will greatly help this load performance? DocValues will help a lot. I'd love to see the before after times on that conversion. I'm surprised it's taking as long as it is… but then you have a tone of data in one index so it's plausible. Lucene 4.4 has some compression improvements there: LUCENE-5035 ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search-tp4081150p4082588.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance question on Spatial Search
On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote: not sure what you mean by good hit raitio? I mean such queries are really expensive (even on cache hit), so if the list of ids changes every time, it never hit cache and hence executes these heavy queries every time. It's well known performance problem. Here are the stacks... they seems like hotspots, and shows index reading that's reasonable. But I can't see what caused these readings, to get that I need whole stack of hot thread. Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 20 0 java.lang.System.identityHashCode(Object) 20 20 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485 527 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 957 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 957 513 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 443 443 org.apache.lucene.index.FilteredTermsEnum.next() 874 324 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef) 368 0 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object, Object) 368 368 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 160 0 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 120 0
Re: Performance question on Spatial Search
the list of IDs does change relatively frequently, but this doesn't seem to have very much impact on the performance of the query as far as I can tell. attached are the stacks thanks, steve On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote: not sure what you mean by good hit raitio? I mean such queries are really expensive (even on cache hit), so if the list of ids changes every time, it never hit cache and hence executes these heavy queries every time. It's well known performance problem. Here are the stacks... they seems like hotspots, and shows index reading that's reasonable. But I can't see what caused these readings, to get that I need whole stack of hot thread. Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 20 0 java.lang.System.identityHashCode(Object) 20 20 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485 527 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 957 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 957 513 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 443 443 org.apache.lucene.index.FilteredTermsEnum.next() 874 324 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef) 368 0 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object, Object) 368
Re: Performance question on Spatial Search
the results with LatLonType. ~ David Smiley Steven Bower wrote @Erick it is alot of hw, but basically trying to create a best case scenario to take HW out of the question. Will try increasing heap size tomorrow.. I haven't seen it get close to the max heap size yet.. but it's worth trying... Note that these queries look something like: q=*:* fq=[date range] fq=geo query on the fq for the geo query i've added {!cache=false} to prevent it from ending up in the filter cache.. once it's in filter cache queries come back in 10-20ms. For my use case i need the first unique geo search query to come back in a more reasonable time so I am currently ignoring the cache. @Bill will look into that, I'm not certain it will support the particular queries that are being executed but I'll investigate.. steve On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson lt; erickerickson@ gt;wrote: This is very strange. I'd expect slow queries on the first few queries while these caches were warmed, but after that I'd expect things to be quite fast. For a 12G index and 256G RAM, you have on the surface a LOT of hardware to throw at this problem. You can _try_ giving the JVM, say, 18G but that really shouldn't be a big issue, your index files should be MMaped. Let's try the crude thing first and give the JVM more memory. FWIW Erick On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower lt; smb-apache@ gt; wrote: I've been doing some performance analysis of a spacial search use case I'm implementing in Solr 4.3.0. Basically I'm seeing search times alot higher than I'd like them to be and I'm hoping people may have some suggestions for how to optimize further. Here are the specs of what I'm doing now: Machine: - 16 cores @ 2.8ghz - 256gb RAM - 1TB (RAID 1+0 on 10 SSD) Content: - 45M docs (not very big only a few fields with no large textual content) - 1 geo field (using config below) - index is 12gb - 1 shard - Using MMapDirectory Field config: fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFa ctory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/ and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search -tp4081150p4081309.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance question on Spatial Search
you how to raise the prefix grid scan level to a # closer to max-levels. (4) Do all of your searches find less than a million points, considering all filters? If so then it's worth comparing the results with LatLonType. ~ David Smiley Steven Bower wrote @Erick it is alot of hw, but basically trying to create a best case scenario to take HW out of the question. Will try increasing heap size tomorrow.. I haven't seen it get close to the max heap size yet.. but it's worth trying... Note that these queries look something like: q=*:* fq=[date range] fq=geo query on the fq for the geo query i've added {!cache=false} to prevent it from ending up in the filter cache.. once it's in filter cache queries come back in 10-20ms. For my use case i need the first unique geo search query to come back in a more reasonable time so I am currently ignoring the cache. @Bill will look into that, I'm not certain it will support the particular queries that are being executed but I'll investigate.. steve On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson lt; erickerickson@ gt;wrote: This is very strange. I'd expect slow queries on the first few queries while these caches were warmed, but after that I'd expect things to be quite fast. For a 12G index and 256G RAM, you have on the surface a LOT of hardware to throw at this problem. You can _try_ giving the JVM, say, 18G but that really shouldn't be a big issue, your index files should be MMaped. Let's try the crude thing first and give the JVM more memory. FWIW Erick On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower lt; smb-apache@ gt; wrote: I've been doing some performance analysis of a spacial search use case I'm implementing in Solr 4.3.0. Basically I'm seeing search times alot higher than I'd like them to be and I'm hoping people may have some suggestions for how to optimize further. Here are the specs of what I'm doing now: Machine: - 16 cores @ 2.8ghz - 256gb RAM - 1TB (RAID 1+0 on 10 SSD) Content: - 45M docs (not very big only a few fields with no large textual content) - 1 geo field (using config below) - index is 12gb - 1 shard - Using MMapDirectory Field config: fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFa ctory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/ and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search -tp4081150p4081309.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance question on Spatial Search
looking to ensure you are not using IsWithin, which is not meant for point data. If your query shape is a circle or the bounding box of a circle, you should use the geofilt query parser, otherwise use the quirky syntax that allows you to specify the spatial predicate with Intersects. (2) Do you actually need JTS? i.e. are you using Polygons, etc. (3) How dense would you estimate the data is at the 50m resolution you've configured the data? If It's very dense then I'll tell you how to raise the prefix grid scan level to a # closer to max-levels. (4) Do all of your searches find less than a million points, considering all filters? If so then it's worth comparing the results with LatLonType. ~ David Smiley Steven Bower wrote @Erick it is alot of hw, but basically trying to create a best case scenario to take HW out of the question. Will try increasing heap size tomorrow.. I haven't seen it get close to the max heap size yet.. but it's worth trying... Note that these queries look something like: q=*:* fq=[date range] fq=geo query on the fq for the geo query i've added {!cache=false} to prevent it from ending up in the filter cache.. once it's in filter cache queries come back in 10-20ms. For my use case i need the first unique geo search query to come back in a more reasonable time so I am currently ignoring the cache. @Bill will look into that, I'm not certain it will support the particular queries that are being executed but I'll investigate.. steve On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson lt; erickerickson@ gt;wrote: This is very strange. I'd expect slow queries on the first few queries while these caches were warmed, but after that I'd expect things to be quite fast. For a 12G index and 256G RAM, you have on the surface a LOT of hardware to throw at this problem. You can _try_ giving the JVM, say, 18G but that really shouldn't be a big issue, your index files should be MMaped. Let's try the crude thing first and give the JVM more memory. FWIW Erick On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower lt; smb-apache@ gt; wrote: I've been doing some performance analysis of a spacial search use case I'm implementing in Solr 4.3.0. Basically I'm seeing search times alot higher than I'd like them to be and I'm hoping people may have some suggestions for how to optimize further. Here are the specs of what I'm doing now: Machine: - 16 cores @ 2.8ghz - 256gb RAM - 1TB (RAID 1+0 on 10 SSD) Content: - 45M docs (not very big only a few fields with no large textual content) - 1 geo field (using config below) - index is 12gb - 1 shard - Using MMapDirectory Field config: fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialConte xtFa ctory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance / and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question
Re: Performance question on Spatial Search
Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear ch -tp4081150p4081309.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Luis Cappa
Re: Performance question on Spatial Search
Very good read... Already using MMap... verified using pmap and vsz from top.. not sure what you mean by good hit raitio? Here are the stacks... Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 20 0 java.lang.System.identityHashCode(Object) 20 20 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485 527 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 957 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 957 513 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 443 443 org.apache.lucene.index.FilteredTermsEnum.next() 874 324 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef) 368 0 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object, Object) 368 368 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 160 0 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 120 0 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 39 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekCeil(BytesRef, boolean) 19 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 19 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 19 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 19 0
Re: Performance question on Spatial Search
@David I will certainly update when we get the data refed... and if you have things you'd like to investigate or try out please let me know.. I'm happy to eval things at scale here... we will be taking this index from its current 45m records to 6-700m over the next few months as well.. steve On Tue, Jul 30, 2013 at 5:10 PM, Steven Bower sbo...@alcyon.net wrote: Very good read... Already using MMap... verified using pmap and vsz from top.. not sure what you mean by good hit raitio? Here are the stacks... Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 20 0 java.lang.System.identityHashCode(Object) 20 20 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485 527 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 957 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 957 513 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 443 443 org.apache.lucene.index.FilteredTermsEnum.next() 874 324 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef) 368 0 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object, Object) 368 368 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 160 0 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 120 0
Re: Performance question on Spatial Search
by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance / and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear ch -tp4081150p4081309.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Luis Cappa
Re: Performance question on Spatial Search
distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialCon te xtFa ctory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance / and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear ch -tp4081150p4081309.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Luis Cappa -- - Luis Cappa
Re: Performance question on Spatial Search
Can you compare with the old geo handler as a baseline. ? Bill Bell Sent from mobile On Jul 29, 2013, at 4:25 PM, Erick Erickson erickerick...@gmail.com wrote: This is very strange. I'd expect slow queries on the first few queries while these caches were warmed, but after that I'd expect things to be quite fast. For a 12G index and 256G RAM, you have on the surface a LOT of hardware to throw at this problem. You can _try_ giving the JVM, say, 18G but that really shouldn't be a big issue, your index files should be MMaped. Let's try the crude thing first and give the JVM more memory. FWIW Erick On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower smb-apa...@alcyon.net wrote: I've been doing some performance analysis of a spacial search use case I'm implementing in Solr 4.3.0. Basically I'm seeing search times alot higher than I'd like them to be and I'm hoping people may have some suggestions for how to optimize further. Here are the specs of what I'm doing now: Machine: - 16 cores @ 2.8ghz - 256gb RAM - 1TB (RAID 1+0 on 10 SSD) Content: - 45M docs (not very big only a few fields with no large textual content) - 1 geo field (using config below) - index is 12gb - 1 shard - Using MMapDirectory Field config: fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/ and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve
SOLR Performance question
Hi everybody. I stored 42 field in solr. and indexed 34 field. and going to store 4-6 coloum more and indexed 3-5 and total doc i have stored --- 250 and may be it will reach upto 500 SO question is, Will i get any problem ?? my machine is m1.small in amazon ec2. so should i shift machine to m1.large for 250 data or for 500?? or it will work for now ?? -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Performance-question-tp4041245.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOLR Performance question
Hi Anurag, We are running solr with almost same number of documents having more than 50 indexed fields and 78 stored fields, we have no issues as of now, so I can say that you won't face any problem. Regards Harshvardhan Ojha -Original Message- From: anurag.jain [mailto:anurag.k...@gmail.com] Sent: Tuesday, February 19, 2013 1:46 PM To: solr-user@lucene.apache.org Subject: SOLR Performance question Hi everybody. I stored 42 field in solr. and indexed 34 field. and going to store 4-6 coloum more and indexed 3-5 and total doc i have stored --- 250 and may be it will reach upto 500 SO question is, Will i get any problem ?? my machine is m1.small in amazon ec2. so should i shift machine to m1.large for 250 data or for 500?? or it will work for now ?? -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Performance-question-tp4041245.html Sent from the Solr - User mailing list archive at Nabble.com. The contents of this email, including the attachments, are PRIVILEGED AND CONFIDENTIAL to the intended recipient at the email address to which it has been addressed. If you receive it in error, please notify the sender immediately by return email and then permanently delete it from your system. The unauthorized use, distribution, copying or alteration of this email, including the attachments, is strictly forbidden. Please note that neither MakeMyTrip nor the sender accepts any responsibility for viruses and it is your responsibility to scan the email and attachments (if any). No contracts may be concluded on behalf of MakeMyTrip by means of email communications.
Solr 4.0 indexing performance question
I am having some difficulty migrating our solr indexing scripts from using 3.5 to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 is about 5-10 times slower when indexing documents. Querying is still quite fast. The code adds documents in groups of 1000, and adds each group to the solr in a thread. The documents are somewhat large, including maybe 30-40 different field types, mostly multivalued. Here are some snippets of the code we used in 3.5. MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); HttpClient client = new HttpClient(mgr); CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for our index,client ); server.setRequestWriter(new BinaryRequestWriter()); Then, we delete the index, and proceed to generate documents and load the groups in a thread that looks kind of like this. I've omitted some overhead for handling exceptions, and retry attempts. class DocWriterThread implements Runnable { CommonsHttpSolrServer server; CollectionSolrInputDocument docs; private int commitWithin = 5; // 50 seconds public DocWriterThread(CommonsHttpSolrServer server,CollectionSolrInputDocument docs) { this.server=server; this.docs=docs; } public void run() { // set the commitWithin feature server.add(docs,commitWithin); } } Now, I've had to change some things to get this compile with the Solr 4.0 libraries. Here is what I tried to convert the above code to. I don't know if these are the correct equivalents, as I am not familiar with apache httpcomponents. ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); DefaultHttpClient client = new DefaultHttpClient(mgr); HttpSolrServer server = new HttpSolrServer( some url for our solr index,client ); server.setRequestWriter(new BinaryRequestWriter()); The thread method is the same, but uses HttpSolrServer instead of CommonsHttpSolrServer. We also, had an old solrconfig (not sure what version, but it is pre 3.x and had mostly default values) that I had to replace with a 4.0 style solrconfig.xml. I don't want to post the entire file (as it is large), but I copied one from the solr 4.0 examples, and made a couple changes. First, I wanted to turn off transaction logging. So essentially I have a line like this (everything inside is commented out): updateHandler class=solr.DirectUpdateHandler2/updateHandler And I added a handler for javabin requestHandler name=/update/javabin class=solr.BinaryUpdateRequestHandler lst name=defaults str name=stream.contentTypeapplication/javabin/str /lst /requestHandler I'm not sure what other configurations I should look at. I would think that there should be a big obvious reason why the indexing performance would drop nearly 10 fold. Against our 3.5 instance I timed our index load, and it adds roughly 40,000 documents every 3-8 seconds. Against our 4.0 instance it adds 40,000 documents every 70-75 seconds. This isn't the end of the world, and I would love to use the new join feature in solr 4.0. However, we have many different indexes with millions of documents, and this kind of increase in load time is troubling. Thanks for your help. -Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
Re: Solr 4.0 indexing performance question
It's hard to guess, but I might start by looking at what the new UpdateLog is costing you. Take it's definition out of solrconfig.xml and try your test again. Then let's take it from there. - Mark On Jan 23, 2013, at 11:00 AM, Kevin Stone kevin.st...@jax.org wrote: I am having some difficulty migrating our solr indexing scripts from using 3.5 to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 is about 5-10 times slower when indexing documents. Querying is still quite fast. The code adds documents in groups of 1000, and adds each group to the solr in a thread. The documents are somewhat large, including maybe 30-40 different field types, mostly multivalued. Here are some snippets of the code we used in 3.5. MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); HttpClient client = new HttpClient(mgr); CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for our index,client ); server.setRequestWriter(new BinaryRequestWriter()); Then, we delete the index, and proceed to generate documents and load the groups in a thread that looks kind of like this. I've omitted some overhead for handling exceptions, and retry attempts. class DocWriterThread implements Runnable { CommonsHttpSolrServer server; CollectionSolrInputDocument docs; private int commitWithin = 5; // 50 seconds public DocWriterThread(CommonsHttpSolrServer server,CollectionSolrInputDocument docs) { this.server=server; this.docs=docs; } public void run() { // set the commitWithin feature server.add(docs,commitWithin); } } Now, I've had to change some things to get this compile with the Solr 4.0 libraries. Here is what I tried to convert the above code to. I don't know if these are the correct equivalents, as I am not familiar with apache httpcomponents. ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); DefaultHttpClient client = new DefaultHttpClient(mgr); HttpSolrServer server = new HttpSolrServer( some url for our solr index,client ); server.setRequestWriter(new BinaryRequestWriter()); The thread method is the same, but uses HttpSolrServer instead of CommonsHttpSolrServer. We also, had an old solrconfig (not sure what version, but it is pre 3.x and had mostly default values) that I had to replace with a 4.0 style solrconfig.xml. I don't want to post the entire file (as it is large), but I copied one from the solr 4.0 examples, and made a couple changes. First, I wanted to turn off transaction logging. So essentially I have a line like this (everything inside is commented out): updateHandler class=solr.DirectUpdateHandler2/updateHandler And I added a handler for javabin requestHandler name=/update/javabin class=solr.BinaryUpdateRequestHandler lst name=defaults str name=stream.contentTypeapplication/javabin/str /lst /requestHandler I'm not sure what other configurations I should look at. I would think that there should be a big obvious reason why the indexing performance would drop nearly 10 fold. Against our 3.5 instance I timed our index load, and it adds roughly 40,000 documents every 3-8 seconds. Against our 4.0 instance it adds 40,000 documents every 70-75 seconds. This isn't the end of the world, and I would love to use the new join feature in solr 4.0. However, we have many different indexes with millions of documents, and this kind of increase in load time is troubling. Thanks for your help. -Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
Re: Solr 4.0 indexing performance question
Do you mean commenting out the updateLog.../updateLog tag? Because that I already commented out. Or do I also need to remove the entire updateHandler tag? Sorry, I am not too familiar with everything in the solrconfig file. I have a tag that essentially looks like this: updateHandler class=solr.DirectUpdateHandler2/updateHandler Everything inside is commented out. -Kevin On 1/23/13 11:21 AM, Mark Miller markrmil...@gmail.com wrote: It's hard to guess, but I might start by looking at what the new UpdateLog is costing you. Take it's definition out of solrconfig.xml and try your test again. Then let's take it from there. - Mark On Jan 23, 2013, at 11:00 AM, Kevin Stone kevin.st...@jax.org wrote: I am having some difficulty migrating our solr indexing scripts from using 3.5 to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 is about 5-10 times slower when indexing documents. Querying is still quite fast. The code adds documents in groups of 1000, and adds each group to the solr in a thread. The documents are somewhat large, including maybe 30-40 different field types, mostly multivalued. Here are some snippets of the code we used in 3.5. MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); HttpClient client = new HttpClient(mgr); CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for our index,client ); server.setRequestWriter(new BinaryRequestWriter()); Then, we delete the index, and proceed to generate documents and load the groups in a thread that looks kind of like this. I've omitted some overhead for handling exceptions, and retry attempts. class DocWriterThread implements Runnable { CommonsHttpSolrServer server; CollectionSolrInputDocument docs; private int commitWithin = 5; // 50 seconds public DocWriterThread(CommonsHttpSolrServer server,CollectionSolrInputDocument docs) { this.server=server; this.docs=docs; } public void run() { // set the commitWithin feature server.add(docs,commitWithin); } } Now, I've had to change some things to get this compile with the Solr 4.0 libraries. Here is what I tried to convert the above code to. I don't know if these are the correct equivalents, as I am not familiar with apache httpcomponents. ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); DefaultHttpClient client = new DefaultHttpClient(mgr); HttpSolrServer server = new HttpSolrServer( some url for our solr index,client ); server.setRequestWriter(new BinaryRequestWriter()); The thread method is the same, but uses HttpSolrServer instead of CommonsHttpSolrServer. We also, had an old solrconfig (not sure what version, but it is pre 3.x and had mostly default values) that I had to replace with a 4.0 style solrconfig.xml. I don't want to post the entire file (as it is large), but I copied one from the solr 4.0 examples, and made a couple changes. First, I wanted to turn off transaction logging. So essentially I have a line like this (everything inside is commented out): updateHandler class=solr.DirectUpdateHandler2/updateHandler And I added a handler for javabin requestHandler name=/update/javabin class=solr.BinaryUpdateRequestHandler lst name=defaults str name=stream.contentTypeapplication/javabin/str /lst /requestHandler I'm not sure what other configurations I should look at. I would think that there should be a big obvious reason why the indexing performance would drop nearly 10 fold. Against our 3.5 instance I timed our index load, and it adds roughly 40,000 documents every 3-8 seconds. Against our 4.0 instance it adds 40,000 documents every 70-75 seconds. This isn't the end of the world, and I would love to use the new join feature in solr 4.0. However, we have many different indexes with millions of documents, and this kind of increase in load time is troubling. Thanks for your help. -Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible. The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
Re: Solr 4.0 indexing performance question
I'm still poking around trying to find the differences. I found a couple things that may or may not be relevant. First, when I start up my 3.5 solr, I get all sorts of warnings that my solrconfig is old and will run using 2.4 emulation. Of course I had to upgrade the solconfig for the 4.0 instance (which I already described). I am curious if there could be some feature I was taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know. Second when I look at the console logs for my server (3.5 and 4.0) and I run the indexer against each, I see a subtle difference in this print out when it connects to the solr core. The 3.5 version prints this out: webapp=/solr path=/update params={waitSearcher=truewt=javabincommit=truesoftCommit=falseversion=2 } {commit=} 0 2722 The 4.0 version prints this out webapp=/solr path=/update/javabin params={wt=javabincommit=truewaitFlush=truewaitSearcher=trueversion=2} status=0 QTime=1404 The params for the update handle seem ever so slightly different. The 3.5 version (the one that runs fast) has a setting softCommit=false. The 4.0 version does not print that setting, but instead prints this setting waitFlush=true. These could be irrelevant, but thought I should add the information. -Kevin On 1/23/13 11:42 AM, Kevin Stone kevin.st...@jax.org wrote: Do you mean commenting out the updateLog.../updateLog tag? Because that I already commented out. Or do I also need to remove the entire updateHandler tag? Sorry, I am not too familiar with everything in the solrconfig file. I have a tag that essentially looks like this: updateHandler class=solr.DirectUpdateHandler2/updateHandler Everything inside is commented out. -Kevin On 1/23/13 11:21 AM, Mark Miller markrmil...@gmail.com wrote: It's hard to guess, but I might start by looking at what the new UpdateLog is costing you. Take it's definition out of solrconfig.xml and try your test again. Then let's take it from there. - Mark On Jan 23, 2013, at 11:00 AM, Kevin Stone kevin.st...@jax.org wrote: I am having some difficulty migrating our solr indexing scripts from using 3.5 to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 is about 5-10 times slower when indexing documents. Querying is still quite fast. The code adds documents in groups of 1000, and adds each group to the solr in a thread. The documents are somewhat large, including maybe 30-40 different field types, mostly multivalued. Here are some snippets of the code we used in 3.5. MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); HttpClient client = new HttpClient(mgr); CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for our index,client ); server.setRequestWriter(new BinaryRequestWriter()); Then, we delete the index, and proceed to generate documents and load the groups in a thread that looks kind of like this. I've omitted some overhead for handling exceptions, and retry attempts. class DocWriterThread implements Runnable { CommonsHttpSolrServer server; CollectionSolrInputDocument docs; private int commitWithin = 5; // 50 seconds public DocWriterThread(CommonsHttpSolrServer server,CollectionSolrInputDocument docs) { this.server=server; this.docs=docs; } public void run() { // set the commitWithin feature server.add(docs,commitWithin); } } Now, I've had to change some things to get this compile with the Solr 4.0 libraries. Here is what I tried to convert the above code to. I don't know if these are the correct equivalents, as I am not familiar with apache httpcomponents. ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(); DefaultHttpClient client = new DefaultHttpClient(mgr); HttpSolrServer server = new HttpSolrServer( some url for our solr index,client ); server.setRequestWriter(new BinaryRequestWriter()); The thread method is the same, but uses HttpSolrServer instead of CommonsHttpSolrServer. We also, had an old solrconfig (not sure what version, but it is pre 3.x and had mostly default values) that I had to replace with a 4.0 style solrconfig.xml. I don't want to post the entire file (as it is large), but I copied one from the solr 4.0 examples, and made a couple changes. First, I wanted to turn off transaction logging. So essentially I have a line like this (everything inside is commented out): updateHandler class=solr.DirectUpdateHandler2/updateHandler And I added a handler for javabin requestHandler name=/update/javabin class=solr.BinaryUpdateRequestHandler lst name=defaults str name=stream.contentTypeapplication/javabin/str /lst /requestHandler I'm not sure what other configurations I should look at. I would think that there should be a big obvious reason why the indexing performance would drop nearly 10 fold. Against our 3.5 instance I timed our index load, and it adds roughly 40,000 documents every 3-8
Re: Solr 4.0 indexing performance question
Another revelation... I can see that there is a time difference in the Solr output for adding these documents when I watch it realtime. Here are some rows from the 3.5 solr server: Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute INFO: [gxdResult] webapp=/solr path=/update/javabin params={wt=javabinversion=2} status=0 QTime=6196 Jan 23, 2013 11:57:23 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[RNA in situ-1386104, RNA in situ-1351487, RNA in situ-1363917, RNA in situ-1377125, RNA in situ-1371738, RNA in situ-1378746, RNA in situ-1383410, RNA in situ-1362712, ... (1001 adds)]} 0 6266 Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute INFO: [gxdResult] webapp=/solr path=/update/javabin params={wt=javabinversion=2} status=0 QTime=6266 Jan 23, 2013 11:57:24 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[RNA in situ-1371578, RNA in situ-1377716, RNA in situ-1378151, RNA in situ-1360580, RNA in situ-1391657, RNA in situ-1370288, RNA in situ-1388236, RNA in situ-1361465, ... (1001 adds)]} 0 6371 Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute INFO: [gxdResult] webapp=/solr path=/update/javabin params={wt=javabinversion=2} status=0 QTime=6371 Jan 23, 2013 11:57:24 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[RNA in situ-1350555, RNA in situ-1350887, RNA in situ-1379699, RNA in situ-1373773, RNA in situ-1374004, RNA in situ-1372265, RNA in situ-1373027, RNA in situ-1380691, ... (1001 adds)]} 0 6440 Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute And here from the 4.0 solr: Jan 23, 2013 3:40:22 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2} {add=[RNA in situ-115650, RNA in situ-4109, RNA in situ-107614, RNA in situ-86038, RNA in situ-19647, RNA in situ-1422, RNA in situ-119536, RNA in situ-5, RNA in situ-86825, RNA in situ-91009, ... (1001 adds)]} 0 3105 Jan 23, 2013 3:40:23 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2} {add=[RNA in situ-38103, RNA in situ-15797, RNA in situ-79946, RNA in situ-124877, RNA in situ-62025, RNA in situ-67908, RNA in situ-70527, RNA in situ-20581, RNA in situ-107574, RNA in situ-96497, ... (1001 adds)]} 0 2689 Jan 23, 2013 3:40:24 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2} {add=[RNA in situ-35518, RNA in situ-50512, RNA in situ-109961, RNA in situ-113025, RNA in situ-33729, RNA in situ-116967, RNA in situ-133871, RNA in situ-55287, RNA in situ-67367, RNA in situ-8617, ... (1001 adds)]} 0 2367 Jan 23, 2013 3:40:28 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2} {add=[RNA in situ-105749, RNA in situ-125415, RNA in situ-14667, RNA in situ-41067, RNA in situ-1099, RNA in situ-86169, RNA in situ-90834, RNA in situ-114639, RT-PCR-26160, RNA in situ-79745, ... (1001 adds)]} 0 3401 Jan 23, 2013 3:40:28 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2} {add=[RNA in situ-82061, RNA in situ-96965, RNA in situ-22677, RNA in situ-52637, RNA in situ-131842, RNA in situ-31863, RNA in situ-111656, RNA in situ-120509, RNA in situ-29659, RNA in situ-63579, ... (1001 adds)]} 0 3580 Jan 23, 2013 3:40:31 PM org.apache.solr.update.processor.LogUpdateProcessor finish I know that they aren't the same exact documents (like I said, there are millions to load), but the times look pretty much like this for all of them. Can someone help me parse out the times of this? It *appears* to me that the inserts are happening just as fast, if not faster in 4.0 than 3.5, BUT the timestamps between the LogUpdateProcessor calls are much longer in 4.0. I do not have the updateLog tag anywhere in my solrconfig.xml. So why does it look to me like it is spending a lot of time logging? It shouldn't really be logging anything, right? Bear in mind that these inserts happen in threads that are pushing to Solr concurrently. So if 4.0 is logging somewhere that 3.5 didn't, then the file-locking on that log file could be slowing me down. -Kevin On 1/23/13 12:03 PM, Kevin Stone kevin.st...@jax.org wrote: I'm still poking around trying to find the differences. I found a couple things that may or may not be relevant. First, when I start up my 3.5 solr, I get all sorts of warnings that my solrconfig is old and will run using 2.4 emulation. Of course I had to upgrade the solconfig for the 4.0 instance (which I already described). I am curious if there could be some feature I was taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know. Second when I look at the console logs for my server (3.5 and 4.0) and I run the indexer against each, I
Re: Performance Question
Mikhail, Thanks for the response. Just to be clear you're saying that the size of the index does not matter, it's more the size of the results? On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Frankly speaking the computational complexity of Lucene search depends from size of search result: numFound*log(start+rows), but from size of index. Regards On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote: I'm curious if anyone tell me how Solr/Lucene performs in a situation where you have 100,000 documents each with 100 tokens vs having 1,000,000 documents each with 10 tokens. Should I expect the performance to be the same? Any information would be greatly appreciated. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance Question
Exactly. That's what I mean. On Mon, Mar 19, 2012 at 6:15 PM, Jamie Johnson jej2...@gmail.com wrote: Mikhail, Thanks for the response. Just to be clear you're saying that the size of the index does not matter, it's more the size of the results? On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Frankly speaking the computational complexity of Lucene search depends from size of search result: numFound*log(start+rows), but from size of index. Regards On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote: I'm curious if anyone tell me how Solr/Lucene performs in a situation where you have 100,000 documents each with 100 tokens vs having 1,000,000 documents each with 10 tokens. Should I expect the performance to be the same? Any information would be greatly appreciated. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance Question
The size of the index does matter practically speaking. Bill Bell Sent from mobile On Mar 19, 2012, at 11:41 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Exactly. That's what I mean. On Mon, Mar 19, 2012 at 6:15 PM, Jamie Johnson jej2...@gmail.com wrote: Mikhail, Thanks for the response. Just to be clear you're saying that the size of the index does not matter, it's more the size of the results? On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Frankly speaking the computational complexity of Lucene search depends from size of search result: numFound*log(start+rows), but from size of index. Regards On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote: I'm curious if anyone tell me how Solr/Lucene performs in a situation where you have 100,000 documents each with 100 tokens vs having 1,000,000 documents each with 10 tokens. Should I expect the performance to be the same? Any information would be greatly appreciated. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Performance Question
I'm curious if anyone tell me how Solr/Lucene performs in a situation where you have 100,000 documents each with 100 tokens vs having 1,000,000 documents each with 10 tokens. Should I expect the performance to be the same? Any information would be greatly appreciated.
Re: Performance Question
Hello, Frankly speaking the computational complexity of Lucene search depends from size of search result: numFound*log(start+rows), but from size of index. Regards On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote: I'm curious if anyone tell me how Solr/Lucene performs in a situation where you have 100,000 documents each with 100 tokens vs having 1,000,000 documents each with 10 tokens. Should I expect the performance to be the same? Any information would be greatly appreciated. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: performance question
Strictly speaking there is some insignificant distinctions in performance related to how a field name is resolved -- Grant alluded to this earlier in this thread -- but it only comes into play when you actually refer to that field by name and Solr has to look them up in the metadata. So for example if your request refered to 100 differnet field names in the q, fq, and facet.field params there would be a small overhead for any of those 100 fields that existed because of dynamicField/ declarations, that would not exist for any of those fields that were declared using field/ -- but there would be no added overhead to htat query if there were 999 other fields that existed in your index because of that same dynamicField/ declaration. But frankly: we're getting talking about seriously ridiculous pico-optimizing at this point ... if you find yourselv with performance concerns there are probaly 500 other things worth worrying about before this should ever cross your mind. Thanks for the follow up. I've converted our schema to required fields only with every other field being a dynamic field. The only negative that I've found so far is that you lose the copyField capability, so it makes my ingest a little bigger, since I have to manually copy the values myself. -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: performance question
You don't lose copyField capability with dynamic fields. You can copy dynamic fields into a fixed field name like *_s = text or dynamic fields into another dynamic field like *_s = *_t Erik On Jan 6, 2010, at 9:35 AM, A. Steven Anderson wrote: Strictly speaking there is some insignificant distinctions in performance related to how a field name is resolved -- Grant alluded to this earlier in this thread -- but it only comes into play when you actually refer to that field by name and Solr has to look them up in the metadata. So for example if your request refered to 100 differnet field names in the q, fq, and facet.field params there would be a small overhead for any of those 100 fields that existed because of dynamicField/ declarations, that would not exist for any of those fields that were declared using field/ -- but there would be no added overhead to htat query if there were 999 other fields that existed in your index because of that same dynamicField/ declaration. But frankly: we're getting talking about seriously ridiculous pico-optimizing at this point ... if you find yourselv with performance concerns there are probaly 500 other things worth worrying about before this should ever cross your mind. Thanks for the follow up. I've converted our schema to required fields only with every other field being a dynamic field. The only negative that I've found so far is that you lose the copyField capability, so it makes my ingest a little bigger, since I have to manually copy the values myself. -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: performance question
You don't lose copyField capability with dynamic fields. You can copy dynamic fields into a fixed field name like *_s = text or dynamic fields into another dynamic field like *_s = *_t Ahhh...I missed that little detail. Nice! Ok, so there are no negatives to using dynamic fields then. ;-) Thanks for all the info! -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: performance question
: So, in general, there is no *significant* performance difference with using : dynamic fields. Correct? : : Correct. There's not even really an insignificant performance difference. : A dynamic field is the same as a regular field in practically every way on the : search side of things. Strictly speaking there is some insignificant distinctions in performance related to how a field name is resolved -- Grant alluded to this earlier in this thread -- but it only comes into play when you actually refer to that field by name and Solr has to look them up in the metadata. So for example if your request refered to 100 differnet field names in the q, fq, and facet.field params there would be a small overhead for any of those 100 fields that existed because of dynamicField/ declarations, that would not exist for any of those fields that were declared using field/ -- but there would be no added overhead to htat query if there were 999 other fields that existed in your index because of that same dynamicField/ declaration. But frankly: we're getting talking about seriously ridiculous pico-optimizing at this point ... if you find yourselv with performance concerns there are probaly 500 other things worth worrying about before this should ever cross your mind. -Hoss
Re: performance question
On Jan 4, 2010, at 12:04 AM, A. Steven Anderson wrote: dynamic fields don't make it worse ... the number of actaul field names you sort on makes it worse. If you sort on 100 fields, the cost is the same regardless of wether all 100 of those fields exist because of a single dynamicField/ declaration, or 100 distinct field/ declarations. Ahh...thanks for the clarification. So, in general, there is no *significant* performance difference with using dynamic fields. Correct? Correct. There's not even really an insignificant performance difference. A dynamic field is the same as a regular field in practically every way on the search side of things. Erik
Re: performance question
Sorting and index norms have space penalties. Sorting on a field creates an array of Java ints, one for every document in the index. Index norms (used for boosting documents and other things) create an array of bytes in the Lucene index files, one for every document in the index. If you sort on many of your dynamic fields your memory use will explode, and the same with index norms and disk space. Thanks for the info. In general, I knew sorting was expensive, but I didn't realize that dynamic fields made it worse. -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: performance question
: If you sort on many of your dynamic fields your memory use will : explode, and the same with index norms and disk space. : Thanks for the info. In general, I knew sorting was expensive, but I didn't : realize that dynamic fields made it worse. dynamic fields don't make it worse ... the number of actaul field names you sort on makes it worse. If you sort on 100 fields, the cost is the same regardless of wether all 100 of those fields exist because of a single dynamicField/ declaration, or 100 distinct field/ declarations. -Hoss
Re: performance question
dynamic fields don't make it worse ... the number of actaul field names you sort on makes it worse. If you sort on 100 fields, the cost is the same regardless of wether all 100 of those fields exist because of a single dynamicField/ declaration, or 100 distinct field/ declarations. Ahh...thanks for the clarification. So, in general, there is no *significant* performance difference with using dynamic fields. Correct? -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: performance question
Sorting and index norms have space penalties. Sorting on a field creates an array of Java ints, one for every document in the index. Index norms (used for boosting documents and other things) create an array of bytes in the Lucene index files, one for every document in the index. If you sort on many of your dynamic fields your memory use will explode, and the same with index norms and disk space. On Wed, Dec 30, 2009 at 6:54 AM, A. Steven Anderson a.steven.ander...@gmail.com wrote: There can be an impact if you are searching against a lot of fields or if you are indexing a lot of fields on every document, but for the most part in most applications it is negligible. We index a lot of fields at one time, but we can tolerate the performance impact at index time. It probably can't hurt to be more streamlined, but without knowing more about your model, it's hard to say. I've built apps that were totally dynamic field based and they worked just fine, but these were more for discovery than just pure search. In other words, the user was interacting with the system in a reflective model that selected which fields to search on. Our application is as much about discovery as search, so this is good to know. Thanks for the feedback. It was very helpful. -- A. Steven Anderson Independent Consultant st...@asanderson.com -- Lance Norskog goks...@gmail.com
Re: performance question
On Dec 29, 2009, at 2:19 PM, A. Steven Anderson wrote: Greetings! Is there any significant negative performance impact of using a dynamicField? There can be an impact if you are searching against a lot of fields or if you are indexing a lot of fields on every document, but for the most part in most applications it is negligible. Likewise for multivalued fields? No. Multivalued fields are just concatenated together with a large position gap underneath the hood. The reason why I ask is that our system basically aggregates data from many disparate data sources (structured, unstructured, and semi-structured), and the management of the schema.xml has become unwieldy; i.e. we currently have dozens of fields which grows every time we add a new data source. I was considering redefining the domain model outside of Solr which would be used to generate the fields for the indexing process and the metadata (e.g. display names) for the search process. Thoughts? It probably can't hurt to be more streamlined, but without knowing more about your model, it's hard to say. I've built apps that were totally dynamic field based and they worked just fine, but these were more for discovery than just pure search. In other words, the user was interacting with the system in a reflective model that selected which fields to search on. -Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: performance question
There can be an impact if you are searching against a lot of fields or if you are indexing a lot of fields on every document, but for the most part in most applications it is negligible. We index a lot of fields at one time, but we can tolerate the performance impact at index time. It probably can't hurt to be more streamlined, but without knowing more about your model, it's hard to say. I've built apps that were totally dynamic field based and they worked just fine, but these were more for discovery than just pure search. In other words, the user was interacting with the system in a reflective model that selected which fields to search on. Our application is as much about discovery as search, so this is good to know. Thanks for the feedback. It was very helpful. -- A. Steven Anderson Independent Consultant st...@asanderson.com
performance question
Greetings! Is there any significant negative performance impact of using a dynamicField? Likewise for multivalued fields? The reason why I ask is that our system basically aggregates data from many disparate data sources (structured, unstructured, and semi-structured), and the management of the schema.xml has become unwieldy; i.e. we currently have dozens of fields which grows every time we add a new data source. I was considering redefining the domain model outside of Solr which would be used to generate the fields for the indexing process and the metadata (e.g. display names) for the search process. Thoughts? -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: Performance question: Solr 64 bit java vs 32 bit mode.
Solr runs equally well on both 64-bit and 32-bit systems. Your 15 second problem could be caused by IO bottleneck (not likely if your index is small and fits in RAM), could be concurrency (esp. if you are using compound index format), could be something else on production killing your CPU, could be the JVM being busy sweeping the garbage out, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Robert Purdy [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 15, 2007 4:05:00 PM Subject: Performance question: Solr 64 bit java vs 32 bit mode. Would anyone know if solr runs better in 64bit java vs 32 bit and could answer another possible related question. I currently have two servers running solr under identical tomcat installations. One is the production server and is under heavy user load and the other is under no load at all because it is a test box. I was looking in the logs on the production server and noticed some queries were taking about 15 seconds, and this is after auto-warming. So I decided to execute that same query on the other server with nothing in the caches and found that it only took 2 seconds to complete. My question is why an Dual Intel Core Duo Xserve server in 64 bit java mode with 8GB of ram allocated to the tomcat server be slower than a Dual Power PC G5 server running in 32 bit mode with only 2GB of ram allocated? Is it because of the load/concurrrency issues on the production sever that made the time next to the query in the log greater on the production server? If so what is the best way to configure tomcat to deal with that issue? Thanks Robert. -- View this message in context: http://www.nabble.com/Performance-question%3A-Solr-64-bit-java-vs-32-bit-mode.-tf4817186.html#a13781791 Sent from the Solr - User mailing list archive at Nabble.com.
Performance question: Solr 64 bit java vs 32 bit mode.
Would anyone know if solr runs better in 64bit java vs 32 bit and could answer another possible related question. I currently have two servers running solr under identical tomcat installations. One is the production server and is under heavy user load and the other is under no load at all because it is a test box. I was looking in the logs on the production server and noticed some queries were taking about 15 seconds, and this is after auto-warming. So I decided to execute that same query on the other server with nothing in the caches and found that it only took 2 seconds to complete. My question is why an Dual Intel Core Duo Xserve server in 64 bit java mode with 8GB of ram allocated to the tomcat server be slower than a Dual Power PC G5 server running in 32 bit mode with only 2GB of ram allocated? Is it because of the load/concurrrency issues on the production sever that made the time next to the query in the log greater on the production server? If so what is the best way to configure tomcat to deal with that issue? Thanks Robert. -- View this message in context: http://www.nabble.com/Performance-question%3A-Solr-64-bit-java-vs-32-bit-mode.-tf4817186.html#a13781791 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Phrase Query Performance Question and score threshold
On 11/5/07, Haishan Chen [EMAIL PROTECTED] wrote: As for the first issues. The number of different phrase queries have performance issues I found so far are about 10. If these are normal phrase queries (no slop), a good solution might be to simply index and query these phrases as a single token. One could do this with a SynonymFilter. Oh, and no, a score threshold won't help performance. I believe there will be a lot more I just haven't tried. It can be solve by using faster hard ware though. Also I believe it will help if SOLR has samilar distributed search architecture like NUTCH so that it can scale out instead of scale up. It's coming... -Yonik
Re: Phrase Query Performance Question
He means extremely frequent and I agree. --wunder On 11/2/07 1:51 AM, Haishan Chen [EMAIL PROTECTED] wrote: Thanks for the advice. You certainly have a point. I believe you mean a query term that appears in 5-10% of an index in a natural language corpus is extremely INFREQUENT?
RE: Phrase Query Performance Question
From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question Date: Thu, 1 Nov 2007 11:25:26 -0700 To: solr-user@lucene.apache.org On 31-Oct-07, at 11:54 PM, Haishan Chen wrote:Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query Performance Question From: [EMAIL PROTECTED] To: solr- [EMAIL PROTECTED] hurricane katrina is a very expensive query against a collection focused on Hurricane Katrina. There will be many matches in many documents. If you want to measure worst-case, this is fine. I'd try other things, like: * ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * FEMA mistakes * storm surge * Jackson Square Of course, real query logs are the only real test. wunder These terms are not frequent in my index. I believe they are going to be fast. The thing is that I feel 2 million documents is a small index. 100,000 or 200,000 hits is a small set and should always have sub second query performance. Now I am only querying one field and the response is almost one second. I feel I can't achieve sub second performance if I add a bit more complexity to the query. Many of the category terms in my index will appear in more than 5% of the documents and those category terms are very popular search terms. So the example I gave were not extreme cases for my index I think that you are somewhat misguided about what constitutes a small set. A query term that appears in 5-10% of the index in a natural language corpus is _extremely_ frequent. Not quite on the order of stopwords, but getting there. As a comparison, on an extremely large corpus that I have handy, documents containing both the word 'auto' and 'repair' (not necessarily adjacent) constitute 0.1% of the index. The frequency of the phrase auto repair is 0.025%. @200k docs would be the response rate from an 800million-doc corpus. What data are you indexing, what what is the intended effect of the phrase queries you are performing? Perhaps getting at the issue from this end would be more productive than hammering at the phrasequery performance question. Thanks for the advice. You certainly have a point. I believe you mean a query term that appears in 5-10% of an index in a natural language corpus is extremely INFREQUENT? When I start tomcat I saw this message: The Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path Is that mean if I use Apache Tomcat Native library the query performance will be better. Anyone has experience on that? Unlikely, though it might help you slightly at a high query rate with high cache hit ratios. -Mike I have try Apache Tomcat Native library on my window machine and you are right. No obvious difference on query performance I have try the index on a linux machine. The windows machine: Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ (Quo-core cpu) 4G Ram The linux machine: (not sure what version of linux), two Intel(R) Xeon(R) CPU E5310 1.6 GHZ (Quo-core cpu) 4G Ram Both system have raid5 but I don't know the difference. I found substantial indexing performance improvement on the linux machine. On the windows machine it took more than 5 hours. But it took only one hour to index 2 million documents on the linux system. I am really happy to see that. I guess both linux and the extra CPU contributed to the improvement. Query performance are almost the same though. The cpu on linux machine is slower so I think if the linux system were using the same cpu as the windows system query performance will improve too. Both index and query are cpu bound. If I am right. I guess I got enough on this question. But I still want to try the solr-trunk. Will update with everyone later. Thanks -Haishan _ Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare! http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews
Re: Phrase Query Performance Question
On 2-Nov-07, at 10:03 AM, Haishan Chen wrote: Date: Fri, 2 Nov 2007 07:32:30 -0700 Subject: Re: Phrase Query Performance Question From: [EMAIL PROTECTED] To: solr- [EMAIL PROTECTED] He means extremely frequent and I agree. --wunder Then it means a PHRASE (combination of terms except stopwords) appear in 5% to 10% of an index should NOT be that frequent? I guess I get the idea. Phrases should be rarer than individual keywords. 5-10% is moderately high even for a _single_ keyword, let alone the conjunction of two keywords, let alone the _exact phrase_ of two keywords (non stopwords in all of this discussion). As I mentioned, the 'natural' rate of 'auto'+'repair' on a corpus 100's of times bigger than yours (web documents) is .1%, and the rate of the phrase 'auto repair' is .025%. It still feels to me that you are trying doing something unique with your phrase queries. Unfortunately, you still haven't said what you are trying to do in general terms, which makes it very difficult for people to help you. -Mike
Re: Phrase Query Performance Question
: It still feels to me that you are trying doing something unique with your : phrase queries. Unfortunately, you still haven't said what you are trying to : do in general terms, which makes it very difficult for people to help you. Agreed. This seems very special case, but we dont' know what the case is. If there are specific phrases you know in advance that you will care about, and those phrases occur as frequetnly as the individual words, then the best way to deal with them is to index each phrase as a single Term (and ignore the individual words) Speaking more generally to mike's point... http://people.apache.org/~hossman/#xyproblem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
RE: Phrase Query Performance Question
Date: Fri, 2 Nov 2007 12:31:29 -0700 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Re: Phrase Query Performance Question : It still feels to me that you are trying doing something unique with your : phrase queries. Unfortunately, you still haven't said what you are trying to : do in general terms, which makes it very difficult for people to help you. Agreed. This seems very special case, but we dont' know what the case is. If there are specific phrases you know in advance that you will care about, and those phrases occur as frequetnly as the individual words, then the best way to deal with them is to index each phrase as a single Term (and ignore the individual words) Speaking more generally to mike's point... http://people.apache.org/~hossman/#xyproblem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss I think the documents I was indexing can not be considered a natural language documents. It is constructed following certain rules and then feed into the indexing process. I guess because of the rules many targeting searching terms have high document frequency. I am not in obligation to achieve the quarter second performance I am just interested to see whether it is achievable. Thanks everyone for offering advice -Haishan _ Help yourself to FREE treats served up daily at the Messenger Café. Stop by today. http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline
Re: Phrase Query Performance Question
On 31-Oct-07, at 11:54 PM, Haishan Chen wrote: Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query Performance Question From: [EMAIL PROTECTED] To: solr- [EMAIL PROTECTED] hurricane katrina is a very expensive query against a collection focused on Hurricane Katrina. There will be many matches in many documents. If you want to measure worst-case, this is fine. I'd try other things, like: * ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * FEMA mistakes * storm surge * Jackson Square Of course, real query logs are the only real test. wunder These terms are not frequent in my index. I believe they are going to be fast. The thing is that I feel 2 million documents is a small index. 100,000 or 200,000 hits is a small set and should always have sub second query performance. Now I am only querying one field and the response is almost one second. I feel I can't achieve sub second performance if I add a bit more complexity to the query. Many of the category terms in my index will appear in more than 5% of the documents and those category terms are very popular search terms. So the example I gave were not extreme cases for my index I think that you are somewhat misguided about what constitutes a small set. A query term that appears in 5-10% of the index in a natural language corpus is _extremely_ frequent. Not quite on the order of stopwords, but getting there. As a comparison, on an extremely large corpus that I have handy, documents containing both the word 'auto' and 'repair' (not necessarily adjacent) constitute 0.1% of the index. The frequency of the phrase auto repair is 0.025%. @200k docs would be the response rate from an 800million-doc corpus. What data are you indexing, what what is the intended effect of the phrase queries you are performing? Perhaps getting at the issue from this end would be more productive than hammering at the phrasequery performance question. When I start tomcat I saw this message: The Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path Is that mean if I use Apache Tomcat Native library the query performance will be better. Anyone has experience on that? Unlikely, though it might help you slightly at a high query rate with high cache hit ratios. -Mike
RE: Phrase Query Performance Question
From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question Date: Tue, 30 Oct 2007 11:22:17 -0700 To: solr-user@lucene.apache.org On 30-Oct-07, at 6:09 AM, Yonik Seeley wrote: On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote: Thanks a lot for replying Yonik! I am running solr on a windows 2003 server (standard version). intel Xeon CPU 3.00GHz, with 4.00 GB RAM. The index is locate on Raid5 with 2 million documents. Is there any way to improve query performance without moving to more powerful computer? I understand that the query performances of phrase query (auto repair) has to do with the number of documents containing the two words. In fact the number of documents that have auto and repair are about 10. It is like 5% of the documents containing auto and repair. It seems to me 937 ms is too slower. Chen, that does seem slow I'm not sure why. 1) was this the first search on the index? if so, try running some other searches to warm things up first. Indeed--phrase matching uses a completely different part of the index, so that needs to be warmed too. One thing to try is solr trunk: it contains some speedups for phrase queries (though perhaps not as substantial as you hope for). -MIke Thanks for replying. The statistics I collected were not on the first query. And I believe I was runing JVM on server mode. I configure tomcat to use the server version of JVM.dll. I guess that is the way to set it on windows. I execute the same phrase query (auto repair) over and over again and that is the best performance I observe. Also when I did the test I disable all solr cache. I want to see the performance without Solr cache I am currently trying to test the index on linux system with similar hardware. It will take me some time to set it up. I read a discussion between Doug cutting and Andrzej Bialecki about lucene performance. http://mail-archives.apache.org/mod_mbox/lucene-java-user/200512.mbox/[EMAIL PROTECTED] It mentioned that http://websearch.archive.org/katrina/ (in nutch) had 10M documents and a search of hurricane katrina was able to return in 1.35 seconds with 600,867 hits. Althought the computer it was using might be more powerful than mine. I feel 937ms for a phrase query on a single field is kind of slower. Nutch actually expand a search to more complex queries. My index and the number of hits on my query (auto repair) is about one fifth of websearch.archive.org and its testing query. So I feel a reasonable performance for my query should be less than 300 ms. I am not sure if I am right on that logic. Anyway I will collect the statistic on linux first and try out other options. Thanks a lot Haishan _ Windows Live Hotmail and Microsoft Office Outlook – together at last. Get it now. http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033
Re: Phrase Query Performance Question
On 31-Oct-07, at 2:40 PM, Haishan Chen wrote: http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 200512.mbox/[EMAIL PROTECTED] It mentioned that http://websearch.archive.org/katrina/ (in nutch) had 10M documents and a search of hurricane katrina was able to return in 1.35 seconds with 600,867 hits. Althought the computer it was using might be more powerful than mine. I feel 937ms for a phrase query on a single field is kind of slower. Nutch actually expand a search to more complex queries. My index and the number of hits on my query (auto repair) is about one fifth of websearch.archive.org and its testing query. So I feel a reasonable performance for my query should be less than 300 ms. I am not sure if I am right on that logic. I'm not sure that it is reasonable, but I'm not sure that it isn't. However, have you tried other queries? 937ms seems a little high, even for phrase queries. Anyway I will collect the statistic on linux first and try out other options. Have you tried using the performance enhancements present in solr-trunk? -Mike
RE: Phrase Query Performance Question
From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question Date: Wed, 31 Oct 2007 15:25:42 -0700 To: solr-user@lucene.apache.org On 31-Oct-07, at 2:40 PM, Haishan Chen wrote: http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 200512.mbox/[EMAIL PROTECTED] It mentioned that http://websearch.archive.org/katrina/ (in nutch) had 10M documents and a search of hurricane katrina was able to return in 1.35 seconds with 600,867 hits. Althought the computer it was using might be more powerful than mine. I feel 937ms for a phrase query on a single field is kind of slower. Nutch actually expand a search to more complex queries. My index and the number of hits on my query (auto repair) is about one fifth of websearch.archive.org and its testing query. So I feel a reasonable performance for my query should be less than 300 ms. I am not sure if I am right on that logic. I'm not sure that it is reasonable, but I'm not sure that it isn't. However, have you tried other queries? 937ms seems a little high, even for phrase queries. Anyway I will collect the statistic on linux first and try out other options. Have you tried using the performance enhancements present in solr-trunk? -Mike Here are some query statistic. The phrase queries look slow to me. These are queries have more than 10 hits. For those return a couple thousand hits the responds time is quite fast. But this is query on one field only. (auto repair) 100384 hits 946 ms(auto repair) 100384 hits 31ms(car repair~100) 112183 hits 766 ms(car repair)112183 hits 63 ms(business service~100) 1209751 hits 1500 ms(business service) 1209751 hits 234 ms(shopping center~100) 119481 hits 359 ms(shopping center~100) 119481 hits 63 ms I don't know what is solr-trunk yet but I will find out Thank you Haishan _ Climb to the top of the charts! Play Star Shuffle: the word scramble challenge with star power. http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct
Re: Phrase Query Performance Question
hurricane katrina is a very expensive query against a collection focused on Hurricane Katrina. There will be many matches in many documents. If you want to measure worst-case, this is fine. I'd try other things, like: * ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * FEMA mistakes * storm surge * Jackson Square Of course, real query logs are the only real test. wunder On 10/31/07 3:25 PM, Mike Klaas [EMAIL PROTECTED] wrote: On 31-Oct-07, at 2:40 PM, Haishan Chen wrote: http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 200512.mbox/[EMAIL PROTECTED] It mentioned that http://websearch.archive.org/katrina/ (in nutch) had 10M documents and a search of hurricane katrina was able to return in 1.35 seconds with 600,867 hits. Althought the computer it was using might be more powerful than mine. I feel 937ms for a phrase query on a single field is kind of slower. Nutch actually expand a search to more complex queries. My index and the number of hits on my query (auto repair) is about one fifth of websearch.archive.org and its testing query. So I feel a reasonable performance for my query should be less than 300 ms. I am not sure if I am right on that logic. I'm not sure that it is reasonable, but I'm not sure that it isn't. However, have you tried other queries? 937ms seems a little high, even for phrase queries. Anyway I will collect the statistic on linux first and try out other options. Have you tried using the performance enhancements present in solr-trunk? -Mike
RE: Phrase Query Performance Question
: (auto repair) 100384 hits 946 ms(auto repair) 100384 hits 31ms(car : repair~100) 112183 hits 766 ms(car repair) 112183 hits 63 : ms(business service~100) 1209751 hits 1500 ms(business service) : 1209751 hits 234 ms(shopping center~100) 119481 hits 359 : ms(shopping center~100) 119481 hits 63 ms if i'm reading those numbers right, every document in your corpus containing the words auto or repair also contains the exact phrase auto repair with no slop ... this seems HIGHLY unlikely. can you show us *exactly* what the query URLs you are using look like, and show us what the request handler section of your solrconfig.xml looks like. also: where are you getting these times from? are these from the logging output solr produces, or from the client you have hitting solr? : I don't know what is solr-trunk yet but I will find out he's refering to the unreleased develoment code, which you can checkout from the trunk of the SOlr subversion repository... http://lucene.apache.org/solr/version_control.html -Hoss
RE: Phrase Query Performance Question
Date: Wed, 31 Oct 2007 19:19:07 -0700 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: RE: Phrase Query Performance Question : (auto repair) 100384 hits 946 ms(auto repair) 100384 hits 31ms(car : repair~100) 112183 hits 766 ms(car repair) 112183 hits 63 : ms(business service~100) 1209751 hits 1500 ms(business service) : 1209751 hits 234 ms(shopping center~100) 119481 hits 359 : ms(shopping center~100) 119481 hits 63 ms if i'm reading those numbers right, every document in your corpus containing the words auto or repair also contains the exact phrase auto repair with no slop ... this seems HIGHLY unlikely. can you show us *exactly* what the query URLs you are using look like, and show us what the request handler section of your solrconfig.xml looks like. Yes that's exactly what the documents are like. The documents are categorized. I indexed the category with the content of the documents using text field type. The URL I used is select?q=content:(auto repair~100)fl=title. All other options like faceting, highlighting are not used. also: where are you getting these times from? are these from the logging output solr produces, or from the client you have hitting solr? : I don't know what is solr-trunk yet but I will find out he's refering to the unreleased develoment code, which you can checkout from the trunk of the SOlr subversion repository... http://lucene.apache.org/solr/version_control.html -Hoss I am getting the time from the client browser Thanks -Haishan _ Help yourself to FREE treats served up daily at the Messenger Café. Stop by today. http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline
RE: Phrase Query Performance Question
Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query Performance Question From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org hurricane katrina is a very expensive query against a collection focused on Hurricane Katrina. There will be many matches in many documents. If you want to measure worst-case, this is fine. I'd try other things, like: * ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * FEMA mistakes * storm surge * Jackson Square Of course, real query logs are the only real test. wunder These terms are not frequent in my index. I believe they are going to be fast. The thing is that I feel 2 million documents is a small index. 100,000 or 200,000 hits is a small set and should always have sub second query performance. Now I am only querying one field and the response is almost one second. I feel I can't achieve sub second performance if I add a bit more complexity to the query. Many of the category terms in my index will appear in more than 5% of the documents and those category terms are very popular search terms. So the example I gave were not extreme cases for my index When I start tomcat I saw this message: The Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path Is that mean if I use Apache Tomcat Native library the query performance will be better. Anyone has experience on that? Thanks a lot -Haishan On 10/31/07 3:25 PM, Mike Klaas [EMAIL PROTECTED] wrote: On 31-Oct-07, at 2:40 PM, Haishan Chen wrote: http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 200512.mbox/[EMAIL PROTECTED] It mentioned that http://websearch.archive.org/katrina/ (in nutch) had 10M documents and a search of hurricane katrina was able to return in 1.35 seconds with 600,867 hits. Althought the computer it was using might be more powerful than mine. I feel 937ms for a phrase query on a single field is kind of slower. Nutch actually expand a search to more complex queries. My index and the number of hits on my query (auto repair) is about one fifth of websearch.archive.org and its testing query. So I feel a reasonable performance for my query should be less than 300 ms. I am not sure if I am right on that logic.I'm not sure that it is reasonable, but I'm not sure that it isn't. However, have you tried other queries? 937ms seems a little high, even for phrase queries. Anyway I will collect the statistic on linux first and try out other options.Have you tried using the performance enhancements present in solr-trunk?-Mike _ Peek-a-boo FREE Tricks Treats for You! http://www.reallivemoms.com?ocid=TXT_TAGHMloc=us
RE: Phrase Query Performance Question
Thanks a lot for replying Yonik! I am running solr on a windows 2003 server (standard version). intel Xeon CPU 3.00GHz, with 4.00 GB RAM. The index is locate on Raid5 with 2 million documents. Is there any way to improve query performance without moving to more powerful computer? I understand that the query performances of phrase query (auto repair) has to do with the number of documents containing the two words. In fact the number of documents that have auto and repair are about 10. It is like 5% of the documents containing auto and repair. It seems to me 937 ms is too slower. Would it be faster if I run solr on linux system? If it is then how much faster it would be generally? My performance target for this kind of phrase query is a quarter of a second or so. Any advice on how to achieve this on the above hardware? Thanks a lot Haishan Re: phrase query performanceYonik SeeleyFri, 26 Oct 2007 08:09:52 -0700 The differences lie in Lucene.Instead of thinking of phrase queries as slow, think of term queries as fast :-)Phrase queries need to read and consider position information thatterm queries do not. -Yonik On 10/26/07, Haishan Chen [EMAIL PROTECTED] wrote: I am a new Solr user and wonder if anyone can help me these questions. I used Solr to index about two million documents and query on it using standard request handler. I disabled all cache. I found phrase query was substantially slower than the usual query. The statistic I collected is as following. I was doing the query on the one field only. content:(auto repair) 47 ms repeatablecontent:(auto repair) 937 ms repeatablecontent:(auto repair~1) 766 ms repeatable What are the factors affecting phrase query performance? How come the phrase query content:(auto repair) is almost 20 times slower than content:(auto repair)? I also notice a the phrase query with a slop is always faster than the one without a slop. Is the difference I observe here a performance problem of Lucene or Solr? It will be appreciated if anyone can help _ Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare! http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews
Re: Phrase Query Performance Question
On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote: Thanks a lot for replying Yonik! I am running solr on a windows 2003 server (standard version). intel Xeon CPU 3.00GHz, with 4.00 GB RAM. The index is locate on Raid5 with 2 million documents. Is there any way to improve query performance without moving to more powerful computer? I understand that the query performances of phrase query (auto repair) has to do with the number of documents containing the two words. In fact the number of documents that have auto and repair are about 10. It is like 5% of the documents containing auto and repair. It seems to me 937 ms is too slower. Chen, that does seem slow I'm not sure why. 1) was this the first search on the index? if so, try running some other searches to warm things up first. 2) was the jvm in server mode? (start with -server) 3) shut down unlrelated things on the system so that there is more memory available to the OS to cache the index files Would it be faster if I run solr on linux system? Maybe... Lucene does rely on the OS caching often used parts of the index, so this can differ the most between Windows and Linux. If you have a Linux box lying around, trying it out quick to remove that variable would be a good idea. -Yonik
Re: Phrase Query Performance Question
On 30-Oct-07, at 6:09 AM, Yonik Seeley wrote: On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote: Thanks a lot for replying Yonik! I am running solr on a windows 2003 server (standard version). intel Xeon CPU 3.00GHz, with 4.00 GB RAM. The index is locate on Raid5 with 2 million documents. Is there any way to improve query performance without moving to more powerful computer? I understand that the query performances of phrase query (auto repair) has to do with the number of documents containing the two words. In fact the number of documents that have auto and repair are about 10. It is like 5% of the documents containing auto and repair. It seems to me 937 ms is too slower. Chen, that does seem slow I'm not sure why. 1) was this the first search on the index? if so, try running some other searches to warm things up first. Indeed--phrase matching uses a completely different part of the index, so that needs to be warmed too. One thing to try is solr trunk: it contains some speedups for phrase queries (though perhaps not as substantial as you hope for). -MIke
Phrase Query Performance Question
I am a new Solr user and wonder if anyone can help me these questions. I used Solr to index about two million documents and query on it using standard request handler. I disabled all cache. I found phrase query was substantially slower than the usual query. The statistic I collected is as following. I was doing the query on the one field only. content:(auto repair)47 ms repeatablecontent:(auto repair) 937 ms repeatablecontent:(auto repair~1) 766 ms repeatable What are the factors affecting phrase query performance? How come the phrase query content:(auto repair) is almost 20 times slower than content:(auto repair)? I also notice a the phrase query with a slop is always faster than the one without a slop. Is the performance difference I observed here between phrase query and regular query a performance problem of Lucene or Solr? I was having trouble starting a new discussion thread eariler. Hopefully I do it right this time. It will be appreciated if anyone can help Haishan _ Climb to the top of the charts! Play Star Shuffle: the word scramble challenge with star power. http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct
Re: Dynamic fields performance question
On 3/26/07, climbingrose [EMAIL PROTECTED] wrote: I'm developing an application that potentially creates thousands of dynamic fields. Does anyone know if large number of dynamic fields will degrade Solr performance? Thousands of fields won't be a problem if - you don't sort on most of them (sorting by a field takes up memory) - you can omit norms on most of them Provided the above is true, differences in searching + indexing performance shouldn't be noticeable. -Yonik
Re: Dynamic fields performance question
Thanks Yonik. I think both of the conditions hold true for our application ;). On 3/27/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 3/26/07, climbingrose [EMAIL PROTECTED] wrote: I'm developing an application that potentially creates thousands of dynamic fields. Does anyone know if large number of dynamic fields will degrade Solr performance? Thousands of fields won't be a problem if - you don't sort on most of them (sorting by a field takes up memory) - you can omit norms on most of them Provided the above is true, differences in searching + indexing performance shouldn't be noticeable. -Yonik -- Regards, Cuong Hoang
Dynamic fields performance question
Hi all, I'm developing an application that potentially creates thousands of dynamic fields. Does anyone know if large number of dynamic fields will degrade Solr performance? Thanks. -- Regards, Cuong Hoang