Re: Partial Counts in SOLR
Anyone? On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Below is one of the sample slow query that takes mins! ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or purchase* or repurchase*)) w/10 (executive or director) If a filter is used it comes in fq but what can be done about plain keyword search? On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson erickerick...@gmail.comwrote: What are our complex queries? You say that your app will very rarely see the same query thus you aren't using caches... But, if you can move some of your clauses to fq clauses, then the filterCache might well be used to good effect. On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: 1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote: 1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards,
Re: Partial Counts in SOLR
Yes, that'll be slow. Wildcards are, at best, interesting and at worst resource consumptive. Especially when you're doing this kind of positioning information as well. Consider looking at the problem sideways. That is, what is your purpose in searching for, say, buy*? You want to find buy, buying, buyers, etc? Would you get bette results if you just stemmed and omitted the wildcards? Do you have a restricted vocabulary that would allow you to define synonyms for the important words and all their variants at index time and use that? Finally, of course, you could shard your index (or add more shards if you're already sharding) if you really _must_ support these kinds of queries and can't work around the problem. Best, Erick On Tue, Mar 18, 2014 at 11:21 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Anyone? On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Below is one of the sample slow query that takes mins! ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or purchase* or repurchase*)) w/10 (executive or director) If a filter is used it comes in fq but what can be done about plain keyword search? On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson erickerick...@gmail.comwrote: What are our complex queries? You say that your app will very rarely see the same query thus you aren't using caches... But, if you can move some of your clauses to fq clauses, then the filterCache might well be used to good effect. On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: 1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote: 1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long?
Re: Partial Counts in SOLR
This was one example. Users can even add phrase searches with wildcards/proximity etc so can't really use stemming. Sharding is definitely something we are already looking into. On Wed, Mar 19, 2014 at 6:59 PM, Erick Erickson erickerick...@gmail.comwrote: Yes, that'll be slow. Wildcards are, at best, interesting and at worst resource consumptive. Especially when you're doing this kind of positioning information as well. Consider looking at the problem sideways. That is, what is your purpose in searching for, say, buy*? You want to find buy, buying, buyers, etc? Would you get bette results if you just stemmed and omitted the wildcards? Do you have a restricted vocabulary that would allow you to define synonyms for the important words and all their variants at index time and use that? Finally, of course, you could shard your index (or add more shards if you're already sharding) if you really _must_ support these kinds of queries and can't work around the problem. Best, Erick On Tue, Mar 18, 2014 at 11:21 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Anyone? On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Below is one of the sample slow query that takes mins! ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or purchase* or repurchase*)) w/10 (executive or director) If a filter is used it comes in fq but what can be done about plain keyword search? On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson erickerick...@gmail.comwrote: What are our complex queries? You say that your app will very rarely see the same query thus you aren't using caches... But, if you can move some of your clauses to fq clauses, then the filterCache might well be used to good effect. On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: 1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote: 1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting
Re: Partial Counts in SOLR
Below is one of the sample slow query that takes mins! ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or purchase* or repurchase*)) w/10 (executive or director) If a filter is used it comes in fq but what can be done about plain keyword search? On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson erickerick...@gmail.comwrote: What are our complex queries? You say that your app will very rarely see the same query thus you aren't using caches... But, if you can move some of your clauses to fq clauses, then the filterCache might well be used to good effect. On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: 1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote: 1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram -- Regards, Salman Akram
Re: Partial Counts in SOLR
What are our complex queries? You say that your app will very rarely see the same query thus you aren't using caches... But, if you can move some of your clauses to fq clauses, then the filterCache might well be used to good effect. On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: 1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote: 1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram
Re: Partial Counts in SOLR
Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram
Re: Partial Counts in SOLR
1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: Partial Counts in SOLR
1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote: 1. What is your solr version? In 4.x family the proximity searches have been optimized among other query types. 2. Do you use the filter queries? What is the situation with the cache utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if you have low hitratios and many evictions. 3. Can you avoid storing some fields and only index them? When the field is stored and it is retrieved in the result, there are couple of disk seeks per field= search slows down. Consider SSD disks. 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you observe STW GC pauses? 5. How often do you commit do you have the autowarming / external warming configured? 6. If you use faceting, consider storing DocValues for facet fields. some solr wiki docs: https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote: As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Regards, Salman Akram
Re: Partial Counts in SOLR
As Hoss pointed out above, different projects have different requirements. Some want to sort by date of ingestion reverse, which means that having posting lists organized in a reverse order with the early termination is the way to go (no such feature in Solr directly). Some other projects want to collect all docs matching a query, and then sort by rank, but you cannot guarantee, that the most recently inserted document is the most relevant in terms of your ranking. Do your current searches take too long? On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: Partial Counts in SOLR
Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram
Re: Partial Counts in SOLR
Salman, It looks like what you describe has been implemented at Twitter. Presentation from the recent Lucene / Solr Revolution conference in Dublin: http://www.youtube.com/watch?v=AguWva8P_DI On Sat, Mar 8, 2014 at 4:16 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: The issue with timeallowed is you never know if it will return minimum amount of docs or not. I do want docs to be sorted based on date but it seems its not possible that solr starts searching from recent docs and stops after finding certain no. of docs...any other tweak? Thanks On Saturday, March 8, 2014, Chris Hostetter hossman_luc...@fucit.org wrote: : Reason: In an index with millions of documents I don't want to know that a : certain query matched 1 million docs (of course it will take time to : calculate that). Why don't just stop looking for more results lets say : after it finds 100 docs? Possible?? but if you care about sorting, ie: you want the top 100 documents sorted by score, or sorted by date, you still have to collect all 1 million matches in order to know what the first 100 are. if you really don't care about sorting, you can use the timAllowed option to tell the seraching method to do the best job it can in an (approximated) limited amount of time, and then pretend that the docs collected so far represent the total number of matches... https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter -Hoss http://www.lucidworks.com/ -- Regards, Salman Akram Project Manager - Intelligize NorthBay Solutions 410-G4 Johar Town, Lahore Off: +92-42-35290152 Cell: +92-302-8495621 -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: Partial Counts in SOLR
The issue with timeallowed is you never know if it will return minimum amount of docs or not. I do want docs to be sorted based on date but it seems its not possible that solr starts searching from recent docs and stops after finding certain no. of docs...any other tweak? Thanks On Saturday, March 8, 2014, Chris Hostetter hossman_luc...@fucit.org wrote: : Reason: In an index with millions of documents I don't want to know that a : certain query matched 1 million docs (of course it will take time to : calculate that). Why don't just stop looking for more results lets say : after it finds 100 docs? Possible?? but if you care about sorting, ie: you want the top 100 documents sorted by score, or sorted by date, you still have to collect all 1 million matches in order to know what the first 100 are. if you really don't care about sorting, you can use the timAllowed option to tell the seraching method to do the best job it can in an (approximated) limited amount of time, and then pretend that the docs collected so far represent the total number of matches... https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter -Hoss http://www.lucidworks.com/ -- Regards, Salman Akram Project Manager - Intelligize NorthBay Solutions 410-G4 Johar Town, Lahore Off: +92-42-35290152 Cell: +92-302-8495621
Re: Partial Counts in SOLR
On 7 March 2014 15:18, Salman Akram salman.ak...@northbaysolutions.net wrote: All, Is it possible to get partial counts in SOLR? The idea is to get the count but if its above a certain limit than just return that limit. Reason: In an index with millions of documents I don't want to know that a certain query matched 1 million docs (of course it will take time to calculate that). Why don't just stop looking for more results lets say after it finds 100 docs? Possible?? e.g. Something similar that we can do in MySQL: SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias The response to the /select Solr URL has a numFound attribute that is the number of matches. Regards, Gora
Re: Partial Counts in SOLR
you limit the number of results by using the rows parameter. You query however may hit more documents (stored in numFound of the response) that what will be returned back to you as rows prescribes. On Fri, Mar 7, 2014 at 11:48 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: All, Is it possible to get partial counts in SOLR? The idea is to get the count but if its above a certain limit than just return that limit. Reason: In an index with millions of documents I don't want to know that a certain query matched 1 million docs (of course it will take time to calculate that). Why don't just stop looking for more results lets say after it finds 100 docs? Possible?? e.g. Something similar that we can do in MySQL: SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias -- Regards, Salman Akram -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: Partial Counts in SOLR
I know about numFound. That's where the issue is. On a complex query that takes mins I think there would be a major chunk of that spent in calculating numFound whereas I don't need it. Let's say I just need first 100 docs and then want SOLR to STOP looking further to populate the numFound. Let's say I just don't want SOLR to return me numFound. Is that possible? Also would it really help on the performance? In MySQL you can simply stop it to look further a certain count for total count and that gives a considerable improvement for complex queries but that's not an inverted index so not sure how it works in SOLR... On Fri, Mar 7, 2014 at 3:17 PM, Gora Mohanty g...@mimirtech.com wrote: On 7 March 2014 15:18, Salman Akram salman.ak...@northbaysolutions.net wrote: All, Is it possible to get partial counts in SOLR? The idea is to get the count but if its above a certain limit than just return that limit. Reason: In an index with millions of documents I don't want to know that a certain query matched 1 million docs (of course it will take time to calculate that). Why don't just stop looking for more results lets say after it finds 100 docs? Possible?? e.g. Something similar that we can do in MySQL: SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias The response to the /select Solr URL has a numFound attribute that is the number of matches. Regards, Gora -- Regards, Salman Akram
Re: Partial Counts in SOLR
: Reason: In an index with millions of documents I don't want to know that a : certain query matched 1 million docs (of course it will take time to : calculate that). Why don't just stop looking for more results lets say : after it finds 100 docs? Possible?? but if you care about sorting, ie: you want the top 100 documents sorted by score, or sorted by date, you still have to collect all 1 million matches in order to know what the first 100 are. if you really don't care about sorting, you can use the timAllowed option to tell the seraching method to do the best job it can in an (approximated) limited amount of time, and then pretend that the docs collected so far represent the total number of matches... https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter -Hoss http://www.lucidworks.com/