Re: Partial Counts in SOLR

2014-03-19 Thread Salman Akram
Anyone?


On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 Below is one of the sample slow query that takes mins!

 ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
 purchase* or repurchase*)) w/10 (executive or director)

 If a filter is used it comes in fq but what can be done about plain
 keyword search?


 On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 What are our complex queries? You
 say that your app will very rarely see the
 same query thus you aren't using caches...
 But, if you can move some of your
 clauses to fq clauses, then the filterCache
 might well be used to good effect.



 On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
 salman.ak...@northbaysolutions.net wrote:
  1- SOLR 4.6
  2- We do but right now I am talking about plain keyword queries just
 sorted
  by date. Once this is better will start looking into caches which we
  already changed a little.
  3- As I said the contents are not stored in this index. Some other
 metadata
  fields are but with normal queries its super fast so I guess even if I
  change there it will be a minor difference. We have SSD and quite fast
 too.
  4- That's something we need to do but even in low workload those queries
  take a lot of time
  5- Every 10 mins and currently no auto warming as user queries are
 rarely
  same and also once its fully warmed those queries are still slow.
  6- Nops.
 
  On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
  1. What is your solr version? In 4.x family the proximity searches have
  been optimized among other query types.
  2. Do you use the filter queries? What is the situation with the cache
  utilization ratios? Optimize (= i.e. bump up the respective cache
 sizes) if
  you have low hitratios and many evictions.
  3. Can you avoid storing some fields and only index them? When the
 field is
  stored and it is retrieved in the result, there are couple of disk
 seeks
  per field= search slows down. Consider SSD disks.
  4. Do you monitor your system in terms of RAM / cache stats / GC? Do
 you
  observe STW GC pauses?
  5. How often do you commit  do you have the autowarming / external
 warming
  configured?
  6. If you use faceting, consider storing DocValues for facet fields.
 
  some solr wiki docs:
 
 
 https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
 
 
 
 
 
  On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
   Well some of the searches take minutes.
  
   Below are some stats about this particular index that I am talking
 about:
  
   Index size = 400GB (Using CommonGrams so without that the index is
 around
   180GB)
   Position File = 280GB
   Total Docs = 170 million (just indexed for searching - for
 highlighting
   contents are stored in another index)
   Avg Doc Size = Few hundred KBs
   RAM = 384GB (it has other indexes too but still OS cache can have
 60-80%
  of
   the total index cached)
  
   Phrase queries run pretty fast with CG but complex versions of
 wildcard
  and
   proximity queries can be really slow. I know using CG will make them
 slow
   but they just take too long. By default sorting is on date but users
 have
   few other parameters too on which they can sort.
  
   I wanted to avoid creating multiple indexes (maybe based on years)
 but
   seems that to search on partial data that's the only feasible way.
  
  
  
  
   On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
  
As Hoss pointed out above, different projects have different
   requirements.
Some want to sort by date of ingestion reverse, which means that
 having
posting lists organized in a reverse order with the early
 termination
  is
the way to go (no such feature in Solr directly). Some other
 projects
   want
to collect all docs matching a query, and then sort by rank, but
 you
   cannot
guarantee, that the most recently inserted document is the most
  relevant
   in
terms of your ranking.
   
   
Do your current searches take too long?
   
   
On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:
   
 Its a long video and I will definitely go through it but it seems
  this
   is
 not possible with SOLR as it is?

 I just thought it would be quite a common issue; I mean
 generally for
 search engines its more important to show the first page results,
   rather
 than using timeAllowed which might not even return a single
 result.

 Thanks!


 --
 Regards,

 Salman Akram

   
   
   
--
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
   
  
  
  
   --
   Regards,
  
   Salman Akram
  
 
 
 
  --
  Dmitry
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
 
 
 
 
  --
  Regards,
 
  

Re: Partial Counts in SOLR

2014-03-19 Thread Erick Erickson
Yes, that'll be slow. Wildcards are, at best, interesting and at worst
resource consumptive. Especially when you're doing this kind of
positioning information as well.

Consider looking at the problem sideways. That is, what is your
purpose in searching for, say, buy*? You want to find buy, buying,
buyers, etc? Would you get bette results if you just stemmed and
omitted the wildcards?

Do you have a restricted vocabulary that would allow you to define
synonyms for the important words and all their variants at index
time and use that?

Finally, of course, you could shard your index (or add more shards if
you're already sharding) if you really _must_ support these kinds of
queries and can't work around the problem.

Best,
Erick

On Tue, Mar 18, 2014 at 11:21 PM, Salman Akram
salman.ak...@northbaysolutions.net wrote:
 Anyone?


 On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

 Below is one of the sample slow query that takes mins!

 ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
 purchase* or repurchase*)) w/10 (executive or director)

 If a filter is used it comes in fq but what can be done about plain
 keyword search?


 On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 What are our complex queries? You
 say that your app will very rarely see the
 same query thus you aren't using caches...
 But, if you can move some of your
 clauses to fq clauses, then the filterCache
 might well be used to good effect.



 On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
 salman.ak...@northbaysolutions.net wrote:
  1- SOLR 4.6
  2- We do but right now I am talking about plain keyword queries just
 sorted
  by date. Once this is better will start looking into caches which we
  already changed a little.
  3- As I said the contents are not stored in this index. Some other
 metadata
  fields are but with normal queries its super fast so I guess even if I
  change there it will be a minor difference. We have SSD and quite fast
 too.
  4- That's something we need to do but even in low workload those queries
  take a lot of time
  5- Every 10 mins and currently no auto warming as user queries are
 rarely
  same and also once its fully warmed those queries are still slow.
  6- Nops.
 
  On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
  1. What is your solr version? In 4.x family the proximity searches have
  been optimized among other query types.
  2. Do you use the filter queries? What is the situation with the cache
  utilization ratios? Optimize (= i.e. bump up the respective cache
 sizes) if
  you have low hitratios and many evictions.
  3. Can you avoid storing some fields and only index them? When the
 field is
  stored and it is retrieved in the result, there are couple of disk
 seeks
  per field= search slows down. Consider SSD disks.
  4. Do you monitor your system in terms of RAM / cache stats / GC? Do
 you
  observe STW GC pauses?
  5. How often do you commit  do you have the autowarming / external
 warming
  configured?
  6. If you use faceting, consider storing DocValues for facet fields.
 
  some solr wiki docs:
 
 
 https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
 
 
 
 
 
  On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
   Well some of the searches take minutes.
  
   Below are some stats about this particular index that I am talking
 about:
  
   Index size = 400GB (Using CommonGrams so without that the index is
 around
   180GB)
   Position File = 280GB
   Total Docs = 170 million (just indexed for searching - for
 highlighting
   contents are stored in another index)
   Avg Doc Size = Few hundred KBs
   RAM = 384GB (it has other indexes too but still OS cache can have
 60-80%
  of
   the total index cached)
  
   Phrase queries run pretty fast with CG but complex versions of
 wildcard
  and
   proximity queries can be really slow. I know using CG will make them
 slow
   but they just take too long. By default sorting is on date but users
 have
   few other parameters too on which they can sort.
  
   I wanted to avoid creating multiple indexes (maybe based on years)
 but
   seems that to search on partial data that's the only feasible way.
  
  
  
  
   On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
  
As Hoss pointed out above, different projects have different
   requirements.
Some want to sort by date of ingestion reverse, which means that
 having
posting lists organized in a reverse order with the early
 termination
  is
the way to go (no such feature in Solr directly). Some other
 projects
   want
to collect all docs matching a query, and then sort by rank, but
 you
   cannot
guarantee, that the most recently inserted document is the most
  relevant
   in
terms of your ranking.
   
   
Do your current searches take too long?
   
   
 

Re: Partial Counts in SOLR

2014-03-19 Thread Salman Akram
This was one example. Users can even add phrase searches with
wildcards/proximity etc so can't really use stemming.

Sharding is definitely something we are already looking into.


On Wed, Mar 19, 2014 at 6:59 PM, Erick Erickson erickerick...@gmail.comwrote:

 Yes, that'll be slow. Wildcards are, at best, interesting and at worst
 resource consumptive. Especially when you're doing this kind of
 positioning information as well.

 Consider looking at the problem sideways. That is, what is your
 purpose in searching for, say, buy*? You want to find buy, buying,
 buyers, etc? Would you get bette results if you just stemmed and
 omitted the wildcards?

 Do you have a restricted vocabulary that would allow you to define
 synonyms for the important words and all their variants at index
 time and use that?

 Finally, of course, you could shard your index (or add more shards if
 you're already sharding) if you really _must_ support these kinds of
 queries and can't work around the problem.

 Best,
 Erick

 On Tue, Mar 18, 2014 at 11:21 PM, Salman Akram
 salman.ak...@northbaysolutions.net wrote:
  Anyone?
 
 
  On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
  Below is one of the sample slow query that takes mins!
 
  ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
  purchase* or repurchase*)) w/10 (executive or director)
 
  If a filter is used it comes in fq but what can be done about plain
  keyword search?
 
 
  On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  What are our complex queries? You
  say that your app will very rarely see the
  same query thus you aren't using caches...
  But, if you can move some of your
  clauses to fq clauses, then the filterCache
  might well be used to good effect.
 
 
 
  On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
  salman.ak...@northbaysolutions.net wrote:
   1- SOLR 4.6
   2- We do but right now I am talking about plain keyword queries just
  sorted
   by date. Once this is better will start looking into caches which we
   already changed a little.
   3- As I said the contents are not stored in this index. Some other
  metadata
   fields are but with normal queries its super fast so I guess even if
 I
   change there it will be a minor difference. We have SSD and quite
 fast
  too.
   4- That's something we need to do but even in low workload those
 queries
   take a lot of time
   5- Every 10 mins and currently no auto warming as user queries are
  rarely
   same and also once its fully warmed those queries are still slow.
   6- Nops.
  
   On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
  
   1. What is your solr version? In 4.x family the proximity searches
 have
   been optimized among other query types.
   2. Do you use the filter queries? What is the situation with the
 cache
   utilization ratios? Optimize (= i.e. bump up the respective cache
  sizes) if
   you have low hitratios and many evictions.
   3. Can you avoid storing some fields and only index them? When the
  field is
   stored and it is retrieved in the result, there are couple of disk
  seeks
   per field= search slows down. Consider SSD disks.
   4. Do you monitor your system in terms of RAM / cache stats / GC? Do
  you
   observe STW GC pauses?
   5. How often do you commit  do you have the autowarming / external
  warming
   configured?
   6. If you use faceting, consider storing DocValues for facet fields.
  
   some solr wiki docs:
  
  
 
 https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
  
  
  
  
  
   On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
   salman.ak...@northbaysolutions.net wrote:
  
Well some of the searches take minutes.
   
Below are some stats about this particular index that I am talking
  about:
   
Index size = 400GB (Using CommonGrams so without that the index is
  around
180GB)
Position File = 280GB
Total Docs = 170 million (just indexed for searching - for
  highlighting
contents are stored in another index)
Avg Doc Size = Few hundred KBs
RAM = 384GB (it has other indexes too but still OS cache can have
  60-80%
   of
the total index cached)
   
Phrase queries run pretty fast with CG but complex versions of
  wildcard
   and
proximity queries can be really slow. I know using CG will make
 them
  slow
but they just take too long. By default sorting is on date but
 users
  have
few other parameters too on which they can sort.
   
I wanted to avoid creating multiple indexes (maybe based on years)
  but
seems that to search on partial data that's the only feasible way.
   
   
   
   
On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com
 
   wrote:
   
 As Hoss pointed out above, different projects have different
requirements.
 Some want to sort by date of ingestion reverse, which means that
  having
 posting 

Re: Partial Counts in SOLR

2014-03-17 Thread Salman Akram
Below is one of the sample slow query that takes mins!

((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
purchase* or repurchase*)) w/10 (executive or director)

If a filter is used it comes in fq but what can be done about plain keyword
search?


On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson erickerick...@gmail.comwrote:

 What are our complex queries? You
 say that your app will very rarely see the
 same query thus you aren't using caches...
 But, if you can move some of your
 clauses to fq clauses, then the filterCache
 might well be used to good effect.



 On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
 salman.ak...@northbaysolutions.net wrote:
  1- SOLR 4.6
  2- We do but right now I am talking about plain keyword queries just
 sorted
  by date. Once this is better will start looking into caches which we
  already changed a little.
  3- As I said the contents are not stored in this index. Some other
 metadata
  fields are but with normal queries its super fast so I guess even if I
  change there it will be a minor difference. We have SSD and quite fast
 too.
  4- That's something we need to do but even in low workload those queries
  take a lot of time
  5- Every 10 mins and currently no auto warming as user queries are rarely
  same and also once its fully warmed those queries are still slow.
  6- Nops.
 
  On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
  1. What is your solr version? In 4.x family the proximity searches have
  been optimized among other query types.
  2. Do you use the filter queries? What is the situation with the cache
  utilization ratios? Optimize (= i.e. bump up the respective cache
 sizes) if
  you have low hitratios and many evictions.
  3. Can you avoid storing some fields and only index them? When the
 field is
  stored and it is retrieved in the result, there are couple of disk seeks
  per field= search slows down. Consider SSD disks.
  4. Do you monitor your system in terms of RAM / cache stats / GC? Do you
  observe STW GC pauses?
  5. How often do you commit  do you have the autowarming / external
 warming
  configured?
  6. If you use faceting, consider storing DocValues for facet fields.
 
  some solr wiki docs:
 
 
 https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
 
 
 
 
 
  On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
   Well some of the searches take minutes.
  
   Below are some stats about this particular index that I am talking
 about:
  
   Index size = 400GB (Using CommonGrams so without that the index is
 around
   180GB)
   Position File = 280GB
   Total Docs = 170 million (just indexed for searching - for
 highlighting
   contents are stored in another index)
   Avg Doc Size = Few hundred KBs
   RAM = 384GB (it has other indexes too but still OS cache can have
 60-80%
  of
   the total index cached)
  
   Phrase queries run pretty fast with CG but complex versions of
 wildcard
  and
   proximity queries can be really slow. I know using CG will make them
 slow
   but they just take too long. By default sorting is on date but users
 have
   few other parameters too on which they can sort.
  
   I wanted to avoid creating multiple indexes (maybe based on years) but
   seems that to search on partial data that's the only feasible way.
  
  
  
  
   On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
  
As Hoss pointed out above, different projects have different
   requirements.
Some want to sort by date of ingestion reverse, which means that
 having
posting lists organized in a reverse order with the early
 termination
  is
the way to go (no such feature in Solr directly). Some other
 projects
   want
to collect all docs matching a query, and then sort by rank, but you
   cannot
guarantee, that the most recently inserted document is the most
  relevant
   in
terms of your ranking.
   
   
Do your current searches take too long?
   
   
On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:
   
 Its a long video and I will definitely go through it but it seems
  this
   is
 not possible with SOLR as it is?

 I just thought it would be quite a common issue; I mean generally
 for
 search engines its more important to show the first page results,
   rather
 than using timeAllowed which might not even return a single
 result.

 Thanks!


 --
 Regards,

 Salman Akram

   
   
   
--
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
   
  
  
  
   --
   Regards,
  
   Salman Akram
  
 
 
 
  --
  Dmitry
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
 
 
 
 
  --
  Regards,
 
  Salman Akram




-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-15 Thread Erick Erickson
What are our complex queries? You
say that your app will very rarely see the
same query thus you aren't using caches...
But, if you can move some of your
clauses to fq clauses, then the filterCache
might well be used to good effect.



On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
salman.ak...@northbaysolutions.net wrote:
 1- SOLR 4.6
 2- We do but right now I am talking about plain keyword queries just sorted
 by date. Once this is better will start looking into caches which we
 already changed a little.
 3- As I said the contents are not stored in this index. Some other metadata
 fields are but with normal queries its super fast so I guess even if I
 change there it will be a minor difference. We have SSD and quite fast too.
 4- That's something we need to do but even in low workload those queries
 take a lot of time
 5- Every 10 mins and currently no auto warming as user queries are rarely
 same and also once its fully warmed those queries are still slow.
 6- Nops.

 On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote:

 1. What is your solr version? In 4.x family the proximity searches have
 been optimized among other query types.
 2. Do you use the filter queries? What is the situation with the cache
 utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if
 you have low hitratios and many evictions.
 3. Can you avoid storing some fields and only index them? When the field is
 stored and it is retrieved in the result, there are couple of disk seeks
 per field= search slows down. Consider SSD disks.
 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you
 observe STW GC pauses?
 5. How often do you commit  do you have the autowarming / external warming
 configured?
 6. If you use faceting, consider storing DocValues for facet fields.

 some solr wiki docs:

 https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29





 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  Well some of the searches take minutes.
 
  Below are some stats about this particular index that I am talking about:
 
  Index size = 400GB (Using CommonGrams so without that the index is around
  180GB)
  Position File = 280GB
  Total Docs = 170 million (just indexed for searching - for highlighting
  contents are stored in another index)
  Avg Doc Size = Few hundred KBs
  RAM = 384GB (it has other indexes too but still OS cache can have 60-80%
 of
  the total index cached)
 
  Phrase queries run pretty fast with CG but complex versions of wildcard
 and
  proximity queries can be really slow. I know using CG will make them slow
  but they just take too long. By default sorting is on date but users have
  few other parameters too on which they can sort.
 
  I wanted to avoid creating multiple indexes (maybe based on years) but
  seems that to search on partial data that's the only feasible way.
 
 
 
 
  On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   As Hoss pointed out above, different projects have different
  requirements.
   Some want to sort by date of ingestion reverse, which means that having
   posting lists organized in a reverse order with the early termination
 is
   the way to go (no such feature in Solr directly). Some other projects
  want
   to collect all docs matching a query, and then sort by rank, but you
  cannot
   guarantee, that the most recently inserted document is the most
 relevant
  in
   terms of your ranking.
  
  
   Do your current searches take too long?
  
  
   On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
   salman.ak...@northbaysolutions.net wrote:
  
Its a long video and I will definitely go through it but it seems
 this
  is
not possible with SOLR as it is?
   
I just thought it would be quite a common issue; I mean generally for
search engines its more important to show the first page results,
  rather
than using timeAllowed which might not even return a single result.
   
Thanks!
   
   
--
Regards,
   
Salman Akram
   
  
  
  
   --
   Dmitry
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
  
 
 
 
  --
  Regards,
 
  Salman Akram
 



 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan




 --
 Regards,

 Salman Akram


Re: Partial Counts in SOLR

2014-03-13 Thread Salman Akram
Well some of the searches take minutes.

Below are some stats about this particular index that I am talking about:

Index size = 400GB (Using CommonGrams so without that the index is around
180GB)
Position File = 280GB
Total Docs = 170 million (just indexed for searching - for highlighting
contents are stored in another index)
Avg Doc Size = Few hundred KBs
RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of
the total index cached)

Phrase queries run pretty fast with CG but complex versions of wildcard and
proximity queries can be really slow. I know using CG will make them slow
but they just take too long. By default sorting is on date but users have
few other parameters too on which they can sort.

I wanted to avoid creating multiple indexes (maybe based on years) but
seems that to search on partial data that's the only feasible way.




On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote:

 As Hoss pointed out above, different projects have different requirements.
 Some want to sort by date of ingestion reverse, which means that having
 posting lists organized in a reverse order with the early termination is
 the way to go (no such feature in Solr directly). Some other projects want
 to collect all docs matching a query, and then sort by rank, but you cannot
 guarantee, that the most recently inserted document is the most relevant in
 terms of your ranking.


 Do your current searches take too long?


 On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  Its a long video and I will definitely go through it but it seems this is
  not possible with SOLR as it is?
 
  I just thought it would be quite a common issue; I mean generally for
  search engines its more important to show the first page results, rather
  than using timeAllowed which might not even return a single result.
 
  Thanks!
 
 
  --
  Regards,
 
  Salman Akram
 



 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan




-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-13 Thread Dmitry Kan
1. What is your solr version? In 4.x family the proximity searches have
been optimized among other query types.
2. Do you use the filter queries? What is the situation with the cache
utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if
you have low hitratios and many evictions.
3. Can you avoid storing some fields and only index them? When the field is
stored and it is retrieved in the result, there are couple of disk seeks
per field= search slows down. Consider SSD disks.
4. Do you monitor your system in terms of RAM / cache stats / GC? Do you
observe STW GC pauses?
5. How often do you commit  do you have the autowarming / external warming
configured?
6. If you use faceting, consider storing DocValues for facet fields.

some solr wiki docs:
https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29





On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 Well some of the searches take minutes.

 Below are some stats about this particular index that I am talking about:

 Index size = 400GB (Using CommonGrams so without that the index is around
 180GB)
 Position File = 280GB
 Total Docs = 170 million (just indexed for searching - for highlighting
 contents are stored in another index)
 Avg Doc Size = Few hundred KBs
 RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of
 the total index cached)

 Phrase queries run pretty fast with CG but complex versions of wildcard and
 proximity queries can be really slow. I know using CG will make them slow
 but they just take too long. By default sorting is on date but users have
 few other parameters too on which they can sort.

 I wanted to avoid creating multiple indexes (maybe based on years) but
 seems that to search on partial data that's the only feasible way.




 On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com wrote:

  As Hoss pointed out above, different projects have different
 requirements.
  Some want to sort by date of ingestion reverse, which means that having
  posting lists organized in a reverse order with the early termination is
  the way to go (no such feature in Solr directly). Some other projects
 want
  to collect all docs matching a query, and then sort by rank, but you
 cannot
  guarantee, that the most recently inserted document is the most relevant
 in
  terms of your ranking.
 
 
  Do your current searches take too long?
 
 
  On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
   Its a long video and I will definitely go through it but it seems this
 is
   not possible with SOLR as it is?
  
   I just thought it would be quite a common issue; I mean generally for
   search engines its more important to show the first page results,
 rather
   than using timeAllowed which might not even return a single result.
  
   Thanks!
  
  
   --
   Regards,
  
   Salman Akram
  
 
 
 
  --
  Dmitry
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
 



 --
 Regards,

 Salman Akram




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Partial Counts in SOLR

2014-03-13 Thread Salman Akram
1- SOLR 4.6
2- We do but right now I am talking about plain keyword queries just sorted
by date. Once this is better will start looking into caches which we
already changed a little.
3- As I said the contents are not stored in this index. Some other metadata
fields are but with normal queries its super fast so I guess even if I
change there it will be a minor difference. We have SSD and quite fast too.
4- That's something we need to do but even in low workload those queries
take a lot of time
5- Every 10 mins and currently no auto warming as user queries are rarely
same and also once its fully warmed those queries are still slow.
6- Nops.

On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan solrexp...@gmail.com wrote:

 1. What is your solr version? In 4.x family the proximity searches have
 been optimized among other query types.
 2. Do you use the filter queries? What is the situation with the cache
 utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if
 you have low hitratios and many evictions.
 3. Can you avoid storing some fields and only index them? When the field is
 stored and it is retrieved in the result, there are couple of disk seeks
 per field= search slows down. Consider SSD disks.
 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you
 observe STW GC pauses?
 5. How often do you commit  do you have the autowarming / external warming
 configured?
 6. If you use faceting, consider storing DocValues for facet fields.

 some solr wiki docs:

 https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29





 On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  Well some of the searches take minutes.
 
  Below are some stats about this particular index that I am talking about:
 
  Index size = 400GB (Using CommonGrams so without that the index is around
  180GB)
  Position File = 280GB
  Total Docs = 170 million (just indexed for searching - for highlighting
  contents are stored in another index)
  Avg Doc Size = Few hundred KBs
  RAM = 384GB (it has other indexes too but still OS cache can have 60-80%
 of
  the total index cached)
 
  Phrase queries run pretty fast with CG but complex versions of wildcard
 and
  proximity queries can be really slow. I know using CG will make them slow
  but they just take too long. By default sorting is on date but users have
  few other parameters too on which they can sort.
 
  I wanted to avoid creating multiple indexes (maybe based on years) but
  seems that to search on partial data that's the only feasible way.
 
 
 
 
  On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   As Hoss pointed out above, different projects have different
  requirements.
   Some want to sort by date of ingestion reverse, which means that having
   posting lists organized in a reverse order with the early termination
 is
   the way to go (no such feature in Solr directly). Some other projects
  want
   to collect all docs matching a query, and then sort by rank, but you
  cannot
   guarantee, that the most recently inserted document is the most
 relevant
  in
   terms of your ranking.
  
  
   Do your current searches take too long?
  
  
   On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
   salman.ak...@northbaysolutions.net wrote:
  
Its a long video and I will definitely go through it but it seems
 this
  is
not possible with SOLR as it is?
   
I just thought it would be quite a common issue; I mean generally for
search engines its more important to show the first page results,
  rather
than using timeAllowed which might not even return a single result.
   
Thanks!
   
   
--
Regards,
   
Salman Akram
   
  
  
  
   --
   Dmitry
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
  
 
 
 
  --
  Regards,
 
  Salman Akram
 



 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan




-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-12 Thread Dmitry Kan
As Hoss pointed out above, different projects have different requirements.
Some want to sort by date of ingestion reverse, which means that having
posting lists organized in a reverse order with the early termination is
the way to go (no such feature in Solr directly). Some other projects want
to collect all docs matching a query, and then sort by rank, but you cannot
guarantee, that the most recently inserted document is the most relevant in
terms of your ranking.


Do your current searches take too long?


On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 Its a long video and I will definitely go through it but it seems this is
 not possible with SOLR as it is?

 I just thought it would be quite a common issue; I mean generally for
 search engines its more important to show the first page results, rather
 than using timeAllowed which might not even return a single result.

 Thanks!


 --
 Regards,

 Salman Akram




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Partial Counts in SOLR

2014-03-11 Thread Salman Akram
Its a long video and I will definitely go through it but it seems this is
not possible with SOLR as it is?

I just thought it would be quite a common issue; I mean generally for
search engines its more important to show the first page results, rather
than using timeAllowed which might not even return a single result.

Thanks!


-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-10 Thread Dmitry Kan
Salman,

It looks like what you describe has been implemented at Twitter.

Presentation from the recent Lucene / Solr Revolution conference in Dublin:
http://www.youtube.com/watch?v=AguWva8P_DI


On Sat, Mar 8, 2014 at 4:16 PM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 The issue with timeallowed is you never know if it will return minimum
 amount of docs or not.

 I do want docs to be sorted based on date but it seems its not possible
 that solr starts searching from recent docs and stops after finding certain
 no. of docs...any other tweak?

 Thanks


 On Saturday, March 8, 2014, Chris Hostetter hossman_luc...@fucit.org
 wrote:

 
  : Reason: In an index with millions of documents I don't want to know
 that
  a
  : certain query matched 1 million docs (of course it will take time to
  : calculate that). Why don't just stop looking for more results lets say
  : after it finds 100 docs? Possible??
 
  but if you care about sorting, ie: you want the top 100 documents sorted
  by score, or sorted by date, you still have to collect all 1 million
  matches in order to know what the first 100 are.
 
  if you really don't care about sorting, you can use the timAllowed
  option to tell the seraching method to do the best job it can in an
  (approximated) limited amount of time, and then pretend that the docs
  collected so far represent the total number of matches...
 
 
 
 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter
 
 
  -Hoss
  http://www.lucidworks.com/
 


 --
 Regards,

 Salman Akram
 Project Manager - Intelligize
 NorthBay Solutions
 410-G4 Johar Town, Lahore
 Off: +92-42-35290152

 Cell: +92-302-8495621




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Partial Counts in SOLR

2014-03-08 Thread Salman Akram
The issue with timeallowed is you never know if it will return minimum
amount of docs or not.

I do want docs to be sorted based on date but it seems its not possible
that solr starts searching from recent docs and stops after finding certain
no. of docs...any other tweak?

Thanks


On Saturday, March 8, 2014, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : Reason: In an index with millions of documents I don't want to know that
 a
 : certain query matched 1 million docs (of course it will take time to
 : calculate that). Why don't just stop looking for more results lets say
 : after it finds 100 docs? Possible??

 but if you care about sorting, ie: you want the top 100 documents sorted
 by score, or sorted by date, you still have to collect all 1 million
 matches in order to know what the first 100 are.

 if you really don't care about sorting, you can use the timAllowed
 option to tell the seraching method to do the best job it can in an
 (approximated) limited amount of time, and then pretend that the docs
 collected so far represent the total number of matches...


 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter


 -Hoss
 http://www.lucidworks.com/



-- 
Regards,

Salman Akram
Project Manager - Intelligize
NorthBay Solutions
410-G4 Johar Town, Lahore
Off: +92-42-35290152

Cell: +92-302-8495621


Re: Partial Counts in SOLR

2014-03-07 Thread Gora Mohanty
On 7 March 2014 15:18, Salman Akram salman.ak...@northbaysolutions.net wrote:
 All,

 Is it possible to get partial counts in SOLR? The idea is to get the count
 but if its above a certain limit than just return that limit.

 Reason: In an index with millions of documents I don't want to know that a
 certain query matched 1 million docs (of course it will take time to
 calculate that). Why don't just stop looking for more results lets say
 after it finds 100 docs? Possible??

 e.g. Something similar that we can do in MySQL:

 SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias

The response to the /select Solr URL has a numFound attribute that
is the number
of matches.

Regards,
Gora


Re: Partial Counts in SOLR

2014-03-07 Thread Dmitry Kan
you limit the number of results by using the rows parameter. You query
however may hit more documents (stored in numFound of the response) that
what will be returned back to you as rows prescribes.


On Fri, Mar 7, 2014 at 11:48 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 All,

 Is it possible to get partial counts in SOLR? The idea is to get the count
 but if its above a certain limit than just return that limit.

 Reason: In an index with millions of documents I don't want to know that a
 certain query matched 1 million docs (of course it will take time to
 calculate that). Why don't just stop looking for more results lets say
 after it finds 100 docs? Possible??

 e.g. Something similar that we can do in MySQL:

 SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias


 --
 Regards,

 Salman Akram




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Partial Counts in SOLR

2014-03-07 Thread Salman Akram
I know about numFound. That's where the issue is.

On a complex query that takes mins I think there would be a major chunk of
that spent in calculating numFound whereas I don't need it. Let's say I
just need first 100 docs and then want SOLR to STOP looking further to
populate the numFound.

Let's say I just don't want SOLR to return me numFound. Is that possible?
Also would it really help on the performance?

In MySQL you can simply stop it to look further a certain count for total
count and that gives a considerable improvement for complex queries but
that's not an inverted index so not sure how it works in SOLR...


On Fri, Mar 7, 2014 at 3:17 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 7 March 2014 15:18, Salman Akram salman.ak...@northbaysolutions.net
 wrote:
  All,
 
  Is it possible to get partial counts in SOLR? The idea is to get the
 count
  but if its above a certain limit than just return that limit.
 
  Reason: In an index with millions of documents I don't want to know that
 a
  certain query matched 1 million docs (of course it will take time to
  calculate that). Why don't just stop looking for more results lets say
  after it finds 100 docs? Possible??
 
  e.g. Something similar that we can do in MySQL:
 
  SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias

 The response to the /select Solr URL has a numFound attribute that
 is the number
 of matches.

 Regards,
 Gora




-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-07 Thread Chris Hostetter

: Reason: In an index with millions of documents I don't want to know that a
: certain query matched 1 million docs (of course it will take time to
: calculate that). Why don't just stop looking for more results lets say
: after it finds 100 docs? Possible??

but if you care about sorting, ie: you want the top 100 documents sorted 
by score, or sorted by date, you still have to collect all 1 million 
matches in order to know what the first 100 are.

if you really don't care about sorting, you can use the timAllowed 
option to tell the seraching method to do the best job it can in an 
(approximated) limited amount of time, and then pretend that the docs 
collected so far represent the total number of matches...

https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter


-Hoss
http://www.lucidworks.com/