Re: SOLR Sizing

2016-10-14 Thread Shawn Heisey
On 10/14/2016 12:18 AM, Vasu Y wrote: > Thank you all for the insight and help. Our SOLR instance has multiple > collections. > Do you know if the spreadsheet at LucidWorks ( > https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/) > is meant to be used to calculate

Re: SOLR Sizing

2016-10-14 Thread Vasu Y
Thank you all for the insight and help. Our SOLR instance has multiple collections. Do you know if the spreadsheet at LucidWorks ( https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/) is meant to be used to calculate sizing per collection or is it meant to be used

Re: SOLR Sizing

2016-10-06 Thread Walter Underwood
The square-root rule comes from a short paper draft (unpublished) that I can’t find right now. But this paper gets the same result: http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html Perfect OCR would follow this rule, but even great

Re: SOLR Sizing

2016-10-06 Thread Erick Erickson
OCR _without errors_ wouldn't break it. That comment assumed that the OCR was dirty I thought. Honest, I once was trying to index an OCR'd image of a "family tree" that was a stylized tree where the most remote ancestor was labeled in vertical text on the trunk, and descendants at various angles

Re: SOLR Sizing

2016-10-06 Thread Rick Leir
I am curious to know where the square-root assumption is from, and why OCR (without errors) would break it. TIA cheers - - Rick On 2016-10-04 10:51 AM, Walter Underwood wrote: No, we don’t have OCR’ed text. But if you do, it breaks the assumption that vocabulary size is the square root of

Re: SOLR Sizing

2016-10-04 Thread Walter Underwood
No, we don’t have OCR’ed text. But if you do, it breaks the assumption that vocabulary size is the square root of the text size. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 4, 2016, at 7:14 AM, Rick Leir wrote: > >

Re: SOLR Sizing

2016-10-04 Thread Rick Leir
OCR’ed text can have large amounts of garbage such as '';,-d'." particularly when there is poor image quality or embedded graphics. Is that what is causing your huge vocabularies? I filtered the text, removing any word with fewer than 3 alphanumerics or more than 2 non-alphas. On 2016-10-03

Re: SOLR Sizing

2016-10-03 Thread Walter Underwood
Dropping ngrams also makes the index 5X smaller on disk. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 3, 2016, at 9:02 PM, Walter Underwood wrote: > > I did not believe the benchmark results the first time, but it

Re: SOLR Sizing

2016-10-03 Thread Walter Underwood
I did not believe the benchmark results the first time, but it seems to hold up. Nobody gets a speedup of over a thousand (unless you are going from that Oracle search thing to Solr). It probably won’t help for most people. We have one service with very, very long queries, up to 1000 words of

Re: SOLR Sizing

2016-10-03 Thread Erick Erickson
Walter: What did you change? I might like to put that in my bag of tricks ;) Erick On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood wrote: > That approach doesn’t work very well for estimates. > > Some parts of the index size and speed scale with the vocabulary instead

Re: SOLR Sizing

2016-10-03 Thread Walter Underwood
That approach doesn’t work very well for estimates. Some parts of the index size and speed scale with the vocabulary instead of the number of documents. Vocabulary usually grows at about the square root of the total amount of text in the index. OCR’ed text breaks that estimate badly, with huge

Re: SOLR Sizing

2016-10-03 Thread Susheel Kumar
In short, if you want your estimate to be closer then run some actual ingestion for say 1-5% of your total docs and extrapolate since every search product may have different schema,different set of fields, different index vs. stored fields, copy fields, different analysis chain etc. If you want

RE: SOLR Sizing

2016-10-03 Thread Allison, Timothy B.
This doesn't answer your question, but Erick Erickson's blog on this topic is invaluable: https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ -Original Message- From: Vasu Y [mailto:vya...@gmail.com] Sent: Monday, October 3, 2016

Re: solr sizing

2013-07-29 Thread Shawn Heisey
On 7/29/2013 2:18 PM, Torsten Albrecht wrote: we have - 70 mio documents to 100 mio documents and we want - 800 requests per second How many servers Amazon EC2/real hardware we Need for this? Solr 4.x with solr cloud or better shards with loadbalancer? Is anyone here who can give me some