Thank you all for the insight and help. Our SOLR instance has multiple
Do you know if the spreadsheet at LucidWorks (
is meant to be used to calculate sizing per collection or is it meant to be
used for the whole SOLR instance (that contains multiple collections).
The reason I am asking this question is, there are some defaults like
"Transient (MB)" (with a value 10 MB) specified "Disk Space Estimator"
sheet; I am not sure if these default values are per collection or the
whole SOLR instance.
On Thu, Oct 6, 2016 at 9:42 PM, Walter Underwood <wun...@wunderwood.org>
> The square-root rule comes from a short paper draft (unpublished) that I
> can’t find right now. But this paper gets the same result:
> http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html <
> Perfect OCR would follow this rule, but even great OCR has lots of errors.
> 95% accuracy is good OCR performance, but that makes a huge, pathological
> long tail of non-language terms.
> I learned about the OCR problems from the Hathi Trust. They hit the Solr
> vocabulary limit of 2.4 billion terms, then when that was raise, they hit
> memory management issues.
> https://www.hathitrust.org/blogs/large-scale-search/too-many-words <
> https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again <
> Walter Underwood
> http://observer.wunderwood.org/ (my blog)
> > On Oct 6, 2016, at 8:05 AM, Rick Leir <rl...@leirtech.com> wrote:
> > I am curious to know where the square-root assumption is from, and why
> OCR (without errors) would break it. TIA
> > cheers - - Rick
> > On 2016-10-04 10:51 AM, Walter Underwood wrote:
> >> No, we don’t have OCR’ed text. But if you do, it breaks the assumption
> that vocabulary size
> >> is the square root of the text size.
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/ (my blog)
> >>> On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote:
> >>> OCR’ed text can have large amounts of garbage such as '';,-d'."
> particularly when there is poor image quality or embedded graphics. Is that
> what is causing your huge vocabularies? I filtered the text, removing any
> word with fewer than 3 alphanumerics or more than 2 non-alphas.
> >>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
> >>>> That approach doesn’t work very well for estimates.
> >>>> Some parts of the index size and speed scale with the vocabulary
> instead of the number of documents.
> >>>> Vocabulary usually grows at about the square root of the total amount
> of text in the index. OCR’ed text
> >>>> breaks that estimate badly, with huge vocabularies.