OCR’ed text can have large amounts of garbage such as '';,-d'." particularly
when there is poor image quality or embedded graphics. Is that what is causing your
huge vocabularies? I filtered the text, removing any word with fewer than 3
alphanumerics or more than 2 non-alphas.
On 2016-10-03 09:30 PM, Walter Underwood wrote:
That approach doesn’t work very well for estimates.
Some parts of the index size and speed scale with the vocabulary instead of the
number of documents.
Vocabulary usually grows at about the square root of the total amount of text
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.
Also, it is common to find non-linear jumps in performance. I’m benchmarking a
change in a 12 million
document index. It improves the 95th percentile response time for one style of
query from 3.8 seconds
to 2 milliseconds. I’m testing with a log of 200k queries from a production
host, so I’m pretty sure that
is accurate.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
In short, if you want your estimate to be closer then run some actual
ingestion for say 1-5% of your total docs and extrapolate since every
search product may have different schema,different set of fields, different
index vs. stored fields, copy fields, different analysis chain etc.
If you want to just have a very quick rough estimate, create few flat json
sample files (below) with field names and key values(actual data for better
estimate). Put all the fields names which you are going to index/put into
Solr and check the json file size. This will give you average size of a doc
and then multiply with # docs to get a rough index size.
{
"id":"product12345"
"name":"productA",
"category":"xyz",
...
...
}
Thanks,
Susheel
On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:
This doesn't answer your question, but Erick Erickson's blog on this topic
is invaluable:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
the-abstract-why-we-dont-have-a-definitive-answer/
-----Original Message-----
From: Vasu Y [mailto:vya...@gmail.com]
Sent: Monday, October 3, 2016 2:09 PM
To: solr-user@lucene.apache.org
Subject: SOLR Sizing
Hi,
I am trying to estimate disk space requirements for the documents indexed
to SOLR.
I went through the LucidWorks blog (
https://lucidworks.com/blog/2011/09/14/estimating-memory-
and-storage-for-lucenesolr/)
and using this as the template. I have a question regarding estimating
"Avg. Document Size (KB)".
When calculating Disk Storage requirements, can we use the Java Types
sizing (
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
& come up average document size?
Please let know if the following assumptions are correct.
Data Type Size
-------------- ------
long 8 bytes
tint 4 bytes
tdate 8 bytes (Stored as long?)
string 1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII chars (Double byte chars)
text 1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII (Double byte chars) (For both with & without norm?)
ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
boolean 1 bit?
Thanks,
Vasu