In short, if you want your estimate to be closer then run some actual
ingestion for say 1-5% of your total docs and extrapolate since every
search product may have different schema,different set of fields, different
index vs. stored fields,  copy fields, different analysis chain etc.

If you want to just have a very quick rough estimate, create few flat json
sample files (below) with field names and key values(actual data for better
estimate). Put all the fields names which you are going to index/put into
Solr and check the json file size. This will give you average size of a doc
and then multiply with # docs to get a rough index size.

{
"id":"product12345"
"name":"productA",
"category":"xyz",
...
...
}

Thanks,
Susheel

On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> This doesn't answer your question, but Erick Erickson's blog on this topic
> is invaluable:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
> the-abstract-why-we-dont-have-a-definitive-answer/
>
> -----Original Message-----
> From: Vasu Y [mailto:vya...@gmail.com]
> Sent: Monday, October 3, 2016 2:09 PM
> To: solr-user@lucene.apache.org
> Subject: SOLR Sizing
>
> Hi,
>  I am trying to estimate disk space requirements for the documents indexed
> to SOLR.
> I went through the LucidWorks blog (
> https://lucidworks.com/blog/2011/09/14/estimating-memory-
> and-storage-for-lucenesolr/)
> and using this as the template. I have a question regarding estimating
> "Avg. Document Size (KB)".
>
> When calculating Disk Storage requirements, can we use the Java Types
> sizing (
> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
> & come up average document size?
>
> Please let know if the following assumptions are correct.
>
>  Data Type       Size
>  --------------      ------
>  long           8 bytes
>  tint       4 bytes
>  tdate         8 bytes (Stored as long?)
>  string         1 byte per char for ASCII chars and 2 bytes per char for
> Non-ASCII chars (Double byte chars)
>  text           1 byte per char for ASCII chars and 2 bytes per char for
> Non-ASCII (Double byte chars) (For both with & without norm?)
> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
> boolean 1 bit?
>
>  Thanks,
>  Vasu
>

Reply via email to