Re: SOLR Sizing

2016-10-14 Thread Shawn Heisey
On 10/14/2016 12:18 AM, Vasu Y wrote:
> Thank you all for the insight and help. Our SOLR instance has multiple
> collections.
> Do you know if the spreadsheet at LucidWorks (
> https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)
> is meant to be used to calculate sizing per collection or is it meant to be
> used for the whole SOLR instance (that contains multiple collections).
>
> The reason I am asking this question is, there are some defaults like
> "Transient (MB)" (with a value 10 MB) specified "Disk Space Estimator"
> sheet; I am not sure if these default values are per collection or the
> whole SOLR instance.

You would need to include info for everything that would live on the
Solr instance ... but even that estimator can only provide you with a
*guess* as to how much heap size you need, and depending on how you
actually use Solr, it might be a completely incorrect guess.  You've
already been given the following URL in response to the initial
question, and gotten other replies about why there's no way for us to
give you an actual answer:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Testing and adjusting the actual live production system is the only way
to be absolutely sure what your requirements are.  Anything before that
is guesswork.

Thanks,
Shawn



Re: SOLR Sizing

2016-10-14 Thread Vasu Y
Thank you all for the insight and help. Our SOLR instance has multiple
collections.
Do you know if the spreadsheet at LucidWorks (
https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)
is meant to be used to calculate sizing per collection or is it meant to be
used for the whole SOLR instance (that contains multiple collections).

The reason I am asking this question is, there are some defaults like
"Transient (MB)" (with a value 10 MB) specified "Disk Space Estimator"
sheet; I am not sure if these default values are per collection or the
whole SOLR instance.

Thanks,
Vasu

On Thu, Oct 6, 2016 at 9:42 PM, Walter Underwood 
wrote:

> The square-root rule comes from a short paper draft (unpublished) that I
> can’t find right now. But this paper gets the same result:
>
> http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html <
> http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html>
>
> Perfect OCR would follow this rule, but even great OCR has lots of errors.
> 95% accuracy is good OCR performance, but that makes a huge, pathological
> long tail of non-language terms.
>
> I learned about the OCR problems from the Hathi Trust. They hit the Solr
> vocabulary limit of 2.4 billion terms, then when that was raise, they hit
> memory management issues.
>
> https://www.hathitrust.org/blogs/large-scale-search/too-many-words <
> https://www.hathitrust.org/blogs/large-scale-search/too-many-words>
> https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again <
> https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again>
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Oct 6, 2016, at 8:05 AM, Rick Leir  wrote:
> >
> > I am curious to know where the square-root assumption is from, and why
> OCR (without errors) would break it. TIA
> >
> > cheers - - Rick
> >
> > On 2016-10-04 10:51 AM, Walter Underwood wrote:
> >> No, we don’t have OCR’ed text. But if you do, it breaks the assumption
> that vocabulary size
> >> is the square root of the text size.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >>> On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:
> >>>
> >>> OCR’ed text can have large amounts of garbage such as '';,-d'."
> particularly when there is poor image quality or embedded graphics. Is that
> what is causing your huge vocabularies? I filtered the text, removing any
> word with fewer than 3 alphanumerics or more than 2 non-alphas.
> >>>
> >>>
> >>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>  That approach doesn’t work very well for estimates.
> 
>  Some parts of the index size and speed scale with the vocabulary
> instead of the number of documents.
>  Vocabulary usually grows at about the square root of the total amount
> of text in the index. OCR’ed text
>  breaks that estimate badly, with huge vocabularies.
> 
> 
> >
>
>


Re: SOLR Sizing

2016-10-06 Thread Walter Underwood
The square-root rule comes from a short paper draft (unpublished) that I can’t 
find right now. But this paper gets the same result:

http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html 


Perfect OCR would follow this rule, but even great OCR has lots of errors. 95% 
accuracy is good OCR performance, but that makes a huge, pathological long tail 
of non-language terms.

I learned about the OCR problems from the Hathi Trust. They hit the Solr 
vocabulary limit of 2.4 billion terms, then when that was raise, they hit 
memory management issues.

https://www.hathitrust.org/blogs/large-scale-search/too-many-words 

https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2016, at 8:05 AM, Rick Leir  wrote:
> 
> I am curious to know where the square-root assumption is from, and why OCR 
> (without errors) would break it. TIA
> 
> cheers - - Rick
> 
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
>> vocabulary size
>> is the square root of the text size.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:
>>> 
>>> OCR’ed text can have large amounts of garbage such as '';,-d'." 
>>> particularly when there is poor image quality or embedded graphics. Is that 
>>> what is causing your huge vocabularies? I filtered the text, removing any 
>>> word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>> 
>>> 
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
 That approach doesn’t work very well for estimates.
 
 Some parts of the index size and speed scale with the vocabulary instead 
 of the number of documents.
 Vocabulary usually grows at about the square root of the total amount of 
 text in the index. OCR’ed text
 breaks that estimate badly, with huge vocabularies.
 
 
> 



Re: SOLR Sizing

2016-10-06 Thread Erick Erickson
OCR _without errors_ wouldn't break it. That comment assumed that the OCR
was dirty I thought.

Honest, I once was trying to index an OCR'd image of a "family tree" that was a
stylized tree where the most remote ancestor was labeled in vertical text on the
trunk, and descendants at various angles as the trunk branched, the branches
branched and on and on

And as far as cleaning up the text is concerned if it's dirty,
anything you do is
wrong. For instance, again using the genealogy example, throwing out
unrecognized
words like, removes the data that's important when they're names.

But leaving nonsense characters in is wrong too

And hand-correcting all of the data is almost always far too expensive.

If your OCR is, indeed perfect, then I envy you ;)...

On a different note, I thought the captcha-image way of correcting OCR
text was brilliant.

Erick

On Thu, Oct 6, 2016 at 8:05 AM, Rick Leir  wrote:
> I am curious to know where the square-root assumption is from, and why OCR
> (without errors) would break it. TIA
>
> cheers - - Rick
>
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>>
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption
>> that vocabulary size
>> is the square root of the text size.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:
>>>
>>> OCR’ed text can have large amounts of garbage such as '';,-d'."
>>> particularly when there is poor image quality or embedded graphics. Is that
>>> what is causing your huge vocabularies? I filtered the text, removing any
>>> word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>>
>>>
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:

 That approach doesn’t work very well for estimates.

 Some parts of the index size and speed scale with the vocabulary instead
 of the number of documents.
 Vocabulary usually grows at about the square root of the total amount of
 text in the index. OCR’ed text
 breaks that estimate badly, with huge vocabularies.


>


Re: SOLR Sizing

2016-10-06 Thread Rick Leir
I am curious to know where the square-root assumption is from, and why 
OCR (without errors) would break it. TIA


cheers - - Rick

On 2016-10-04 10:51 AM, Walter Underwood wrote:

No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
vocabulary size
is the square root of the text size.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:

OCR’ed text can have large amounts of garbage such as '';,-d'." particularly 
when there is poor image quality or embedded graphics. Is that what is causing your 
huge vocabularies? I filtered the text, removing any word with fewer than 3 
alphanumerics or more than 2 non-alphas.


On 2016-10-03 09:30 PM, Walter Underwood wrote:

That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the 
number of documents.
Vocabulary usually grows at about the square root of the total amount of text 
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.






Re: SOLR Sizing

2016-10-04 Thread Walter Underwood
No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
vocabulary size
is the square root of the text size.

wunder 
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote:
> 
> OCR’ed text can have large amounts of garbage such as '';,-d'." particularly 
> when there is poor image quality or embedded graphics. Is that what is 
> causing your huge vocabularies? I filtered the text, removing any word with 
> fewer than 3 alphanumerics or more than 2 non-alphas.
> 
> 
> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>> That approach doesn’t work very well for estimates.
>> 
>> Some parts of the index size and speed scale with the vocabulary instead of 
>> the number of documents.
>> Vocabulary usually grows at about the square root of the total amount of 
>> text in the index. OCR’ed text
>> breaks that estimate badly, with huge vocabularies.
>> 
>> Also, it is common to find non-linear jumps in performance. I’m benchmarking 
>> a change in a 12 million
>> document index. It improves the 95th percentile response time for one style 
>> of query from 3.8 seconds
>> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
>> host, so I’m pretty sure that
>> is accurate.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>> 
>>> In short, if you want your estimate to be closer then run some actual
>>> ingestion for say 1-5% of your total docs and extrapolate since every
>>> search product may have different schema,different set of fields, different
>>> index vs. stored fields,  copy fields, different analysis chain etc.
>>> 
>>> If you want to just have a very quick rough estimate, create few flat json
>>> sample files (below) with field names and key values(actual data for better
>>> estimate). Put all the fields names which you are going to index/put into
>>> Solr and check the json file size. This will give you average size of a doc
>>> and then multiply with # docs to get a rough index size.
>>> 
>>> {
>>> "id":"product12345"
>>> "name":"productA",
>>> "category":"xyz",
>>> ...
>>> ...
>>> }
>>> 
>>> Thanks,
>>> Susheel
>>> 
>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>>> wrote:
>>> 
>>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>>> is invaluable:
>>>> 
>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>> 
>>>> -Original Message-
>>>> From: Vasu Y [mailto:vya...@gmail.com]
>>>> Sent: Monday, October 3, 2016 2:09 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: SOLR Sizing
>>>> 
>>>> Hi,
>>>> I am trying to estimate disk space requirements for the documents indexed
>>>> to SOLR.
>>>> I went through the LucidWorks blog (
>>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>>> and-storage-for-lucenesolr/)
>>>> and using this as the template. I have a question regarding estimating
>>>> "Avg. Document Size (KB)".
>>>> 
>>>> When calculating Disk Storage requirements, can we use the Java Types
>>>> sizing (
>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>>> & come up average document size?
>>>> 
>>>> Please let know if the following assumptions are correct.
>>>> 
>>>> Data Type   Size
>>>> --  --
>>>> long   8 bytes
>>>> tint   4 bytes
>>>> tdate 8 bytes (Stored as long?)
>>>> string 1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII chars (Double byte chars)
>>>> text   1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>>> boolean 1 bit?
>>>> 
>>>> Thanks,
>>>> Vasu
>>>> 
>> 
> 



Re: SOLR Sizing

2016-10-04 Thread Rick Leir

OCR’ed text can have large amounts of garbage such as '';,-d'." particularly 
when there is poor image quality or embedded graphics. Is that what is causing your 
huge vocabularies? I filtered the text, removing any word with fewer than 3 
alphanumerics or more than 2 non-alphas.


On 2016-10-03 09:30 PM, Walter Underwood wrote:

That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the 
number of documents.
Vocabulary usually grows at about the square root of the total amount of text 
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.

Also, it is common to find non-linear jumps in performance. I’m benchmarking a 
change in a 12 million
document index. It improves the 95th percentile response time for one style of 
query from 3.8 seconds
to 2 milliseconds. I’m testing with a log of 200k queries from a production 
host, so I’m pretty sure that
is accurate.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:

In short, if you want your estimate to be closer then run some actual
ingestion for say 1-5% of your total docs and extrapolate since every
search product may have different schema,different set of fields, different
index vs. stored fields,  copy fields, different analysis chain etc.

If you want to just have a very quick rough estimate, create few flat json
sample files (below) with field names and key values(actual data for better
estimate). Put all the fields names which you are going to index/put into
Solr and check the json file size. This will give you average size of a doc
and then multiply with # docs to get a rough index size.

{
"id":"product12345"
"name":"productA",
"category":"xyz",
...
...
}

Thanks,
Susheel

On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:


This doesn't answer your question, but Erick Erickson's blog on this topic
is invaluable:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
the-abstract-why-we-dont-have-a-definitive-answer/

-Original Message-
From: Vasu Y [mailto:vya...@gmail.com]
Sent: Monday, October 3, 2016 2:09 PM
To: solr-user@lucene.apache.org
Subject: SOLR Sizing

Hi,
I am trying to estimate disk space requirements for the documents indexed
to SOLR.
I went through the LucidWorks blog (
https://lucidworks.com/blog/2011/09/14/estimating-memory-
and-storage-for-lucenesolr/)
and using this as the template. I have a question regarding estimating
"Avg. Document Size (KB)".

When calculating Disk Storage requirements, can we use the Java Types
sizing (
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
& come up average document size?

Please let know if the following assumptions are correct.

Data Type   Size
--  --
long   8 bytes
tint   4 bytes
tdate 8 bytes (Stored as long?)
string 1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII chars (Double byte chars)
text   1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII (Double byte chars) (For both with & without norm?)
ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
boolean 1 bit?

Thanks,
Vasu







Re: SOLR Sizing

2016-10-03 Thread Walter Underwood
Dropping ngrams also makes the index 5X smaller on disk.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 3, 2016, at 9:02 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> I did not believe the benchmark results the first time, but it seems to hold 
> up.
> Nobody gets a speedup of over a thousand (unless you are going from that
> Oracle search thing to Solr).
> 
> It probably won’t help for most people. We have one service with very, very 
> long
> queries, up to 1000 words of free text. We also do as-you-type instant 
> results,
> so we have been using edge ngrams. Not using edge ngrams made the huge
> speedup.
> 
> Query results cache hit rate almost doubled, which is part of the non-linear 
> speedup.
> 
> We already trim the number of terms passed to Solr to a reasonable amount.
> Google cuts off at 32; we use a few more.
> 
> We’re running a relevance A/B test for dropping the ngrams. If that doesn’t 
> pass,
> we’ll try something else, like only ngramming the first few words. Or 
> something.
> 
> I wanted to use MLT to extract the best terms out of the long queries. 
> Unfortunately,
> you can’t highlight and MLT (MLT was never moved to the new component system)
> and the MLT handler was really slow. Dang.
> 
> I still might do an outboard MLT with a snapshot of high-idf terms.
> 
> The queries are for homework help. I’ve only found one other search that had 
> to
> deal with this. I was talking with someone who worked on Encarta, and they had
> the same challenge.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Oct 3, 2016, at 8:06 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> Walter:
>> 
>> What did you change? I might like to put that in my bag of tricks ;)
>> 
>> Erick
>> 
>> On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> 
>> wrote:
>>> That approach doesn’t work very well for estimates.
>>> 
>>> Some parts of the index size and speed scale with the vocabulary instead of 
>>> the number of documents.
>>> Vocabulary usually grows at about the square root of the total amount of 
>>> text in the index. OCR’ed text
>>> breaks that estimate badly, with huge vocabularies.
>>> 
>>> Also, it is common to find non-linear jumps in performance. I’m 
>>> benchmarking a change in a 12 million
>>> document index. It improves the 95th percentile response time for one style 
>>> of query from 3.8 seconds
>>> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
>>> host, so I’m pretty sure that
>>> is accurate.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>>> 
>>>> In short, if you want your estimate to be closer then run some actual
>>>> ingestion for say 1-5% of your total docs and extrapolate since every
>>>> search product may have different schema,different set of fields, different
>>>> index vs. stored fields,  copy fields, different analysis chain etc.
>>>> 
>>>> If you want to just have a very quick rough estimate, create few flat json
>>>> sample files (below) with field names and key values(actual data for better
>>>> estimate). Put all the fields names which you are going to index/put into
>>>> Solr and check the json file size. This will give you average size of a doc
>>>> and then multiply with # docs to get a rough index size.
>>>> 
>>>> {
>>>> "id":"product12345"
>>>> "name":"productA",
>>>> "category":"xyz",
>>>> ...
>>>> ...
>>>> }
>>>> 
>>>> Thanks,
>>>> Susheel
>>>> 
>>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>>>> wrote:
>>>> 
>>>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>>>> is invaluable:
>>>>> 
>>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>>> 
>>>>> -Original Message-
>>>>> From: Vasu Y [mailto:

Re: SOLR Sizing

2016-10-03 Thread Walter Underwood
I did not believe the benchmark results the first time, but it seems to hold up.
Nobody gets a speedup of over a thousand (unless you are going from that
Oracle search thing to Solr).

It probably won’t help for most people. We have one service with very, very long
queries, up to 1000 words of free text. We also do as-you-type instant results,
so we have been using edge ngrams. Not using edge ngrams made the huge
speedup.

Query results cache hit rate almost doubled, which is part of the non-linear 
speedup.

We already trim the number of terms passed to Solr to a reasonable amount.
Google cuts off at 32; we use a few more.

We’re running a relevance A/B test for dropping the ngrams. If that doesn’t 
pass,
we’ll try something else, like only ngramming the first few words. Or something.

I wanted to use MLT to extract the best terms out of the long queries. 
Unfortunately,
you can’t highlight and MLT (MLT was never moved to the new component system)
and the MLT handler was really slow. Dang.

I still might do an outboard MLT with a snapshot of high-idf terms.

The queries are for homework help. I’ve only found one other search that had to
deal with this. I was talking with someone who worked on Encarta, and they had
the same challenge.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 3, 2016, at 8:06 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Walter:
> 
> What did you change? I might like to put that in my bag of tricks ;)
> 
> Erick
> 
> On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> 
> wrote:
>> That approach doesn’t work very well for estimates.
>> 
>> Some parts of the index size and speed scale with the vocabulary instead of 
>> the number of documents.
>> Vocabulary usually grows at about the square root of the total amount of 
>> text in the index. OCR’ed text
>> breaks that estimate badly, with huge vocabularies.
>> 
>> Also, it is common to find non-linear jumps in performance. I’m benchmarking 
>> a change in a 12 million
>> document index. It improves the 95th percentile response time for one style 
>> of query from 3.8 seconds
>> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
>> host, so I’m pretty sure that
>> is accurate.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>> 
>>> In short, if you want your estimate to be closer then run some actual
>>> ingestion for say 1-5% of your total docs and extrapolate since every
>>> search product may have different schema,different set of fields, different
>>> index vs. stored fields,  copy fields, different analysis chain etc.
>>> 
>>> If you want to just have a very quick rough estimate, create few flat json
>>> sample files (below) with field names and key values(actual data for better
>>> estimate). Put all the fields names which you are going to index/put into
>>> Solr and check the json file size. This will give you average size of a doc
>>> and then multiply with # docs to get a rough index size.
>>> 
>>> {
>>> "id":"product12345"
>>> "name":"productA",
>>> "category":"xyz",
>>> ...
>>> ...
>>> }
>>> 
>>> Thanks,
>>> Susheel
>>> 
>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>>> wrote:
>>> 
>>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>>> is invaluable:
>>>> 
>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>> 
>>>> -Original Message-
>>>> From: Vasu Y [mailto:vya...@gmail.com]
>>>> Sent: Monday, October 3, 2016 2:09 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: SOLR Sizing
>>>> 
>>>> Hi,
>>>> I am trying to estimate disk space requirements for the documents indexed
>>>> to SOLR.
>>>> I went through the LucidWorks blog (
>>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>>> and-storage-for-lucenesolr/)
>>>> and using this as the template. I have a question regarding estimating
>>>> "Avg. Document Size (KB)".
>>>> 
>>>> When calculating Disk Storage requirements, can we use the Java Types
>>>> sizing (
>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>>> & come up average document size?
>>>> 
>>>> Please let know if the following assumptions are correct.
>>>> 
>>>> Data Type   Size
>>>> --  --
>>>> long   8 bytes
>>>> tint   4 bytes
>>>> tdate 8 bytes (Stored as long?)
>>>> string 1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII chars (Double byte chars)
>>>> text   1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>>> boolean 1 bit?
>>>> 
>>>> Thanks,
>>>> Vasu
>>>> 
>> 



Re: SOLR Sizing

2016-10-03 Thread Erick Erickson
Walter:

What did you change? I might like to put that in my bag of tricks ;)

Erick

On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> That approach doesn’t work very well for estimates.
>
> Some parts of the index size and speed scale with the vocabulary instead of 
> the number of documents.
> Vocabulary usually grows at about the square root of the total amount of text 
> in the index. OCR’ed text
> breaks that estimate badly, with huge vocabularies.
>
> Also, it is common to find non-linear jumps in performance. I’m benchmarking 
> a change in a 12 million
> document index. It improves the 95th percentile response time for one style 
> of query from 3.8 seconds
> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
> host, so I’m pretty sure that
> is accurate.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>
>> In short, if you want your estimate to be closer then run some actual
>> ingestion for say 1-5% of your total docs and extrapolate since every
>> search product may have different schema,different set of fields, different
>> index vs. stored fields,  copy fields, different analysis chain etc.
>>
>> If you want to just have a very quick rough estimate, create few flat json
>> sample files (below) with field names and key values(actual data for better
>> estimate). Put all the fields names which you are going to index/put into
>> Solr and check the json file size. This will give you average size of a doc
>> and then multiply with # docs to get a rough index size.
>>
>> {
>> "id":"product12345"
>> "name":"productA",
>> "category":"xyz",
>> ...
>> ...
>> }
>>
>> Thanks,
>> Susheel
>>
>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>> wrote:
>>
>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>> is invaluable:
>>>
>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>
>>> -Original Message-
>>> From: Vasu Y [mailto:vya...@gmail.com]
>>> Sent: Monday, October 3, 2016 2:09 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: SOLR Sizing
>>>
>>> Hi,
>>> I am trying to estimate disk space requirements for the documents indexed
>>> to SOLR.
>>> I went through the LucidWorks blog (
>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>> and-storage-for-lucenesolr/)
>>> and using this as the template. I have a question regarding estimating
>>> "Avg. Document Size (KB)".
>>>
>>> When calculating Disk Storage requirements, can we use the Java Types
>>> sizing (
>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>> & come up average document size?
>>>
>>> Please let know if the following assumptions are correct.
>>>
>>> Data Type   Size
>>> --  --
>>> long   8 bytes
>>> tint   4 bytes
>>> tdate 8 bytes (Stored as long?)
>>> string 1 byte per char for ASCII chars and 2 bytes per char for
>>> Non-ASCII chars (Double byte chars)
>>> text   1 byte per char for ASCII chars and 2 bytes per char for
>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>> boolean 1 bit?
>>>
>>> Thanks,
>>> Vasu
>>>
>


Re: SOLR Sizing

2016-10-03 Thread Walter Underwood
That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the 
number of documents.
Vocabulary usually grows at about the square root of the total amount of text 
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.

Also, it is common to find non-linear jumps in performance. I’m benchmarking a 
change in a 12 million
document index. It improves the 95th percentile response time for one style of 
query from 3.8 seconds
to 2 milliseconds. I’m testing with a log of 200k queries from a production 
host, so I’m pretty sure that
is accurate.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
> 
> In short, if you want your estimate to be closer then run some actual
> ingestion for say 1-5% of your total docs and extrapolate since every
> search product may have different schema,different set of fields, different
> index vs. stored fields,  copy fields, different analysis chain etc.
> 
> If you want to just have a very quick rough estimate, create few flat json
> sample files (below) with field names and key values(actual data for better
> estimate). Put all the fields names which you are going to index/put into
> Solr and check the json file size. This will give you average size of a doc
> and then multiply with # docs to get a rough index size.
> 
> {
> "id":"product12345"
> "name":"productA",
> "category":"xyz",
> ...
> ...
> }
> 
> Thanks,
> Susheel
> 
> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
> 
>> This doesn't answer your question, but Erick Erickson's blog on this topic
>> is invaluable:
>> 
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>> the-abstract-why-we-dont-have-a-definitive-answer/
>> 
>> -Original Message-
>> From: Vasu Y [mailto:vya...@gmail.com]
>> Sent: Monday, October 3, 2016 2:09 PM
>> To: solr-user@lucene.apache.org
>> Subject: SOLR Sizing
>> 
>> Hi,
>> I am trying to estimate disk space requirements for the documents indexed
>> to SOLR.
>> I went through the LucidWorks blog (
>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>> and-storage-for-lucenesolr/)
>> and using this as the template. I have a question regarding estimating
>> "Avg. Document Size (KB)".
>> 
>> When calculating Disk Storage requirements, can we use the Java Types
>> sizing (
>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>> & come up average document size?
>> 
>> Please let know if the following assumptions are correct.
>> 
>> Data Type   Size
>> --  --
>> long   8 bytes
>> tint   4 bytes
>> tdate 8 bytes (Stored as long?)
>> string 1 byte per char for ASCII chars and 2 bytes per char for
>> Non-ASCII chars (Double byte chars)
>> text   1 byte per char for ASCII chars and 2 bytes per char for
>> Non-ASCII (Double byte chars) (For both with & without norm?)
>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>> boolean 1 bit?
>> 
>> Thanks,
>> Vasu
>> 



Re: SOLR Sizing

2016-10-03 Thread Susheel Kumar
In short, if you want your estimate to be closer then run some actual
ingestion for say 1-5% of your total docs and extrapolate since every
search product may have different schema,different set of fields, different
index vs. stored fields,  copy fields, different analysis chain etc.

If you want to just have a very quick rough estimate, create few flat json
sample files (below) with field names and key values(actual data for better
estimate). Put all the fields names which you are going to index/put into
Solr and check the json file size. This will give you average size of a doc
and then multiply with # docs to get a rough index size.

{
"id":"product12345"
"name":"productA",
"category":"xyz",
...
...
}

Thanks,
Susheel

On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> This doesn't answer your question, but Erick Erickson's blog on this topic
> is invaluable:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
> the-abstract-why-we-dont-have-a-definitive-answer/
>
> -Original Message-
> From: Vasu Y [mailto:vya...@gmail.com]
> Sent: Monday, October 3, 2016 2:09 PM
> To: solr-user@lucene.apache.org
> Subject: SOLR Sizing
>
> Hi,
>  I am trying to estimate disk space requirements for the documents indexed
> to SOLR.
> I went through the LucidWorks blog (
> https://lucidworks.com/blog/2011/09/14/estimating-memory-
> and-storage-for-lucenesolr/)
> and using this as the template. I have a question regarding estimating
> "Avg. Document Size (KB)".
>
> When calculating Disk Storage requirements, can we use the Java Types
> sizing (
> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
> & come up average document size?
>
> Please let know if the following assumptions are correct.
>
>  Data Type   Size
>  --  --
>  long   8 bytes
>  tint   4 bytes
>  tdate 8 bytes (Stored as long?)
>  string 1 byte per char for ASCII chars and 2 bytes per char for
> Non-ASCII chars (Double byte chars)
>  text   1 byte per char for ASCII chars and 2 bytes per char for
> Non-ASCII (Double byte chars) (For both with & without norm?)
> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
> boolean 1 bit?
>
>  Thanks,
>  Vasu
>


RE: SOLR Sizing

2016-10-03 Thread Allison, Timothy B.
This doesn't answer your question, but Erick Erickson's blog on this topic is 
invaluable:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

-Original Message-
From: Vasu Y [mailto:vya...@gmail.com] 
Sent: Monday, October 3, 2016 2:09 PM
To: solr-user@lucene.apache.org
Subject: SOLR Sizing

Hi,
 I am trying to estimate disk space requirements for the documents indexed to 
SOLR.
I went through the LucidWorks blog (
https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)
and using this as the template. I have a question regarding estimating "Avg. 
Document Size (KB)".

When calculating Disk Storage requirements, can we use the Java Types sizing (
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html) & 
come up average document size?

Please let know if the following assumptions are correct.

 Data Type   Size
 --  --
 long   8 bytes
 tint   4 bytes
 tdate 8 bytes (Stored as long?)
 string 1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII chars (Double byte chars)
 text   1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII (Double byte chars) (For both with & without norm?)  
ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)  boolean 1 
bit?

 Thanks,
 Vasu


SOLR Sizing

2016-10-03 Thread Vasu Y
Hi,
 I am trying to estimate disk space requirements for the documents indexed
to SOLR.
I went through the LucidWorks blog (
https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)
and using this as the template. I have a question regarding estimating
"Avg. Document Size (KB)".

When calculating Disk Storage requirements, can we use the Java Types
sizing (
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html) &
come up average document size?

Please let know if the following assumptions are correct.

 Data Type   Size
 --  --
 long   8 bytes
 tint   4 bytes
 tdate 8 bytes (Stored as long?)
 string 1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII chars (Double byte chars)
 text   1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII (Double byte chars) (For both with & without norm?)
 ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
 boolean 1 bit?

 Thanks,
 Vasu


solr sizing

2013-07-29 Thread Torsten Albrecht
Hi all,

we have

- 70 mio documents to 100 mio documents

and we want

- 800 requests per second


How many servers Amazon EC2/real hardware we Need for this?

Solr 4.x with solr cloud or better shards with loadbalancer?

Is anyone here who can give me some information, or who operates a similar 
system itself?


Regards,

Torsten


Re: solr sizing

2013-07-29 Thread Shawn Heisey

On 7/29/2013 2:18 PM, Torsten Albrecht wrote:

we have

- 70 mio documents to 100 mio documents

and we want

- 800 requests per second


How many servers Amazon EC2/real hardware we Need for this?

Solr 4.x with solr cloud or better shards with loadbalancer?

Is anyone here who can give me some information, or who operates a similar 
system itself?


Your question is impossible to answer, aside from generalities that 
won't really help all that much.


I have a similarly sized system (82 million docs), but I don't have 
query volume anywhere near what yours is.  I've got less than 10 queries 
per second.  I have two copies of my index.  I use a load balancer with 
traditional sharding.


I don't do replication, my two index copies are completely independent. 
 I set it up this way long before SolrCloud was released.  Having two 
completely independent indexes lets me do a lot of experimentation that 
a typical SolrCloud setup won't let me do.


One copy of the index is running 3.5.0 and is about 142GB if you add up 
all the shards.  The other copy of the index is running 4.2.1 and is 
about 87GB on disk.  Each copy of the index runs on two servers, six 
large cold shards and one small hot shard.  Each of those servers has 
two quad-core processors (Xeon E5400 series, so fairly old now) and 64GB 
of RAM.  I can get away with multiple shards per host because my query 
volume is so low.


Here's a screenshot of a status servlet that I wrote for my index. 
There's tons of info here about my index stats:


https://dl.dropboxusercontent.com/u/97770508/statuspagescreenshot.png

If I needed to start over from scratch with your higher query volume, I 
would probably set up two independent SolrCloud installs, each with a 
replicationFactor of at least two, and I'd use 4-8 shards.  I would put 
a load balancer in front of it so that I could bring one cloud down and 
have everything still work, though with lower performance.  Because of 
the query volume, I'd only have one shard per host.  Depending on how 
big the index ended up being, I'd want 16-32GB (or possibly more) RAM 
per host.


You might not need the flexibility of two independent clouds, and it 
would require additional complexity in your indexing software.  If you 
only went with one cloud, you'd just need a higher replicationFactor.


I'd also want to have another set of servers (not as beefy) to have 
another independent SolrCloud with a replicationFactor of 1 or 2 for dev 
purposes.


That's a LOT of hardware, and it would NOT be cheap.  Can I be sure that 
you'd really need that much hardware?  Not really.  To to be quite 
honest, you'll just have to set up a proof-of-concept system and be 
prepared to make it bigger.


Thanks,
Shawn