Dropping ngrams also makes the index 5X smaller on disk.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 3, 2016, at 9:02 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> I did not believe the benchmark results the first time, but it seems to hold 
> up.
> Nobody gets a speedup of over a thousand (unless you are going from that
> Oracle search thing to Solr).
> 
> It probably won’t help for most people. We have one service with very, very 
> long
> queries, up to 1000 words of free text. We also do as-you-type instant 
> results,
> so we have been using edge ngrams. Not using edge ngrams made the huge
> speedup.
> 
> Query results cache hit rate almost doubled, which is part of the non-linear 
> speedup.
> 
> We already trim the number of terms passed to Solr to a reasonable amount.
> Google cuts off at 32; we use a few more.
> 
> We’re running a relevance A/B test for dropping the ngrams. If that doesn’t 
> pass,
> we’ll try something else, like only ngramming the first few words. Or 
> something.
> 
> I wanted to use MLT to extract the best terms out of the long queries. 
> Unfortunately,
> you can’t highlight and MLT (MLT was never moved to the new component system)
> and the MLT handler was really slow. Dang.
> 
> I still might do an outboard MLT with a snapshot of high-idf terms.
> 
> The queries are for homework help. I’ve only found one other search that had 
> to
> deal with this. I was talking with someone who worked on Encarta, and they had
> the same challenge.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Oct 3, 2016, at 8:06 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> Walter:
>> 
>> What did you change? I might like to put that in my bag of tricks ;)
>> 
>> Erick
>> 
>> On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> 
>> wrote:
>>> That approach doesn’t work very well for estimates.
>>> 
>>> Some parts of the index size and speed scale with the vocabulary instead of 
>>> the number of documents.
>>> Vocabulary usually grows at about the square root of the total amount of 
>>> text in the index. OCR’ed text
>>> breaks that estimate badly, with huge vocabularies.
>>> 
>>> Also, it is common to find non-linear jumps in performance. I’m 
>>> benchmarking a change in a 12 million
>>> document index. It improves the 95th percentile response time for one style 
>>> of query from 3.8 seconds
>>> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
>>> host, so I’m pretty sure that
>>> is accurate.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>>> 
>>>> In short, if you want your estimate to be closer then run some actual
>>>> ingestion for say 1-5% of your total docs and extrapolate since every
>>>> search product may have different schema,different set of fields, different
>>>> index vs. stored fields,  copy fields, different analysis chain etc.
>>>> 
>>>> If you want to just have a very quick rough estimate, create few flat json
>>>> sample files (below) with field names and key values(actual data for better
>>>> estimate). Put all the fields names which you are going to index/put into
>>>> Solr and check the json file size. This will give you average size of a doc
>>>> and then multiply with # docs to get a rough index size.
>>>> 
>>>> {
>>>> "id":"product12345"
>>>> "name":"productA",
>>>> "category":"xyz",
>>>> ...
>>>> ...
>>>> }
>>>> 
>>>> Thanks,
>>>> Susheel
>>>> 
>>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>>>> wrote:
>>>> 
>>>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>>>> is invaluable:
>>>>> 
>>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Vasu Y [mailto:vya...@gmail.com]
>>>>> Sent: Monday, October 3, 2016 2:09 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: SOLR Sizing
>>>>> 
>>>>> Hi,
>>>>> I am trying to estimate disk space requirements for the documents indexed
>>>>> to SOLR.
>>>>> I went through the LucidWorks blog (
>>>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>>>> and-storage-for-lucenesolr/)
>>>>> and using this as the template. I have a question regarding estimating
>>>>> "Avg. Document Size (KB)".
>>>>> 
>>>>> When calculating Disk Storage requirements, can we use the Java Types
>>>>> sizing (
>>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>>>> & come up average document size?
>>>>> 
>>>>> Please let know if the following assumptions are correct.
>>>>> 
>>>>> Data Type       Size
>>>>> --------------      ------
>>>>> long           8 bytes
>>>>> tint       4 bytes
>>>>> tdate         8 bytes (Stored as long?)
>>>>> string         1 byte per char for ASCII chars and 2 bytes per char for
>>>>> Non-ASCII chars (Double byte chars)
>>>>> text           1 byte per char for ASCII chars and 2 bytes per char for
>>>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>>>> boolean 1 bit?
>>>>> 
>>>>> Thanks,
>>>>> Vasu
>>>>> 
>>> 
> 

Reply via email to