Dropping ngrams also makes the index 5X smaller on disk. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)
> On Oct 3, 2016, at 9:02 PM, Walter Underwood <wun...@wunderwood.org> wrote: > > I did not believe the benchmark results the first time, but it seems to hold > up. > Nobody gets a speedup of over a thousand (unless you are going from that > Oracle search thing to Solr). > > It probably won’t help for most people. We have one service with very, very > long > queries, up to 1000 words of free text. We also do as-you-type instant > results, > so we have been using edge ngrams. Not using edge ngrams made the huge > speedup. > > Query results cache hit rate almost doubled, which is part of the non-linear > speedup. > > We already trim the number of terms passed to Solr to a reasonable amount. > Google cuts off at 32; we use a few more. > > We’re running a relevance A/B test for dropping the ngrams. If that doesn’t > pass, > we’ll try something else, like only ngramming the first few words. Or > something. > > I wanted to use MLT to extract the best terms out of the long queries. > Unfortunately, > you can’t highlight and MLT (MLT was never moved to the new component system) > and the MLT handler was really slow. Dang. > > I still might do an outboard MLT with a snapshot of high-idf terms. > > The queries are for homework help. I’ve only found one other search that had > to > deal with this. I was talking with someone who worked on Encarta, and they had > the same challenge. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Oct 3, 2016, at 8:06 PM, Erick Erickson <erickerick...@gmail.com> wrote: >> >> Walter: >> >> What did you change? I might like to put that in my bag of tricks ;) >> >> Erick >> >> On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> >> wrote: >>> That approach doesn’t work very well for estimates. >>> >>> Some parts of the index size and speed scale with the vocabulary instead of >>> the number of documents. >>> Vocabulary usually grows at about the square root of the total amount of >>> text in the index. OCR’ed text >>> breaks that estimate badly, with huge vocabularies. >>> >>> Also, it is common to find non-linear jumps in performance. I’m >>> benchmarking a change in a 12 million >>> document index. It improves the 95th percentile response time for one style >>> of query from 3.8 seconds >>> to 2 milliseconds. I’m testing with a log of 200k queries from a production >>> host, so I’m pretty sure that >>> is accurate. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>> >>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote: >>>> >>>> In short, if you want your estimate to be closer then run some actual >>>> ingestion for say 1-5% of your total docs and extrapolate since every >>>> search product may have different schema,different set of fields, different >>>> index vs. stored fields, copy fields, different analysis chain etc. >>>> >>>> If you want to just have a very quick rough estimate, create few flat json >>>> sample files (below) with field names and key values(actual data for better >>>> estimate). Put all the fields names which you are going to index/put into >>>> Solr and check the json file size. This will give you average size of a doc >>>> and then multiply with # docs to get a rough index size. >>>> >>>> { >>>> "id":"product12345" >>>> "name":"productA", >>>> "category":"xyz", >>>> ... >>>> ... >>>> } >>>> >>>> Thanks, >>>> Susheel >>>> >>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org> >>>> wrote: >>>> >>>>> This doesn't answer your question, but Erick Erickson's blog on this topic >>>>> is invaluable: >>>>> >>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in- >>>>> the-abstract-why-we-dont-have-a-definitive-answer/ >>>>> >>>>> -----Original Message----- >>>>> From: Vasu Y [mailto:vya...@gmail.com] >>>>> Sent: Monday, October 3, 2016 2:09 PM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: SOLR Sizing >>>>> >>>>> Hi, >>>>> I am trying to estimate disk space requirements for the documents indexed >>>>> to SOLR. >>>>> I went through the LucidWorks blog ( >>>>> https://lucidworks.com/blog/2011/09/14/estimating-memory- >>>>> and-storage-for-lucenesolr/) >>>>> and using this as the template. I have a question regarding estimating >>>>> "Avg. Document Size (KB)". >>>>> >>>>> When calculating Disk Storage requirements, can we use the Java Types >>>>> sizing ( >>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html) >>>>> & come up average document size? >>>>> >>>>> Please let know if the following assumptions are correct. >>>>> >>>>> Data Type Size >>>>> -------------- ------ >>>>> long 8 bytes >>>>> tint 4 bytes >>>>> tdate 8 bytes (Stored as long?) >>>>> string 1 byte per char for ASCII chars and 2 bytes per char for >>>>> Non-ASCII chars (Double byte chars) >>>>> text 1 byte per char for ASCII chars and 2 bytes per char for >>>>> Non-ASCII (Double byte chars) (For both with & without norm?) >>>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars) >>>>> boolean 1 bit? >>>>> >>>>> Thanks, >>>>> Vasu >>>>> >>> >