Re: edge ngram/find as you type sorting

Erick Erickson Thu, 26 Mar 2020 06:38:12 -0700

From other mails, it looks like you’re inheriting something you had
no input in building. My sympathies ;)


Unless you’ve explicitly changed the memory by specifying -Xmx and -Xms
at startup, you’re operating with 512M of memory, which is far too small
for most Solr installations. the -m parameter at startup will modify this.

The admin UI will also show you how much memory Solr is running with.

Best,
Erick

> On Mar 26, 2020, at 8:52 AM, matthew sporleder <msporle...@gmail.com> wrote:
> 
> That explains the OOM's I've been getting in the initial test cycle.
> I'm working with about 50M (small) documents.
> 
> On Thu, Mar 26, 2020 at 7:58 AM Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> 
>> the ngramming is a time/space tradeoff. Typically,
>> if you restrict the wildcards to have three or more
>> “real” characters performance is fine. One real
>> character (i.e. a*) will be your worst-case. I’ve
>> seen requiring two characters in the prefix work well
>> too. It Depends (tm).
>> 
>> Conceptually what happens here is that Lucene has
>> to enumerate all of the terms that start with the prefix
>> and create a ginormous OR clause. The term
>> enumeration will take longer the more terms there are.
>> Things are more efficient than that, but still...
>> 
>> So make sure you’re testing with a real corpus. Having
>> a test index with just a few terms will be misleading.
>> 
>> Best,
>> Erick
>> 
>>> On Mar 25, 2020, at 9:37 PM, matthew sporleder <msporle...@gmail.com> wrote:
>>> 
>>> Okay confirmed-
>>> I am getting a more predictable results set after adding an additional 
>>> field:
>>> <fieldType name="string_alpha" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="true">
>>>    <analyzer>
>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>         <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="\p{Punct}" replacement=""/>
>>>    </analyzer>
>>> </fieldType>
>>> 
>>> q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc
>>> 
>>> So it appears I can skip edge ngram entirely using this method as
>>> slug:foo* appears to be the exact same results as fayt:foo, but I have
>>> the cost of the alphaOnly field :)
>>> 
>>> I will try to figure out some benchmarks or something to decide how to go.
>>> 
>>> Thanks again for the help so far.
>>> 
>>> 
>>> On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <erickerick...@gmail.com> 
>>> wrote:
>>>> 
>>>> You’re getting the correct sorted order… The underscore character is 
>>>> confusing you.
>>>> 
>>>> It’s ascii code for underscore is %2d which sorts before any letter, 
>>>> uppercase or lowercase.
>>>> 
>>>> See the alphaOnlySort type for a way to remove this, although the output 
>>>> there can also
>>>> be confusing.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <msporle...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> What_is_Lov_Holtz_known_for
>>>>> What_is_lova_after_it_harddens
>>>>> What_is_Lova_Moor's_birthday
>>>>> What_is_lovable_in_Spanish
>>>>> What_is_lovage
>>>>> What_is_Lovagny's_population
>>>>> What_is_lovan_for
>>>>> What_is_lovanox
>>>>> What_is_lovarstan_for
>>>>> What_is_Lovasatin
>>>> 
>>

Re: edge ngram/find as you type sorting

Reply via email to