Re: Which Tokeniser (and/or filter)

Erick Erickson Thu, 09 Feb 2012 06:49:15 -0800

Give it a try. I was surprised the first time I tried ngramming,
the actual increase in my index size was much less than I feared.


Best
Erick

On Wed, Feb 8, 2012 at 11:41 AM, Robert Brown <r...@intelcompute.com> wrote:
> Attempting to re-produce legacy behaviour (i know!) of simple SQL
> substring searching, with and without phrases.
>
> I feel simply NGram'ing 4m CV's may be pushing it?
>
>
> ---
>
> IntelCompute
> Web Design & Local Online Marketing
>
> http://www.intelcompute.com
>
>
> On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson
> <erickerick...@gmail.com> wrote:
>> You'll probably have to index them in separate fields to
>> get what you want. The question is always whether it's
>> worth it, is the use-case really well served by having a
>> variant that keeps dots and things? But that's always more
>> a question for your product manager....
>>
>> Best
>> Erick
>>
>> On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown <r...@intelcompute.com> wrote:
>>> Thanks Erick,
>>>
>>> I didn't get confused with multiple tokens vs multiValued  :)
>>>
>>> Before I go ahead and re-index 4m docs, and believe me I'm using the
>>> analysis page like a mad-man!
>>>
>>> What do I need to configure to have the following both indexed with and
>>> without the dots...
>>>
>>> .net
>>> sales manager.
>>> £12.50
>>>
>>> Currently...
>>>
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> <filter class="solr.WordDelimiterFilterFactory"
>>>        generateWordParts="1"
>>>        generateNumberParts="1"
>>>        catenateWords="1"
>>>        catenateNumbers="1"
>>>        catenateAll="1"
>>>        splitOnCaseChange="1"
>>>        splitOnNumerics="1"
>>>        types="wdftypes.txt"
>>> />
>>>
>>> with nothing specific in wdftypes.txt for full-stops.
>>>
>>> Should there also be any difference when quoting my searches?
>>>
>>> The analysis page seems to just drop the quotes, but surely actual
>>> calls don't do this?
>>>
>>>
>>>
>>> ---
>>>
>>> IntelCompute
>>> Web Design & Local Online Marketing
>>>
>>> http://www.intelcompute.com
>>>
>>>
>>> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
>>> <erickerick...@gmail.com> wrote:
>>>> Yes, WDDF creates multiple tokens. But that has
>>>> nothing to do with the multiValued suggestion.
>>>>
>>>> You can get exactly what you want by
>>>> 1> setting multiValued="true" in your schema file and re-indexing. Say
>>>> positionIncrementGap is set to 100
>>>> 2> When you index, add the field for each sentence, so your doc
>>>>       looks something like:
>>>>      <doc>
>>>>         <field name="sentences">i am a sales-manager in here</field>
>>>>        <field name="sentences">using asp.net and .net daily</field>
>>>>          .....
>>>>       </doc>
>>>> 3> search like "sales manager"~100
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote:
>>>>> Apologies if things were a little vague.
>>>>>
>>>>> Given the example snippet to index (numbered to show searches needed to
>>>>> match)...
>>>>>
>>>>> 1: i am a sales-manager in here
>>>>> 2: using asp.net and .net daily
>>>>> 3: working in design.
>>>>> 4: using something called sage 200. and i'm fluent
>>>>> 5: german sausages.
>>>>> 6: busy A&E dept earning £10,000 annually
>>>>>
>>>>>
>>>>> ... all with newlines in place.
>>>>>
>>>>> able to match...
>>>>>
>>>>> 1. sales
>>>>> 1. "sales manager"
>>>>> 1. sales-manager
>>>>> 1. "sales-manager"
>>>>> 2. .net
>>>>> 2. asp.net
>>>>> 3. design
>>>>> 4. sage 200
>>>>> 6. A&E
>>>>> 6. £10,000
>>>>>
>>>>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>>>>> between them when indexed, but not when searched.
>>>>>
>>>>>
>>>>> Do the filters (wdf in this case) not create multiple tokens, so if
>>>>> splitting on period in "asp.net" would create tokens for all of "asp",
>>>>> "asp.", "asp.net", ".net", "net".
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Rob
>>>>>
>>>>> --
>>>>>
>>>>> IntelCompute
>>>>> Web Design and Online Marketing
>>>>>
>>>>> http://www.intelcompute.com
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chris Hostetter <hossman_luc...@fucit.org>
>>>>> Reply-to: solr-user@lucene.apache.org
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Which Tokeniser (and/or filter)
>>>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>>>>
>>>>> : This all seems a bit too much work for such a real-world scenario?
>>>>>
>>>>> You haven't really told us what your scenerio is.
>>>>>
>>>>> You said you want to split tokens on whitespace, full-stop (aka:
>>>>> period) and comma only, but then in response to some suggestions you added
>>>>> comments other things that you never mentioned previously...
>>>>>
>>>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>>>>> 2) evidently you not only want token splits on newlines, but also
>>>>> positition gaps to prevent phrases matching across newlines.
>>>>>
>>>>> ...these are kind of important details that affect suggestions people
>>>>> might give you.
>>>>>
>>>>> can you please provide some concrete examples of hte types of data you
>>>>> have, the types of queries you want them to match, and the types of
>>>>> queries you *don't* want to match?
>>>>>
>>>>>
>>>>> -Hoss
>>>>>
>>>
>

Re: Which Tokeniser (and/or filter)

Reply via email to