Re: Solr Reference Guide issue for simplified tokenizers

Nikolay Khitrin Mon, 16 Apr 2018 06:43:11 -0700

Yes, Lucene RegExp javadoc seems a bit complicated and even tests do not
cover all syntax variants. But the whole point is: parser doesn't mangle
any characters and using backslashes only for distinguish syntax symbols
from raw characters.


The example might be not a best possible, but I think reference guide
should be corrected (may be with additional note about character escape)
because it is difficult to find out correct solution by end users those not
familiar with Lucene codebase.


Unfortunately, sometimes fine grained tokenizing control is the one
workaround for weird issues like LUCENE-7766.
For example I have to strip quotes on tokenizer stage to obtain WDGF
offsets on parts (for strings like &quot;Foo-Bar&quot; and
HTMLStripCharFilter before tokenizer) as temporary solution.


2018-04-15 21:08 GMT+03:00 Shawn Heisey <apa...@elyograg.org>:

> On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
>
>> Given example is <analyzer> <tokenizer 
>> class="solr.SimplePatternSplitTokenizerFactory"
>> pattern="[ \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes
>> raw unicode characters instead of \t\r\n form, so correct configuration is
>> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
>> &#x9;& #xA;&#xD;]+"/>
>>
>
> Looks like you're right about that example not working.  I also tried it
> with double backslashes -- something that would be required if the string
> were found in actual java code.  Your suggested replacement DOES work --
> the characters are encoded with XML syntax and passed as ascii/unicode to
> the constructor for the tokenizer.
>
> I cannot make any sense out of the Lucene RegExp javadoc.  I think it
> needs some full string examples to illustrate what it is trying to say.
>
> I don't think this is a good example for this particular tokenizer, even
> if it's changed to your replacement that does work.  For what the example
> is TRYING to do, WhitespaceTokenizerFactory is a better choice.  It will
> match more whitespace characters than spaces, tabs, and newlines.
>
> Here's an example using that tokenizer that will split on semicolon and
> eliminate leading/trailing whitespace from each token:
>
> <analyzer>
>   <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/>
>   <filter class="solr.TrimFilterFactory"/>
> </analyzer>
>
> Thanks,
> Shawn
>
>


-- 
Николай Хитрин

Re: Solr Reference Guide issue for simplified tokenizers

Reply via email to