Re: Solr Reference Guide issue for simplified tokenizers

2018-04-16 Thread Nikolay Khitrin
Yes, Lucene RegExp javadoc seems a bit complicated and even tests do not
cover all syntax variants. But the whole point is: parser doesn't mangle
any characters and using backslashes only for distinguish syntax symbols
from raw characters.

The example might be not a best possible, but I think reference guide
should be corrected (may be with additional note about character escape)
because it is difficult to find out correct solution by end users those not
familiar with Lucene codebase.


Unfortunately, sometimes fine grained tokenizing control is the one
workaround for weird issues like LUCENE-7766.
For example I have to strip quotes on tokenizer stage to obtain WDGF
offsets on parts (for strings like Foo-Bar and
HTMLStripCharFilter before tokenizer) as temporary solution.


2018-04-15 21:08 GMT+03:00 Shawn Heisey :

> On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
>
>> Given example is  > class="solr.SimplePatternSplitTokenizerFactory"
>> pattern="[ \t\r\n]+"/> but Lucene's RegExp constructor consumes
>> raw unicode characters instead of \t\r\n form, so correct configuration is
>> 
>>
>
> Looks like you're right about that example not working.  I also tried it
> with double backslashes -- something that would be required if the string
> were found in actual java code.  Your suggested replacement DOES work --
> the characters are encoded with XML syntax and passed as ascii/unicode to
> the constructor for the tokenizer.
>
> I cannot make any sense out of the Lucene RegExp javadoc.  I think it
> needs some full string examples to illustrate what it is trying to say.
>
> I don't think this is a good example for this particular tokenizer, even
> if it's changed to your replacement that does work.  For what the example
> is TRYING to do, WhitespaceTokenizerFactory is a better choice.  It will
> match more whitespace characters than spaces, tabs, and newlines.
>
> Here's an example using that tokenizer that will split on semicolon and
> eliminate leading/trailing whitespace from each token:
>
> 
>   
>   
> 
>
> Thanks,
> Shawn
>
>


-- 
Николай Хитрин


Re: Solr Reference Guide issue for simplified tokenizers

2018-04-15 Thread Shawn Heisey

On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
Given example is  class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 
\t\r\n]+"/> but Lucene's RegExp constructor consumes raw 
unicode characters instead of \t\r\n form, so correct configuration is 
 


Looks like you're right about that example not working.  I also tried it 
with double backslashes -- something that would be required if the 
string were found in actual java code.  Your suggested replacement DOES 
work -- the characters are encoded with XML syntax and passed as 
ascii/unicode to the constructor for the tokenizer.


I cannot make any sense out of the Lucene RegExp javadoc.  I think it 
needs some full string examples to illustrate what it is trying to say.


I don't think this is a good example for this particular tokenizer, even 
if it's changed to your replacement that does work.  For what the 
example is TRYING to do, WhitespaceTokenizerFactory is a better choice.  
It will match more whitespace characters than spaces, tabs, and newlines.


Here's an example using that tokenizer that will split on semicolon and 
eliminate leading/trailing whitespace from each token:



  
  


Thanks,
Shawn



Solr Reference Guide issue for simplified tokenizers

2018-04-15 Thread Nikolay Khitrin
I'm feeling I found an issue in Solr Reference Guide for Simplified Regular
Expression Pattern [Splitting ]Tokenizer (https://lucene.apache.org/
solr/guide/7_3/tokenizers.html#simplified-regular-
expression-pattern-splitting-tokenizer).

Given example is


  


but Lucene's RegExp constructor consumes raw unicode characters instead of
\t\r\n form, so correct configuration is



-- 
Nikolay Khitrin