Re: Unicode Quotes in query parser

2019-01-28 Thread John Ryan
Thanks Michael,

The dismax hardler does indeed run and escape all non standard characters 
before handing it off to the analysers and tokenisers. This fix looks like it 
belongs more in the handler, more than the parser. I wrote a SearchComponent 
handler to do the same thing at that level and can drop it in as a plugin - 
although it seems like something that should be out-of-the-box.

I think I’ll close it and reconsider the implementation.

John

> On 22 Jan 2019, at 16:20, Michael Sokolov  wrote:
> 
> Right - QueryParsers generally do a first pass, parsing incoming Strings 
> using their operator characters tok tokenize the input and only after that do 
> they pass the tokens (or phrases) to an Analyzer. I haven't checked Dismax - 
> not sure how it does its parsing exactly, but I doubt you can just "turn on 
> the right Analyzer" to get it to recognize curly quotes as phrase operators, 
> eg.
> 
> On Tue, Jan 22, 2019 at 10:39 AM Mikhail Khludnev  > wrote:
> My impression that these quotes are ones which are part of dismax query 
> syntax ie they should be handled before the analysis happens. 
> 
> On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood  > wrote:
> First, check which transforms are already handled by Unicode normalization. 
> Put this in all of your analyzer chains:
> 
> 
> 
> Probably need this in solrconfig.xml:
> 
>  
>regex=".*\.jar" />
>dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
> regex=".*\.jar" />
> 
> I really cannot think of a reason to use unnormalized Unicode in Solr. That 
> should be in all the sample files.
> 
> For search character matching, yes, all spaces should be normalized. I have 
> too many hacks fixing non-breaking spaces spread around the code. When 
> matching, there is zero use for stuff like ideographic space (U+3000).
> 
> I’m not sure if quotes are normalized. I did some searching around without 
> success. That might come under character folding. There was a draft, now 
> withdrawn, for standard character folding. I’d probably start there for a 
> Unicode folding char filter.
> 
> https://www.unicode.org/reports/tr30/tr30-4.html 
> 
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org 
> http://observer.wunderwood.org/   (my blog)
> 
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov > > wrote:
>> 
>> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
>> since it is dismax parser you are talking about, which really lives in Solr. 
>> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
>> those in their terms? Also, it leads to a kind of slippery slope: would you 
>> also want to convert all the various white space characters (no-break space, 
>> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
>> "operator" characters like brackets?
>> 
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan > > wrote:
>> I'm looking to create an issue to add support for Unicode Double Quotes to 
>> the dismax parser. 
>> 
>> I want to replace all types of double quotes with standard ones before they 
>> get stripped 
>> 
>> i.e.
>> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>> 
>> With 
>> "
>> I presume this has been discussed before?
>> 
>> I have a POC here: 
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
>> 
>> 
>> Thanks, 
>> 
>> John
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> 
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> 
>> 
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev



Re: Unicode Quotes in query parser

2019-01-22 Thread Michael Sokolov
Right - QueryParsers generally do a first pass, parsing incoming Strings
using their operator characters tok tokenize the input and only after that
do they pass the tokens (or phrases) to an Analyzer. I haven't checked
Dismax - not sure how it does its parsing exactly, but I doubt you can just
"turn on the right Analyzer" to get it to recognize curly quotes as phrase
operators, eg.

On Tue, Jan 22, 2019 at 10:39 AM Mikhail Khludnev  wrote:

> My impression that these quotes are ones which are part of dismax query
> syntax ie they should be handled before the analysis happens.
>
> On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood 
> wrote:
>
>> First, check which transforms are already handled by Unicode
>> normalization. Put this in all of your analyzer chains:
>>
>> 
>>
>> Probably need this in solrconfig.xml:
>>
>>  
>>   > regex=".*\.jar" />
>>   > dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"
>> regex=".*\.jar" />
>>
>> I really cannot think of a reason to use unnormalized Unicode in Solr.
>> That should be in all the sample files.
>>
>> For search character matching, yes, all spaces should be normalized. I
>> have too many hacks fixing non-breaking spaces spread around the code. When
>> matching, there is zero use for stuff like ideographic space (U+3000).
>>
>> I’m not sure if quotes are normalized. I did some searching around
>> without success. That might come under character folding. There was a
>> draft, now withdrawn, for standard character folding. I’d probably start
>> there for a Unicode folding char filter.
>>
>> https://www.unicode.org/reports/tr30/tr30-4.html
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov  wrote:
>>
>> I think this is probably better to discuss on solr-user, or maybe
>> solr-dev, since it is dismax parser you are talking about, which really
>> lives in Solr. However, my 2c  - this seems somewhat dubious. Maybe people
>> want to include those in their terms? Also, it leads to a kind of slippery
>> slope: would you also want to convert all the various white space
>> characters (no-break space, thin space, em space, etc)  as vanilla ascii
>> 32? How about all the other "operator" characters like brackets?
>>
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan 
>> wrote:
>>
>>> I'm looking to create an issue to add support for Unicode Double Quotes
>>> to the dismax parser.
>>>
>>> I want to replace all types of double quotes with standard ones before
>>> they get stripped
>>>
>>> i.e.
>>> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>>>
>>> With
>>> "
>>> I presume this has been discussed before?
>>>
>>> I have a POC here:
>>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
>>>
>>> Thanks,
>>>
>>> John
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Unicode Quotes in query parser

2019-01-22 Thread Mikhail Khludnev
My impression that these quotes are ones which are part of dismax query
syntax ie they should be handled before the analysis happens.

On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood 
wrote:

> First, check which transforms are already handled by Unicode
> normalization. Put this in all of your analyzer chains:
>
> 
>
> Probably need this in solrconfig.xml:
>
>  
>regex=".*\.jar" />
>dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"
> regex=".*\.jar" />
>
> I really cannot think of a reason to use unnormalized Unicode in Solr.
> That should be in all the sample files.
>
> For search character matching, yes, all spaces should be normalized. I
> have too many hacks fixing non-breaking spaces spread around the code. When
> matching, there is zero use for stuff like ideographic space (U+3000).
>
> I’m not sure if quotes are normalized. I did some searching around without
> success. That might come under character folding. There was a draft, now
> withdrawn, for standard character folding. I’d probably start there for a
> Unicode folding char filter.
>
> https://www.unicode.org/reports/tr30/tr30-4.html
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Jan 21, 2019, at 7:43 AM, Michael Sokolov  wrote:
>
> I think this is probably better to discuss on solr-user, or maybe
> solr-dev, since it is dismax parser you are talking about, which really
> lives in Solr. However, my 2c  - this seems somewhat dubious. Maybe people
> want to include those in their terms? Also, it leads to a kind of slippery
> slope: would you also want to convert all the various white space
> characters (no-break space, thin space, em space, etc)  as vanilla ascii
> 32? How about all the other "operator" characters like brackets?
>
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan 
> wrote:
>
>> I'm looking to create an issue to add support for Unicode Double Quotes
>> to the dismax parser.
>>
>> I want to replace all types of double quotes with standard ones before
>> they get stripped
>>
>> i.e.
>> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>>
>> With
>> "
>> I presume this has been discussed before?
>>
>> I have a POC here:
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
>>
>> Thanks,
>>
>> John
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

-- 
Sincerely yours
Mikhail Khludnev


Re: Unicode Quotes in query parser

2019-01-22 Thread John Ryan
Thanks Walter,

The solr.ICUNormalizer2CharFilterFactory testing and research I have done leads 
me to believe that quotes are not normalised.

I attempted to do this with character folding, many implementations out there - 
but none actually seem to work. 

I’ll look into the draft.

Thanks
--
John  

> On 21 Jan 2019, at 17:09, Walter Underwood  wrote:
> 
> First, check which transforms are already handled by Unicode normalization. 
> Put this in all of your analyzer chains:
> 
> 
> 
> Probably need this in solrconfig.xml:
> 
>  
>regex=".*\.jar" />
>dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
> regex=".*\.jar" />
> 
> I really cannot think of a reason to use unnormalized Unicode in Solr. That 
> should be in all the sample files.
> 
> For search character matching, yes, all spaces should be normalized. I have 
> too many hacks fixing non-breaking spaces spread around the code. When 
> matching, there is zero use for stuff like ideographic space (U+3000).
> 
> I’m not sure if quotes are normalized. I did some searching around without 
> success. That might come under character folding. There was a draft, now 
> withdrawn, for standard character folding. I’d probably start there for a 
> Unicode folding char filter.
> 
> https://www.unicode.org/reports/tr30/tr30-4.html 
> 
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org 
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov > > wrote:
>> 
>> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
>> since it is dismax parser you are talking about, which really lives in Solr. 
>> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
>> those in their terms? Also, it leads to a kind of slippery slope: would you 
>> also want to convert all the various white space characters (no-break space, 
>> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
>> "operator" characters like brackets?
>> 
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan > > wrote:
>> I'm looking to create an issue to add support for Unicode Double Quotes to 
>> the dismax parser. 
>> 
>> I want to replace all types of double quotes with standard ones before they 
>> get stripped 
>> 
>> i.e.
>> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>> 
>> With 
>> "
>> I presume this has been discussed before?
>> 
>> I have a POC here: 
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
>> 
>> 
>> Thanks, 
>> 
>> John
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> 
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> 
>> 
> 



Re: Unicode Quotes in query parser

2019-01-21 Thread Walter Underwood
First, check which transforms are already handled by Unicode normalization. Put 
this in all of your analyzer chains:



Probably need this in solrconfig.xml:

 
  
  

I really cannot think of a reason to use unnormalized Unicode in Solr. That 
should be in all the sample files.

For search character matching, yes, all spaces should be normalized. I have too 
many hacks fixing non-breaking spaces spread around the code. When matching, 
there is zero use for stuff like ideographic space (U+3000).

I’m not sure if quotes are normalized. I did some searching around without 
success. That might come under character folding. There was a draft, now 
withdrawn, for standard character folding. I’d probably start there for a 
Unicode folding char filter.

https://www.unicode.org/reports/tr30/tr30-4.html

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 21, 2019, at 7:43 AM, Michael Sokolov  wrote:
> 
> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
> since it is dismax parser you are talking about, which really lives in Solr. 
> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
> those in their terms? Also, it leads to a kind of slippery slope: would you 
> also want to convert all the various white space characters (no-break space, 
> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
> "operator" characters like brackets?
> 
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan  
> wrote:
> I'm looking to create an issue to add support for Unicode Double Quotes to 
> the dismax parser. 
> 
> I want to replace all types of double quotes with standard ones before they 
> get stripped 
> 
> i.e.
> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
> 
> With 
> "
> I presume this has been discussed before?
> 
> I have a POC here: 
> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
> 
> 
> Thanks, 
> 
> John
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> 
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> 
> 



Re: Unicode Quotes in query parser

2019-01-21 Thread Michael Sokolov
I think this is probably better to discuss on solr-user, or maybe solr-dev,
since it is dismax parser you are talking about, which really lives in
Solr. However, my 2c  - this seems somewhat dubious. Maybe people want to
include those in their terms? Also, it leads to a kind of slippery slope:
would you also want to convert all the various white space characters
(no-break space, thin space, em space, etc)  as vanilla ascii 32? How about
all the other "operator" characters like brackets?

On Mon, Jan 21, 2019 at 9:50 AM John Ryan 
wrote:

> I'm looking to create an issue to add support for Unicode Double Quotes to
> the dismax parser.
>
> I want to replace all types of double quotes with standard ones before
> they get stripped
>
> i.e.
> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>
> With
> "
> I presume this has been discussed before?
>
> I have a POC here:
> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
>
> Thanks,
>
> John
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Unicode Quotes in query parser

2019-01-21 Thread John Ryan
I'm looking to create an issue to add support for Unicode Double Quotes to the 
dismax parser. 

I want to replace all types of double quotes with standard ones before they get 
stripped 

i.e.
“ ” „ “ „ « » ‟ ❝ ❞ ⹂ "

With 
"
I presume this has been discussed before?

I have a POC here: 
https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x

Thanks, 

John
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org