Re: Unicode Quotes in query parser

2019-01-28 Thread John Ryan
Thanks Michael, The dismax hardler does indeed run and escape all non standard characters before handing it off to the analysers and tokenisers. This fix looks like it belongs more in the handler, more than the parser. I wrote a SearchComponent handler to do the same thing at that level and can

Re: Unicode Quotes in query parser

2019-01-22 Thread Michael Sokolov
Right - QueryParsers generally do a first pass, parsing incoming Strings using their operator characters tok tokenize the input and only after that do they pass the tokens (or phrases) to an Analyzer. I haven't checked Dismax - not sure how it does its parsing exactly, but I doubt you can just "tur

Re: Unicode Quotes in query parser

2019-01-22 Thread Mikhail Khludnev
My impression that these quotes are ones which are part of dismax query syntax ie they should be handled before the analysis happens. On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood wrote: > First, check which transforms are already handled by Unicode > normalization. Put this in all of your an

Re: Unicode Quotes in query parser

2019-01-22 Thread John Ryan
Thanks Walter, The solr.ICUNormalizer2CharFilterFactory testing and research I have done leads me to believe that quotes are not normalised. I attempted to do this with character folding, many implementations out there - but none actually seem to work. I’ll look into the draft. Thank

Re: Unicode Quotes in query parser

2019-01-21 Thread Walter Underwood
First, check which transforms are already handled by Unicode normalization. Put this in all of your analyzer chains: Probably need this in solrconfig.xml: I really cannot think of a reason to use unnormalized Unicode in Solr. That should be in all the sample files. For searc

Re: Unicode Quotes in query parser

2019-01-21 Thread Michael Sokolov
I think this is probably better to discuss on solr-user, or maybe solr-dev, since it is dismax parser you are talking about, which really lives in Solr. However, my 2c - this seems somewhat dubious. Maybe people want to include those in their terms? Also, it leads to a kind of slippery slope: woul