Re: Indexing word with plus sign

Fundera Developer Tue, 23 May 2017 10:17:50 -0700

I have also tried this option, by using a PatternReplaceFilterFactory, like 
this:


<filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" 
replacement="investigación y desarrollo"/>

but it gets processed AFTER the Tokenizer, so when it executes there is no 
longer an "i+d" token, but two "i" and "d" independent tokens.

Is there a way I could make the filter execute before the Tokenizer? I have 
tried to place it first in the Analyzer definition like this:

     <analyzer type="index">
       <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-FoldToASCII.txt"/>
       <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" 
replacement="investigación y desarrollo"/>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
     </analyzer>

But I had no luck.

Are there any other approaches I could be missing?

Thanks!


El 22/05/17 a las 20:50, Rick Leir escribió:

Fundera,
You need a regex which matches a '+' with non-blank chars before and after. It 
should not replace a  '+' preceded by white space, that is important in Solr. 
This is not a perfect solution, but might improve matters for you.
Cheers -- Rick

On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
<funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com> wrote:


Thank you Zahid and Erik,

I was going to try the CharFilter suggestion, but then I doubted. I see
the indexing process, and how the appearance of 'i+d' would be handled,
but, what happens at query time? If I use the same filter, I could
remove '+' chars that are added by the user to identify compulsory
tokens in the search results, couldn't I?  However, if i do not use the
CharFilter I would not be able to match the 'i+d' search tokens...

Thanks all!



El 22/05/17 a las 16:39, Erick Erickson escribió:

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc....

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
<zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net>
wrote:


Hi,


Before applying tokenizer, you can replace your special symbols with
some
phrase to preserve it and after tokenized you can replace it back.

For example:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\+)"
replacement="xxx" />


Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com<mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com>>
wrote:



Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In
Spanish it is usual to find the term 'i+d'. We are working with Solr
5.5,
and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
the
index documents both in Spanish and Catalan, and in Catalan it is
frequent
to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
documents as results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it
does
not split this word?

Thanks in advance!

Re: Indexing word with plus sign

Reply via email to