subject:"Indexing word with plus sign"

Re: Indexing word with plus sign

2017-05-24 Thread Fundera Developer

Thank you very much Erick!  You're right!

The "Char" part in PatternReplaceCharFilterFactory misguided me and I tought it 
was just for Char replacements. One I have gone through the documentation of 
CharFilters (my fault...) I realized that I could use the very same regex I was 
using with the PatternReplaceFilterFactory to replace the whole "i+d" 
expression, and nothing more than that, and it is working like charm now.

Thanks again!!


El 23/05/17 a las 19:41, Erick Erickson escribió:

You need to distinguish between

PatternReplaceCharFilterFactory

and

PatternReplaceFilterFactory

The first one is applied to the entire input _before_ tokenization.
The second is applied _after_ tokenization to individual tokens, by
that time it's too late.

It's an easy thing to miss.

And at query time you'll have to be careful to keep the + sign from
being interpreted as an operator.
Best,
Erick

On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
 wrote:


I have also tried this option, by using a PatternReplaceFilterFactory, like 
this:



but it gets processed AFTER the Tokenizer, so when it executes there is no 
longer an "i+d" token, but two "i" and "d" independent tokens.

Is there a way I could make the filter execute before the Tokenizer? I have 
tried to place it first in the Analyzer definition like this:

 
   
   
   
   
   
 

But I had no luck.

Are there any other approaches I could be missing?

Thanks!


El 22/05/17 a las 20:50, Rick Leir escribió:

Fundera,
You need a regex which matches a '+' with non-blank chars before and after. It 
should not replace a  '+' preceded by white space, that is important in Solr. 
This is not a perfect solution, but might improve matters for you.
Cheers -- Rick

On May 22, 2017 1:58:21 PM EDT, Fundera Developer 

 wrote:


Thank you Zahid and Erik,

I was going to try the CharFilter suggestion, but then I doubted. I see
the indexing process, and how the appearance of 'i+d' would be handled,
but, what happens at query time? If I use the same filter, I could
remove '+' chars that are added by the user to identify compulsory
tokens in the search results, couldn't I?  However, if i do not use the
CharFilter I would not be able to match the 'i+d' search tokens...

Thanks all!



El 22/05/17 a las 16:39, Erick Erickson escribió:

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal

wrote:


Hi,


Before applying tokenizer, you can replace your special symbols with
some
phrase to preserve it and after tokenized you can replace it back.

For example:



Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com>
wrote:



Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In
Spanish it is usual to find the term 'i+d'. We are working with Solr
5.5,
and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
the
index documents both in Spanish and Catalan, and in Catalan it is
frequent
to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
documents as results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it
does
not split this word?

Thanks in advance!

Re: Indexing word with plus sign

2017-05-23 Thread Walter Underwood

That was on Solr 1.3, so I’m pretty sure it was the whitespace tokenizer.

The synonym substitution for “+/-" was done in client code and indexing code, 
outside of Solr. We also sanitized queries to remove all query syntax 
characters. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 23, 2017, at 11:21 AM, Fundera Developer 
>  wrote:
> 
> Thanks Walter!!
> 
> For the sake of curiosity, do you remember which Tokenizer were you using in 
> that case?
> 
> Thanks!
> 
> 
> El 23/05/17 a las 20:02, Walter Underwood escribió:
> 
> Years ago at Netflix, I had to deal with a DVD from a band named “+/-“. I 
> gave up and translated that to “plusminus” at index and query time.
> 
> http://plusmin.us/ 
> 
> Luckily, “.hack//Sign” and other related dot-hack anime matched if I just 
> deleted all the punctuation. And everyone searched for "[•REC]²” as “rec2”. 
> The middot is supposed to be red. Movie studios are clueless about searchable 
> strings.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> 
> 
> On May 23, 2017, at 10:41 AM, Erick Erickson 
>  wrote:
> 
> You need to distinguish between
> 
> PatternReplaceCharFilterFactory
> 
> and
> 
> PatternReplaceFilterFactory
> 
> The first one is applied to the entire input _before_ tokenization.
> The second is applied _after_ tokenization to individual tokens, by
> that time it's too late.
> 
> It's an easy thing to miss.
> 
> And at query time you'll have to be careful to keep the + sign from
> being interpreted as an operator.
> Best,
> Erick
> 
> On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
>  wrote:
> 
> 
> I have also tried this option, by using a PatternReplaceFilterFactory, like 
> this:
> 
>  replacement="investigación y desarrollo"/>
> 
> but it gets processed AFTER the Tokenizer, so when it executes there is no 
> longer an "i+d" token, but two "i" and "d" independent tokens.
> 
> Is there a way I could make the filter execute before the Tokenizer? I have 
> tried to place it first in the Analyzer definition like this:
> 
>
>   mapping="mapping-FoldToASCII.txt"/>
>   replacement="investigación y desarrollo"/>
>  
>  
>   words="stopwords.txt" />
>
> 
> But I had no luck.
> 
> Are there any other approaches I could be missing?
> 
> Thanks!
> 
> 
> El 22/05/17 a las 20:50, Rick Leir escribió:
> 
> Fundera,
> You need a regex which matches a '+' with non-blank chars before and after. 
> It should not replace a  '+' preceded by white space, that is important in 
> Solr. This is not a perfect solution, but might improve matters for you.
> Cheers -- Rick
> 
> On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
> 
>  wrote:
> 
> 
> Thank you Zahid and Erik,
> 
> I was going to try the CharFilter suggestion, but then I doubted. I see
> the indexing process, and how the appearance of 'i+d' would be handled,
> but, what happens at query time? If I use the same filter, I could
> remove '+' chars that are added by the user to identify compulsory
> tokens in the search results, couldn't I?  However, if i do not use the
> CharFilter I would not be able to match the 'i+d' search tokens...
> 
> Thanks all!
> 
> 
> 
> El 22/05/17 a las 16:39, Erick Erickson escribió:
> 
> You can also use any of the other tokenizers. WhitespaceTokenizer for
> instance. There are a couple that use regular expressions. Etc. See:
> https://cwiki.apache.org/confluence/display/solr/Tokenizers
> 
> Each one has it's considerations. WhitespaceTokenizer won't, for
> instance, separate out punctuation so you might then have to use a
> filter to remove those. Regex's can be tricky to get right ;). Etc
> 
> Best,
> Erick
> 
> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
> 
> wrote:
> 
> 
> Hi,
> 
> 
> Before applying tokenizer, you can replace your special symbols with
> some
> phrase to preserve it and after tokenized you can replace it back.
> 
> For example:
>  replacement="xxx" />
> 
> 
> Thanks,
> Zahid iqbal
> 
> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
>

Re: Indexing word with plus sign

2017-05-23 Thread Fundera Developer

Thanks Walter!!

For the sake of curiosity, do you remember which Tokenizer were you using in 
that case?

Thanks!


El 23/05/17 a las 20:02, Walter Underwood escribió:

Years ago at Netflix, I had to deal with a DVD from a band named “+/-“. I gave 
up and translated that to “plusminus” at index and query time.

http://plusmin.us/ 

Luckily, “.hack//Sign” and other related dot-hack anime matched if I just 
deleted all the punctuation. And everyone searched for "[•REC]²” as “rec2”. The 
middot is supposed to be red. Movie studios are clueless about searchable 
strings.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




On May 23, 2017, at 10:41 AM, Erick Erickson 
 wrote:

You need to distinguish between

PatternReplaceCharFilterFactory

and

PatternReplaceFilterFactory

The first one is applied to the entire input _before_ tokenization.
The second is applied _after_ tokenization to individual tokens, by
that time it's too late.

It's an easy thing to miss.

And at query time you'll have to be careful to keep the + sign from
being interpreted as an operator.
Best,
Erick

On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
 wrote:


I have also tried this option, by using a PatternReplaceFilterFactory, like 
this:



but it gets processed AFTER the Tokenizer, so when it executes there is no 
longer an "i+d" token, but two "i" and "d" independent tokens.

Is there a way I could make the filter execute before the Tokenizer? I have 
tried to place it first in the Analyzer definition like this:


  
  
  
  
  


But I had no luck.

Are there any other approaches I could be missing?

Thanks!


El 22/05/17 a las 20:50, Rick Leir escribió:

Fundera,
You need a regex which matches a '+' with non-blank chars before and after. It 
should not replace a  '+' preceded by white space, that is important in Solr. 
This is not a perfect solution, but might improve matters for you.
Cheers -- Rick

On May 22, 2017 1:58:21 PM EDT, Fundera Developer 

 wrote:


Thank you Zahid and Erik,

I was going to try the CharFilter suggestion, but then I doubted. I see
the indexing process, and how the appearance of 'i+d' would be handled,
but, what happens at query time? If I use the same filter, I could
remove '+' chars that are added by the user to identify compulsory
tokens in the search results, couldn't I?  However, if i do not use the
CharFilter I would not be able to match the 'i+d' search tokens...

Thanks all!



El 22/05/17 a las 16:39, Erick Erickson escribió:

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal

wrote:


Hi,


Before applying tokenizer, you can replace your special symbols with
some
phrase to preserve it and after tokenized you can replace it back.

For example:



Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com>
wrote:



Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In
Spanish it is usual to find the term 'i+d'. We are working with Solr
5.5,
and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
the
index documents both in Spanish and Catalan, and in Catalan it is
frequent
to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
documents as results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it
does
not split this word?

Thanks in advance!

Re: Indexing word with plus sign

2017-05-23 Thread Walter Underwood

Years ago at Netflix, I had to deal with a DVD from a band named “+/-“. I gave 
up and translated that to “plusminus” at index and query time.

http://plusmin.us/ 

Luckily, “.hack//Sign” and other related dot-hack anime matched if I just 
deleted all the punctuation. And everyone searched for "[•REC]²” as “rec2”. The 
middot is supposed to be red. Movie studios are clueless about searchable 
strings.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 23, 2017, at 10:41 AM, Erick Erickson  wrote:
> 
> You need to distinguish between
> 
> PatternReplaceCharFilterFactory
> 
> and
> 
> PatternReplaceFilterFactory
> 
> The first one is applied to the entire input _before_ tokenization.
> The second is applied _after_ tokenization to individual tokens, by
> that time it's too late.
> 
> It's an easy thing to miss.
> 
> And at query time you'll have to be careful to keep the + sign from
> being interpreted as an operator.
> Best,
> Erick
> 
> On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
>  wrote:
>> I have also tried this option, by using a PatternReplaceFilterFactory, like 
>> this:
>> 
>> > replacement="investigación y desarrollo"/>
>> 
>> but it gets processed AFTER the Tokenizer, so when it executes there is no 
>> longer an "i+d" token, but two "i" and "d" independent tokens.
>> 
>> Is there a way I could make the filter execute before the Tokenizer? I have 
>> tried to place it first in the Analyzer definition like this:
>> 
>> 
>>   > mapping="mapping-FoldToASCII.txt"/>
>>   > replacement="investigación y desarrollo"/>
>>   
>>   
>>   > words="stopwords.txt" />
>> 
>> 
>> But I had no luck.
>> 
>> Are there any other approaches I could be missing?
>> 
>> Thanks!
>> 
>> 
>> El 22/05/17 a las 20:50, Rick Leir escribió:
>> 
>> Fundera,
>> You need a regex which matches a '+' with non-blank chars before and after. 
>> It should not replace a  '+' preceded by white space, that is important in 
>> Solr. This is not a perfect solution, but might improve matters for you.
>> Cheers -- Rick
>> 
>> On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
>>  wrote:
>> 
>> 
>> Thank you Zahid and Erik,
>> 
>> I was going to try the CharFilter suggestion, but then I doubted. I see
>> the indexing process, and how the appearance of 'i+d' would be handled,
>> but, what happens at query time? If I use the same filter, I could
>> remove '+' chars that are added by the user to identify compulsory
>> tokens in the search results, couldn't I?  However, if i do not use the
>> CharFilter I would not be able to match the 'i+d' search tokens...
>> 
>> Thanks all!
>> 
>> 
>> 
>> El 22/05/17 a las 16:39, Erick Erickson escribió:
>> 
>> You can also use any of the other tokenizers. WhitespaceTokenizer for
>> instance. There are a couple that use regular expressions. Etc. See:
>> https://cwiki.apache.org/confluence/display/solr/Tokenizers
>> 
>> Each one has it's considerations. WhitespaceTokenizer won't, for
>> instance, separate out punctuation so you might then have to use a
>> filter to remove those. Regex's can be tricky to get right ;). Etc
>> 
>> Best,
>> Erick
>> 
>> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
>> 
>> wrote:
>> 
>> 
>> Hi,
>> 
>> 
>> Before applying tokenizer, you can replace your special symbols with
>> some
>> phrase to preserve it and after tokenized you can replace it back.
>> 
>> For example:
>> > replacement="xxx" />
>> 
>> 
>> Thanks,
>> Zahid iqbal
>> 
>> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
>> funderadevelo...@outlook.com>
>> wrote:
>> 
>> 
>> 
>> Hi all,
>> 
>> I am a bit stuck at a problem that I feel must be easy to solve. In
>> Spanish it is usual to find the term 'i+d'. We are working with Solr
>> 5.5,
>> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
>> the
>> index documents both in Spanish and Catalan, and in Catalan it is
>> frequent
>> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
>> documents as results.
>> 
>> I have tried to use the SynonymFilter, with something like:
>> 
>> i+d => investigacionYdesarrollo
>> 
>> But it does not seem to change anything.
>> 
>> Is there a way I could set an exception to the Tokenizer so that it
>> does
>> not split this word?
>> 
>> Thanks in advance!
>> 
>> 
>> 
>> 
>>

Re: Indexing word with plus sign

2017-05-23 Thread Erick Erickson

You need to distinguish between

PatternReplaceCharFilterFactory

and

PatternReplaceFilterFactory

The first one is applied to the entire input _before_ tokenization.
The second is applied _after_ tokenization to individual tokens, by
that time it's too late.

It's an easy thing to miss.

And at query time you'll have to be careful to keep the + sign from
being interpreted as an operator.
Best,
Erick

On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
 wrote:
> I have also tried this option, by using a PatternReplaceFilterFactory, like 
> this:
>
>  replacement="investigación y desarrollo"/>
>
> but it gets processed AFTER the Tokenizer, so when it executes there is no 
> longer an "i+d" token, but two "i" and "d" independent tokens.
>
> Is there a way I could make the filter execute before the Tokenizer? I have 
> tried to place it first in the Analyzer definition like this:
>
>  
> mapping="mapping-FoldToASCII.txt"/>
> replacement="investigación y desarrollo"/>
>
>
> words="stopwords.txt" />
>  
>
> But I had no luck.
>
> Are there any other approaches I could be missing?
>
> Thanks!
>
>
> El 22/05/17 a las 20:50, Rick Leir escribió:
>
> Fundera,
> You need a regex which matches a '+' with non-blank chars before and after. 
> It should not replace a  '+' preceded by white space, that is important in 
> Solr. This is not a perfect solution, but might improve matters for you.
> Cheers -- Rick
>
> On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
>  wrote:
>
>
> Thank you Zahid and Erik,
>
> I was going to try the CharFilter suggestion, but then I doubted. I see
> the indexing process, and how the appearance of 'i+d' would be handled,
> but, what happens at query time? If I use the same filter, I could
> remove '+' chars that are added by the user to identify compulsory
> tokens in the search results, couldn't I?  However, if i do not use the
> CharFilter I would not be able to match the 'i+d' search tokens...
>
> Thanks all!
>
>
>
> El 22/05/17 a las 16:39, Erick Erickson escribió:
>
> You can also use any of the other tokenizers. WhitespaceTokenizer for
> instance. There are a couple that use regular expressions. Etc. See:
> https://cwiki.apache.org/confluence/display/solr/Tokenizers
>
> Each one has it's considerations. WhitespaceTokenizer won't, for
> instance, separate out punctuation so you might then have to use a
> filter to remove those. Regex's can be tricky to get right ;). Etc
>
> Best,
> Erick
>
> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
> 
> wrote:
>
>
> Hi,
>
>
> Before applying tokenizer, you can replace your special symbols with
> some
> phrase to preserve it and after tokenized you can replace it back.
>
> For example:
>  replacement="xxx" />
>
>
> Thanks,
> Zahid iqbal
>
> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
> funderadevelo...@outlook.com>
> wrote:
>
>
>
> Hi all,
>
> I am a bit stuck at a problem that I feel must be easy to solve. In
> Spanish it is usual to find the term 'i+d'. We are working with Solr
> 5.5,
> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
> the
> index documents both in Spanish and Catalan, and in Catalan it is
> frequent
> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
> documents as results.
>
> I have tried to use the SynonymFilter, with something like:
>
> i+d => investigacionYdesarrollo
>
> But it does not seem to change anything.
>
> Is there a way I could set an exception to the Tokenizer so that it
> does
> not split this word?
>
> Thanks in advance!
>
>
>
>
>

Re: Indexing word with plus sign

2017-05-23 Thread Fundera Developer

I have also tried this option, by using a PatternReplaceFilterFactory, like 
this:



but it gets processed AFTER the Tokenizer, so when it executes there is no 
longer an "i+d" token, but two "i" and "d" independent tokens.

Is there a way I could make the filter execute before the Tokenizer? I have 
tried to place it first in the Analyzer definition like this:

 
   
   
   
   
   
 

But I had no luck.

Are there any other approaches I could be missing?

Thanks!


El 22/05/17 a las 20:50, Rick Leir escribió:

Fundera,
You need a regex which matches a '+' with non-blank chars before and after. It 
should not replace a  '+' preceded by white space, that is important in Solr. 
This is not a perfect solution, but might improve matters for you.
Cheers -- Rick

On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
 wrote:


Thank you Zahid and Erik,

I was going to try the CharFilter suggestion, but then I doubted. I see
the indexing process, and how the appearance of 'i+d' would be handled,
but, what happens at query time? If I use the same filter, I could
remove '+' chars that are added by the user to identify compulsory
tokens in the search results, couldn't I?  However, if i do not use the
CharFilter I would not be able to match the 'i+d' search tokens...

Thanks all!



El 22/05/17 a las 16:39, Erick Erickson escribió:

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal

wrote:


Hi,


Before applying tokenizer, you can replace your special symbols with
some
phrase to preserve it and after tokenized you can replace it back.

For example:



Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com>
wrote:



Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In
Spanish it is usual to find the term 'i+d'. We are working with Solr
5.5,
and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
the
index documents both in Spanish and Catalan, and in Catalan it is
frequent
to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
documents as results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it
does
not split this word?

Thanks in advance!

Re: Indexing word with plus sign

2017-05-22 Thread Rick Leir

Fundera,
You need a regex which matches a '+' with non-blank chars before and after. It 
should not replace a  '+' preceded by white space, that is important in Solr. 
This is not a perfect solution, but might improve matters for you.
Cheers -- Rick

On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
 wrote:
>Thank you Zahid and Erik,
>
>I was going to try the CharFilter suggestion, but then I doubted. I see
>the indexing process, and how the appearance of 'i+d' would be handled,
>but, what happens at query time? If I use the same filter, I could
>remove '+' chars that are added by the user to identify compulsory
>tokens in the search results, couldn't I?  However, if i do not use the
>CharFilter I would not be able to match the 'i+d' search tokens...
>
>Thanks all!
>
>
>
>El 22/05/17 a las 16:39, Erick Erickson escribió:
>
>You can also use any of the other tokenizers. WhitespaceTokenizer for
>instance. There are a couple that use regular expressions. Etc. See:
>https://cwiki.apache.org/confluence/display/solr/Tokenizers
>
>Each one has it's considerations. WhitespaceTokenizer won't, for
>instance, separate out punctuation so you might then have to use a
>filter to remove those. Regex's can be tricky to get right ;). Etc
>
>Best,
>Erick
>
>On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
>
>wrote:
>
>
>Hi,
>
>
>Before applying tokenizer, you can replace your special symbols with
>some
>phrase to preserve it and after tokenized you can replace it back.
>
>For example:
>replacement="xxx" />
>
>
>Thanks,
>Zahid iqbal
>
>On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
>funderadevelo...@outlook.com>
>wrote:
>
>
>
>Hi all,
>
>I am a bit stuck at a problem that I feel must be easy to solve. In
>Spanish it is usual to find the term 'i+d'. We are working with Solr
>5.5,
>and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
>the
>index documents both in Spanish and Catalan, and in Catalan it is
>frequent
>to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
>documents as results.
>
>I have tried to use the SynonymFilter, with something like:
>
>i+d => investigacionYdesarrollo
>
>But it does not seem to change anything.
>
>Is there a way I could set an exception to the Tokenizer so that it
>does
>not split this word?
>
>Thanks in advance!

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Indexing word with plus sign

2017-05-22 Thread Fundera Developer

Thank you Zahid and Erik,

I was going to try the CharFilter suggestion, but then I doubted. I see the 
indexing process, and how the appearance of 'i+d' would be handled, but, what 
happens at query time? If I use the same filter, I could remove '+' chars that 
are added by the user to identify compulsory tokens in the search results, 
couldn't I?  However, if i do not use the CharFilter I would not be able to 
match the 'i+d' search tokens...

Thanks all!



El 22/05/17 a las 16:39, Erick Erickson escribió:

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
 
wrote:


Hi,


Before applying tokenizer, you can replace your special symbols with some
phrase to preserve it and after tokenized you can replace it back.

For example:



Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com> wrote:



Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In
Spanish it is usual to find the term 'i+d'. We are working with Solr 5.5,
and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in the
index documents both in Spanish and Catalan, and in Catalan it is frequent
to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
documents as results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it does
not split this word?

Thanks in advance!

Re: Indexing word with plus sign

2017-05-22 Thread Erick Erickson

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
 wrote:
> Hi,
>
>
> Before applying tokenizer, you can replace your special symbols with some
> phrase to preserve it and after tokenized you can replace it back.
>
> For example:
>  replacement="xxx" />
>
>
> Thanks,
> Zahid iqbal
>
> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
> funderadevelo...@outlook.com> wrote:
>
>> Hi all,
>>
>> I am a bit stuck at a problem that I feel must be easy to solve. In
>> Spanish it is usual to find the term 'i+d'. We are working with Solr 5.5,
>> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in the
>> index documents both in Spanish and Catalan, and in Catalan it is frequent
>> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
>> documents as results.
>>
>> I have tried to use the SynonymFilter, with something like:
>>
>> i+d => investigacionYdesarrollo
>>
>> But it does not seem to change anything.
>>
>> Is there a way I could set an exception to the Tokenizer so that it does
>> not split this word?
>>
>> Thanks in advance!
>>
>>

Re: Indexing word with plus sign

2017-05-22 Thread Muhammad Zahid Iqbal

Hi,


Before applying tokenizer, you can replace your special symbols with some
phrase to preserve it and after tokenized you can replace it back.

For example:



Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com> wrote:

> Hi all,
>
> I am a bit stuck at a problem that I feel must be easy to solve. In
> Spanish it is usual to find the term 'i+d'. We are working with Solr 5.5,
> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in the
> index documents both in Spanish and Catalan, and in Catalan it is frequent
> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
> documents as results.
>
> I have tried to use the SynonymFilter, with something like:
>
> i+d => investigacionYdesarrollo
>
> But it does not seem to change anything.
>
> Is there a way I could set an exception to the Tokenizer so that it does
> not split this word?
>
> Thanks in advance!
>
>

Indexing word with plus sign

2017-05-21 Thread Fundera Developer

Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In Spanish it 
is usual to find the term 'i+d'. We are working with Solr 5.5, and 
StandardTokenizer splits 'i' and 'd' and sometimes, as we have in the index 
documents both in Spanish and Catalan, and in Catalan it is frequent to find 
'i' as a word, when a user searches for 'i+d' it gets Catalan documents as 
results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it does not 
split this word?

Thanks in advance!

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Re: Indexing word with plus sign

Indexing word with plus sign

11 matches

Site Navigation

Mail list logo

Footer information