Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

Erick Erickson Fri, 23 Oct 2015 07:22:25 -0700

Scott:

The Apache spam filters are quite aggressive and sometimes reject e-mails
that are formatted any way other than "plain text" so that may have been
what happened to your e-mails.


Best,
Erick

On Fri, Oct 23, 2015 at 3:23 AM, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:
> Hi Scott,
> This replacement will only be in index terms and not in stored field so you
> are fine - problem you mention is related to case when you do replacement in
> raw text. However, this would be part of analysis chain (both index and
> query)  so has no effect on presentation (unless you are using index to
> reconstruct your text - which I assume you don't).
>
> Thanks,
> Emir
>
> On 23.10.2015 03:26, Scott Chu wrote:
>>
>> Hi Emir,
>> Very weirdly. I've reply to your email at home many times yesterday but
>> they never show up in the solr-user email list again. Don't know why. So I
>> reply this again at office. Hope this will show up.
>> Thanks to your explanation. I'll see PatternReplaceCharFilter as a
>> workaround (As I know, Character filter are dealing with input stream before
>> the tokenizer. In some way, indexed data no longer has original C1 if I do
>> the replacement.) What I deal wth are published news articles and I don't
>> know how the author of these articles feel about when they see C1 in their
>> articles become C2 since some term containing C1 are proper nouns or
>> terminologies. I'll talk to them to see if this is ok. Thanks anyway.
>> Scott Chu，scott....@udngroup.com <mailto:scott....@udngroup.com>
>> 2015/10/23
>>
>>     ----- Original Message -----
>>     *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com>
>>     *To: *solr-user <mailto:solr-user@lucene.apache.org>
>>     *Date: *2015-10-22, 18:20:38
>>     *Subject: *Re: Is it possible to specigfy only one-character term
>>     synonymfor2-gram tokenizer?
>>
>>     Hi Scott,
>>     Using PatternReplaceCharFilter is not same as replacing raw data
>>     (replacing raw data is not proper solution as it does not solve issue
>>     when searching with "other" character). This is part of token
>>     standardization, no different than lower casing - it is standard
>>     approach as well when it comes to Latin characters:
>>     <charFilter class="solr.MappingCharFilterFactory"
>>     mapping="mapping-ISOLatin1Accent.txt"/>
>>
>>     Quick search of "MappingCharFilterFactory chinese" shows it is used -
>>     you should check if suitable for your case.
>>
>>     Thanks,
>>     Emir
>>
>>     On 22.10.2015 11:48, Scott Chu wrote:
>>     > Hi solr-user,
>>     > Ya, I thought about replacing C1 with C2 in the underground raw
>>     data.
>>     > However, it's a huge data set (over 10M news articles) so I give up
>>     > this strategy eariler. My current temporary solution is going
>>     back to
>>     > use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1
>>     > rule. But it is kinda ugly, especially when applying highlight,
>>     e.g.
>>     > search "C1C2" Solr returns highlight snippet such as
>>     > "...<em>C1</em><em>C2<em>...".
>>     > Scott Chu，scott....@udngroup.com
>>     <mailto:%20scott....@udngroup.com> <mailto:scott....@udngroup.com
>>     <mailto:%20scott....@udngroup.com>>
>>     > 2015/10/22
>>     >
>>     > ----- Original Message -----
>>     > *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com
>>     <mailto:%20emir.arnauto...@sematext.com>>
>>     > *To: *solr-user <mailto:solr-user@lucene.apache.org
>>     <mailto:%20solr-u...@lucene.apache.org>>
>>
>>     > *Date: *2015-10-22, 17:08:26
>>     > *Subject: *Re: Is it possible to specigfy only one-character term
>>     > synonym for2-gram tokenizer?
>>     >
>>     > Hi Scott,
>>     > I don't have experience with Chinese, but SynonymFilter works on
>>     > tokens,
>>     > so if CJKTokenizer recognizes C1 and Cm as tokens, it should
>>     work. If
>>     > not, than you can try configuring PatternReplaceCharFilter to
>>     > replace C1
>>     > to C2 during indexing and searching and get a match.
>>     >
>>     > Thanks,
>>     > Emir
>>     >
>>     > On 22.10.2015 10:53, Scott Chu wrote:
>>     > > Hi solr-user,
>>     > > I always uses CJKTokenizer on appropriate amount of Chinese news
>>     > > articles. Say in Chinese, character C1 has same meaning as
>>     > > character C2 (e.g 台=臺), Is it possible that I only add this
>>     > line in
>>     > > synonym.txt:
>>     > > C1,C2 (and in true exmaple: 台, 臺)
>>     > > and by applying CJKTokenizer and SynonymFilter, I only have to
>>     > query
>>     > > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
>>     > > return documents that matche whether "C1Cm" or "C2Cm"?
>>     > > Scott Chu，scott....@udngroup.com
>>     <mailto:%20scott....@udngroup.com>
>>     > <mailto:%20scott....@udngroup.com
>>     <mailto:%2020scott....@udngroup.com>>
>>     <mailto:scott....@udngroup.com <mailto:%20scott....@udngroup.com>
>>     > <mailto:%20scott....@udngroup.com
>>     <mailto:%2020scott....@udngroup.com>>>
>>     > > 2015/10/22
>>     > >
>>     >
>>     > --
>>     > Monitoring * Alerting * Anomaly Detection * Centralized Log
>>     Management
>>     > Solr & Elasticsearch Support * http://sematext.com/
>>     >
>>     >
>>     >
>>     >
>>     > -----
>>     > 未在此訊息中找到病毒。
>>     > 已透過 AVG 檢查 - www.avg.com
>>     > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>>     >
>>
>>     --     Monitoring * Alerting * Anomaly Detection * Centralized Log
>> Management
>>     Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>>
>>     -----
>>     未在此訊息中找到病毒。
>>     已透過 AVG 檢查 - www.avg.com
>>     版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>>
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

Reply via email to