Scott: The Apache spam filters are quite aggressive and sometimes reject e-mails that are formatted any way other than "plain text" so that may have been what happened to your e-mails.
Best, Erick On Fri, Oct 23, 2015 at 3:23 AM, Emir Arnautovic <emir.arnauto...@sematext.com> wrote: > Hi Scott, > This replacement will only be in index terms and not in stored field so you > are fine - problem you mention is related to case when you do replacement in > raw text. However, this would be part of analysis chain (both index and > query) so has no effect on presentation (unless you are using index to > reconstruct your text - which I assume you don't). > > Thanks, > Emir > > On 23.10.2015 03:26, Scott Chu wrote: >> >> Hi Emir, >> Very weirdly. I've reply to your email at home many times yesterday but >> they never show up in the solr-user email list again. Don't know why. So I >> reply this again at office. Hope this will show up. >> Thanks to your explanation. I'll see PatternReplaceCharFilter as a >> workaround (As I know, Character filter are dealing with input stream before >> the tokenizer. In some way, indexed data no longer has original C1 if I do >> the replacement.) What I deal wth are published news articles and I don't >> know how the author of these articles feel about when they see C1 in their >> articles become C2 since some term containing C1 are proper nouns or >> terminologies. I'll talk to them to see if this is ok. Thanks anyway. >> Scott Chu,scott....@udngroup.com <mailto:scott....@udngroup.com> >> 2015/10/23 >> >> ----- Original Message ----- >> *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com> >> *To: *solr-user <mailto:solr-user@lucene.apache.org> >> *Date: *2015-10-22, 18:20:38 >> *Subject: *Re: Is it possible to specigfy only one-character term >> synonymfor2-gram tokenizer? >> >> Hi Scott, >> Using PatternReplaceCharFilter is not same as replacing raw data >> (replacing raw data is not proper solution as it does not solve issue >> when searching with "other" character). This is part of token >> standardization, no different than lower casing - it is standard >> approach as well when it comes to Latin characters: >> <charFilter class="solr.MappingCharFilterFactory" >> mapping="mapping-ISOLatin1Accent.txt"/> >> >> Quick search of "MappingCharFilterFactory chinese" shows it is used - >> you should check if suitable for your case. >> >> Thanks, >> Emir >> >> On 22.10.2015 11:48, Scott Chu wrote: >> > Hi solr-user, >> > Ya, I thought about replacing C1 with C2 in the underground raw >> data. >> > However, it's a huge data set (over 10M news articles) so I give up >> > this strategy eariler. My current temporary solution is going >> back to >> > use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 >> > rule. But it is kinda ugly, especially when applying highlight, >> e.g. >> > search "C1C2" Solr returns highlight snippet such as >> > "...<em>C1</em><em>C2<em>...". >> > Scott Chu,scott....@udngroup.com >> <mailto:%20scott....@udngroup.com> <mailto:scott....@udngroup.com >> <mailto:%20scott....@udngroup.com>> >> > 2015/10/22 >> > >> > ----- Original Message ----- >> > *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com >> <mailto:%20emir.arnauto...@sematext.com>> >> > *To: *solr-user <mailto:solr-user@lucene.apache.org >> <mailto:%20solr-u...@lucene.apache.org>> >> >> > *Date: *2015-10-22, 17:08:26 >> > *Subject: *Re: Is it possible to specigfy only one-character term >> > synonym for2-gram tokenizer? >> > >> > Hi Scott, >> > I don't have experience with Chinese, but SynonymFilter works on >> > tokens, >> > so if CJKTokenizer recognizes C1 and Cm as tokens, it should >> work. If >> > not, than you can try configuring PatternReplaceCharFilter to >> > replace C1 >> > to C2 during indexing and searching and get a match. >> > >> > Thanks, >> > Emir >> > >> > On 22.10.2015 10:53, Scott Chu wrote: >> > > Hi solr-user, >> > > I always uses CJKTokenizer on appropriate amount of Chinese news >> > > articles. Say in Chinese, character C1 has same meaning as >> > > character C2 (e.g 台=臺), Is it possible that I only add this >> > line in >> > > synonym.txt: >> > > C1,C2 (and in true exmaple: 台, 臺) >> > > and by applying CJKTokenizer and SynonymFilter, I only have to >> > query >> > > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will >> > > return documents that matche whether "C1Cm" or "C2Cm"? >> > > Scott Chu,scott....@udngroup.com >> <mailto:%20scott....@udngroup.com> >> > <mailto:%20scott....@udngroup.com >> <mailto:%2020scott....@udngroup.com>> >> <mailto:scott....@udngroup.com <mailto:%20scott....@udngroup.com> >> > <mailto:%20scott....@udngroup.com >> <mailto:%2020scott....@udngroup.com>>> >> > > 2015/10/22 >> > > >> > >> > -- >> > Monitoring * Alerting * Anomaly Detection * Centralized Log >> Management >> > Solr & Elasticsearch Support * http://sematext.com/ >> > >> > >> > >> > >> > ----- >> > 未在此訊息中找到病毒。 >> > 已透過 AVG 檢查 - www.avg.com >> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15 >> > >> >> -- Monitoring * Alerting * Anomaly Detection * Centralized Log >> Management >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> >> >> ----- >> 未在此訊息中找到病毒。 >> 已透過 AVG 檢查 - www.avg.com >> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15 >> > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ >