Hi, For your info, this issue is occurring in Solr 7.7.0 as well.
Regards, Edwin On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi, > > Should we report this as a bug in Solr? > > Regards, > Edwin > > On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi Paul, >> >> Regarding the regex (\n\s*){2,} that we are using, when we try in on >> https://regex101.com/, it is able to give us the correct result for all >> the examples (ie: All of them will only have <br><br>, and not more than >> that like what we are getting in Solr in our earlier examples). >> >> Could there be a possibility of a bug in Solr? >> >> Regards, >> Edwin >> >> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> wrote: >> >>> Hi Paul, >>> >>> We have tried it with the space preceeding the \n i.e. <str >>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern: >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> <str name="fieldName">content</str> >>> <str name="pattern">(\s*\n){2,}</str> >>> <str name="replacement"><br><br></str> >>> </processor> >>> >>> However, we are also getting the exact same results as the earlier >>> Example 1, 2 and 3. >>> >>> As for your point 2 on perhaps in the data you have other (non printing) >>> characters than \n, we have find that there are no non printing characters. >>> It is just next line with a space. You can refer to the original content in >>> the same examples below. >>> >>> >>> Example 1: The sentence that the above regex pattern is working >>> correctly >>> *Original content in EML file:* >>> Dear Sir, >>> >>> >>> I am terminating >>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>> *Index content: * Dear Sir, <br><br>I am terminating >>> >>> Example 2: The sentence that the above regex pattern is partially >>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>> *Original content in EML file:* >>> >>> *exalted* >>> >>> *Psalm 89:17* >>> >>> >>> 3 Choa Chu Kang Avenue 4 >>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>> Choa Chu Kang Avenue 4, Singapore >>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>> Choa Chu Kang Avenue 4, Singapore >>> >>> Example 3: The sentence that the above regex pattern is partially >>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>> *Original content in EML file:* >>> >>> http://www.concordpri.moe.edu.sg/ >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Dec 18, 2018 at 10:07 AM >>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>> 2018 at 10:07 AM >>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >>> >>> >>> Appreciate any other ideas or suggestions that you may have. >>> >>> Thank you. >>> >>> Regards, >>> Edwin >>> >>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: >>> >>>> Hi Edwin >>>> >>>> >>>> >>>> 1. Sorry, the pattern was wrong, the space should preceed the \n >>>> i.e. <str name="pattern">(\s*\n){2,}</str> >>>> 2. Perhaps in the data you have other (non printing) characters than >>>> \n? >>>> >>>> >>>> >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>>> Windows 10 >>>> >>>> >>>> >>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>>> >>>> >>>> >>>> Hi Paul, >>>> >>>> We have tried this suggested regex pattern as follow: >>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>> <str name="fieldName">content</str> >>>> <str name="pattern">(\n\s*){2,}</str> >>>> <str name="replacement"><br><br></str> >>>> </processor> >>>> >>>> But we still have exactly the same problem of Example 1,2 and 3 below. >>>> >>>> Example 1: The sentence that the above regex pattern is working >>>> correctly >>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>> *Index content: * Dear Sir, <br><br>I am terminating >>>> >>>> Example 2: The sentence that the above regex pattern is partially >>>> working >>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa >>>> Chu Kang Avenue 4, Singapore >>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa >>>> Chu Kang Avenue 4, Singapore >>>> >>>> Example 3: The sentence that the above regex pattern is partially >>>> working >>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >>>> \n\n >>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>>> 2018 >>>> at 10:07 AM >>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>> <br><br>On >>>> Tue, Dec 18, 2018 at 10:07 AM >>>> >>>> Any further suggestion? >>>> >>>> Thank you. >>>> >>>> Regards, >>>> Edwin >>>> >>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: >>>> >>>> > To avoid the «\n+\s*» matching too many \n and then failing on the >>>> {2,} >>>> > part you could try >>>> > >>>> > >>>> > >>>> > <str name="pattern">(\n\s*){2,}</str> >>>> > >>>> > >>>> > >>>> > If you also want to match CRLF then >>>> > >>>> > <str name="pattern">(\r?\n\s*){2,}</str> >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>>> > Windows 10 >>>> > >>>> > >>>> > >>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10 >>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple >>>> \n >>>> > >>>> > >>>> > >>>> > Hi Paul, >>>> > >>>> > Thanks for your reply. >>>> > >>>> > When I use this pattern: >>>> > <processor class="solr.RegexReplaceProcessorFactory"> >>>> > <str name="fieldName">content</str> >>>> > <str name="pattern">(\n+\s*){2,}</str> >>>> > <str name="replacement"><br><br></str> >>>> > </processor> >>>> > >>>> > It is working for some sentence within the same content and not >>>> working for >>>> > some sentences. Please see below for the one that is working and >>>> another >>>> > that is not working (partially working): >>>> > >>>> > Example 1: The sentence that the above regex pattern is working >>>> correctly >>>> > *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>> > *Index content: * Dear Sir, <br><br>I am terminating >>>> > >>>> > Example 2: The sentence that the above regex pattern is partially >>>> working >>>> > (as you can see, instead of 2 <br>, there are 4 <br>) >>>> > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>> Choa >>>> > Chu Kang Avenue 4, Singapore >>>> > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>> Choa >>>> > Chu Kang Avenue 4, Singapore >>>> > >>>> > Example 3: The sentence that the above regex pattern is partially >>>> working >>>> > (as you can see, instead of 2 <br>, there are 4 <br>) >>>> > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n >>>> \n >>>> > \n\n >>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec >>>> 18, 2018 >>>> > at 10:07 AM >>>> > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>> <br><br>On >>>> > Tue, Dec 18, 2018 at 10:07 AM >>>> > >>>> > We would appreciate your help to see what is wrong? >>>> > >>>> > Thank you. >>>> > >>>> > Regards, >>>> > Edwin >>>> > >>>> > On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: >>>> > >>>> > > You don’t say what happens, just that it is not working. I assume >>>> nothing >>>> > > is replaced? Perhaps the pattern should be >>>> > > >>>> > > >>>> > > >>>> > > <str name="pattern">"(\n\s*){2,}"</str> >>>> > > >>>> > > >>>> > > >>>> > > ?? >>>> > > >>>> > > >>>> > > >>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>>> für >>>> > > Windows 10 >>>> > > >>>> > > >>>> > > >>>> > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08 >>>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n >>>> > > >>>> > > >>>> > > >>>> > > Hi, >>>> > > >>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more >>>> than >>>> > two >>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n >>>> \n >>>> > \n), >>>> > > and replace it with two <br>. >>>> > > >>>> > > I use the following regex pattern and it is working when I test it >>>> in >>>> > > regex101.com. But it is not working when I put it inside the >>>> > > RegexReplaceProcessorFactory as below: >>>> > > >>>> > > <updateRequestProcessorChain name="removeCode"> >>>> > > <processor class="solr.RegexReplaceProcessorFactory"> >>>> > > <str name="fieldName">content</str> >>>> > > <str name="pattern">"(\\n\s*){2,}"</str> >>>> > > <str name="replacement"><br><br></str> >>>> > > </processor> >>>> > > </updateRequestProcessorChain> >>>> > > >>>> > > To explain further about my regex pattern, \s* is instructing the >>>> regex >>>> > to >>>> > > match any \n that have space after and {2,} is instructing the >>>> regex to >>>> > > match 2 or more occurrence of such pattern (\n). >>>> > > >>>> > > Please kindly let me know what is wrong and how should I do it? >>>> > > >>>> > > I am using Solr 7.6.0. >>>> > > >>>> > > Regards, >>>> > > Edwin >>>> > > >>>> > >>>> >>>