If the second step is executed first, then you will get the unwanted 4 <br>
Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10 Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> Gesendet: Mittwoch, 20. Februar 2019 09:29 An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n Hi Jörn , Do you mean the regex is not correct? We are already using two RegexReplaceProcessorFactory steps, like the one shown below. The output that we get is still the same. <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">([ \t]*\r?\n){2,}</str> <str name="replacement"><br><br></str> <bool name="literalReplacement">true</bool> <processor> <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">([ \t]*\r?\n){1,}</str> <str name="replacement"><br></str> <bool name="literalReplacement">true</bool> <processor> Regards, Edwin On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com> wrote: > Then you need two regexprocessfactory steps > > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <edwinye...@gmail.com > >: > > > > Hi, > > > > Thanks for the reply. > > > > Do you know of any regex online tool that works correctly for Java regex? > > I tried to find some, but they are not working properly. > > > > Yes, our plan is to replace more than one \n with <br><br>, and single \n > > with single <br>. > > > > Regards, > > Edwin > > > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com> wrote: > >> > >> Solr uses Java regex matching, so i doubt there is a bug - it would then > >> be in the JDK. Try out in a regex online Tool that supports Java regex > for > >> your solution. > >> > >> I believe you want to have 2 regex process factories: > >> One that deals with single \n and one that deals with more than one \n > >> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < > edwinye...@gmail.com > >>> : > >>> > >>> Hi, > >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and > >>> configuration: > >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> <str name="fieldName">content</str> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> > >>> <str name="replacement"><br><br></str> > >>> <bool name="literalReplacement">true</bool> > >>> </processor> > >>> > >>> However, the issue is still occurring. > >>> > >>> Anyone else is able to help? > >>> > >>> Regards, > >>> Edwin > >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well. > >>>> > >>>> Regards, > >>>> Edwin > >>>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < > edwinye...@gmail.com > >>> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> Should we report this as a bug in Solr? > >>>>> > >>>>> Regards, > >>>>> Edwin > >>>>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < > edwinye...@gmail.com > >>> > >>>>> wrote: > >>>>> > >>>>>> Hi Paul, > >>>>>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on > >>>>>> https://regex101.com/, it is able to give us the correct result for > >> all > >>>>>> the examples (ie: All of them will only have <br><br>, and not more > >> than > >>>>>> that like what we are getting in Solr in our earlier examples). > >>>>>> > >>>>>> Could there be a possibility of a bug in Solr? > >>>>>> > >>>>>> Regards, > >>>>>> Edwin > >>>>>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < > >> edwinye...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Paul, > >>>>>>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern: > >>>>>>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>>> <str name="fieldName">content</str> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> > >>>>>>> <str name="replacement"><br><br></str> > >>>>>>> </processor> > >>>>>>> > >>>>>>> However, we are also getting the exact same results as the earlier > >>>>>>> Example 1, 2 and 3. > >>>>>>> > >>>>>>> As for your point 2 on perhaps in the data you have other (non > >>>>>>> printing) characters than \n, we have find that there are no non > >> printing > >>>>>>> characters. It is just next line with a space. You can refer to the > >>>>>>> original content in the same examples below. > >>>>>>> > >>>>>>> > >>>>>>> Example 1: The sentence that the above regex pattern is working > >>>>>>> correctly > >>>>>>> *Original content in EML file:* > >>>>>>> Dear Sir, > >>>>>>> > >>>>>>> > >>>>>>> I am terminating > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>>>>>> > >>>>>>> Example 2: The sentence that the above regex pattern is partially > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>> *Original content in EML file:* > >>>>>>> > >>>>>>> *exalted* > >>>>>>> > >>>>>>> *Psalm 89:17* > >>>>>>> > >>>>>>> > >>>>>>> 3 Choa Chu Kang Avenue 4 > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 > >>>>>>> Choa Chu Kang Avenue 4, Singapore > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 > >>>>>>> Choa Chu Kang Avenue 4, Singapore > >>>>>>> > >>>>>>> Example 3: The sentence that the above regex pattern is partially > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>> *Original content in EML file:* > >>>>>>> > >>>>>>> http://www.concordpri.moe.edu.sg/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > \n\n > >> \n > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, > >> Dec 18, > >>>>>>> 2018 at 10:07 AM > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM > >>>>>>> > >>>>>>> > >>>>>>> Appreciate any other ideas or suggestions that you may have. > >>>>>>> > >>>>>>> Thank you. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Edwin > >>>>>>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: > >>>>>>>> > >>>>>>>> Hi Edwin > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should preceed the \n > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> > >>>>>>>> 2. Perhaps in the data you have other (non printing) characters > >>>>>>>> than \n? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> > >> für > >>>>>>>> Windows 10 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 > >>>>>>>> An: solr-user@lucene.apache.org<mailto: > solr-user@lucene.apache.org> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > >> multiple \n > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Paul, > >>>>>>>> > >>>>>>>> We have tried this suggested regex pattern as follow: > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>>>> <str name="fieldName">content</str> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>>>>>>> <str name="replacement"><br><br></str> > >>>>>>>> </processor> > >>>>>>>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3 > >> below. > >>>>>>>> > >>>>>>>> Example 1: The sentence that the above regex pattern is working > >>>>>>>> correctly > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>>>>>>> > >>>>>>>> Example 2: The sentence that the above regex pattern is partially > >>>>>>>> working > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n > 3 > >>>>>>>> Choa > >>>>>>>> Chu Kang Avenue 4, Singapore > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > <br><br>3 > >>>>>>>> Choa > >>>>>>>> Chu Kang Avenue 4, Singapore > >>>>>>>> > >>>>>>>> Example 3: The sentence that the above regex pattern is partially > >>>>>>>> working > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > \n\n > >>>>>>>> \n \n\n > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec > >> 18, > >>>>>>>> 2018 > >>>>>>>> at 10:07 AM > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > >>>>>>>> <br><br>On > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>>>>>>> > >>>>>>>> Any further suggestion? > >>>>>>>> > >>>>>>>> Thank you. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Edwin > >>>>>>>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: > >>>>>>>>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on > the > >>>>>>>> {2,} > >>>>>>>>> part you could try > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> If you also want to match CRLF then > >>>>>>>>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986 > > > >>>>>>>> für > >>>>>>>>> Windows 10 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: > solr-user@lucene.apache.org > >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > >> multiple > >>>>>>>> \n > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hi Paul, > >>>>>>>>> > >>>>>>>>> Thanks for your reply. > >>>>>>>>> > >>>>>>>>> When I use this pattern: > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>>>>> <str name="fieldName">content</str> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> > >>>>>>>>> <str name="replacement"><br><br></str> > >>>>>>>>> </processor> > >>>>>>>>> > >>>>>>>>> It is working for some sentence within the same content and not > >>>>>>>> working for > >>>>>>>>> some sentences. Please see below for the one that is working and > >>>>>>>> another > >>>>>>>>> that is not working (partially working): > >>>>>>>>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is working > >>>>>>>> correctly > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>>>>>>>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is partially > >>>>>>>> working > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n > \n\n 3 > >>>>>>>> Choa > >>>>>>>>> Chu Kang Avenue 4, Singapore > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > <br><br>3 > >>>>>>>> Choa > >>>>>>>>> Chu Kang Avenue 4, Singapore > >>>>>>>>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is partially > >>>>>>>> working > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > >> \n\n > >>>>>>>> \n > >>>>>>>>> \n\n > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, > Dec > >>>>>>>> 18, 2018 > >>>>>>>>> at 10:07 AM > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > >>>>>>>> <br><br>On > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>>>>>>>> > >>>>>>>>> We would appreciate your help to see what is wrong? > >>>>>>>>> > >>>>>>>>> Thank you. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Edwin > >>>>>>>>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: > >>>>>>>>>> > >>>>>>>>>> You don’t say what happens, just that it is not working. I > assume > >>>>>>>> nothing > >>>>>>>>>> is replaced? Perhaps the pattern should be > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> ?? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Gesendet von Mail< > https://go.microsoft.com/fwlink/?LinkId=550986> > >>>>>>>> für > >>>>>>>>>> Windows 10 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: > >> solr-user@lucene.apache.org > >>>>>>>>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple > >> \n > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove > more > >>>>>>>> than > >>>>>>>>> two > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n > \n > >>>>>>>> \n > >>>>>>>>> \n), > >>>>>>>>>> and replace it with two <br>. > >>>>>>>>>> > >>>>>>>>>> I use the following regex pattern and it is working when I test > it > >>>>>>>> in > >>>>>>>>>> regex101.com. But it is not working when I put it inside the > >>>>>>>>>> RegexReplaceProcessorFactory as below: > >>>>>>>>>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>>>>>> <str name="fieldName">content</str> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> > >>>>>>>>>> <str name="replacement"><br><br></str> > >>>>>>>>>> </processor> > >>>>>>>>>> </updateRequestProcessorChain> > >>>>>>>>>> > >>>>>>>>>> To explain further about my regex pattern, \s* is instructing > the > >>>>>>>> regex > >>>>>>>>> to > >>>>>>>>>> match any \n that have space after and {2,} is instructing the > >>>>>>>> regex to > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). > >>>>>>>>>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do it? > >>>>>>>>>> > >>>>>>>>>> I am using Solr 7.6.0. > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> Edwin > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >> >