Hi, For your info, this issue is occurring in the new Solr 7.7.1 as well.
Regards, Edwin On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi, > > Anyone else has other suggestions or have faced the same problem? > > Regards, > Edwin > > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi Paul, >> >> If I tried to execute the second step first, then I will only get a >> single <br> for those with 2 <br>. >> For those that we originally get 4 <br>, there will be 2 <br> with a >> space in between. >> >> This is just changing the 2 <br> to be a single <br>, since the second >> step is to replace with a single <br>. >> But it has not solved the underlying problem yet. >> >> Regards, >> Edwin >> >> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: >> >>> If the second step is executed first, then you will get the unwanted 4 >>> <br> >>> >>> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>> Windows 10 >>> >>> >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>> >>> >>> >>> Hi Jörn , >>> >>> Do you mean the regex is not correct? >>> >>> We are already using two RegexReplaceProcessorFactory steps, like the one >>> shown below. The output that we get is still the same. >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> <str name="fieldName">content</str> >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >>> <str name="replacement"><br><br></str> >>> <bool name="literalReplacement">true</bool> >>> <processor> >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> <str name="fieldName">content</str> >>> <str name="pattern">([ \t]*\r?\n){1,}</str> >>> <str name="replacement"><br></str> >>> <bool name="literalReplacement">true</bool> >>> <processor> >>> >>> Regards, >>> Edwin >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>> > Then you need two regexprocessfactory steps >>> > >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < >>> edwinye...@gmail.com >>> > >: >>> > > >>> > > Hi, >>> > > >>> > > Thanks for the reply. >>> > > >>> > > Do you know of any regex online tool that works correctly for Java >>> regex? >>> > > I tried to find some, but they are not working properly. >>> > > >>> > > Yes, our plan is to replace more than one \n with <br><br>, and >>> single \n >>> > > with single <br>. >>> > > >>> > > Regards, >>> > > Edwin >>> > > >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> > >> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would >>> then >>> > >> be in the JDK. Try out in a regex online Tool that supports Java >>> regex >>> > for >>> > >> your solution. >>> > >> >>> > >> I believe you want to have 2 regex process factories: >>> > >> One that deals with single \n and one that deals with more than one >>> \n >>> > >> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < >>> > edwinye...@gmail.com >>> > >>> : >>> > >>> >>> > >>> Hi, >>> > >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and >>> > >>> configuration: >>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> > >>> <str name="fieldName">content</str> >>> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >>> > >>> <str name="replacement"><br><br></str> >>> > >>> <bool name="literalReplacement">true</bool> >>> > >>> </processor> >>> > >>> >>> > >>> However, the issue is still occurring. >>> > >>> >>> > >>> Anyone else is able to help? >>> > >>> >>> > >>> Regards, >>> > >>> Edwin >>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < >>> > edwinye...@gmail.com> >>> > >>> wrote: >>> > >>> >>> > >>>> Hi, >>> > >>>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well. >>> > >>>> >>> > >>>> Regards, >>> > >>>> Edwin >>> > >>>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < >>> > edwinye...@gmail.com >>> > >>> >>> > >>>> wrote: >>> > >>>> >>> > >>>>> Hi, >>> > >>>>> >>> > >>>>> Should we report this as a bug in Solr? >>> > >>>>> >>> > >>>>> Regards, >>> > >>>>> Edwin >>> > >>>>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < >>> > edwinye...@gmail.com >>> > >>> >>> > >>>>> wrote: >>> > >>>>> >>> > >>>>>> Hi Paul, >>> > >>>>>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try >>> in on >>> > >>>>>> https://regex101.com/, it is able to give us the correct >>> result for >>> > >> all >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and not >>> more >>> > >> than >>> > >>>>>> that like what we are getting in Solr in our earlier examples). >>> > >>>>>> >>> > >>>>>> Could there be a possibility of a bug in Solr? >>> > >>>>>> >>> > >>>>>> Regards, >>> > >>>>>> Edwin >>> > >>>>>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < >>> > >> edwinye...@gmail.com> >>> > >>>>>> wrote: >>> > >>>>>> >>> > >>>>>>> Hi Paul, >>> > >>>>>>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex >>> pattern: >>> > >>>>>>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> > >>>>>>> <str name="fieldName">content</str> >>> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> >>> > >>>>>>> <str name="replacement"><br><br></str> >>> > >>>>>>> </processor> >>> > >>>>>>> >>> > >>>>>>> However, we are also getting the exact same results as the >>> earlier >>> > >>>>>>> Example 1, 2 and 3. >>> > >>>>>>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other (non >>> > >>>>>>> printing) characters than \n, we have find that there are no >>> non >>> > >> printing >>> > >>>>>>> characters. It is just next line with a space. You can refer >>> to the >>> > >>>>>>> original content in the same examples below. >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is working >>> > >>>>>>> correctly >>> > >>>>>>> *Original content in EML file:* >>> > >>>>>>> Dear Sir, >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> I am terminating >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>> > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>> > >>>>>>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is >>> partially >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>> > >>>>>>> *Original content in EML file:* >>> > >>>>>>> >>> > >>>>>>> *exalted* >>> > >>>>>>> >>> > >>>>>>> *Psalm 89:17* >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >>> \n\n 3 >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >>> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >>> <br><br>3 >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >>> > >>>>>>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is >>> partially >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>> > >>>>>>> *Original content in EML file:* >>> > >>>>>>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n >>> > \n\n >>> > >> \n >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >>> Tue, >>> > >> Dec 18, >>> > >>>>>>> 2018 at 10:07 AM >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may have. >>> > >>>>>>> >>> > >>>>>>> Thank you. >>> > >>>>>>> >>> > >>>>>>> Regards, >>> > >>>>>>> Edwin >>> > >>>>>>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: >>> > >>>>>>>> >>> > >>>>>>>> Hi Edwin >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should preceed >>> the \n >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >>> > >>>>>>>> 2. Perhaps in the data you have other (non printing) >>> characters >>> > >>>>>>>> than \n? >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Gesendet von Mail< >>> https://go.microsoft.com/fwlink/?LinkId=550986> >>> > >> für >>> > >>>>>>>> Windows 10 >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: >>> > solr-user@lucene.apache.org> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >>> > >> multiple \n >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Hi Paul, >>> > >>>>>>>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow: >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> > >>>>>>>> <str name="fieldName">content</str> >>> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> >>> > >>>>>>>> <str name="replacement"><br><br></str> >>> > >>>>>>>> </processor> >>> > >>>>>>>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and >>> 3 >>> > >> below. >>> > >>>>>>>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is >>> working >>> > >>>>>>>> correctly >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >>> terminating >>> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>> > >>>>>>>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is >>> partially >>> > >>>>>>>> working >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >>> \n\n >>> > 3 >>> > >>>>>>>> Choa >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >>> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >>> > <br><br>3 >>> > >>>>>>>> Choa >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >>> > >>>>>>>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is >>> partially >>> > >>>>>>>> working >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n >>> > \n\n >>> > >>>>>>>> \n \n\n >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >>> Tue, Dec >>> > >> 18, >>> > >>>>>>>> 2018 >>> > >>>>>>>> at 10:07 AM >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>> > >>>>>>>> <br><br>On >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >>> > >>>>>>>> >>> > >>>>>>>> Any further suggestion? >>> > >>>>>>>> >>> > >>>>>>>> Thank you. >>> > >>>>>>>> >>> > >>>>>>>> Regards, >>> > >>>>>>>> Edwin >>> > >>>>>>>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: >>> > >>>>>>>>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing >>> on >>> > the >>> > >>>>>>>> {2,} >>> > >>>>>>>>> part you could try >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> If you also want to match CRLF then >>> > >>>>>>>>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> Gesendet von Mail< >>> https://go.microsoft.com/fwlink/?LinkId=550986 >>> > > >>> > >>>>>>>> für >>> > >>>>>>>>> Windows 10 >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: >>> > solr-user@lucene.apache.org >>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >>> > >> multiple >>> > >>>>>>>> \n >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> Hi Paul, >>> > >>>>>>>>> >>> > >>>>>>>>> Thanks for your reply. >>> > >>>>>>>>> >>> > >>>>>>>>> When I use this pattern: >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> > >>>>>>>>> <str name="fieldName">content</str> >>> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> >>> > >>>>>>>>> <str name="replacement"><br><br></str> >>> > >>>>>>>>> </processor> >>> > >>>>>>>>> >>> > >>>>>>>>> It is working for some sentence within the same content and >>> not >>> > >>>>>>>> working for >>> > >>>>>>>>> some sentences. Please see below for the one that is working >>> and >>> > >>>>>>>> another >>> > >>>>>>>>> that is not working (partially working): >>> > >>>>>>>>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is >>> working >>> > >>>>>>>> correctly >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >>> terminating >>> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>> > >>>>>>>>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is >>> partially >>> > >>>>>>>> working >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >>> > \n\n 3 >>> > >>>>>>>> Choa >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >>> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >>> > <br><br>3 >>> > >>>>>>>> Choa >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >>> > >>>>>>>>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is >>> partially >>> > >>>>>>>> working >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n >>> > >> \n\n >>> > >>>>>>>> \n >>> > >>>>>>>>> \n\n >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >>> Tue, >>> > Dec >>> > >>>>>>>> 18, 2018 >>> > >>>>>>>>> at 10:07 AM >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >>> <br><br> >>> > >>>>>>>> <br><br>On >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >>> > >>>>>>>>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong? >>> > >>>>>>>>> >>> > >>>>>>>>> Thank you. >>> > >>>>>>>>> >>> > >>>>>>>>> Regards, >>> > >>>>>>>>> Edwin >>> > >>>>>>>>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: >>> > >>>>>>>>>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not working. I >>> > assume >>> > >>>>>>>> nothing >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> ?? >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> Gesendet von Mail< >>> > https://go.microsoft.com/fwlink/?LinkId=550986> >>> > >>>>>>>> für >>> > >>>>>>>>>> Windows 10 >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: >>> > >> solr-user@lucene.apache.org >>> > >>>>>>>>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect >>> multiple >>> > >> \n >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> Hi, >>> > >>>>>>>>>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to >>> remove >>> > more >>> > >>>>>>>> than >>> > >>>>>>>>> two >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, >>> \n >>> > \n >>> > >>>>>>>> \n >>> > >>>>>>>>> \n), >>> > >>>>>>>>>> and replace it with two <br>. >>> > >>>>>>>>>> >>> > >>>>>>>>>> I use the following regex pattern and it is working when I >>> test >>> > it >>> > >>>>>>>> in >>> > >>>>>>>>>> regex101.com. But it is not working when I put it inside >>> the >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: >>> > >>>>>>>>>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> > >>>>>>>>>> <str name="fieldName">content</str> >>> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> >>> > >>>>>>>>>> <str name="replacement"><br><br></str> >>> > >>>>>>>>>> </processor> >>> > >>>>>>>>>> </updateRequestProcessorChain> >>> > >>>>>>>>>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is >>> instructing >>> > the >>> > >>>>>>>> regex >>> > >>>>>>>>> to >>> > >>>>>>>>>> match any \n that have space after and {2,} is instructing >>> the >>> > >>>>>>>> regex to >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). >>> > >>>>>>>>>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do >>> it? >>> > >>>>>>>>>> >>> > >>>>>>>>>> I am using Solr 7.6.0. >>> > >>>>>>>>>> >>> > >>>>>>>>>> Regards, >>> > >>>>>>>>>> Edwin >>> > >>>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>> >>> > >>>>>>> >>> > >> >>> > >>> >>