Hi Paul, Sorry, I realized there is an extra ']' in the pattern provided, which is why there are so many <br> in the output.
The output is exactly the same as previously (previous index result) if we remove the extra ']', as shown in the configuration below. <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">[ \t\x0b\f]*\r?\n</str> <str name="replacement"><br></str> <bool name="literalReplacement">true</bool> </processor> <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">(<br>[ \t\x0b\f]*){3,}</str> <str name="replacement"><br><br></str> <bool name="literalReplacement">true</bool> </processor> Regards, Edwin On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Paul, > > Thanks for the reply. > > For the 2nd pattern, if we put this pattern <str > name="pattern">(<br>[ \t\x0b\f]]*){3,}</str>, which is like the > configurations below: > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">[ \t\x0b\f]*\r?\n</str> > <str name="replacement"><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(<br>[ \t\x0b\f]]*){3,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > It will not be able to change all those more than 3 <br> to 2 <br>. > > We will end up with many <br> in the output, like the example below: > > http://www.concorded.com/<br><br> > <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> > On Tue, Dec 18, 2018 > > > Regards, > Edwin > > > > > On Thu, 7 Mar 2019 at 20:44, <paul.d...@ub.unibe.ch> wrote: > >> Hi Edwin >> >> >> >> I can’t understand why the pattern is not working and where the spaces >> between the <br> are coming from. It should be possible to allow for spaces >> between the <br> in the second match pattern however i.e. 2nd pattern >> >> >> >> <str name="pattern">(<br>[ \t\x0b\f]]*){3,}</str> >> >> >> >> /Paul >> >> >> >> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >> Windows 10 >> >> >> >> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> Gesendet: Mittwoch, 6. März 2019 16:28 >> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> >> >> >> Hi Paul, >> >> I have tried with the first match pattern to be <str name="pattern">[ >> \t\x0b\f]*\r?\n</str>, like the configuration below: >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">[ \t\x0b\f]*\r?\n</str> >> <str name="replacement"><br></str> >> <bool name="literalReplacement">true</bool> >> </processor> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">(<br>){3,}</str> >> <str name="replacement"><br><br></str> >> <bool name="literalReplacement">true</bool> >> </processor> >> >> However, the result is still the same as before (previous index results), >> with the 4 <br>. >> >> Regards, >> Edwin >> >> >> On Wed, 6 Mar 2019 at 18:23, <paul.d...@ub.unibe.ch> wrote: >> >> > Hi Edwin >> > >> > >> > >> > You are correct re the 2nd pattern – my bad. Looking at the 4 <br>, >> it’s >> > actually the sequence «<br><br> <br><br>»? So perhaps the first match >> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str> >> > >> > >> > >> > i.e. [space tab vertical-tab formfeed] >> > >> > >> > >> > Regards, >> > >> > Paul >> > >> > >> > >> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >> > Windows 10 >> > >> > >> > >> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > Gesendet: Mittwoch, 6. März 2019 07:44 >> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> > >> > >> > >> > Hi Paul, >> > >> > I have modified the second pattern to be (<br>){3,}, instead of >> > (<br><br>){3,}. This pattern of (<br><br>){3,} >> > will actually look for 6 or more <br> instead of 3 <br>, as we have put >> > the <br> two times in the pattern, which is the reason that there are >> more >> > <br> in the result, as cases where there are less than 6 <br> are not >> being >> > replaced, so we ended up having up to 5 <br> in the index. >> > >> > Modified configuration: >> > <processor class="solr.RegexReplaceProcessorFactory"> >> > <str name="fieldName">content</str> >> > <str name="pattern">(<br>){3,}</str> >> > <str name="replacement"><br><br></str> >> > <bool name="literalReplacement">true</bool> >> > </processor> >> > >> > This will bring us back to the result of the previous index content, >> > meaning the issue of having the 4 <br> is still there. >> > >> > Regards, >> > Edwin >> > >> > >> > >> > Regards, >> > Edwin >> > >> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> > wrote: >> > >> > > Hi Paul, >> > > >> > > Further to my previous email, which there was an extra "}" in the >> > > configuration, I have changed to use the below configuration based on >> > your >> > > suggestion. >> > > >> > > <processor class="solr.RegexReplaceProcessorFactory"> >> > > <str name="fieldName">content</str> >> > > <str name="pattern">[ \t]*\r?\n</str> >> > > <str name="replacement"><br></str> >> > > <bool name="literalReplacement">true</bool> >> > > </processor> >> > > <processor class="solr.RegexReplaceProcessorFactory"> >> > > <str name="fieldName">content</str> >> > > <str name="pattern">(<br><br>){3,}</str> >> > > <str name="replacement"><br><br></str> >> > > <bool name="literalReplacement">true</bool> >> > > </processor> >> > > >> > > However, the result that I get still has more than 2 <br>. In fact, >> the >> > > result become worse, as you can see from the comparison below. >> > > >> > > Example 1: The sentence that the regex pattern used to work correctly. >> > But >> > > with the latest pattern, it has now changed from 2 <br> to become 5 >> <br>, >> > > which is wrong. >> > > *Original content in EML file:* >> > > Dear Sir, >> > > >> > > >> > > I am terminating >> > > *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> > > *Previous Index content: * Dear Sir, <br><br>I am terminating >> > > *Current Index content*: Dear Sir, <br><br><br><br><br> I am >> > terminating >> > > >> > > Example 2: The sentence that the above regex pattern is partially >> working >> > > (as you can see, instead of 2 <br>, there are 4 <br>) >> > > *Original content in EML file:* >> > > >> > > *exalted* >> > > >> > > *Psalm 89:17* >> > > >> > > >> > > 3 Choa Chu Kang Avenue 4 >> > > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >> Choa >> > > Chu Kang Avenue 4, Singapore >> > > *Previous Index content: *exalted <br><br>Psalm 89:17 <br><br> >> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore >> > > *Current Index content*: <br><br><br> Psalm 89:17<br><br> >> <br><br> 3 >> > > Choa Chu Kang Avenue 3, Singapor4 >> > > >> > > Example 3: The sentence that the above regex pattern is partially >> working >> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest >> > code, >> > > there are now 5 <br> >> > > *Original content in EML file:* >> > > >> > > http://www.concorded.com/ >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > On Tue, Dec 18, 2018 at 10:07 AM >> > > *Original content:* http://www.concorded.com/ \n\n \n\n \n \n\n >> \n\n >> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >> 2018 at >> > > 10:07 AM >> > > *Previous Index content: *http://www.concorded.com/ <br><br> >> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM >> > > *Current Index content:* http://www.concorded.com/<br><br> >> <br><br><br> >> > > On Tue, Dec 18, 2018 at 10:07 AM >> > > >> > > >> > > Regards, >> > > Edwin >> > > >> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo < >> edwinye...@gmail.com> >> > > wrote: >> > > >> > >> Hi Paul, >> > >> >> > >> Thank you for the reply. >> > >> >> > >> I have tried to add the following configuration according to your >> > >> suggestion: >> > >> >> > >> <processor class="solr.RegexReplaceProcessorFactory"> >> > >> <str name="fieldName">content</str> >> > >> <str name="pattern">[ \t]*\r?\n}</str> >> > >> <str name="replacement"><br></str> >> > >> <bool name="literalReplacement">true</bool> >> > >> </processor> >> > >> >> > >> <processor class="solr.RegexReplaceProcessorFactory"> >> > >> <str name="fieldName">content</str> >> > >> <str name="pattern">(<br><br>){3,}</str> >> > >> <str name="replacement"><br><br></str> >> > >> <bool name="literalReplacement">true</bool> >> > >> </processor> >> > >> >> > >> However, none of the \n is being removed this time round. >> > >> Is the order and/or the pattern correct? >> > >> >> > >> Regards, >> > >> Edwin >> > >> >> > >> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote: >> > >> >> > >>> Hi Edwin >> > >>> >> > >>> >> > >>> >> > >>> Try for the first pattern/replacement >> > >>> >> > >>> >> > >>> >> > >>> <str name="pattern">[ \t]*\r?\n</str> >> > >>> >> > >>> <str name="replacement"><br></str> >> > >>> >> > >>> >> > >>> >> > >>> Now all line endings and preceding whitespace characters should be >> > >>> changed to ‘<br>’. >> > >>> >> > >>> >> > >>> >> > >>> The second pattern replacement should replace 3 or more ‘<br>’ >> > sequences >> > >>> to 2 ‘<br>’ sequences: >> > >>> >> > >>> >> > >>> >> > >>> <str name="pattern">(<br><br>){3,}</str> >> > >>> >> > >>> <str name="replacement"><br><br></str> >> > >>> >> > >>> >> > >>> >> > >>> Hope this approach works. Sorry for not replying earlier and best >> > >>> regards, >> > >>> >> > >>> Paul >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >> für >> > >>> Windows 10 >> > >>> >> > >>> >> > >>> >> > >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > >>> Gesendet: Dienstag, 5. März 2019 03:35 >> > >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> multiple \n >> > >>> >> > >>> >> > >>> >> > >>> Hi, >> > >>> >> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as >> well. >> > >>> >> > >>> Regards, >> > >>> Edwin >> > >>> >> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo < >> > edwinye...@gmail.com> >> > >>> wrote: >> > >>> >> > >>> > Hi, >> > >>> > >> > >>> > Anyone else has other suggestions or have faced the same problem? >> > >>> > >> > >>> > Regards, >> > >>> > Edwin >> > >>> > >> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo < >> > >>> edwinye...@gmail.com> >> > >>> > wrote: >> > >>> > >> > >>> >> Hi Paul, >> > >>> >> >> > >>> >> If I tried to execute the second step first, then I will only >> get a >> > >>> >> single <br> for those with 2 <br>. >> > >>> >> For those that we originally get 4 <br>, there will be 2 <br> >> with a >> > >>> >> space in between. >> > >>> >> >> > >>> >> This is just changing the 2 <br> to be a single <br>, since the >> > second >> > >>> >> step is to replace with a single <br>. >> > >>> >> But it has not solved the underlying problem yet. >> > >>> >> >> > >>> >> Regards, >> > >>> >> Edwin >> > >>> >> >> > >>> >> >> > >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: >> > >>> >> >> > >>> >>> If the second step is executed first, then you will get the >> > unwanted >> > >>> 4 >> > >>> >>> <br> >> > >>> >>> >> > >>> >>> >> > >>> >>> >> > >>> >>> Gesendet von Mail< >> https://go.microsoft.com/fwlink/?LinkId=550986> >> > >>> für >> > >>> >>> Windows 10 >> > >>> >>> >> > >>> >>> >> > >>> >>> >> > >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 >> > >>> >>> An: solr-user@lucene.apache.org<mailto: >> solr-user@lucene.apache.org >> > > >> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> > multiple >> > >>> \n >> > >>> >>> >> > >>> >>> >> > >>> >>> >> > >>> >>> Hi Jörn , >> > >>> >>> >> > >>> >>> Do you mean the regex is not correct? >> > >>> >>> >> > >>> >>> We are already using two RegexReplaceProcessorFactory steps, >> like >> > >>> the one >> > >>> >>> shown below. The output that we get is still the same. >> > >>> >>> >> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> <str name="fieldName">content</str> >> > >>> >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >> > >>> >>> <str name="replacement"><br><br></str> >> > >>> >>> <bool name="literalReplacement">true</bool> >> > >>> >>> <processor> >> > >>> >>> >> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> <str name="fieldName">content</str> >> > >>> >>> <str name="pattern">([ \t]*\r?\n){1,}</str> >> > >>> >>> <str name="replacement"><br></str> >> > >>> >>> <bool name="literalReplacement">true</bool> >> > >>> >>> <processor> >> > >>> >>> >> > >>> >>> Regards, >> > >>> >>> Edwin >> > >>> >>> >> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com >> > >> > >>> wrote: >> > >>> >>> >> > >>> >>> > Then you need two regexprocessfactory steps >> > >>> >>> > >> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < >> > >>> >>> edwinye...@gmail.com >> > >>> >>> > >: >> > >>> >>> > > >> > >>> >>> > > Hi, >> > >>> >>> > > >> > >>> >>> > > Thanks for the reply. >> > >>> >>> > > >> > >>> >>> > > Do you know of any regex online tool that works correctly >> for >> > >>> Java >> > >>> >>> regex? >> > >>> >>> > > I tried to find some, but they are not working properly. >> > >>> >>> > > >> > >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, >> and >> > >>> >>> single \n >> > >>> >>> > > with single <br>. >> > >>> >>> > > >> > >>> >>> > > Regards, >> > >>> >>> > > Edwin >> > >>> >>> > > >> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke < >> > jornfra...@gmail.com >> > >>> > >> > >>> >>> wrote: >> > >>> >>> > >> >> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - >> it >> > >>> would >> > >>> >>> then >> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports >> > Java >> > >>> >>> regex >> > >>> >>> > for >> > >>> >>> > >> your solution. >> > >>> >>> > >> >> > >>> >>> > >> I believe you want to have 2 regex process factories: >> > >>> >>> > >> One that deals with single \n and one that deals with more >> > than >> > >>> one >> > >>> >>> \n >> > >>> >>> > >> >> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < >> > >>> >>> > edwinye...@gmail.com >> > >>> >>> > >>> : >> > >>> >>> > >>> >> > >>> >>> > >>> Hi, >> > >>> >>> > >>> >> > >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} >> > and >> > >>> >>> > >>> configuration: >> > >>> >>> > >>> >> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> > >>> <str name="fieldName">content</str> >> > >>> >>> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >> > >>> >>> > >>> <str name="replacement"><br><br></str> >> > >>> >>> > >>> <bool name="literalReplacement">true</bool> >> > >>> >>> > >>> </processor> >> > >>> >>> > >>> >> > >>> >>> > >>> However, the issue is still occurring. >> > >>> >>> > >>> >> > >>> >>> > >>> Anyone else is able to help? >> > >>> >>> > >>> >> > >>> >>> > >>> Regards, >> > >>> >>> > >>> Edwin >> > >>> >>> > >>> >> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < >> > >>> >>> > edwinye...@gmail.com> >> > >>> >>> > >>> wrote: >> > >>> >>> > >>> >> > >>> >>> > >>>> Hi, >> > >>> >>> > >>>> >> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as >> > well. >> > >>> >>> > >>>> >> > >>> >>> > >>>> Regards, >> > >>> >>> > >>>> Edwin >> > >>> >>> > >>>> >> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < >> > >>> >>> > edwinye...@gmail.com >> > >>> >>> > >>> >> > >>> >>> > >>>> wrote: >> > >>> >>> > >>>> >> > >>> >>> > >>>>> Hi, >> > >>> >>> > >>>>> >> > >>> >>> > >>>>> Should we report this as a bug in Solr? >> > >>> >>> > >>>>> >> > >>> >>> > >>>>> Regards, >> > >>> >>> > >>>>> Edwin >> > >>> >>> > >>>>> >> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < >> > >>> >>> > edwinye...@gmail.com >> > >>> >>> > >>> >> > >>> >>> > >>>>> wrote: >> > >>> >>> > >>>>> >> > >>> >>> > >>>>>> Hi Paul, >> > >>> >>> > >>>>>> >> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, >> when we >> > >>> try >> > >>> >>> in on >> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the >> correct >> > >>> >>> result for >> > >>> >>> > >> all >> > >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, >> and >> > >>> not >> > >>> >>> more >> > >>> >>> > >> than >> > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier >> > >>> examples). >> > >>> >>> > >>>>>> >> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr? >> > >>> >>> > >>>>>> >> > >>> >>> > >>>>>> Regards, >> > >>> >>> > >>>>>> Edwin >> > >>> >>> > >>>>>> >> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < >> > >>> >>> > >> edwinye...@gmail.com> >> > >>> >>> > >>>>>> wrote: >> > >>> >>> > >>>>>> >> > >>> >>> > >>>>>>> Hi Paul, >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. >> > <str >> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following >> > regex >> > >>> >>> pattern: >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> > >>>>>>> <str name="fieldName">content</str> >> > >>> >>> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> >> > >>> >>> > >>>>>>> <str name="replacement"><br><br></str> >> > >>> >>> > >>>>>>> </processor> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> However, we are also getting the exact same results as >> > the >> > >>> >>> earlier >> > >>> >>> > >>>>>>> Example 1, 2 and 3. >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have >> other >> > >>> (non >> > >>> >>> > >>>>>>> printing) characters than \n, we have find that there >> are >> > >>> no >> > >>> >>> non >> > >>> >>> > >> printing >> > >>> >>> > >>>>>>> characters. It is just next line with a space. You can >> > >>> refer >> > >>> >>> to the >> > >>> >>> > >>>>>>> original content in the same examples below. >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern >> is >> > >>> working >> > >>> >>> > >>>>>>> correctly >> > >>> >>> > >>>>>>> *Original content in EML file:* >> > >>> >>> > >>>>>>> Dear Sir, >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> I am terminating >> > >>> >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> > >>> terminating >> > >>> >>> > >>>>>>> *Index content: * Dear Sir, <br><br>I am >> terminating >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern >> is >> > >>> >>> partially >> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are >> 4 >> > >>> <br>) >> > >>> >>> > >>>>>>> *Original content in EML file:* >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> *exalted* >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> *Psalm 89:17* >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 >> > >>> >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 >> \n\n >> > >>> >>> \n\n 3 >> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> > >>> >>> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 >> <br><br> >> > >>> >>> <br><br>3 >> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern >> is >> > >>> >>> partially >> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are >> 4 >> > >>> <br>) >> > >>> >>> > >>>>>>> *Original content in EML file:* >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM >> > >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >> > >>> \n\n >> > >>> >>> > \n\n >> > >>> >>> > >> \n >> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n >> > >>> \n\n\n On >> > >>> >>> Tue, >> > >>> >>> > >> Dec 18, >> > >>> >>> > >>>>>>> 2018 at 10:07 AM >> > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >> > >>> <br><br> >> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may >> > >>> have. >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> Thank you. >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>> Regards, >> > >>> >>> > >>>>>>> Edwin >> > >>> >>> > >>>>>>> >> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> >> > >>> wrote: >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Hi Edwin >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should >> > preceed >> > >>> >>> the \n >> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >> > >>> >>> > >>>>>>>> 2. Perhaps in the data you have other (non printing) >> > >>> >>> characters >> > >>> >>> > >>>>>>>> than \n? >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Gesendet von Mail< >> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986> >> > >>> >>> > >> für >> > >>> >>> > >>>>>>>> Windows 10 >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com >> > >> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: >> > >>> >>> > solr-user@lucene.apache.org> >> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to >> > >>> detect >> > >>> >>> > >> multiple \n >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Hi Paul, >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow: >> > >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> > >>>>>>>> <str name="fieldName">content</str> >> > >>> >>> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> >> > >>> >>> > >>>>>>>> <str name="replacement"><br><br></str> >> > >>> >>> > >>>>>>>> </processor> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> But we still have exactly the same problem of Example >> > 1,2 >> > >>> and >> > >>> >>> 3 >> > >>> >>> > >> below. >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern >> is >> > >>> >>> working >> > >>> >>> > >>>>>>>> correctly >> > >>> >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> > >>> >>> terminating >> > >>> >>> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am >> terminating >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern >> is >> > >>> >>> partially >> > >>> >>> > >>>>>>>> working >> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>> >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 >> > \n\n >> > >>> >>> \n\n >> > >>> >>> > 3 >> > >>> >>> > >>>>>>>> Choa >> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >> > >>> >>> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 >> <br><br> >> > >>> >>> > <br><br>3 >> > >>> >>> > >>>>>>>> Choa >> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern >> is >> > >>> >>> partially >> > >>> >>> > >>>>>>>> working >> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>> >>> > >>>>>>>> *Original content:* >> http://www.concordpri.moe.edu.sg/ >> > >>> \n\n >> > >>> >>> > \n\n >> > >>> >>> > >>>>>>>> \n \n\n >> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n >> > On >> > >>> >>> Tue, Dec >> > >>> >>> > >> 18, >> > >>> >>> > >>>>>>>> 2018 >> > >>> >>> > >>>>>>>> at 10:07 AM >> > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >> > >>> <br><br> >> > >>> >>> > >>>>>>>> <br><br>On >> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Any further suggestion? >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Thank you. >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>> Regards, >> > >>> >>> > >>>>>>>> Edwin >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch >> > >> > >>> wrote: >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then >> > >>> failing >> > >>> >>> on >> > >>> >>> > the >> > >>> >>> > >>>>>>>> {2,} >> > >>> >>> > >>>>>>>>> part you could try >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> If you also want to match CRLF then >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Gesendet von Mail< >> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986 >> > >>> >>> > > >> > >>> >>> > >>>>>>>> für >> > >>> >>> > >>>>>>>>> Windows 10 >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: >> edwinye...@gmail.com> >> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: >> > >>> >>> > solr-user@lucene.apache.org >> > >>> >>> > >>> >> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to >> > >>> detect >> > >>> >>> > >> multiple >> > >>> >>> > >>>>>>>> \n >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Hi Paul, >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Thanks for your reply. >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> When I use this pattern: >> > >>> >>> > >>>>>>>>> <processor >> class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> > >>>>>>>>> <str name="fieldName">content</str> >> > >>> >>> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> >> > >>> >>> > >>>>>>>>> <str name="replacement"><br><br></str> >> > >>> >>> > >>>>>>>>> </processor> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> It is working for some sentence within the same >> content >> > >>> and >> > >>> >>> not >> > >>> >>> > >>>>>>>> working for >> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is >> > >>> working >> > >>> >>> and >> > >>> >>> > >>>>>>>> another >> > >>> >>> > >>>>>>>>> that is not working (partially working): >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex >> pattern is >> > >>> >>> working >> > >>> >>> > >>>>>>>> correctly >> > >>> >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> > >>> >>> terminating >> > >>> >>> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am >> > terminating >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex >> pattern is >> > >>> >>> partially >> > >>> >>> > >>>>>>>> working >> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 >> <br>) >> > >>> >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 >> > \n\n >> > >>> >>> > \n\n 3 >> > >>> >>> > >>>>>>>> Choa >> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> > >>> >>> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 >> > <br><br> >> > >>> >>> > <br><br>3 >> > >>> >>> > >>>>>>>> Choa >> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex >> pattern is >> > >>> >>> partially >> > >>> >>> > >>>>>>>> working >> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 >> <br>) >> > >>> >>> > >>>>>>>>> *Original content:* >> http://www.concordpri.moe.edu.sg/ >> > >>> \n\n >> > >>> >>> > >> \n\n >> > >>> >>> > >>>>>>>> \n >> > >>> >>> > >>>>>>>>> \n\n >> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n >> \n\n\n >> > On >> > >>> >>> Tue, >> > >>> >>> > Dec >> > >>> >>> > >>>>>>>> 18, 2018 >> > >>> >>> > >>>>>>>>> at 10:07 AM >> > >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >> > >>> >>> <br><br> >> > >>> >>> > >>>>>>>> <br><br>On >> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong? >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Thank you. >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>> Regards, >> > >>> >>> > >>>>>>>>> Edwin >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, < >> paul.d...@ub.unibe.ch> >> > >>> wrote: >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not >> > >>> working. I >> > >>> >>> > assume >> > >>> >>> > >>>>>>>> nothing >> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> ?? >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> Gesendet von Mail< >> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986> >> > >>> >>> > >>>>>>>> für >> > >>> >>> > >>>>>>>>>> Windows 10 >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: >> edwinye...@gmail.com> >> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: >> > >>> >>> > >> solr-user@lucene.apache.org >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to >> > detect >> > >>> >>> multiple >> > >>> >>> > >> \n >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> Hi, >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> I am trying to use the >> RegexReplaceProcessorFactory to >> > >>> >>> remove >> > >>> >>> > more >> > >>> >>> > >>>>>>>> than >> > >>> >>> > >>>>>>>>> two >> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: >> \n\n, >> > \n >> > >>> \n, >> > >>> >>> \n >> > >>> >>> > \n >> > >>> >>> > >>>>>>>> \n >> > >>> >>> > >>>>>>>>> \n), >> > >>> >>> > >>>>>>>>>> and replace it with two <br>. >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working >> > >>> when I >> > >>> >>> test >> > >>> >>> > it >> > >>> >>> > >>>>>>>> in >> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it >> > >>> inside >> > >>> >>> the >> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> >> > >>> >>> > >>>>>>>>>> <processor >> class="solr.RegexReplaceProcessorFactory"> >> > >>> >>> > >>>>>>>>>> <str name="fieldName">content</str> >> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> >> > >>> >>> > >>>>>>>>>> <str name="replacement"><br><br></str> >> > >>> >>> > >>>>>>>>>> </processor> >> > >>> >>> > >>>>>>>>>> </updateRequestProcessorChain> >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is >> > >>> >>> instructing >> > >>> >>> > the >> > >>> >>> > >>>>>>>> regex >> > >>> >>> > >>>>>>>>> to >> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is >> > >>> instructing >> > >>> >>> the >> > >>> >>> > >>>>>>>> regex to >> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how >> should >> > >>> I do >> > >>> >>> it? >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0. >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>>> Regards, >> > >>> >>> > >>>>>>>>>> Edwin >> > >>> >>> > >>>>>>>>>> >> > >>> >>> > >>>>>>>>> >> > >>> >>> > >>>>>>>> >> > >>> >>> > >>>>>>> >> > >>> >>> > >> >> > >>> >>> > >> > >>> >>> >> > >>> >> >> > >>> >> > >> >> > >> >