Hi Paul, I have tried with the first match pattern to be <str name="pattern">[ \t\x0b\f]*\r?\n</str>, like the configuration below:
<processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">[ \t\x0b\f]*\r?\n</str> <str name="replacement"><br></str> <bool name="literalReplacement">true</bool> </processor> <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">(<br>){3,}</str> <str name="replacement"><br><br></str> <bool name="literalReplacement">true</bool> </processor> However, the result is still the same as before (previous index results), with the 4 <br>. Regards, Edwin On Wed, 6 Mar 2019 at 18:23, <paul.d...@ub.unibe.ch> wrote: > Hi Edwin > > > > You are correct re the 2nd pattern – my bad. Looking at the 4 <br>, it’s > actually the sequence «<br><br> <br><br>»? So perhaps the first match > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str> > > > > i.e. [space tab vertical-tab formfeed] > > > > Regards, > > Paul > > > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > Windows 10 > > > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > Gesendet: Mittwoch, 6. März 2019 07:44 > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n > > > > Hi Paul, > > I have modified the second pattern to be (<br>){3,}, instead of > (<br><br>){3,}. This pattern of (<br><br>){3,} > will actually look for 6 or more <br> instead of 3 <br>, as we have put > the <br> two times in the pattern, which is the reason that there are more > <br> in the result, as cases where there are less than 6 <br> are not being > replaced, so we ended up having up to 5 <br> in the index. > > Modified configuration: > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(<br>){3,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > This will bring us back to the result of the previous index content, > meaning the issue of having the 4 <br> is still there. > > Regards, > Edwin > > > > Regards, > Edwin > > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > > Hi Paul, > > > > Further to my previous email, which there was an extra "}" in the > > configuration, I have changed to use the below configuration based on > your > > suggestion. > > > > <processor class="solr.RegexReplaceProcessorFactory"> > > <str name="fieldName">content</str> > > <str name="pattern">[ \t]*\r?\n</str> > > <str name="replacement"><br></str> > > <bool name="literalReplacement">true</bool> > > </processor> > > <processor class="solr.RegexReplaceProcessorFactory"> > > <str name="fieldName">content</str> > > <str name="pattern">(<br><br>){3,}</str> > > <str name="replacement"><br><br></str> > > <bool name="literalReplacement">true</bool> > > </processor> > > > > However, the result that I get still has more than 2 <br>. In fact, the > > result become worse, as you can see from the comparison below. > > > > Example 1: The sentence that the regex pattern used to work correctly. > But > > with the latest pattern, it has now changed from 2 <br> to become 5 <br>, > > which is wrong. > > *Original content in EML file:* > > Dear Sir, > > > > > > I am terminating > > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > > *Previous Index content: * Dear Sir, <br><br>I am terminating > > *Current Index content*: Dear Sir, <br><br><br><br><br> I am > terminating > > > > Example 2: The sentence that the above regex pattern is partially working > > (as you can see, instead of 2 <br>, there are 4 <br>) > > *Original content in EML file:* > > > > *exalted* > > > > *Psalm 89:17* > > > > > > 3 Choa Chu Kang Avenue 4 > > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa > > Chu Kang Avenue 4, Singapore > > *Previous Index content: *exalted <br><br>Psalm 89:17 <br><br> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore > > *Current Index content*: <br><br><br> Psalm 89:17<br><br> <br><br> 3 > > Choa Chu Kang Avenue 3, Singapor4 > > > > Example 3: The sentence that the above regex pattern is partially working > > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest > code, > > there are now 5 <br> > > *Original content in EML file:* > > > > http://www.concorded.com/ > > > > > > > > > > > > > > > > > > On Tue, Dec 18, 2018 at 10:07 AM > > *Original content:* http://www.concorded.com/ \n\n \n\n \n \n\n \n\n > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 at > > 10:07 AM > > *Previous Index content: *http://www.concorded.com/ <br><br> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM > > *Current Index content:* http://www.concorded.com/<br><br> <br><br><br> > > On Tue, Dec 18, 2018 at 10:07 AM > > > > > > Regards, > > Edwin > > > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > >> Hi Paul, > >> > >> Thank you for the reply. > >> > >> I have tried to add the following configuration according to your > >> suggestion: > >> > >> <processor class="solr.RegexReplaceProcessorFactory"> > >> <str name="fieldName">content</str> > >> <str name="pattern">[ \t]*\r?\n}</str> > >> <str name="replacement"><br></str> > >> <bool name="literalReplacement">true</bool> > >> </processor> > >> > >> <processor class="solr.RegexReplaceProcessorFactory"> > >> <str name="fieldName">content</str> > >> <str name="pattern">(<br><br>){3,}</str> > >> <str name="replacement"><br><br></str> > >> <bool name="literalReplacement">true</bool> > >> </processor> > >> > >> However, none of the \n is being removed this time round. > >> Is the order and/or the pattern correct? > >> > >> Regards, > >> Edwin > >> > >> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote: > >> > >>> Hi Edwin > >>> > >>> > >>> > >>> Try for the first pattern/replacement > >>> > >>> > >>> > >>> <str name="pattern">[ \t]*\r?\n</str> > >>> > >>> <str name="replacement"><br></str> > >>> > >>> > >>> > >>> Now all line endings and preceding whitespace characters should be > >>> changed to ‘<br>’. > >>> > >>> > >>> > >>> The second pattern replacement should replace 3 or more ‘<br>’ > sequences > >>> to 2 ‘<br>’ sequences: > >>> > >>> > >>> > >>> <str name="pattern">(<br><br>){3,}</str> > >>> > >>> <str name="replacement"><br><br></str> > >>> > >>> > >>> > >>> Hope this approach works. Sorry for not replying earlier and best > >>> regards, > >>> > >>> Paul > >>> > >>> > >>> > >>> > >>> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > >>> Windows 10 > >>> > >>> > >>> > >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>> Gesendet: Dienstag, 5. März 2019 03:35 > >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n > >>> > >>> > >>> > >>> Hi, > >>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as well. > >>> > >>> Regards, > >>> Edwin > >>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > >>> wrote: > >>> > >>> > Hi, > >>> > > >>> > Anyone else has other suggestions or have faced the same problem? > >>> > > >>> > Regards, > >>> > Edwin > >>> > > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo < > >>> edwinye...@gmail.com> > >>> > wrote: > >>> > > >>> >> Hi Paul, > >>> >> > >>> >> If I tried to execute the second step first, then I will only get a > >>> >> single <br> for those with 2 <br>. > >>> >> For those that we originally get 4 <br>, there will be 2 <br> with a > >>> >> space in between. > >>> >> > >>> >> This is just changing the 2 <br> to be a single <br>, since the > second > >>> >> step is to replace with a single <br>. > >>> >> But it has not solved the underlying problem yet. > >>> >> > >>> >> Regards, > >>> >> Edwin > >>> >> > >>> >> > >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: > >>> >> > >>> >>> If the second step is executed first, then you will get the > unwanted > >>> 4 > >>> >>> <br> > >>> >>> > >>> >>> > >>> >>> > >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> > >>> für > >>> >>> Windows 10 > >>> >>> > >>> >>> > >>> >>> > >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 > >>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org > > > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > multiple > >>> \n > >>> >>> > >>> >>> > >>> >>> > >>> >>> Hi Jörn , > >>> >>> > >>> >>> Do you mean the regex is not correct? > >>> >>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps, like > >>> the one > >>> >>> shown below. The output that we get is still the same. > >>> >>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> <str name="fieldName">content</str> > >>> >>> <str name="pattern">([ \t]*\r?\n){2,}</str> > >>> >>> <str name="replacement"><br><br></str> > >>> >>> <bool name="literalReplacement">true</bool> > >>> >>> <processor> > >>> >>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> <str name="fieldName">content</str> > >>> >>> <str name="pattern">([ \t]*\r?\n){1,}</str> > >>> >>> <str name="replacement"><br></str> > >>> >>> <bool name="literalReplacement">true</bool> > >>> >>> <processor> > >>> >>> > >>> >>> Regards, > >>> >>> Edwin > >>> >>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com> > >>> wrote: > >>> >>> > >>> >>> > Then you need two regexprocessfactory steps > >>> >>> > > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < > >>> >>> edwinye...@gmail.com > >>> >>> > >: > >>> >>> > > > >>> >>> > > Hi, > >>> >>> > > > >>> >>> > > Thanks for the reply. > >>> >>> > > > >>> >>> > > Do you know of any regex online tool that works correctly for > >>> Java > >>> >>> regex? > >>> >>> > > I tried to find some, but they are not working properly. > >>> >>> > > > >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and > >>> >>> single \n > >>> >>> > > with single <br>. > >>> >>> > > > >>> >>> > > Regards, > >>> >>> > > Edwin > >>> >>> > > > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke < > jornfra...@gmail.com > >>> > > >>> >>> wrote: > >>> >>> > >> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it > >>> would > >>> >>> then > >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports > Java > >>> >>> regex > >>> >>> > for > >>> >>> > >> your solution. > >>> >>> > >> > >>> >>> > >> I believe you want to have 2 regex process factories: > >>> >>> > >> One that deals with single \n and one that deals with more > than > >>> one > >>> >>> \n > >>> >>> > >> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < > >>> >>> > edwinye...@gmail.com > >>> >>> > >>> : > >>> >>> > >>> > >>> >>> > >>> Hi, > >>> >>> > >>> > >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} > and > >>> >>> > >>> configuration: > >>> >>> > >>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> > >>> <str name="fieldName">content</str> > >>> >>> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> > >>> >>> > >>> <str name="replacement"><br><br></str> > >>> >>> > >>> <bool name="literalReplacement">true</bool> > >>> >>> > >>> </processor> > >>> >>> > >>> > >>> >>> > >>> However, the issue is still occurring. > >>> >>> > >>> > >>> >>> > >>> Anyone else is able to help? > >>> >>> > >>> > >>> >>> > >>> Regards, > >>> >>> > >>> Edwin > >>> >>> > >>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < > >>> >>> > edwinye...@gmail.com> > >>> >>> > >>> wrote: > >>> >>> > >>> > >>> >>> > >>>> Hi, > >>> >>> > >>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as > well. > >>> >>> > >>>> > >>> >>> > >>>> Regards, > >>> >>> > >>>> Edwin > >>> >>> > >>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < > >>> >>> > edwinye...@gmail.com > >>> >>> > >>> > >>> >>> > >>>> wrote: > >>> >>> > >>>> > >>> >>> > >>>>> Hi, > >>> >>> > >>>>> > >>> >>> > >>>>> Should we report this as a bug in Solr? > >>> >>> > >>>>> > >>> >>> > >>>>> Regards, > >>> >>> > >>>>> Edwin > >>> >>> > >>>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < > >>> >>> > edwinye...@gmail.com > >>> >>> > >>> > >>> >>> > >>>>> wrote: > >>> >>> > >>>>> > >>> >>> > >>>>>> Hi Paul, > >>> >>> > >>>>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we > >>> try > >>> >>> in on > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct > >>> >>> result for > >>> >>> > >> all > >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and > >>> not > >>> >>> more > >>> >>> > >> than > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier > >>> examples). > >>> >>> > >>>>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr? > >>> >>> > >>>>>> > >>> >>> > >>>>>> Regards, > >>> >>> > >>>>>> Edwin > >>> >>> > >>>>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < > >>> >>> > >> edwinye...@gmail.com> > >>> >>> > >>>>>> wrote: > >>> >>> > >>>>>> > >>> >>> > >>>>>>> Hi Paul, > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. > <str > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following > regex > >>> >>> pattern: > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> > >>>>>>> <str name="fieldName">content</str> > >>> >>> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> > >>> >>> > >>>>>>> <str name="replacement"><br><br></str> > >>> >>> > >>>>>>> </processor> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> However, we are also getting the exact same results as > the > >>> >>> earlier > >>> >>> > >>>>>>> Example 1, 2 and 3. > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other > >>> (non > >>> >>> > >>>>>>> printing) characters than \n, we have find that there are > >>> no > >>> >>> non > >>> >>> > >> printing > >>> >>> > >>>>>>> characters. It is just next line with a space. You can > >>> refer > >>> >>> to the > >>> >>> > >>>>>>> original content in the same examples below. > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is > >>> working > >>> >>> > >>>>>>> correctly > >>> >>> > >>>>>>> *Original content in EML file:* > >>> >>> > >>>>>>> Dear Sir, > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> I am terminating > >>> >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am > >>> terminating > >>> >>> > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is > >>> >>> partially > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 > >>> <br>) > >>> >>> > >>>>>>> *Original content in EML file:* > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> *exalted* > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> *Psalm 89:17* > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 > >>> >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n > >>> >>> \n\n 3 > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore > >>> >>> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > >>> >>> <br><br>3 > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is > >>> >>> partially > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 > >>> <br>) > >>> >>> > >>>>>>> *Original content in EML file:* > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM > >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ > >>> \n\n > >>> >>> > \n\n > >>> >>> > >> \n > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n > >>> \n\n\n On > >>> >>> Tue, > >>> >>> > >> Dec 18, > >>> >>> > >>>>>>> 2018 at 10:07 AM > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ > >>> <br><br> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may > >>> have. > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> Thank you. > >>> >>> > >>>>>>> > >>> >>> > >>>>>>> Regards, > >>> >>> > >>>>>>> Edwin > >>> >>> > >>>>>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> > >>> wrote: > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Hi Edwin > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should > preceed > >>> >>> the \n > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> > >>> >>> > >>>>>>>> 2. Perhaps in the data you have other (non printing) > >>> >>> characters > >>> >>> > >>>>>>>> than \n? > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Gesendet von Mail< > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986> > >>> >>> > >> für > >>> >>> > >>>>>>>> Windows 10 > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: > >>> >>> > solr-user@lucene.apache.org> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to > >>> detect > >>> >>> > >> multiple \n > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Hi Paul, > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow: > >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> > >>>>>>>> <str name="fieldName">content</str> > >>> >>> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>> >>> > >>>>>>>> <str name="replacement"><br><br></str> > >>> >>> > >>>>>>>> </processor> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of Example > 1,2 > >>> and > >>> >>> 3 > >>> >>> > >> below. > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is > >>> >>> working > >>> >>> > >>>>>>>> correctly > >>> >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am > >>> >>> terminating > >>> >>> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is > >>> >>> partially > >>> >>> > >>>>>>>> working > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>> >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 > \n\n > >>> >>> \n\n > >>> >>> > 3 > >>> >>> > >>>>>>>> Choa > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore > >>> >>> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > >>> >>> > <br><br>3 > >>> >>> > >>>>>>>> Choa > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is > >>> >>> partially > >>> >>> > >>>>>>>> working > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ > >>> \n\n > >>> >>> > \n\n > >>> >>> > >>>>>>>> \n \n\n > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n > On > >>> >>> Tue, Dec > >>> >>> > >> 18, > >>> >>> > >>>>>>>> 2018 > >>> >>> > >>>>>>>> at 10:07 AM > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ > >>> <br><br> > >>> >>> > >>>>>>>> <br><br>On > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Any further suggestion? > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Thank you. > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>> Regards, > >>> >>> > >>>>>>>> Edwin > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> > >>> wrote: > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then > >>> failing > >>> >>> on > >>> >>> > the > >>> >>> > >>>>>>>> {2,} > >>> >>> > >>>>>>>>> part you could try > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Gesendet von Mail< > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986 > >>> >>> > > > >>> >>> > >>>>>>>> für > >>> >>> > >>>>>>>>> Windows 10 > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: > >>> >>> > solr-user@lucene.apache.org > >>> >>> > >>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to > >>> detect > >>> >>> > >> multiple > >>> >>> > >>>>>>>> \n > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Hi Paul, > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Thanks for your reply. > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> When I use this pattern: > >>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> > >>>>>>>>> <str name="fieldName">content</str> > >>> >>> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> > >>> >>> > >>>>>>>>> <str name="replacement"><br><br></str> > >>> >>> > >>>>>>>>> </processor> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same content > >>> and > >>> >>> not > >>> >>> > >>>>>>>> working for > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is > >>> working > >>> >>> and > >>> >>> > >>>>>>>> another > >>> >>> > >>>>>>>>> that is not working (partially working): > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is > >>> >>> working > >>> >>> > >>>>>>>> correctly > >>> >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am > >>> >>> terminating > >>> >>> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am > terminating > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is > >>> >>> partially > >>> >>> > >>>>>>>> working > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>> >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 > \n\n > >>> >>> > \n\n 3 > >>> >>> > >>>>>>>> Choa > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore > >>> >>> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 > <br><br> > >>> >>> > <br><br>3 > >>> >>> > >>>>>>>> Choa > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is > >>> >>> partially > >>> >>> > >>>>>>>> working > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ > >>> \n\n > >>> >>> > >> \n\n > >>> >>> > >>>>>>>> \n > >>> >>> > >>>>>>>>> \n\n > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n > On > >>> >>> Tue, > >>> >>> > Dec > >>> >>> > >>>>>>>> 18, 2018 > >>> >>> > >>>>>>>>> at 10:07 AM > >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ > >>> >>> <br><br> > >>> >>> > >>>>>>>> <br><br>On > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong? > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Thank you. > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>> Regards, > >>> >>> > >>>>>>>>> Edwin > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> > >>> wrote: > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not > >>> working. I > >>> >>> > assume > >>> >>> > >>>>>>>> nothing > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> ?? > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail< > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986> > >>> >>> > >>>>>>>> für > >>> >>> > >>>>>>>>>> Windows 10 > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: > >>> >>> > >> solr-user@lucene.apache.org > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to > detect > >>> >>> multiple > >>> >>> > >> \n > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> Hi, > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to > >>> >>> remove > >>> >>> > more > >>> >>> > >>>>>>>> than > >>> >>> > >>>>>>>>> two > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, > \n > >>> \n, > >>> >>> \n > >>> >>> > \n > >>> >>> > >>>>>>>> \n > >>> >>> > >>>>>>>>> \n), > >>> >>> > >>>>>>>>>> and replace it with two <br>. > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working > >>> when I > >>> >>> test > >>> >>> > it > >>> >>> > >>>>>>>> in > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it > >>> inside > >>> >>> the > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> > >>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> >>> > >>>>>>>>>> <str name="fieldName">content</str> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> > >>> >>> > >>>>>>>>>> <str name="replacement"><br><br></str> > >>> >>> > >>>>>>>>>> </processor> > >>> >>> > >>>>>>>>>> </updateRequestProcessorChain> > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is > >>> >>> instructing > >>> >>> > the > >>> >>> > >>>>>>>> regex > >>> >>> > >>>>>>>>> to > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is > >>> instructing > >>> >>> the > >>> >>> > >>>>>>>> regex to > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should > >>> I do > >>> >>> it? > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0. > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>>> Regards, > >>> >>> > >>>>>>>>>> Edwin > >>> >>> > >>>>>>>>>> > >>> >>> > >>>>>>>>> > >>> >>> > >>>>>>>> > >>> >>> > >>>>>>> > >>> >>> > >> > >>> >>> > > >>> >>> > >>> >> > >>> > >> >