Hi Paul, Thanks for your reply.
So far we did not find cases of punctuation that are being removed. Our aim is to remove the list of spaces (\n) into 2 <br>, and they are not likely to have any punctuation in between. Do you know if this pattern <str name="pattern">(\n\W*){2,}</str> that we are using is ok? Or would the other pattern like <str name="pattern">[ \t\x0b\f]*\r?\n</str> is better? Regards, Edwin On Wed, 13 Mar 2019 at 20:08, <paul.d...@ub.unibe.ch> wrote: > Hi Edwin, > With \W you will also replace non-word characters such as punktuation. If > that's OK fine. Otherwise you need to identify the white space characters > that are causing the problem. > ________________________________ > Von: Zheng Lin Edwin Yeo <edwinye...@gmail.com> > Gesendet: Mittwoch, 13. März 2019 03:25:39 > An: solr-user@lucene.apache.org > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n > > Hi, > > We have managed to resolve the issue, by changing the \s to \W. The reason > could be due to that some of the spaces and white space instead of just a > space. Using \s will only remove the spaces and not the white spaces, but > using \W will remove the white spaces as well. > > We have used this config, and it works. > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\n\W*){2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\n\W*){1,}</str> > <str name="replacement"><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > Regards, > Edwin > > On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > > Hi, > > > > Has anyone else faced the same issue before? > > So far all the regex patterns that we tried in this thread are not able > to > > resolve the issue. > > > > Regards, > > Edwin > > > > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > >> Hi Paul, > >> > >> Sorry, I realized there is an extra ']' in the pattern provided, which > is > >> why there are so many <br> in the output. > >> > >> The output is exactly the same as previously (previous index result) if > >> we remove the extra ']', as shown in the configuration below. > >> > >> <processor class="solr.RegexReplaceProcessorFactory"> > >> <str name="fieldName">content</str> > >> <str name="pattern">[ \t\x0b\f]*\r?\n</str> > >> <str name="replacement"><br></str> > >> <bool name="literalReplacement">true</bool> > >> </processor> > >> <processor class="solr.RegexReplaceProcessorFactory"> > >> <str name="fieldName">content</str> > >> <str name="pattern">(<br>[ \t\x0b\f]*){3,}</str> > >> <str name="replacement"><br><br></str> > >> <bool name="literalReplacement">true</bool> > >> </processor> > >> > >> Regards, > >> Edwin > >> > >> > >> > >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > >> wrote: > >> > >>> Hi Paul, > >>> > >>> Thanks for the reply. > >>> > >>> For the 2nd pattern, if we put this pattern <str > >>> name="pattern">(<br>[ \t\x0b\f]]*){3,}</str>, which is like the > >>> configurations below: > >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> <str name="fieldName">content</str> > >>> <str name="pattern">[ \t\x0b\f]*\r?\n</str> > >>> <str name="replacement"><br></str> > >>> <bool name="literalReplacement">true</bool> > >>> </processor> > >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>> <str name="fieldName">content</str> > >>> <str name="pattern">(<br>[ \t\x0b\f]]*){3,}</str> > >>> <str name="replacement"><br><br></str> > >>> <bool name="literalReplacement">true</bool> > >>> </processor> > >>> > >>> It will not be able to change all those more than 3 <br> to 2 <br>. > >>> > >>> We will end up with many <br> in the output, like the example below: > >>> > >>> http://www.concorded.com/<br><br> > <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> > On Tue, Dec 18, 2018 > >>> > >>> > >>> Regards, > >>> Edwin > >>> > >>> > >>> > >>> > >>> On Thu, 7 Mar 2019 at 20:44, <paul.d...@ub.unibe.ch> wrote: > >>> > >>>> Hi Edwin > >>>> > >>>> > >>>> > >>>> I can’t understand why the pattern is not working and where the spaces > >>>> between the <br> are coming from. It should be possible to allow for > spaces > >>>> between the <br> in the second match pattern however i.e. 2nd pattern > >>>> > >>>> > >>>> > >>>> <str name="pattern">(<br>[ \t\x0b\f]]*){3,}</str> > >>>> > >>>> > >>>> > >>>> /Paul > >>>> > >>>> > >>>> > >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > >>>> Windows 10 > >>>> > >>>> > >>>> > >>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>> Gesendet: Mittwoch, 6. März 2019 16:28 > >>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple > \n > >>>> > >>>> > >>>> > >>>> Hi Paul, > >>>> > >>>> I have tried with the first match pattern to be <str name="pattern">[ > >>>> \t\x0b\f]*\r?\n</str>, like the configuration below: > >>>> > >>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> <str name="fieldName">content</str> > >>>> <str name="pattern">[ \t\x0b\f]*\r?\n</str> > >>>> <str name="replacement"><br></str> > >>>> <bool name="literalReplacement">true</bool> > >>>> </processor> > >>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> <str name="fieldName">content</str> > >>>> <str name="pattern">(<br>){3,}</str> > >>>> <str name="replacement"><br><br></str> > >>>> <bool name="literalReplacement">true</bool> > >>>> </processor> > >>>> > >>>> However, the result is still the same as before (previous index > >>>> results), > >>>> with the 4 <br>. > >>>> > >>>> Regards, > >>>> Edwin > >>>> > >>>> > >>>> On Wed, 6 Mar 2019 at 18:23, <paul.d...@ub.unibe.ch> wrote: > >>>> > >>>> > Hi Edwin > >>>> > > >>>> > > >>>> > > >>>> > You are correct re the 2nd pattern – my bad. Looking at the 4 <br>, > >>>> it’s > >>>> > actually the sequence «<br><br> <br><br>»? So perhaps the first > match > >>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str> > >>>> > > >>>> > > >>>> > > >>>> > i.e. [space tab vertical-tab formfeed] > >>>> > > >>>> > > >>>> > > >>>> > Regards, > >>>> > > >>>> > Paul > >>>> > > >>>> > > >>>> > > >>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> > für > >>>> > Windows 10 > >>>> > > >>>> > > >>>> > > >>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>> > Gesendet: Mittwoch, 6. März 2019 07:44 > >>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > >>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple > >>>> \n > >>>> > > >>>> > > >>>> > > >>>> > Hi Paul, > >>>> > > >>>> > I have modified the second pattern to be (<br>){3,}, instead > of > >>>> > (<br><br>){3,}. This pattern of > >>>> (<br><br>){3,} > >>>> > will actually look for 6 or more <br> instead of 3 <br>, as we have > >>>> put > >>>> > the <br> two times in the pattern, which is the reason that there > are > >>>> more > >>>> > <br> in the result, as cases where there are less than 6 <br> are > not > >>>> being > >>>> > replaced, so we ended up having up to 5 <br> in the index. > >>>> > > >>>> > Modified configuration: > >>>> > <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > <str name="fieldName">content</str> > >>>> > <str name="pattern">(<br>){3,}</str> > >>>> > <str name="replacement"><br><br></str> > >>>> > <bool name="literalReplacement">true</bool> > >>>> > </processor> > >>>> > > >>>> > This will bring us back to the result of the previous index content, > >>>> > meaning the issue of having the 4 <br> is still there. > >>>> > > >>>> > Regards, > >>>> > Edwin > >>>> > > >>>> > > >>>> > > >>>> > Regards, > >>>> > Edwin > >>>> > > >>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo < > >>>> edwinye...@gmail.com> > >>>> > wrote: > >>>> > > >>>> > > Hi Paul, > >>>> > > > >>>> > > Further to my previous email, which there was an extra "}" in the > >>>> > > configuration, I have changed to use the below configuration based > >>>> on > >>>> > your > >>>> > > suggestion. > >>>> > > > >>>> > > <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > > <str name="fieldName">content</str> > >>>> > > <str name="pattern">[ \t]*\r?\n</str> > >>>> > > <str name="replacement"><br></str> > >>>> > > <bool name="literalReplacement">true</bool> > >>>> > > </processor> > >>>> > > <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > > <str name="fieldName">content</str> > >>>> > > <str name="pattern">(<br><br>){3,}</str> > >>>> > > <str name="replacement"><br><br></str> > >>>> > > <bool name="literalReplacement">true</bool> > >>>> > > </processor> > >>>> > > > >>>> > > However, the result that I get still has more than 2 <br>. In > fact, > >>>> the > >>>> > > result become worse, as you can see from the comparison below. > >>>> > > > >>>> > > Example 1: The sentence that the regex pattern used to work > >>>> correctly. > >>>> > But > >>>> > > with the latest pattern, it has now changed from 2 <br> to become > 5 > >>>> <br>, > >>>> > > which is wrong. > >>>> > > *Original content in EML file:* > >>>> > > Dear Sir, > >>>> > > > >>>> > > > >>>> > > I am terminating > >>>> > > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>> > > *Previous Index content: * Dear Sir, <br><br>I am terminating > >>>> > > *Current Index content*: Dear Sir, <br><br><br><br><br> I am > >>>> > terminating > >>>> > > > >>>> > > Example 2: The sentence that the above regex pattern is partially > >>>> working > >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>) > >>>> > > *Original content in EML file:* > >>>> > > > >>>> > > *exalted* > >>>> > > > >>>> > > *Psalm 89:17* > >>>> > > > >>>> > > > >>>> > > 3 Choa Chu Kang Avenue 4 > >>>> > > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n > 3 > >>>> Choa > >>>> > > Chu Kang Avenue 4, Singapore > >>>> > > *Previous Index content: *exalted <br><br>Psalm 89:17 <br><br> > >>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore > >>>> > > *Current Index content*: <br><br><br> Psalm 89:17<br><br> > >>>> <br><br> 3 > >>>> > > Choa Chu Kang Avenue 3, Singapor4 > >>>> > > > >>>> > > Example 3: The sentence that the above regex pattern is partially > >>>> working > >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the > >>>> latest > >>>> > code, > >>>> > > there are now 5 <br> > >>>> > > *Original content in EML file:* > >>>> > > > >>>> > > http://www.concorded.com/ > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > On Tue, Dec 18, 2018 at 10:07 AM > >>>> > > *Original content:* http://www.concorded.com/ \n\n \n\n \n > >>>> \n\n \n\n > >>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, > >>>> 2018 at > >>>> > > 10:07 AM > >>>> > > *Previous Index content: *http://www.concorded.com/ <br><br> > >>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM > >>>> > > *Current Index content:* http://www.concorded.com/<br><br> > >>>> <br><br><br> > >>>> > > On Tue, Dec 18, 2018 at 10:07 AM > >>>> > > > >>>> > > > >>>> > > Regards, > >>>> > > Edwin > >>>> > > > >>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo < > >>>> edwinye...@gmail.com> > >>>> > > wrote: > >>>> > > > >>>> > >> Hi Paul, > >>>> > >> > >>>> > >> Thank you for the reply. > >>>> > >> > >>>> > >> I have tried to add the following configuration according to your > >>>> > >> suggestion: > >>>> > >> > >>>> > >> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > >> <str name="fieldName">content</str> > >>>> > >> <str name="pattern">[ \t]*\r?\n}</str> > >>>> > >> <str name="replacement"><br></str> > >>>> > >> <bool name="literalReplacement">true</bool> > >>>> > >> </processor> > >>>> > >> > >>>> > >> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > >> <str name="fieldName">content</str> > >>>> > >> <str name="pattern">(<br><br>){3,}</str> > >>>> > >> <str name="replacement"><br><br></str> > >>>> > >> <bool name="literalReplacement">true</bool> > >>>> > >> </processor> > >>>> > >> > >>>> > >> However, none of the \n is being removed this time round. > >>>> > >> Is the order and/or the pattern correct? > >>>> > >> > >>>> > >> Regards, > >>>> > >> Edwin > >>>> > >> > >>>> > >> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote: > >>>> > >> > >>>> > >>> Hi Edwin > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> Try for the first pattern/replacement > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> <str name="pattern">[ \t]*\r?\n</str> > >>>> > >>> > >>>> > >>> <str name="replacement"><br></str> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> Now all line endings and preceding whitespace characters should > be > >>>> > >>> changed to ‘<br>’. > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’ > >>>> > sequences > >>>> > >>> to 2 ‘<br>’ sequences: > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> <str name="pattern">(<br><br>){3,}</str> > >>>> > >>> > >>>> > >>> <str name="replacement"><br><br></str> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> Hope this approach works. Sorry for not replying earlier and > best > >>>> > >>> regards, > >>>> > >>> > >>>> > >>> Paul > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> Gesendet von Mail< > https://go.microsoft.com/fwlink/?LinkId=550986> > >>>> für > >>>> > >>> Windows 10 > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35 > >>>> > >>> An: solr-user@lucene.apache.org<mailto: > >>>> solr-user@lucene.apache.org> > >>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > >>>> multiple \n > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> Hi, > >>>> > >>> > >>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as > >>>> well. > >>>> > >>> > >>>> > >>> Regards, > >>>> > >>> Edwin > >>>> > >>> > >>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo < > >>>> > edwinye...@gmail.com> > >>>> > >>> wrote: > >>>> > >>> > >>>> > >>> > Hi, > >>>> > >>> > > >>>> > >>> > Anyone else has other suggestions or have faced the same > >>>> problem? > >>>> > >>> > > >>>> > >>> > Regards, > >>>> > >>> > Edwin > >>>> > >>> > > >>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo < > >>>> > >>> edwinye...@gmail.com> > >>>> > >>> > wrote: > >>>> > >>> > > >>>> > >>> >> Hi Paul, > >>>> > >>> >> > >>>> > >>> >> If I tried to execute the second step first, then I will only > >>>> get a > >>>> > >>> >> single <br> for those with 2 <br>. > >>>> > >>> >> For those that we originally get 4 <br>, there will be 2 <br> > >>>> with a > >>>> > >>> >> space in between. > >>>> > >>> >> > >>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since > the > >>>> > second > >>>> > >>> >> step is to replace with a single <br>. > >>>> > >>> >> But it has not solved the underlying problem yet. > >>>> > >>> >> > >>>> > >>> >> Regards, > >>>> > >>> >> Edwin > >>>> > >>> >> > >>>> > >>> >> > >>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: > >>>> > >>> >> > >>>> > >>> >>> If the second step is executed first, then you will get the > >>>> > unwanted > >>>> > >>> 4 > >>>> > >>> >>> <br> > >>>> > >>> >>> > >>>> > >>> >>> > >>>> > >>> >>> > >>>> > >>> >>> Gesendet von Mail< > >>>> https://go.microsoft.com/fwlink/?LinkId=550986> > >>>> > >>> für > >>>> > >>> >>> Windows 10 > >>>> > >>> >>> > >>>> > >>> >>> > >>>> > >>> >>> > >>>> > >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 > >>>> > >>> >>> An: solr-user@lucene.apache.org<mailto: > >>>> solr-user@lucene.apache.org > >>>> > > > >>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > >>>> > multiple > >>>> > >>> \n > >>>> > >>> >>> > >>>> > >>> >>> > >>>> > >>> >>> > >>>> > >>> >>> Hi Jörn , > >>>> > >>> >>> > >>>> > >>> >>> Do you mean the regex is not correct? > >>>> > >>> >>> > >>>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps, > >>>> like > >>>> > >>> the one > >>>> > >>> >>> shown below. The output that we get is still the same. > >>>> > >>> >>> > >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> <str name="fieldName">content</str> > >>>> > >>> >>> <str name="pattern">([ \t]*\r?\n){2,}</str> > >>>> > >>> >>> <str name="replacement"><br><br></str> > >>>> > >>> >>> <bool name="literalReplacement">true</bool> > >>>> > >>> >>> <processor> > >>>> > >>> >>> > >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> <str name="fieldName">content</str> > >>>> > >>> >>> <str name="pattern">([ \t]*\r?\n){1,}</str> > >>>> > >>> >>> <str name="replacement"><br></str> > >>>> > >>> >>> <bool name="literalReplacement">true</bool> > >>>> > >>> >>> <processor> > >>>> > >>> >>> > >>>> > >>> >>> Regards, > >>>> > >>> >>> Edwin > >>>> > >>> >>> > >>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke < > >>>> jornfra...@gmail.com> > >>>> > >>> wrote: > >>>> > >>> >>> > >>>> > >>> >>> > Then you need two regexprocessfactory steps > >>>> > >>> >>> > > >>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < > >>>> > >>> >>> edwinye...@gmail.com > >>>> > >>> >>> > >: > >>>> > >>> >>> > > > >>>> > >>> >>> > > Hi, > >>>> > >>> >>> > > > >>>> > >>> >>> > > Thanks for the reply. > >>>> > >>> >>> > > > >>>> > >>> >>> > > Do you know of any regex online tool that works > correctly > >>>> for > >>>> > >>> Java > >>>> > >>> >>> regex? > >>>> > >>> >>> > > I tried to find some, but they are not working properly. > >>>> > >>> >>> > > > >>>> > >>> >>> > > Yes, our plan is to replace more than one \n with > >>>> <br><br>, and > >>>> > >>> >>> single \n > >>>> > >>> >>> > > with single <br>. > >>>> > >>> >>> > > > >>>> > >>> >>> > > Regards, > >>>> > >>> >>> > > Edwin > >>>> > >>> >>> > > > >>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke < > >>>> > jornfra...@gmail.com > >>>> > >>> > > >>>> > >>> >>> wrote: > >>>> > >>> >>> > >> > >>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a > bug > >>>> - it > >>>> > >>> would > >>>> > >>> >>> then > >>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that > >>>> supports > >>>> > Java > >>>> > >>> >>> regex > >>>> > >>> >>> > for > >>>> > >>> >>> > >> your solution. > >>>> > >>> >>> > >> > >>>> > >>> >>> > >> I believe you want to have 2 regex process factories: > >>>> > >>> >>> > >> One that deals with single \n and one that deals with > >>>> more > >>>> > than > >>>> > >>> one > >>>> > >>> >>> \n > >>>> > >>> >>> > >> > >>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < > >>>> > >>> >>> > edwinye...@gmail.com > >>>> > >>> >>> > >>> : > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> Hi, > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> We have tried with the following pattern ([ > >>>> \t]*\r?\n){2,} > >>>> > and > >>>> > >>> >>> > >>> configuration: > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> > >>> <str name="fieldName">content</str> > >>>> > >>> >>> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> > >>>> > >>> >>> > >>> <str name="replacement"><br><br></str> > >>>> > >>> >>> > >>> <bool name="literalReplacement">true</bool> > >>>> > >>> >>> > >>> </processor> > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> However, the issue is still occurring. > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> Anyone else is able to help? > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> Regards, > >>>> > >>> >>> > >>> Edwin > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < > >>>> > >>> >>> > edwinye...@gmail.com> > >>>> > >>> >>> > >>> wrote: > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>>> Hi, > >>>> > >>> >>> > >>>> > >>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 > as > >>>> > well. > >>>> > >>> >>> > >>>> > >>>> > >>> >>> > >>>> Regards, > >>>> > >>> >>> > >>>> Edwin > >>>> > >>> >>> > >>>> > >>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < > >>>> > >>> >>> > edwinye...@gmail.com > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>>> wrote: > >>>> > >>> >>> > >>>> > >>>> > >>> >>> > >>>>> Hi, > >>>> > >>> >>> > >>>>> > >>>> > >>> >>> > >>>>> Should we report this as a bug in Solr? > >>>> > >>> >>> > >>>>> > >>>> > >>> >>> > >>>>> Regards, > >>>> > >>> >>> > >>>>> Edwin > >>>> > >>> >>> > >>>>> > >>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < > >>>> > >>> >>> > edwinye...@gmail.com > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>>>> wrote: > >>>> > >>> >>> > >>>>> > >>>> > >>> >>> > >>>>>> Hi Paul, > >>>> > >>> >>> > >>>>>> > >>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, > >>>> when we > >>>> > >>> try > >>>> > >>> >>> in on > >>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the > >>>> correct > >>>> > >>> >>> result for > >>>> > >>> >>> > >> all > >>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have > >>>> <br><br>, and > >>>> > >>> not > >>>> > >>> >>> more > >>>> > >>> >>> > >> than > >>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our > earlier > >>>> > >>> examples). > >>>> > >>> >>> > >>>>>> > >>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr? > >>>> > >>> >>> > >>>>>> > >>>> > >>> >>> > >>>>>> Regards, > >>>> > >>> >>> > >>>>>> Edwin > >>>> > >>> >>> > >>>>>> > >>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < > >>>> > >>> >>> > >> edwinye...@gmail.com> > >>>> > >>> >>> > >>>>>> wrote: > >>>> > >>> >>> > >>>>>> > >>>> > >>> >>> > >>>>>>> Hi Paul, > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n > >>>> i.e. > >>>> > <str > >>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the > following > >>>> > regex > >>>> > >>> >>> pattern: > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> <processor > >>>> class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> > >>>>>>> <str name="fieldName">content</str> > >>>> > >>> >>> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> > >>>> > >>> >>> > >>>>>>> <str > name="replacement"><br><br></str> > >>>> > >>> >>> > >>>>>>> </processor> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> However, we are also getting the exact same > results > >>>> as > >>>> > the > >>>> > >>> >>> earlier > >>>> > >>> >>> > >>>>>>> Example 1, 2 and 3. > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you > have > >>>> other > >>>> > >>> (non > >>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that > >>>> there are > >>>> > >>> no > >>>> > >>> >>> non > >>>> > >>> >>> > >> printing > >>>> > >>> >>> > >>>>>>> characters. It is just next line with a space. You > >>>> can > >>>> > >>> refer > >>>> > >>> >>> to the > >>>> > >>> >>> > >>>>>>> original content in the same examples below. > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex > >>>> pattern is > >>>> > >>> working > >>>> > >>> >>> > >>>>>>> correctly > >>>> > >>> >>> > >>>>>>> *Original content in EML file:* > >>>> > >>> >>> > >>>>>>> Dear Sir, > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> I am terminating > >>>> > >>> >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I > am > >>>> > >>> terminating > >>>> > >>> >>> > >>>>>>> *Index content: * Dear Sir, <br><br>I am > >>>> terminating > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> partially > >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there > >>>> are 4 > >>>> > >>> <br>) > >>>> > >>> >>> > >>>>>>> *Original content in EML file:* > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> *exalted* > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> *Psalm 89:17* > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 > >>>> > >>> >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 > >>>> \n\n > >>>> > >>> >>> \n\n 3 > >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore > >>>> > >>> >>> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 > >>>> <br><br> > >>>> > >>> >>> <br><br>3 > >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> partially > >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there > >>>> are 4 > >>>> > >>> <br>) > >>>> > >>> >>> > >>>>>>> *Original content in EML file:* > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM > >>>> > >>> >>> > >>>>>>> *Original content:* > >>>> http://www.concordpri.moe.edu.sg/ > >>>> > >>> \n\n > >>>> > >>> >>> > \n\n > >>>> > >>> >>> > >> \n > >>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n > \n\n\n > >>>> > >>> \n\n\n On > >>>> > >>> >>> Tue, > >>>> > >>> >>> > >> Dec 18, > >>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM > >>>> > >>> >>> > >>>>>>> *Index content: * > http://www.concordpri.moe.edu.sg/ > >>>> > >>> <br><br> > >>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you > >>>> may > >>>> > >>> have. > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> Thank you. > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>> Regards, > >>>> > >>> >>> > >>>>>>> Edwin > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, < > >>>> paul.d...@ub.unibe.ch> > >>>> > >>> wrote: > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Hi Edwin > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space > should > >>>> > preceed > >>>> > >>> >>> the \n > >>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> > >>>> > >>> >>> > >>>>>>>> 2. Perhaps in the data you have other (non > >>>> printing) > >>>> > >>> >>> characters > >>>> > >>> >>> > >>>>>>>> than \n? > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Gesendet von Mail< > >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986> > >>>> > >>> >>> > >> für > >>>> > >>> >>> > >>>>>>>> Windows 10 > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: > >>>> edwinye...@gmail.com> > >>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 > >>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: > >>>> > >>> >>> > solr-user@lucene.apache.org> > >>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern > >>>> to > >>>> > >>> detect > >>>> > >>> >>> > >> multiple \n > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Hi Paul, > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as > >>>> follow: > >>>> > >>> >>> > >>>>>>>> <processor > >>>> class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> > >>>>>>>> <str name="fieldName">content</str> > >>>> > >>> >>> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>>> > >>> >>> > >>>>>>>> <str > name="replacement"><br><br></str> > >>>> > >>> >>> > >>>>>>>> </processor> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of > >>>> Example > >>>> > 1,2 > >>>> > >>> and > >>>> > >>> >>> 3 > >>>> > >>> >>> > >> below. > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> working > >>>> > >>> >>> > >>>>>>>> correctly > >>>> > >>> >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I > am > >>>> > >>> >>> terminating > >>>> > >>> >>> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am > >>>> terminating > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> partially > >>>> > >>> >>> > >>>>>>>> working > >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 > >>>> <br>) > >>>> > >>> >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm > 89:17 > >>>> > \n\n > >>>> > >>> >>> \n\n > >>>> > >>> >>> > 3 > >>>> > >>> >>> > >>>>>>>> Choa > >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore > >>>> > >>> >>> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 > >>>> <br><br> > >>>> > >>> >>> > <br><br>3 > >>>> > >>> >>> > >>>>>>>> Choa > >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> partially > >>>> > >>> >>> > >>>>>>>> working > >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 > >>>> <br>) > >>>> > >>> >>> > >>>>>>>> *Original content:* > >>>> http://www.concordpri.moe.edu.sg/ > >>>> > >>> \n\n > >>>> > >>> >>> > \n\n > >>>> > >>> >>> > >>>>>>>> \n \n\n > >>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n > >>>> \n\n\n > >>>> > On > >>>> > >>> >>> Tue, Dec > >>>> > >>> >>> > >> 18, > >>>> > >>> >>> > >>>>>>>> 2018 > >>>> > >>> >>> > >>>>>>>> at 10:07 AM > >>>> > >>> >>> > >>>>>>>> *Index content: * > http://www.concordpri.moe.edu.sg/ > >>>> > >>> <br><br> > >>>> > >>> >>> > >>>>>>>> <br><br>On > >>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Any further suggestion? > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Thank you. > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>> Regards, > >>>> > >>> >>> > >>>>>>>> Edwin > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, < > >>>> paul.d...@ub.unibe.ch> > >>>> > >>> wrote: > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and > >>>> then > >>>> > >>> failing > >>>> > >>> >>> on > >>>> > >>> >>> > the > >>>> > >>> >>> > >>>>>>>> {2,} > >>>> > >>> >>> > >>>>>>>>> part you could try > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Gesendet von Mail< > >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986 > >>>> > >>> >>> > > > >>>> > >>> >>> > >>>>>>>> für > >>>> > >>> >>> > >>>>>>>>> Windows 10 > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: > >>>> edwinye...@gmail.com> > >>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 > >>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: > >>>> > >>> >>> > solr-user@lucene.apache.org > >>>> > >>> >>> > >>> > >>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory > pattern > >>>> to > >>>> > >>> detect > >>>> > >>> >>> > >> multiple > >>>> > >>> >>> > >>>>>>>> \n > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Hi Paul, > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Thanks for your reply. > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> When I use this pattern: > >>>> > >>> >>> > >>>>>>>>> <processor > >>>> class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> > >>>>>>>>> <str name="fieldName">content</str> > >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> > >>>> > >>> >>> > >>>>>>>>> <str > >>>> name="replacement"><br><br></str> > >>>> > >>> >>> > >>>>>>>>> </processor> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same > >>>> content > >>>> > >>> and > >>>> > >>> >>> not > >>>> > >>> >>> > >>>>>>>> working for > >>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one > that > >>>> is > >>>> > >>> working > >>>> > >>> >>> and > >>>> > >>> >>> > >>>>>>>> another > >>>> > >>> >>> > >>>>>>>>> that is not working (partially working): > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> working > >>>> > >>> >>> > >>>>>>>> correctly > >>>> > >>> >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I > >>>> am > >>>> > >>> >>> terminating > >>>> > >>> >>> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am > >>>> > terminating > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> partially > >>>> > >>> >>> > >>>>>>>> working > >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 > >>>> <br>) > >>>> > >>> >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm > 89:17 > >>>> > \n\n > >>>> > >>> >>> > \n\n 3 > >>>> > >>> >>> > >>>>>>>> Choa > >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore > >>>> > >>> >>> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 > >>>> > <br><br> > >>>> > >>> >>> > <br><br>3 > >>>> > >>> >>> > >>>>>>>> Choa > >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex > >>>> pattern is > >>>> > >>> >>> partially > >>>> > >>> >>> > >>>>>>>> working > >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 > >>>> <br>) > >>>> > >>> >>> > >>>>>>>>> *Original content:* > >>>> http://www.concordpri.moe.edu.sg/ > >>>> > >>> \n\n > >>>> > >>> >>> > >> \n\n > >>>> > >>> >>> > >>>>>>>> \n > >>>> > >>> >>> > >>>>>>>>> \n\n > >>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n > >>>> \n\n\n > >>>> > On > >>>> > >>> >>> Tue, > >>>> > >>> >>> > Dec > >>>> > >>> >>> > >>>>>>>> 18, 2018 > >>>> > >>> >>> > >>>>>>>>> at 10:07 AM > >>>> > >>> >>> > >>>>>>>>> *Index content: * > >>>> http://www.concordpri.moe.edu.sg/ > >>>> > >>> >>> <br><br> > >>>> > >>> >>> > >>>>>>>> <br><br>On > >>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is > >>>> wrong? > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Thank you. > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> Regards, > >>>> > >>> >>> > >>>>>>>>> Edwin > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, < > >>>> paul.d...@ub.unibe.ch> > >>>> > >>> wrote: > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not > >>>> > >>> working. I > >>>> > >>> >>> > assume > >>>> > >>> >>> > >>>>>>>> nothing > >>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> ?? > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail< > >>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986> > >>>> > >>> >>> > >>>>>>>> für > >>>> > >>> >>> > >>>>>>>>>> Windows 10 > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: > >>>> edwinye...@gmail.com> > >>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 > >>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: > >>>> > >>> >>> > >> solr-user@lucene.apache.org > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern > to > >>>> > detect > >>>> > >>> >>> multiple > >>>> > >>> >>> > >> \n > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> Hi, > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> I am trying to use the > >>>> RegexReplaceProcessorFactory to > >>>> > >>> >>> remove > >>>> > >>> >>> > more > >>>> > >>> >>> > >>>>>>>> than > >>>> > >>> >>> > >>>>>>>>> two > >>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: > >>>> \n\n, > >>>> > \n > >>>> > >>> \n, > >>>> > >>> >>> \n > >>>> > >>> >>> > \n > >>>> > >>> >>> > >>>>>>>> \n > >>>> > >>> >>> > >>>>>>>>> \n), > >>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>. > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is > >>>> working > >>>> > >>> when I > >>>> > >>> >>> test > >>>> > >>> >>> > it > >>>> > >>> >>> > >>>>>>>> in > >>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put > >>>> it > >>>> > >>> inside > >>>> > >>> >>> the > >>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> > >>>> > >>> >>> > >>>>>>>>>> <processor > >>>> class="solr.RegexReplaceProcessorFactory"> > >>>> > >>> >>> > >>>>>>>>>> <str name="fieldName">content</str> > >>>> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> > >>>> > >>> >>> > >>>>>>>>>> <str > >>>> name="replacement"><br><br></str> > >>>> > >>> >>> > >>>>>>>>>> </processor> > >>>> > >>> >>> > >>>>>>>>>> </updateRequestProcessorChain> > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* > is > >>>> > >>> >>> instructing > >>>> > >>> >>> > the > >>>> > >>> >>> > >>>>>>>> regex > >>>> > >>> >>> > >>>>>>>>> to > >>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is > >>>> > >>> instructing > >>>> > >>> >>> the > >>>> > >>> >>> > >>>>>>>> regex to > >>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern > (\n). > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how > >>>> should > >>>> > >>> I do > >>>> > >>> >>> it? > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0. > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>>> Regards, > >>>> > >>> >>> > >>>>>>>>>> Edwin > >>>> > >>> >>> > >>>>>>>>>> > >>>> > >>> >>> > >>>>>>>>> > >>>> > >>> >>> > >>>>>>>> > >>>> > >>> >>> > >>>>>>> > >>>> > >>> >>> > >> > >>>> > >>> >>> > > >>>> > >>> >>> > >>>> > >>> >> > >>>> > >>> > >>>> > >> > >>>> > > >>>> > >>> >